Reliability of Computer Memory?
olddoc writes "In the days of 512MB systems, I remember reading about cosmic rays causing memory errors and how errors become more frequent with more RAM. Now, home PCs are stuffed with 6GB or 8GB and no one uses ECC memory in them. Recently I had consistent BSODs with Vista64 on a PC with 4GB; I tried memtest86 and it always failed within hours. Yet when I ran 64-bit Ubuntu at 100% load and using all memory, it ran fine for days. I have two questions: 1) Do people trust a memtest86 error to mean a bad memory module or motherboard or CPU? 2) When I check my email on my desktop 16GB PC next year, should I be running ECC memory?"
Recently I had consistent BSODs with Vista64 on a PC with 4GB...
This was a surprise?
My experience with memtest is you can trust the results if it says the memory is bad, however if the memory passed it could still be bad. Troubleshooting your scenario should involve replacing the DIMM's in questions with known good modules while running Windows.
brandelf -t FreeBSD
wrap your _whole_ computer in tinfoil to deflect those pesky cosmic rays. it also works to keep them out of your head too.
It's the lowest of the low end of the market that doesn't use ECC, or at least Parity RAM. For anything where reliability and veracity is important, you simply must use ECC.
If a system gives memtest86 errors, I break it down and swap components until it doesn't. The test pattern it uses can find subtle errors you're unlikely to run into with any application-based testing even when run for a few days. Any failures it reports should be taken seriously. Also: you should pay a attention to the memory speed value it reports, that's a surprisingly effective simple benchmark for figuring out if you've setup your RAM optimally. The last system I built, I ended up purchasing 4 different sets of RAM, and there was about a 30% delta between how well the best and worst performed on the memtest86 results--correlated extremely well with other benchmarks I ran too.
At the same time, I've had memory that memtest86 said was fine, but the system itself still crashed under a heavy Linux-based test. I consider both a full memtest86 test and a moderate workload Linux test to be necessary before I consider a new system to have baseline usable reliability.
There are a few separate problems here that are worthwhile to distinguish among. A significant amount of RAM doesn't work reliably when tested fully. Once you've culled those out, only using the good stuff, some of that will degrade over time to where it will no longer pass a repeat of the initial tests; I recently had a perfectly good set of RAM degrade to useless in only 3 months here. After you take out those two problematic sources for bad RAM, is the remainder likely enough to have problems that it's worth upgrading to ECC RAM? I don't think it is for my home systems, because I'm OK with initial and periodic culling to kick out borderline modules. And things like power reliability cause me more downtime than RAM issues do. If you don't know how or have the time to do that sort of thing yourself though, you could easily be better off buying more redundant RAM.
1) Yes
2) No
Now to be serious. Home PC do not come yet with 6GB or 8GB. Most new home PC still seem to have between 1GB and 4GB. Where the 4GB variety is rare because of the fact that most home PCs still come with a 32-bit operating system. 3GB seems to be the sweet spot for higher-end-home-pcs. Your home PC will most likely not have 16GB next year. Your workstation at work, perhaps, but then even perhaps.
At the risk of sounding like "640KByte is enough for everyone", I have to ask why you think why you need 16GB to check your email next year. I'm typing this on a 6 year old computer, I'm running quite a few applications at the same time and I know a second user is logged in. Current memory usage: 764Meg RAM. As a general rule, I know that Windows XP runs fine on 512Meg RAM and is comfortable with 1GB RAM. The same is true for GNU/Linux running Gnome.
Now, at work with Eclipse loaded, a couple of application servers, a database and a few VMs... Yeah, there indeed you get memory starved quickly. You have to keep in mind that such usage pattern is not that of a typical office worker. I can imagine that a heavy Photoshop user would want every bit of RAM he can get too. The Word-wielding-office-worker? I don't think so.
Now, I can't speak for Vista. I heard it runs well on 2GB systems, but I can't say. I got a new work laptop last week and booted briefly in Vista. It felt extremely sluggish and my machine does have 4Gig RAM. Anyway, I didn't bother and put Debian Lenny/amd64 on it and didn't look back.
I my idea, you have quite a twisted sense of reality regarding to the computers people actually use.
Oh, and frankly... If cosmic rays would be a big issue by now with huge memories, don't you think that more people would be complaining? I can't say why Ubuntu/amd64 ran fine on your machine. Perhaps GNU/Linux has built-in error correction and marks bad RAM as "bad".
Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)
All of my computers that run for days on end without rebooting have ECC ram in them (Home server, workstation at work). Others must be rebooted every now and then.
Are there laptops that use ECC RAM? I wish I could buy some.
In Soviet Russia, articles before post read *you*!
Is ECC memory worth the money in a machine you use to check your E-mail? Can't you just reboot and/or replace the memory if errors occur?
I could see it happening when the cost of ECC memory is no higher than normal memory, and using ECC memory has no or minimal impact on performance, until then, I won't expect to start seeing it desktop machines.
If you want ECC memory on your desktop, feel free to build your own machine with a motherboard that supports ECC memory. Some high end desktops do support ECC memory already.
My first computer was a 80286 with 1 MB of RAM. That RAM was all parity memory. Cheaper than ECC, but still good enough to positively identify a genuine bit flip with great accuracy. My 80386SX had parity RAM, so did my 486DX4 120. I ran a computer shop for some years, so I went through at least a dozen machines ranging from the 386 era through the Pentium II era, at which point I sold the shop and settled on a AMDK62 450. And right about the time that the Pentium was giving way to the Pentium II, non-parity memory started to take hold.
What protection did parity memory provide, anyway? Not much, really. It would detect with 99.99...? % accuracy when a memory bit had flipped, but provided no answer as to which one. The result was that if parity failed, you'd see a generic "MEMORY FAILURE" message and the system would instantly lock up.
I saw this message perhaps three times - it didn't really help much. I had other problems, but when I've had problems with memory, it's usually been due to mismatched sticks, or sticks that are strangely incompatible with a specific motherboard, etc. none of which caused a parity error. So, if it matters, spend the money and get ECC RAM to eliminate the small risk of parity error. If it doesn't, don't bother, at least not now.
Note: having more memory increases your error rate assuming a constant rate of error (per megabyte) in the memory. However, if the error rate drops as technology advances, adding more memory does not necessarily result in a higher system error rate. And based on what I've seen, this most definitely seems to be the case.
Remember this blog article about the end of RAID 5 in 2009? Come on... are you really going to think that Western Digital is going to be OK with near 100% failure of their drives in a RAID 5 array? They'll do whatever it takes to keep it working because they have to - if the error rate became anywhere near that high, their good name would be trashed because some other company (Seagate, Hitachi, etc) would do the research and pwn3rz the marketplace.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
With memory becoming so plentiful these days (I haven't seen many home PC's with 6 or 8GB granted, but we're getting there) it seems that a single error on a large capacity chip is getting more and more trivial. Isn't it a waste to throw away a whole DIMM? Why isn't it possible to "remap" this known-bad address, or allocate some amount of RAM for parity the way software like PAR2 works? Hard drive manufacturers already remap bad blocks on new drives. Also it seems to me that, being a solid state device, small failures in RAM aren't necessarily indicative of a failing component like bad sectors on a hard drive are. Am I missing something really obvious here or is it really just easier/cheaper to throw it away?
First, it was not cosmic rays; memory was tested in a lead vault and showed the same error rate. Turns out to have been alpha particles emitted by the epoxy / ceramic that the memory chips were encapsulated in.
That said: Quite clearly given your experience, Vista and Ubuntu load the memory subsystem quite differently. It is possible that Vista, with its all-over-the-map program flow, is missing cache a lot more often and so is hitting DRAM harder; I don't have the background to really know. I believe that Memtest86, in order to put the most strain on memory and thus test it in the most pessimal conditions, tries to access memory in patterns that equally hit physical memory hardest. But, what I have found is that some OSs, apparently including Ubuntu, will run on memory that is marginal, memory that Memtest86 picks up as bad.
As for ECC in memory... The problem is that ECC carries a heavy performance hit on write. If you only want to write 1 byte, you still have to read in the whole QWord, change the byte, and write it back to get the ECC to recalculate correctly. It is because of that performance hit that ECC was deprecated. The problem goes away to a large extent if your cache is write-back rather than write-through; though there will be still a significant number of cases where you have to write a set of bytes that has not yet been read into cache and does not comprise a whole ECC word.
That said, it is still used on servers...
But I don't expect it will reappear on desktops any time soon. Apparently they have managed to control the alpha radiation to a great extent, and so the actual radiation-caused errors are now occurring at a much lower rate, significantly lower than software-induced BSODs.
My experience with a server that recorded about 15TB of data is something like 6 bit-errors per year that could not be traced to any source. This was a server with ECC RAM, so the problem likely occured in busses, network cards, and the like, not in RAM.
For non-ECC memory, I would strongly syggest running memtest86+ at least a day before using the system and if it gives you errors, replace the memory. I had one very persistend bit-error in a PC in a cluster, that actually reqired 2 days of memtest86+ to show up once, but did occure about once per hour for some computations. I also had one other bit-error that memtest86+ did not find, but the Linux commandline memory tester found after about 12 hours.
The problem here is that different testing/usage patterns result in different occurence probability for weak bits, i.e. bits that only sometimes fail. Any failure in memtest86+ or any other RAM tester indicates a serious problem. The absence of errors in a RAM test does not indicate the memory is necessarily fine.
That said, I do not believe memory errors have become more common on a per computer basis. RAM has become larger, but also more reliable. Of course, people participating in the stupidity called "overclocking" will see a lot more memory errors and other errors as well. But a well-designed system with quality hardware and a thourough initial test should typically not have memory issues.
However there is "quality" hardware, that gets it wrong. My ASUS board sets the timing for 2 and 4 memory modules to the values for 1 module. This resulted in stable 1 and 2 module operation, but got flaky for 4 modules. Finally I moved to ECC memory before I figuerd out that I had to manually set the correct timings. (No BIOS upgrade available that fixed this...) This board has a "professional" in its name, but apparently, "professional" does not include use of generic (Kingston, no less) memory modules. Other people have memory issues with this board as well that they could not fix this way, seems that somethimes a design just is bad or even reputed manufacturers do not spend a lot of effort to fix issues in some cases.In can only advise you to do a thourough forum-search before buying a specific mainboard.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Then it would proba%ly alter not just one byte, b%t a chain of them. The cha%n of modified bytes would be stru%g out, in a regular patter%. Now if only there were so%e way to read memory in%a chain of bytes, as if it w%re a string, to visu%lize the cosmic ray mod%fication. hmmm...
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
Not all memory is created equal. Memory can be bad if Memtest detects errors, or you can simply be running it at the wrong settings. Usually there are both "normal" and "performance" settings for memory on higher end motherboards, or sometimes you can tweak all sorts of cycle-level stuff manually (CAS latency etc.).
Try running your memory with the most conservative settings before you assume it's bad.
.: Max Romantschuk
Depending on where it fails (if it fails in a the same spot) you can relatively easily work around it and not throw out the remaining good portion of the stick. I wrote a howto..
http://gquigs.blogspot.com/2009/01/bad-memory-howto.html
I've been running on Option 3 for quite some time now. No, it's not as good as ECC, but it doesn't cost you anything.
1) Do people trust a memtest86 error to mean a bad memory module or motherboard or CPU?
Well, I'd add some other possibilities such as:
Bad power supply,
Memory isn't seated properly in it's socket.
Incorrect timing set in bios.
Memory is incompatable with your motherboard.
etc..
But yeah, if memtest86 says there's a problem then there really is something wrong.
is to swap the memory modules to find out which is causing the problem, if not motherboard. Also i don't see how memory tests running inside an OS can be effective, i'd much rather boot off of a smaller system on a DVD, USB-stick or floppy to run a memory test. Dell servers have those Dell Diagnostics CDs that are very small in memory footprint just in order to run diagnostics on memory. But even they're not perfect so you often have to take memory out and see if you can reproduce errors.
The probability of a cosmic ray at precisely the right angle and speed to cause a single bit error and cause an app to crash is somewhere on the same order as your chances of getting hit by a car, getting struck by lightning, getting torn apart by rabid wolves, and having sex in the back of a red 1948 Buick convertible at a drive-in movie theater on Tuesday night, Feb. 29th under a blue moon... all at the same time.... Sure, given enough bits, it's bound to happen sooner or later, but it isn't something I'd worry about. :-)
The probability of RAM just plain being defective---failing to operate correctly due to bugs in handling of certain low power states, having actual bad bits, having insufficient decoupling capacitance to work correctly in the presence of power supply rail noise, etc---is probably several hundred thousand orders of magnitude greater (probably on the order of a one in several thousand chance of a given part being bad versus happening to a given part a few times before the heat death of the universe).
Memory test failures (other than mapping errors) are pretty much always caused by hardware failing. If running memtest86 in Linux works correctly for days, this probably means one of three things:
I couldn't tell you which of these is the case without swapping out parts, of course. You should definitely take the time to replace whatever is bad even if it seems to be "working" in Linux. In the worst case, you have a few bad bits of RAM, they're somewhere in the middle of your disk cache in Linux, and you are slowly and silently corrupting data periodically on its way out to disk.... You definitely need to figure out what's wrong with the hardware and why it is only failing in Windows, and it sounds like the only way to do that is to swap out parts, boot into Windows, and see if the problem is still reproducible in under a couple of days, repeating with different part swaps until the problem goes away. Don't forget to try a different power supply.
Check out my sci-fi/humor trilogy at PatriotsBooks.
Memtest86 is the usual test tool for a couple of reasons (and only one of those is price).
Chances are very good you have a problem. Definitely worth checking it out.
1) Re-run the test and see if the error is in the same place. If it is, you can pretty much guarantee the RAM is bad at that position.
2) Swap the memory out and try again. You're best to do this while you still can under warranty.
Bottom line is you're not paranoid and you probably do have a problem. You can either deal with it up front or live with a compromised system that eventually bites you on the backside.
These posts express my own personal views, not those of my employer
Please, please tell me this is an early April fool's joke. If not, dear submitter, I hope that you're either very tired or very drunk right now because you literally just asked:
"Windows is crashing randomly and the program that I ran to test the memory is reporting errors. Does that mean the memory in my computer is bad?"
You should have also tried running a hacked version of OS X to serve as a tie-breaker.
several hundred thousand orders of magnitude
We've crossed beyond the realm of the astronomical and into something else entirely. Surely you meant several orders of magnitude, aka, hundreds of thousands of times? Let's keep things on this side of the googol.
Evidently, the key to understanding recursion is to begin by understanding recursion. The rest is easy.
It has been completely stupid to dispense with parity memory and ECC memory in PC's. Apple was the first to go to 8-bit memory bytes long ago (and they still cost more!) and now it seems everyone below the server level is happy playing without a net. Even GPU cards, if used for highly parallel FP calculations should have the ability to detect when a memory error has happened and signal the application to handle. Completely stupid, and beyond completely stupid, that we trust our calculations to a system now that can't even determine if it has made an error!
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
really, it's not that much more expensive. Search newegg for unbuffered ecc, if you are using a desktop class system that can't handle registered ram.
You wouldn't put data you care about on a hard drive without raid, would you?
Was it cosmic rays, or Alpha particle decay from impure materials that was going to do in our memory soon? IIRC it was the latter.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Yes. I do, anyway; I've never had it report a false-positive, and it's always been one of the three (and even if it was cosmic rays, it wouldn't consistently come up bad, then, would it?). Then again, it could also mean that you could be using RAM requiring a higher voltage than what your motherboard is giving it. If it's brand-name RAM, you should look up the model number and see what voltage the RAM requires. Things like Crucial Ballistix and Corsair Dominator usually require around 2.1v.
Depends. If you're doing really important stuff then sure. ECC memory is quite a boon in that case. If you're just using your desktop for word processing and web browsing, it's a waste of money.
Screw the rules, I have green hair!
and having sex in the back of a red 1948 Buick convertible at a drive-in movie theater on Tuesday night, Feb. 29th under a blue moon... all at the same time....
Mom?
My ASUS mobo (A8N-SLI) would reduce the memory timings if I put 4 memory modules in automatically. I hated that so I used the BIOS to undo it. I ran MemTest to make sure it was okay.
Oddly, the only RAM I've ever really had problems with was some bad-ass Corsair memory I bought for my 800 FSB P4 early on. The timings in the SPD would prevent the system from booting, even if it was the only RAM in the system. I override this in the BIOS (on one of the rare occasions that it booted) and it was okay unless I cleared CMOS. After a few months I removed that RAM to send it back and put in the cheapest PC3200 RAM I could find at Fry's. That fixed it, and I altered the settings in the CMOS over time, I could overclock this RAM to the same speed as the Corsair stuff. And it would work. And if I cleared CMOS it would just slow back down instead of failing.
To Corsair's credit, they replaced my RAM, although the replacement was in the same series, it did not have the same timings as the original RAM. But at least it worked.
It's funny, if you look up DDR SRAM on wikipedia, it has pictures of essentially the Corsair RAM I used. The version number is the same as the later RAM I got that worked, the earlier stuff was v1.1 or something but otherwise looked the same. The C2 in the name is supposed to mean it is CAS latency 2 RAM, but as I mentioned, the replacement RAM I received was not actually as fast, it was CL3.
http://lkml.org/lkml/2005/8/20/95
Most failures will appear when a pc is heavely stressed.
Combine the Mem86 test togheter with some continues running programs who are using memory, harddisk, CPU and network.
If a systems survivals this test a whole day, it's in perfect shape.
While lots of people are making fun of your seemingly paranoidic concern toward the destructive deathly cosmic ray, I'm here to support you. We get hit by cosmic ray every second, our skulls are just not thick enough to resist all those penatrations, that's why we'd lose memory from time to time. Have you ever found yourself forgot something that was just happened an hour ago? That's why. Wear tinfoil hat is the only safeguard against unexpected memory degradation.
Beside cosmic ray, other form of harmful radiation should not be neglected. The radiation emitted from computer processors and monitor can also cause a deformation of your unborn children, therefore you should buy anti-radiation suit for your pregnant love ones. Remember, the harmful effect is irreversible, you don't want to take the risk.
Last but not least, reports has shown high corealation between impotence and prolonged computer use. I've friends having their balls fried by sitting near computer with 2.4GHz CPU, because the frequency is exactly the same as your microwave oven. We just can't tell how many poor dudes having their balls disabled this way, sad.
You can't be too careful. At the end of days when the street are stuffed with mindless, impotent zomies, guess who has the last laugh.
I've seen lots of RAM errors as the speed of memory has increased, especially with the AMD64 Hammer chips. What it usually boils down to is someone not manufacturing their components such that they truly meet their spec.
If you slow down your memory and the errors go away, it's not cosmic rays. AFAIK, cosmic rays will flip bits regardless of how fast the RAM is being run at.
That's only a little more than a few orders of magnitude of orders of magnitude.
paintball
I have to agree with you on linux must be mapping out the bad areas. I have an old p3 500 with 128mb ram and running linux (slackware) it ran fine but I installed XP on it and got a BSOD "Hardware Parity Error". It's likely memtest is not lying to you and Ubuntu is just making the best out of a bad situation.
A loop, by its nature, continues. If that didn't make sense, start reading this sentence again.
I'm old enough to remember that rubbish in the press as well. But it started before 512Meg I remember asking a clerk if my $450 4Meg sim would degrade in potentially high radiation env's. Like the crap that comes from my microwave.
For a cosmic ray to have enough energy to flip a bit of memory would be fairly impressive. Has to hit the right spot + be the right energy to stimulate a relatively large device to think it's got a fresh signal. Not too likely.
I'd be more worried about that same cosmic ray causing a DNA error and giving me cancer.
Now of course if you were design inter stellar probes you have a definite concern on your hands. Once out past the Oort cloud ( OK farther than that ) you no longer have the magnet shield of our sun. Now we are in the Cosmic ray bath. This is where the odds of a bit flip is starting to get high. Now lets add the fact that you little probe is going to be out there a LONG time. I can bet a few bucks that yah you are going to suffer from a bit flip or 7. :)
---
Oh Vista 64bit is a nightmare. 3 machines of mine have had it. ALL had major issues. back to XP 32 and ZERO issues. All of them just sing along now. My home server and a few minor devices run Ubuntu NEVER had an issue.
Am I looking forward to Windows 7? Nope. It means Win XP will really die and MS won't patch it. ( Prediction. Win 7 will suck as bad as Vista when people figure out it doesn't actually work on a EEEEEEeeeeeeeeeeePC after they install crap. ) ( Second prediction. The much touted touch screen interface additions in Win7 actually are really annoying to use. )
----
Back to the Oort cloud. Why are you sending a PC out there again?
If running memtest86 in Linux works correctly for days, this probably means one of three things:
First of all, you don't run Memtest86 under Windows, Linux, or any other operating system. Why? Because you can't test memory that is in use by any other program. This already tells us that you probably haven't used Memtest86 recently enough to remember you would run this from a bootable CD or Floppy. It's downhill from here.
A. Linux is detecting the bad part and is mapping out the RAM in question.
No. Linux doesn't do this. Can you imagine the extra overhead of double checking every single read and write to RAM? Jesus Christ.
B. The Linux VM system doesn't move things around RAM as much as Windows.
Nice, baseless troll argument.
C. Linux power management isn't as rough on the RAM or CPU as Windows.
Isn't as rough? Because half the time it doesn't work as intended? So now a negative becomes a plus? Give us a break.
"When you see a unixer brainwashed beyond saving, kick him out of the door." - Xah Lee
The stuff is so cheap now, it only costs a tiny bit more to buy the brand name stuff so you're fine.
When it was 800$ AUD for 64mb of ram, the cheap '500$ stuff!!' was an option, sadly it was the wrong one but all we could afford back then. :/
If anything needs more quality control it's either hard disks or high end gaming video cards, which literally seem to burn out between 3 and 24 months nowadays
The word you're looking for here is "hyperbole". :-)
Check out my sci-fi/humor trilogy at PatriotsBooks.
I disagree - cosmic rays do happen all the time, and do interact with silicon. I was at a demo of a low-light level camera, and every few seconds there'd be a cosmic ray artefact on the monitor.
I usually wear medieval armour. Not only does that work as efficient as tinfoil, it's also very fashionable.
Thanks for clarifying that about memtest86. I was thinking of a different memory tester that basically does an mmap of gigabytes of RAM and beats on it. I forget the name of that one, not that it matters.
The reason for my confusion was that the original post says that in Ubuntu, "it ran fine for days". The problem is that the antecedent of the word "it" is unclear and could refer either to the computer and Ubuntu combination itself or to memtest86 running in Ubuntu. Without that critical piece of information about memtest86, it wasn't at all clear which of these was the intended meaning.
Check out my sci-fi/humor trilogy at PatriotsBooks.
I've built many a system, and discovered that
1. removing & re-socketing the DIMM in question, sometimes 3 or 4 times ( testing between ),
*almost always* fixes the problem:
it's a CONNECTION problem, not a DIMM problem.
( dust in the socket? slight pressure-differences between "pins" and contact-pads?
whatever you do, don't wobble the DIMM when socketing it, or that'll push the contacts away *just enough* to make the connection erratic ).
2. Also, ASUS makes AMD based motherboards that DO have ECC, and some of us, who LIVE by our computers, insist on 'em.
( *drastically* cheaper than an Opteron/Xeon system, but still ECC reliability? Yowza, baby! )
3. The PSU oft is the culprit: RAM that tests fine, when the disks aren't in use, is being under-volted by the PSU, so it *becomes* erratic, when the *whole system* is under load.
Cheers, people!
sorry buddy, but random crashes on a vista machine are a feature not a bug. I've been using linux for the last 6 years and not once have I had an unexplained crash. I started a new job in a developer environment which is exclusively windows and within the first week my vista machine had to be wiped and reinstalled to solve the problem of repeated and unexplained crashes. Colleagues have had their machines simply reboot for no reason, and the performance of the machine is painful. I can unzip the same file faster in a virtual box linux guest than I can in the vista host, pardon my language, but it's a f**king joke that vista is being pushed as a "replacement"
prepare the survey weasels.
When rudely swiping at other people, at least stop dribbling nonsense like "several hundred thousand orders of magnitude greater". I don't think you know what you are talking about. >>10^100000?
So I discount the rest of your "contribution" accordingly. Actually, several other parts of your answer are independently rubbish too: have you considered a career in tabloid journalism? Wish I had mod points...
Rgds
Damon
http://m.earth.org.uk/
Not only silicon... NASA astronauts consistently observe bright flashes in orbit, whether their eyes are open or closed. It is believed that these flashes are the result of cosmic rays interacting with the astronauts' retinas.
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
Umm... given that you're talking about CCDs, odds are most of what you saw was thermal noise from the equipment itself and/or tiny fluctuations in the power rails, not cosmic rays. Also, you're talking about radically different voltages here---a device with an open face designed specifically to detect single photons versus a device inside a sealed package designed to reject outside interference operating at much higher threshold levels that has to swing by a volt or two to change states....
Check out my sci-fi/humor trilogy at PatriotsBooks.
"Sure, given enough bits, it's bound to happen sooner or later, but it isn't something I'd worry about. :-)"
The last numbers I saw said something like 1 bit-flip per gigabyte month of RAM, but this is on ground.
In space (and to some extent high altitude airplanes as well) applications, bit-flips are extremely common and usually occur more than once per day and 2 megabytes of RAM (I saw some statistics on this from one mission, I believe it was SMART).
In general, I would say that the following should use ECC:
Servers and mainframes
Avionics and space computers
Laptops (they are frequently used in high altitude airplanes)
Desktops do not in general need them, since a bit error usually do not have catastrophic effects and the likelihood of them happening is not that great.
I'd mod you down but you're already at -1. Stop whining about kdawson and whine about the posts instead! n00b
Requiem for the American Dream
You're right that I've never run memtest86 at all. I hadn't regularly worked with any hardware based on an Intel architecture until about two years ago, and haven't experienced any RAM problems in that relatively short period. That is the sole valid criticism in your post, and even that was redundant. The rest of your post consists of you putting words in my mouth that I did not say.
Regarding point A., many Linux systems do perform at least rudimentary RAM checks. What I said was that it is remotely possible that it got lucky and detected the problem during such screening, then flagged that page of physical RAM as defective. I never said anything about checking every write to RAM. That was you putting words in my mouth, and completely ludicrous words that I'd have to know almost nothing about hardware to say, at that. NIce straw man.
Regarding point B., that's not a baseless troll argument by any stretch of the imagination. First, running a lean Linux distro will almost certainly thrash pages around far less than 64-bit Vista simply because the OS uses far less RAM. Second, last time I used it, Linux wired down a -lot- of pages down in the kernel. All of those pages are just going to sit there. If anything, this was a criticism of Linux's tendency to wire too many pages, not any sort of "pro-Linux" comment. Maybe it might be taken to mean that Linux is less likely to eject pages belonging to one process in favor of another process---indeed, my experience has been that it does seem to do so less frequently than some other operating systems, though this can either be good or bad depending on the workload in question---but that was in no way implied by my previous comment, nor certainly was there any value judgment on my part as to whether such behavior is good or bad.
Likewise on point C., I was actually being harshly critical of Linux's power management, albeit without coming right out and saying it. Nowhere in my statement did I in ANY way insinuate that failing to switch into the lowest power states was in any way a good thing. It isn't. Poor power management leads to diminished battery life in portables and increased electric bills from computers of all types.
Before you go painting me as a pro-Linux troll, you need to learn some reading comprehension skills and stop trying to put words in my mouth. It only makes you look like a troll yourself.
Check out my sci-fi/humor trilogy at PatriotsBooks.
much MUCH smaller.
Travelling 1 um through paper doesn't get you to the other side. It will get you through several bit sites in modern RAM.
I will add that memory is not necessarily bad.
I had issues once because of a buggy BIOS setting a bad voltage on the memory chip.
Check for BIOS updates.
Also memtest that memory on another computer if possible.
I noticed the same behaviour : Windows wouldn't even install in optimal performance condition (not overclocking, simply set up for dual-channel) while Ubuntu would install.
However some programs wouldn't work because during install some buffers went corrupted and the binaries were then corrupted on disk.
So it's not because Ubuntu doesn't crash that it can be considered reliable.
In the case of a bad memory chip :
Linux doesn't do the "A" part of the parent post, but you can tell the kernel that some memory range is bad (other posts are giving instructions)
And I have a "D." item :
Linux apps uses much more shared libraries than Windows ones, so have a smaller "unique" memory footprint and are therefore less likely to fall on a bad memory portion.
I have discovered a truly marvelous proof of killer sig, which this margin is too narrow to contain.
I always run Memtest86 (or Memtest86+) for at least a couple of full passes (preferably overnight) whenever I build a new system or upgrade RAM. It gives you a pretty good "zeroth order" indication of whether your motherboard and RAM are stable. I don't even bother trying to install an OS until I can get Memtest86 to run clean. As others have already noted, if it reports errors, you can be pretty certain that you have a problem; if it runs clean overnight the RAM is probably OK, but there is still a slight chance that there may be some issues.
Regarding ECC, I think it is a travesty that most consumer desktop motherboards do not even have the option of enabling it in the BIOS. When I put together PCs, I try to use motherboards which I know support ECC, and pay the extra few dollars for ECC memory modules. I've found that Asus desktop motherboards often do support ECC (unlike most of the other brands).
The cosmic rays (CR) probably do not interact with the retina. A CR would only interact with a few cells not enough to be called a flash. Plus the cells in the retina are tuned to work with photons in a specific spectrum. We can't see IR or UV light even when very intense. So the interaction with the CR would probably not give a signal.
My bet is that it is cerenkov radiaton as a high speed charged particle breaks the speed of light in the fluid in the eyeball.
I would be interested to see if you could use memory errors as a method to detect cosmic rays.
xterm -n 8
Since nobody's mentioned it yet:
More recent versions of Red Hat come with EDAC (formerly known as bluesmoke) enabled and will throw parity errors to the syslog ...
http://bluesmoke.sf.net/
http://buttersideup.com/edacwiki/Main_Page
Its predecessor, Linux-ECC, also has a plug by DJB for its use with some decent details:
http://cr.yp.to/hardware/ecc.html
http://www.anime.net/~goemon/linux-ecc/
o/~ Join us now and share the software
When you take a look at computers from a movie's perspective such as Tron you can see all the reason why we having so many issues with error rates being high it's just the programs not wanting to work with each other. ECC Is nice for Servers and big data crunchers who can't have an error except once every million or ten bytes. If you're willing to shell out the cash for it more power to you. But 80% if not more of the computer users don't even notice issues the errors because they almost never end in a blue screen of death. Personally I blame it all on Microsoft
This is a Mac, what you have there is an embarrassment to your fellow computer users.
http://www.ida.liu.se/~abdmo/SNDFT/docs/ram-soft.html
This references an IBM study, which is what I think I actually remember but could not find quickly this morning.
"In a study by IBM, it was noted that errors in cache memory were twice as common above an altitude of 2600 feet as at sea level. The soft error rate of cache memory above 2600 feet was five times the rate at sea level, and the soft error rate in Denver (5280 feet) was ten times the rate at sea level."
My bet is that it is cerenkov radiaton as a high speed charged particle breaks the speed of light in the fluid in the eyeball.
Indeed, these flashes have pretty much already been identified as the result of Cerenkov radiation.
Looks like these truths are not so self-evident after all...
Second time I've posted this, but it's an interesting paper:
http://www.ida.liu.se/~abdmo/SNDFT/docs/ram-soft.html
If you buy cheap shit RAM, you'll get exactly the same as the OP. Get decent RAM and you'll not have an issue. For the last decade, I've bought nothing but Crucial and never had an issue with it. The odd time I've bought cheapass generic in that time, it's bitten me in the ass without fail.
I only please one person per day. Today is not your day. Tomorrow isn't looking good either. - Scott Adams
The probability of a cosmic ray at precisely the right angle and speed to cause a single bit error and cause an app to crash is somewhere on the same order as your chances of getting hit by a car, getting struck by lightning, getting torn apart by rabid wolves, and having sex in the back of a red 1948 Buick convertible at a drive-in movie theater on Tuesday night, Feb. 29th under a blue moon... all at the same time....
That was one hell of a night, wasn't it?
A CR would only interact with a few cells not enough to be called a flash.
I bet you tell the kids that there's not really molten lava in their baking soda volcano too. :(
:D
But I forgive you because Cerenkov radiation is way, WAY cooler than just stupid gamma rays.
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
At least on all computer's I've used, there's a multitude of small programs running, each of which seems to want do wakeup briefly every second or so. I don't know why - most of them probably just want's to realize there's nothing for them to do and goes back to wait mode again. However, each of these wakeups, are potentially causing cache-misses, memory swap-ins and outs.
This is not nessecarrily the fault of Windows itself, but the applications running on a normal Windows box. But the end result is the same - an "idle" Windows box excercises the memory more than it really would need to, just because of the behaviour of its applications.
MacPro Towers can use ECC memory, lots of it.
music lover since 1969
At what point in time was that true? Back when 1KB of memory was the norm? 16KB of memory? 64KB? 640KB? 64MB? 1GB?
Has the energy required to alter a single bit changed in the last 30 years? I would guess yes due to the fact that everything seems to be getting smaller and running on lower voltages, but maybe shielding is better these days?
But the main thing is that in the last 30 years we have increased the average amount of memory in the average computer by somewhere around a million times. We also have many many more computers running much more of the time. Everything else being equal (which it probably isn't), there must be a lot more of this going on than you think... (memory corruption caused by cosmic rays that is, not rabid wolves having sex in the back of cars at drive-ins)
I was recently running a server that archived about 2 terabytes of data and got something like one or two bit errors per year that could not be traced to any known source. This server boasted ECC RAM, so the problem didn't occur in the ram, and it was unlikely for it to have occurred in the FSB.
If you go with non-ECC, I would suggest running memtest86+. If you get errors, swap the memory. If swapping the memory still doesn't take care of it, swap motherboards! I recently had a memory problem in one of my customers' racks, and running memtest86+ got nothing until I had it running on my bench for over a week. There may be some problems with memtest86+...I even had another bit-error that memtest86+ did not find, but a Linux commandline memory tester found a problem almost immediately
The problem here is that different testing/usage patterns result in different probabilities of finding potentially bad words, e.g. words that may only be bad if you read from them a hundred cycles consecutively. But, if you do see a failure in memtest86+ or the CLI tester, you got yourself a serious problem. The point to take from this is that if you don't see errors, that doesn't mean you don't have errors!
Having said this, I still don't think memory errors among PCs are that common. We have more RAM on machines these days, but at the same time, the manufacturing processes have become better. I have a personal conviction in believing that though the likelihood of word error due to the increased amount of words in memory has increased, the RAM itself has become so much more "solid" that the increase of memory is negligible. Now, if you do dumb things with your computer like running it without a case or not giving it ventilation( learned this the hard way) or overclocking it, you *WILL* still run into problems. But if you design a system with quality and integrity, you typically shouldn't have these issues with memory!
One last thing to point out: there is quality hardware, and there is cheap hardware. My PC-Chips motherboard ran for three months and two days, and I didn't have a problem. Two days out of warrant. Now, take my MSI motherboard. It sets the timing for all memory modules to have the values of a single module. This resulted in stable single module operation, but got flaky for all four modules. I Finally moved to ECC before I figuerd out that I had to manually set the correct timings. This board is an ultra board, but apparently, it does not include use of generic (Micron, Corsair, etc!! - tried 'em all) memory modules. People on the Newegg reviews board have memory issues with this board as well that they could not fix with a BIOS update, and it appears that sometimes a design just is bad! Even the "good" manufacturers do not spend a lot of effort to fix issues in some cases.
My words of advice: Do your homework. Read through the reviews. AND DON'T BUY HARDWARE AS SOON AS IT COMES OUT!
why do you think the bsod is a memory glitch? particularly after running another OS that ran fine? Don't they teach troubleshooting anywhere? You should go learn it.
-- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it
Actually. 3GB isn't as sweet a spot as people seem to assume.
In a system where you have 2 DIMM slots (eg a laptop), it's very much advisable to put in 4GB, being 2x2GB modules and still have dual-channel access to your memory.
Dual-channel doesn't work with DIMMs that are different in size.
For most 'normal' boards, this is the same. Using 4x1 or 2x2 is performance-wise better than using 1+2. The cost involved in this is negligible nowadays, so there's no reason not to do it, even if you do 'lose' 1G.
The 3GB limit also entirely depends on whether your system maps 1G to PCI/Graphics/etc. Some systems only map 512M, some 768, etc. So depending on that, you might actually end up with 3.5 or 3.25 usable.
Regards,
Splut
Coz eternity my friend, is a long *ing time.
2) A prerequisite (and an expensive and hard-to-find one) for ECC memory is having a motherboard/chipset that *supports* ECC memory. Usually this means a server-class motherboard, but Intel usually has at least one high-end desktop motherboard on their line-up with ECC capability.
3) I always buy machines with support for ECC memory on the motherboard/chipset (except notebooks, where it doesn't seem to be an option), and always used ECC memory on them.
Best Regards,
Durval Menezes.
I have never met a computer that didn't like me.
Are you sure your motherboard applies the correct voltage for the memory modules? This can be verified in the BIOS I believe?
I'm not sure but, I will say this. Some memory brands definitely have more issues that others. I bought (2) 1GB sticks of Patriot memory and had to RMA both after about six months and the pair I got back lastest two months before they too started crashing my machine. I finally just yanked them and am running the original (2) sticks of 1GB Crucial that I've had since the beginning. I had bought them to run VMs, but I'm not running them now so I don't need the extra memory now anyhow.
amd desktop cpus can use ECC. Intel needs xeon and Intel xeon cost more and have less io then desktop i7 boards aka only 1 pci-e x16 slot / some don't even have 1 full x16 slot.
And why is it posted as a response to a "fp" type post? This is usually what karma whores do to game the system. An implicit confirmation of flaws in the comment system?
...if you catch it while hibernating
Be careful. Vista hibernates with one eye open. It can wake itself up from hibernation to do updates. I dual boot my laptop with Linux Mint (an Ubuntu variant). Every week, I'd go to turn on my computer only to find that the battery was dead. Checking the startup logs showed that linux was starting up at about 3:00 in the morning. After googling, I found out that many people were having that problem. The suggested solution was to turn off Vista automatic updates. I checked my Vista, and sure enough, it was set to update at 3am. I turned that setting off: no battery issue. I turned it back on: battery drained.
My CMOS settings pages do not have any facility for waking the laptop at a specific time, so I don't know how Vista manages it. I only know that it can. So beware! Vista hibernates with one eye open.
When our name is on the back of your car, we're behind you all the way!
If you want to prevent that kind of thing, remove the battery and power supply.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Both Vista and Ubuntu have trouble with my older GeForce 6200 graphics card... Vista sometimes "snows" up when the screen goes black due to idle, but it stays up for weeks at a time and I use it for development day in and out. Ubuntu, on the other hand, blows away the nvidia driver every time I get a kernel update, leaving me stuck at 800x600, and that -really- sucks.
This is my sig.
One thing that many people seem to forget is that you have the OS talking through the chipset to various components in the system. If you are using an Intel processor from the pre-i7 days, the chipset driver may be the source of many problems if you are on the Intel side of things.
Now, the fact that Linux works just shows that Linux may be doing things in a sane and more organized way when it comes to accessing the CPU and chipset(no matter how crazy the kernel code may seem). I have noticed that 64 bit driver support is rather weak on the Windows side of things, so that could be the source of your problems as well.
I would not jump to the conclusion that memory is your problem, but look for other factors. How good is your power supply since power issues can cause all sorts of unpredictable behavior?
I have to ask why you think why you need 16GB to check your email next year.
Between the sex spammers throwing in virus-laden teaser videos and the 419 spammers throwing in virus-laden bootleg Hollywood videos once 64-bit OSes become standard, I'm quite sure I'll need more than 16GB to read email next year.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
if you look at the username it's not him at all, it's someone with ID 1344097 pretending to be him. Still, what he says is sensible, and what's wrong with this piece? If it doesn't interest you, why are you reading the comments?
which is totally what she said
OS is irrelevant. I've had more than a handful of memory modules go bad over the years. If memtest86 (there's a memtest86+ now too) detects an error get new memory. Running different OS or tasks or switching modules around may seem to "fix" it but it's not. It's either avoiding the problem spot or the memory error didn't cause whatever to blow up. It eventually will. You might be writing bad data to your disk, miscalculating something, etc. Sometimes it looks like the OS is corrupted. Writing this makes me think it a good idea to run memtest once a month or at least once a quarter.
Since the rest of the post indulges in hyperbole, why not the scientific notation as well?
So you're saying there's a chance?
I like you, Mary Samsonite.
If you do what you always did, you get what you always got.
Long story short: things other than cosmic rays can cause memory errors.
I once had a box that wouldn't pass memtest86 if run with 3 DIMMs installed. The sticks all tested fine individually, in any socket or combination, except for when all 3 were installed. It turned out that some cabling in the box was hanging nearby, and when moved away from the RAM the setup became rock solid.
Stability does not mean only "no crashes". It also means that you can run the same installation for years without having to re-install your OS for whatever reason.
I know it is possible to do that. I have Windows XP installation from 2002 still going without re-install, but it does require expert knowledge, really cautions browsing and constant tinkering and cleaning up of garbage from registry, temp files, residues of programs that I have uninstalled (or upgraded to newer versions) etc.
Since I have switched to Mac, I have to baby sit my OS installation way less if at all. It's much more resilient to user using it :D.
As the island of our knowledge grows, so does the shore of our ignorance.
I usually ignore sigs. And obviously some people still thought it was KDawson. You can't act like he didn't think that would ever happen.
which is totally what she said
But the OP was talking about a *HARDWARE* problem.
you had me at #!
The real issue with memory cells flipping is not cosmic rays -- at least not with terrestrially deployed memory, it's alpha particle emissions from radioactive decay of the plastics in the memory package. Yes, the plastics surrounding the silicon.
A lot of work has been done to reduce the radioactivity of plastics used in IC packaging from normal background levels that you don't worry about in day-to-day life, to as quiet as possible, by carefully selecting source materials that have few naturally occurring radioisotopes.
From my chip-designer days I recall that the minimum charge required on a dynamic memory cell (like the ones in your computer's DRAM) to prevent spurious bit flips is one million electrons, give-or-take. The various designs back then were coming up with ways to reduce the footprint of the elements used to store that charge.
That said, it's been about 10 years since I've been in that line of work, and things have probably changed -- strike that, they've definitely changed -- substantially.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
I do get it, but some people (like me) automatically skip past sigs, and the guy created his frickin name as KDawson's name plus his userID, what about that don't you get?
which is totally what she said
You must be unlucky or the cause.
This would make a great slogan for Microsoft's new ad campaign:
"Not an actor, but he plays one on TV."
"...no one uses ECC memory..." You calling me a nobody? ECC came in this Mac Pro here. Comes in all Mac Pro models.
If I didn't have absolutely NOTHING to do, I wouldn't be here.
I had a problem with bad RAM on a brand new machine and nearly ended up sending the DIMMs back. But I wondered how that could happen, since they were Corsair Dominator, supposedly a reliable brand. Turns out that when my BIOS "loaded" the "extreme" profile to run the memory at 1600 instead of 1333, it didn't actually set the correct voltages, but only displayed them. Since the values were grayed out, I had assumed it would just set them, and so didn't check further at first. Later, after some corrupted data and many many memtest errors it occured to me to look at the BIOS settings again. Raising the voltage on the DIMM to the correct value (which BIOS was displaying as its "extreme profile") eliminated all the errors right away. (This was on GA-EX58-UD5, by the way)
I work for a large academic HPC organization which operates ~10k cores. our typical config has 2G/core, so we have a lot of dimms, all ECC. the majority of our systems have no corrected errors (CE); a few have modest rates (few hundred/day). we replace dimms which cause either uncorrected errors or > 1k CE/day. these are typically 8 GB machines with 1G ecc dimms - the bios hides details like whether the ECC is chipkill or not, or whether it's scrubbing. but the fact that large samples of COTS dimms generate no ECCs implies that a smaller-memory desktop stands a good chance of operating without random corruption. dimms are from Micron; systems are 1U servers in pretty decent machinerooms, at close to sealevel.
I find that when a Windows machine, from Windows 2000 on up, when taken care not to install too many programs and/or immature or junk-ware, then Windows remains quite stable and usable.
So basically, Windows is great and stable as long as you don't try to, y'know, run programs on it? So then why would it ever be considered useful or, going a step further, worth half the cost of a peecee with Windows installed? You can get the same hardware as parts and assemble the machine yourself, or buy the whole thing from a store that will put it together for you, save the Windows Tax, and install Ubuntu on it for free. On a simple PC, which these days has 1 GB of memory and a pretty fast processor, that's a huge discount on the price, plus you get a more stable system than you would get by paying a lot more to clog up the machine with Windows.
I had some awful experiences running just Microsoft Access on Windows NT and 2000, and I generally had to switch off the power due to a serious crash at least once a day. Due to the ridiculous instability, I closed everything except Access, and the install was pretty clean, because I've never liked the various additional toolbars and launchers and stuff. So is Microsoft's own Office "too many programs and/or immature or junk-ware"?
There is a lot of software misbehavior in Windows-world. (To be fair, there is software misbehavior in MacOS and Linux as well, but I see it far less often.)
I see crappy software for Linux and OSX all the time. The difference is that I don't see the crappy software for those platforms bringing down the whole OS.
In addition to those dark Windows-and-Access days, I also have some experience writing programs to solve systems of coupled nonlinear partial differential equations representing models of certain polymer systems. Those programs were written in C on Sun and SGI workstations (and on some dumb terminals talking to those workstations). In those programs, I had to do a lot of big matrix calculations, which involved me allocating and deallocating memory quite a bit. My understanding is that those are "dangerous" operations if you do them wrong, and I know I am not a great programmer. Even so, I can only remember two problems, neither one of which came close to a BSOD or a crash requiring a hard restart.
One was that I would occasionally make a small mistake and the program (my own program, which I don't mind calling immature or junk-ware) would crash, giving me something like "Segmentation fault. Core dumped." That problem was more common when my programs were new, in about 1993-1995. When I went back in 2000 to finish my Ph.D. after having left and worked at two different companies for a while, many of the old workstations had been replaced by beige box peecees running I-don't-remember-which distro of Linux. I still modified my programs some for different situations, but I got very few segmentation faults (I don't remember any from that time, but I wouldn't be shocked if there were one or a few that I had forgotten). I do remember one occasion when the UI died on a machine I was using. I went to another and managed to determine that the machine on which I had been working was still OK and my processes were still running there. I went to the computing services guy responsible for that computer lab. He verified what I'd told him, reset the X Server on the machine I'd been using, and showed me how to do the same thing in case it happened again. After that one time, I never even had UI problems again, in stark contrast to the unpleasant experiences I've had dealing with Windows.
In stability and security (another big subject) terms, there is still a huge difference between the 40 year old UNIX model on which OSX and Linux are based and the Windows "this time for sure" advertising gimmick of the year. Remember when Vista was going to be better than OSX and Linux? Great days. Now even Microsoft itself is talking about what an utter piece of crap Vista is, and trying to get y'all to hang on for Windows 7. You gonna fall for that again?
"It is nice to know that the computer understands the problem. But I would like to understand it too." --Eugene Wigner
On the other hand performance-wise, Windows XP running on a Core i7 with 3GB of triple channel DDR3 ram should be sufficient for checking your email. Personally I am using a 1x2GB module. This was recommended to me so I could upgrade easily when 64 bit becomes more common and/or RAM drops in price. In the meanwhile I save a few watts of power by only having one module of RAM.
When you buy memory from an outlet, see how ESD damage is thought not possible if they grab the memory by the ends with thumb and forefinger. I have had to tell shops that they would be sacked on the spot if they were working on building or handling memory at the manufacturer. They proudly tell you they've never damaged memory, just because purchasers have not hot footed it back to the shop minutes later to say it's broken. The damage they've caused may not result in failure for months, but it will. With a large antistatic mat and a monitored wrist strap I have found memory is the most frequent failing component.
If you haven't yet, download Prime95 right now and run it on your Windows systems.
http://www.mersenne.org/freesoft/
I've found that the prime95 stress test will catch errors that memtest won't. You need to run prime95 twice. Once right after booting the system, and a second time after the system has been running for more than a week (and in use for that week). Each run should be at least 1 hour.
In theory, memtest could pass this location automatically, or at the click of a button, which would be a lot more time efficient than replacement... assuming the bad RAM doesn't waste your time again later.
There are a number of reasons that might come in handy. I find modern motherboards very easy to damage however, so IMHO anything that involves opening the case should be avoided. I've kept gaming rigs that had minor hardware faults. Also while waiting for the RAM to arrive a "just don't use the bad bits" may come in handy.
but it doesn't report correctable ECC errors if it doesn't know the chipset or if it doesn't have reliable support for the chipset (that's what it means if ecc is set to 'no' in memtest on a board with ecc ram)
that said, you are right, some motherboards don't support ECC. I'm just saying that's not what the 'ecc' 'on/off' field in memtest86 means.
Zowie. I had no idea (living in Denver) that it was 10x sea-level. Now I'd like to run that test during the summer, when I'm living in Leadville, elevation 10,000ft.
That's a very cool article. Thanks for posting it.
Nostalgia's not what it used to be.
There are two of them:
http://www.memtest86.com/
http://www.memtest.org/
Or use both?
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
Memtest86 will find some memory issues, but not all. I find that you have to really stress the memory interface, itself (i.e.: not just typical reads/writes) in order to see truly "interesting" failures...ones that could easily cause "unexplained" lockups, etc. For this sort of thing, I recommend this. This will slam the memory i/f much more than just reads/writes...which is what you really ought to be doing anyway.
"...The smart and lazy ones I make my commanders." - Erwin Rommel
The probability of RAM just plain being defective---failing to operate correctly due to bugs in handling of certain low power states, having actual bad bits, having insufficient decoupling capacitance to work correctly in the presence of power supply rail noise, etc---is probably several hundred thousand orders of magnitude greater
Several hundred thousand orders of magnitude? Are you kidding?
Suppose that there is one cosmic ray particle that strikes the Earth per trillion years (orders of magnitude longer than the Earth will even exist). Suppose further that it strikes a completely random place distributed uniformly over the Earth's surface, and must hit a box one attometer square to flip the bit (orders of magnitude smaller than an atomic nucleus).
A square attometer is (10^-18)^2 = 10^-36 m^2. The Earth's surface is about 10^9 km^2 = 10^15 m^2. The rate of errors would then be about 10^-51 per trillion years, or 10^-63 per year.
This isn't even close to a hundred orders of magnitude below unity, let alone hundreds of thousands. You have to get into questions like "What's the probability that all air on Earth will spontaneously gather in one cubic centimeter over Nebraska, instantly causing all life to die?" to get more than a couple hundred orders of magnitude difference between anything.
(I compute about a 10^-10^50 probability of that happening, by the way, although I'm sure I'm off by a number of orders of magnitude of orders of magnitude. There's a reason entropy involves the natural log of the number of possible microstates: it gets you down to nice small numbers like a googol . . .)
MediaWiki developer, Total War Center sysadmin
http://www.ida.liu.se/~abdmo/SNDFT/docs/ram-soft.html
This references an IBM study, which is what I think I actually remember but could not find quickly this morning.
"In a study by IBM, it was noted that errors in cache memory were twice as common above an altitude of 2600 feet as at sea level. The soft error rate of cache memory above 2600 feet was five times the rate at sea level, and the soft error rate in Denver (5280 feet) was ten times the rate at sea level."
IBM research is a wonderful resource in the area of soft errors. I do remember exactly reading your quote, I didn't bother to track the exact article, but it should be part of this special issue http://www.research.ibm.com/journal/rd40-1.html, the banner article mentions Denver but doesn't have the exact quote. The web shows it would be "Terrestrial Cosmic Rays", the second article in that issue. They have a more recent special issue on the same subject http://www.research.ibm.com/journal/rd52-3.html
I bought the cheapest DDR400 RAM I could find at Fry's and it failed the memtest86. I had to manually change the BIOS to DDR333 for it to run reliably and pass the memory test. It has worked fine ever since, which has been almost a year.
"Meaningless!, Meaningless!" says the Teacher. "Utterly meaningless!"
I have found that if Memtest86 reports an error, that error is real and will matter sooner or later.
Memtest86 does a rather thorough scan including writing and reading patterns that are by design worst case. For example, some memory faults are such that a particular zero bit may flip to a one if all bits surrounding it on the chip are set to one. Memtest86 uses a series of bit patterns meant to trigger such problems. That's why it may find problems that don't or haven't yet crashed your system in practice. Note well that it's pure luck if there's no crash and there COULD be silent data corruption going on. Imagine a single I/O buffer out of thousands that flips 1 bit occasionally if just the wrong bit pattern is stored.
That's why I run my memory in RAID mode (Redundant Array of Inexpensive Dram)!
today is spelling optional day.
I have once in my career seen a bit flip that might haev been a cosmic ray of alpha particle. The machine was sorting massive (for the time) indexes and produced a list that was out of order. Because it was batch processing, I was able to re-run the exact input repeatedly and compare the output. The error never repeated itself after dozens of runs. It was single threaded and the only process on the system (this was in the single tasking DOS days), so it wasn't a race condition. No hardware test showed an error and the machine remained in error free service for years after that. But for that one run, the output was unquestionably wrong.
That's not to say such an event has happened only once, just that it happened one time under conditions that I could reliably repeat.
Some simple tests:
Being one who has maintained an 1100 node cluster with 8800 pieces of ECC RAM I can tell you we chase bad RAM sticks ALL THE TIME! It's not necessarily due to cosmic activity, the RAM just exhibits bad behavior as the circuits get older and things start to separate and break down due to thermal load over time. Even a small defect that would let the RAM pass the manufacturers tests will eventually lead to a DIMM failure down the road. Most average human beings will never determine why their machine crashes every few days if it is a RAM issue. Some power users will even overlook it because they have too much faith in RAM that *was* good when they bought it, but now that it's two or three years old ...
I wouldn't trust a single app to verify your RAM. Run a couple different tests and see if you can nail down the problem. I can look and see how we're tracking that and get back to you.
Would that mean people using notebook computers in airplanes should expect to see more errors than they do on the ground?
Or do the airplanes themselves shield enough of the alpha particles?
Look at the Apple Mac Pro. It comes with ECC memory. It's not the only computer either.
Most people buy PCs based on price. If they can save $5 they will. These people don't get ECC but then they aren't doing anything critical with their computers, just games or the Internet. Anyone running a critical service would have some kind of redundant setup to either deal with the crashes or prevent/reduce them with with things like ECC, RAID, dual power supplies and so on. Sun's servers even allow you to "boot around" failed hardware
Memory testing will NOT protect from casmic rays and flipped bits, what they call "soft errors"
Sheesh. Yes, I was purposefully exaggerating the odds for humorous effect. You're only the eightieth person to note that, starting with the very first reply to my post.
Check out my sci-fi/humor trilogy at PatriotsBooks.
That is not normal behavior for either XP or Vista. Are you running Windows in an emulator or something?
...which as you note is brought up often on these boards. What is not brought up is that most people replace their computers long before the HD is even thinking about going south on you.
I've had hard drives physically fail on me before, but in my personal experience it's been pretty rare, and I'm used to working on computers that are over half a decade old.
I would suspect you've got problematic hardware. That's the most likely cause of Windows FUBAR.
My experience with Linux, Windows, and glitchy hardware is fairly epic. Back in late '96, I got my first personal computer. It was from a fly-by-night shop, with sub-par parts: an integrated SiS motherboard, cheap RAM, Quantum Bigfoot drive. In retrospect, this computer likely had RAM problems when I got it, as in W95 is was more unstable than I was expecting.
Years went on, and I kept using this computer: I was a young geek who couldn't afford another $1500+ for a computer, after all. I heard that there was this linux thing that was rock-solid stable, so I decided to give it a try.
Long story short, the instability in the system got worse and worse - to the point where Windows would occasionally crash while or shortly after booting, but always several times an hour. I was down to using Linux for 99% of everything, and had mostly stopped gaming. Linux, while it wasn't rock-solid-stable, would only crash 2-3 times a day at the outside (in '99).
In '99, I did manage to patch and build the kernel with BadRAM, and that improved things measurably. (But it was time for a new system, anyway.)
Might try patching your kernel with BadRAM (not sure if Ubuntu does by default): https://help.ubuntu.com/community/BadRAM
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
This is either a troll, or you asked your question in the worst possible way. It's not a windows/linux question. Bad ram is bad, period. It sounds like your ram is bad. Probably the way linux is allocating that chunk of ram just happens to be non-crucial and not crashing your box. It's probably using it for cache and trashing your files or something. ;)
If it's taking a couple hours, it may be heat related, or it might just be a borderline chip.
Swap ram, run memtest overnight, if good, you're in the clear. If bad, you swapped the wrong piece, or your board just doesn't like the ram. (Not to brand bash, but this used to happen to me quite frequently with samsung ram)
No need to find out why it crashes in one and not the other.
Linux won't be mapping it out, so it's sitting around somewhere breaking things silently.
Just because it doesn't immediately crap out doesn't mean that it's dealt with the issue in any way. It's still broken. You can map out the bad ram manually with kernel boot parameters if you must use bad ram, but it doesn't happen automatically.
Mom?
We'll need to have the forensics guys analyze the contents of the wolves' stomachs to be sure. But don't worry, kid -- the car's still in great shape.
Breakfast served all day!
...write a better one. :)
So true.
I think we should bring back segmented addressing while we are at it too. I was so much more productive when I had to use DS, SS, GS, FS, and CS. I miss them so. All those segments made the computer run so fast, too!
Many people seem unaware of the fact that mixing memory modules of different speeds or different timings can and does cause problems. Even the engineers in our IT department at work don't get it.
http://www.research.ibm.com/journal/rd/401/ziegler.pdf
The article does mention concrete shielding, but nothing about metal?
"Recently, experiments on cosmic ray effects at airplane
altitudes have been published by IBM, Boeing, and others
[14, 151. We do not review this specialized field, other than
to note that the fail rate of electronics at airplane altitudes
is about one hundred times worse than at terrestrial
altitudes, as was predicted in an IBM paper 15 years
earlier [16]."
So you're telling me I'm much more likely to have a RAM problem on February 29th?
A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
According to this paper, "Impact of DRAM process technology on neutron-induced soft errors" (unfortunately not free), the basic strategy is still the same. As the process scales down, the memory cells get smaller and present less of a target to cosmic rays. However, logic becomes more susceptible to upset due to decreased capacitance.
In the paper they placed various sets of DIMMs into a neutron beam and recorded the MTBE for 1TB of memory. From the data, susceptibility decreases with smaller process nodes. However this effect is offset by the increasing amount of memory installed in computers.
Windows Internals has a pretty fun tool which by the click of a button will do bad things. One of the 'bad' things it can do is randomly overwrite kernel memory.
What is fun about the tool is that it is like Russian Roulette: You can click the button several times currupting memory, and eventually you will corrupt something important and bring the machine down. Or you can click the button a couple of times, and see how long you can use your machine before that memory path is hit and your system comes down.
The tool consists of two components, Notmyfault.exe, and Myfault.sys (which IIRC, is embedded within the exe, and launched into the kernel when the tool is run with admin rights). Notmyfault.exe cannot itself take down the machine, as it has not 'rights' to stomp on memory outside of its sandbox. This is why it has to request the bad deed to be done by the kernel component, myfault.sys
You can find a link to the tool below:
http://download.sysinternals.com/Files/Notmyfault.zip
Related to this thread - you can corrupt memory and not see any adverse affects if nothing important is located in that memory space. This would easily explain why Windows might crash, but not Linux. It just depends on where modules are loaded, and what code/data is corrupted.
Memtest or use ultimate boot cd.. If it has 2 sticks of memory, take out the 2nd stick and re-test, then take out the 1st stick and add the 2nd stick in the 1st slot and re-test. It is usually only one or the other, or a seating problem. Good luck.
Just palpate your memories and check for lumps.
I was just researching ECC memory and memory errors a few weeks ago. I am planning to build a computer with so much RAM that it can basically keep all my most frequently accessed files and programs in main memory, all the time. However, this made me seriously consider the reliability of RAM. I was pleasantly surprised that ECC RAM is quite easy to find and not that much more expensive than non-ECC RAM. However, figuring out which motherboards and/or CPUs support ECC RAM was so tedious that I gave up.
I have not been able to find any clear information about whether ECC support is purely a motherboard issue, purely a CPU issue, or a combination, for most kinds of CPU. My best guess is that if the CPU does not have an integrated memory controller (Intel CPUs and some AMD CPUs), ECC support is a motherboard issue, and if the CPU does have an integrated memory controller (many AMD CPUs), then ECC needs to be supported by both the CPU and the motherboard. In all cases, ECC needs to be supported by the RAM (although, in theory, this doesn't have to be the case).
With these somewhat shaky assumptions, I went to investigate what combination of 64-bit CPU and motherboard I could buy that would support at least 8 and preferably at least 16 GB of ECC RAM, preferably non-registered. As it turns out, whether a particular motherboard or CPU supports ECC memory is often not listed by sellers. Searching the web for this kind of information is difficult, because searching for "ECC" will give you many hits that are, in fact, "non-ECC" - which means that the component supports non-ECC memory, but doesn't mean it supports _only_ non-ECC memory.
All this really confused me. If it is so easy to find ECC-memory, it seems reasonable to assume there is a fairly strong demand for it. But if that is the case, why is it so difficult to find out which CPU/motherboard combinations support ECC? And, actually, why aren't we all using ECC memory nowadays? I would say that the probability of memory errors occurring must have increased with increased megabits per chip, both because there are more bits that can be flipped and because the energy needed to flip a bit is smaller. Given that, I would say memory reliability is a real issue. But it is hard to even find up to date statistics on this.
Now, I know all of my failures to find information can just be attributed to me not searching in the right way, but I really want to know the answers. So, any help and advice appreciated.
As for the computer I want to build:
- 64-bit CPU with virtualization extensions, preferably multi-core
- 8 or 16 GB of ECC RAM
- 80 or more GB solid state disk
- Video card with fast 3D using open-source drivers (this probably means AMD)
- I care about power consumption, especially when the system is idle
- Preferably no moving parts
I know this is ambitious, but everything besides figuring out if it will support ECC seems feasible.
Please correct me if I got my facts wrong.
I've especially ran into issues with DDR2 sticks in that they may not use the default 1.8/1.9V setting on most systems, but require 2.0-2.3V to operate: especially if they're "high-performance" memory meant to run at 1066 speed. Default timings also can be an issue with speed levels programmed into the chips as well: you can check for this issue by setting the RAM to run at 1 or 2 speeds lower (say DDR2-800 running with a 333mhz clock (DDR2-667) instead of 400mhz.
Life is irony, and nothing ever goes as planned.
Apple's Mac Pro uses (and shipped with) fully buffered ECC... at least my Jan 2008 model does.
I guess that extra cost *does* go somewhere. ;)
interests me is the fact that DDR costs at least twice as much as DDR2?Now why is that?
Wanted : A Signature.
It is serious.
I think people get their ideas from Maths. There things are perfect, without catches.
There are no such cases in real world. This gentleman have just point out that our system can not get bigger and bigger without rethinking the realibality.
Get things seriously. recheck if one of the assumptions have been fail from time to time.
This is a really good question and it points to the need for more utilities to check and fix bit level errors in files and memory images.
I wish there was a set of Linux utilities for generating and using error correction checksums.
The usual checksum tool simply reports if a file has an error.
What I would like to see is a checksum tool capable of fixing a multi-byte, single byte or single bit error.
These are checksums using Hammung codes I think?
I strongly suspect that huge hunks of don't care data like JPG photos and DVI movies develop bit level errors during storage, handling, copying, and editing.
A utility like gpg will spot errors. And diff spots character level errors. But they are both real clumsy at finding and precisely repairing individual bits. With diff you wind up unsure if your reference or original file changed.
This Linux system I am writing on had a bad RAN memory bit and the random number generator loaded over that location.
The random number generator didn't work but Linux mostly ran OK. I thought it was a software problem for several weeks. Finally after replacing the random number code and even checking bug reports, I finally looked at the RAM memory. Bing!
Ever think that the faster a computer runs, the more memory it handles they higher the probability of an error. If OSes hadn't improved imagine the hell we would be in now. Win 95 crashed every 30 minutes or more on a 486 running at 33Mhz. At 3,000Mhz that could be seen are 30m/90.9 or a crash every 20 seconds. Windows isn't more sensitive to a memory error, it just depends on where in ram the error is. Vista loaded the OS files in different memory locations with each boot to prevent 'badware' from finding a sweet spot in ram to attach to. Hence a memory error which in XP my effect some incidental program and go unnoticed in Vista can cause errors that vary for each boot cycle. Yani
1) I usually trust the results of memtest after a cycle or three. It is true that this doesn't give any conclusive evidence; but generally when memtest tells me a module bit the dust; the machine can be revived by replacing that module. If memtest has nothing to say and the machine is still wonky; I would investigate the hard drive; processor etc. first before re-evaluating the memory.
2) It depends how important your email is to you. whether you've got 64k or 16GB of RAM; every single word on it's own can be corrupted. So; even with better production; the end product is still dependent on each word of memory being correct. Each corruption may lead to an error. If that error is likely to crash a system you depend on; go ECC. If it's more likely to mean that you just need to reboot your gaming rig; you don't need ECC. When running your machine in ubuntu; the faulty words of your memory where allocated for some random number or something not addressed enough to cause corruption.
The probability of a cosmic ray at precisely the right angle and speed to cause a single bit error and cause an app to crash is somewhere on the same order as your chances of getting hit by a car, getting struck by lightning, getting torn apart by rabid wolves, and having sex in the back of a red 1948 Buick convertible at a drive-in movie theater on Tuesday night, Feb. 29th under a blue moon... all at the same time....
Nice analogies, but there are never any blue moons in February, as the lunar month is 29.53 days and the longest February is only 29 days ...
(For somebody who doesn't know, a blue moon is when you get two full moons in one month.)
I love obscure references as much as the next guy, but WTH was that?
I know tobacco is bad for you, so I smoke weed with crack.
Prolly referred to this. (And her character's last name was "Swanson", not "Samsonite" -- GP needs to get his facts straight before posting to Slashdot, like everyone else around here always does. ;)
Attention zealots and haters: 00100 00100