Reliability of Computer Memory?
olddoc writes "In the days of 512MB systems, I remember reading about cosmic rays causing memory errors and how errors become more frequent with more RAM. Now, home PCs are stuffed with 6GB or 8GB and no one uses ECC memory in them. Recently I had consistent BSODs with Vista64 on a PC with 4GB; I tried memtest86 and it always failed within hours. Yet when I ran 64-bit Ubuntu at 100% load and using all memory, it ran fine for days. I have two questions: 1) Do people trust a memtest86 error to mean a bad memory module or motherboard or CPU? 2) When I check my email on my desktop 16GB PC next year, should I be running ECC memory?"
My experience with memtest is you can trust the results if it says the memory is bad, however if the memory passed it could still be bad. Troubleshooting your scenario should involve replacing the DIMM's in questions with known good modules while running Windows.
brandelf -t FreeBSD
It's the lowest of the low end of the market that doesn't use ECC, or at least Parity RAM. For anything where reliability and veracity is important, you simply must use ECC.
If a system gives memtest86 errors, I break it down and swap components until it doesn't. The test pattern it uses can find subtle errors you're unlikely to run into with any application-based testing even when run for a few days. Any failures it reports should be taken seriously. Also: you should pay a attention to the memory speed value it reports, that's a surprisingly effective simple benchmark for figuring out if you've setup your RAM optimally. The last system I built, I ended up purchasing 4 different sets of RAM, and there was about a 30% delta between how well the best and worst performed on the memtest86 results--correlated extremely well with other benchmarks I ran too.
At the same time, I've had memory that memtest86 said was fine, but the system itself still crashed under a heavy Linux-based test. I consider both a full memtest86 test and a moderate workload Linux test to be necessary before I consider a new system to have baseline usable reliability.
There are a few separate problems here that are worthwhile to distinguish among. A significant amount of RAM doesn't work reliably when tested fully. Once you've culled those out, only using the good stuff, some of that will degrade over time to where it will no longer pass a repeat of the initial tests; I recently had a perfectly good set of RAM degrade to useless in only 3 months here. After you take out those two problematic sources for bad RAM, is the remainder likely enough to have problems that it's worth upgrading to ECC RAM? I don't think it is for my home systems, because I'm OK with initial and periodic culling to kick out borderline modules. And things like power reliability cause me more downtime than RAM issues do. If you don't know how or have the time to do that sort of thing yourself though, you could easily be better off buying more redundant RAM.
First, it was not cosmic rays; memory was tested in a lead vault and showed the same error rate. Turns out to have been alpha particles emitted by the epoxy / ceramic that the memory chips were encapsulated in.
That said: Quite clearly given your experience, Vista and Ubuntu load the memory subsystem quite differently. It is possible that Vista, with its all-over-the-map program flow, is missing cache a lot more often and so is hitting DRAM harder; I don't have the background to really know. I believe that Memtest86, in order to put the most strain on memory and thus test it in the most pessimal conditions, tries to access memory in patterns that equally hit physical memory hardest. But, what I have found is that some OSs, apparently including Ubuntu, will run on memory that is marginal, memory that Memtest86 picks up as bad.
As for ECC in memory... The problem is that ECC carries a heavy performance hit on write. If you only want to write 1 byte, you still have to read in the whole QWord, change the byte, and write it back to get the ECC to recalculate correctly. It is because of that performance hit that ECC was deprecated. The problem goes away to a large extent if your cache is write-back rather than write-through; though there will be still a significant number of cases where you have to write a set of bytes that has not yet been read into cache and does not comprise a whole ECC word.
That said, it is still used on servers...
But I don't expect it will reappear on desktops any time soon. Apparently they have managed to control the alpha radiation to a great extent, and so the actual radiation-caused errors are now occurring at a much lower rate, significantly lower than software-induced BSODs.
That's absolutely true. As Samuel Johnson remarked, "Depend upon it, sir, when a man knows he is to be drowned in ice water, it concentrates his mind wonderfully." Of course, Boswell made a few errors in transcription.
Not all memory is created equal. Memory can be bad if Memtest detects errors, or you can simply be running it at the wrong settings. Usually there are both "normal" and "performance" settings for memory on higher end motherboards, or sometimes you can tweak all sorts of cycle-level stuff manually (CAS latency etc.).
Try running your memory with the most conservative settings before you assume it's bad.
.: Max Romantschuk
Depending on where it fails (if it fails in a the same spot) you can relatively easily work around it and not throw out the remaining good portion of the stick. I wrote a howto..
http://gquigs.blogspot.com/2009/01/bad-memory-howto.html
I've been running on Option 3 for quite some time now. No, it's not as good as ECC, but it doesn't cost you anything.
1) Do people trust a memtest86 error to mean a bad memory module or motherboard or CPU?
Well, I'd add some other possibilities such as:
Bad power supply,
Memory isn't seated properly in it's socket.
Incorrect timing set in bios.
Memory is incompatable with your motherboard.
etc..
But yeah, if memtest86 says there's a problem then there really is something wrong.
is to swap the memory modules to find out which is causing the problem, if not motherboard. Also i don't see how memory tests running inside an OS can be effective, i'd much rather boot off of a smaller system on a DVD, USB-stick or floppy to run a memory test. Dell servers have those Dell Diagnostics CDs that are very small in memory footprint just in order to run diagnostics on memory. But even they're not perfect so you often have to take memory out and see if you can reproduce errors.
Actually there exists at least one way to do it on Linux(TM), namely the "BadRAM Linux(TM) kernel patch":
http://rick.vanrein.org/linux/badram/index.html
Was it cosmic rays, or Alpha particle decay from impure materials that was going to do in our memory soon? IIRC it was the latter.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
An alpha particle whizzing through your RAM will take out several bits if it hits the memory array at the right angle.
Did you figure out how alpha particles cannot travel through paper but can travel through RAM?
Yes. Vista is rock solid on solid hardware. Seriously. Vista is as reliable as Linux. Some people wreck their vista installation, some people wreck their Linux installation.
This is your sig. There are thousands more, but this one is yours.
I disagree - cosmic rays do happen all the time, and do interact with silicon. I was at a demo of a low-light level camera, and every few seconds there'd be a cosmic ray artefact on the monitor.
... 3GB seems to be the sweet spot for higher-end-home-pcs.
3GB is not so much a "sweet spot" as it is a limitation based on a 32 bit OS.
You can address 4GB max using 32 bits. Now take out the address space needed for your video card and any other cards you may put on the bus and you are looking at a 3GB max for useable memory.
So instead of "sweet spot" you really mean "maximum that can be used by Windows XP 32 Bit (the most commonly used OS today).
I've built many a system, and discovered that
1. removing & re-socketing the DIMM in question, sometimes 3 or 4 times ( testing between ),
*almost always* fixes the problem:
it's a CONNECTION problem, not a DIMM problem.
( dust in the socket? slight pressure-differences between "pins" and contact-pads?
whatever you do, don't wobble the DIMM when socketing it, or that'll push the contacts away *just enough* to make the connection erratic ).
2. Also, ASUS makes AMD based motherboards that DO have ECC, and some of us, who LIVE by our computers, insist on 'em.
( *drastically* cheaper than an Opteron/Xeon system, but still ECC reliability? Yowza, baby! )
3. The PSU oft is the culprit: RAM that tests fine, when the disks aren't in use, is being under-volted by the PSU, so it *becomes* erratic, when the *whole system* is under load.
Cheers, people!
Agreed. People who will sit and tell me with a straight face that Vista, in their experience, is unstable are either very unlucky, or liars. Windows stopped being generally unstable years ago. Get with the times.
"16MB (fuck off, MiB fascists)" - The Mighty Buzzard
Not only silicon... NASA astronauts consistently observe bright flashes in orbit, whether their eyes are open or closed. It is believed that these flashes are the result of cosmic rays interacting with the astronauts' retinas.
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
I fail to see how the parent is a troll, regardless of whether he is right or not.
Nevertheless my experience with Vista is the same, I run home premium on a newish laptop I use for music production and haven't had a glitch on it for months. My first intention was to wipe out the drive and install XP, but I abandoned the idea some time ago.
Your head a splode
Just FYI, 32bit Intel processors from the Pentium Pro generation and forward (with the exception of most, if not all of the Pentium-M's) have 36 physical address pins or more?
Many, but not all, chipsets have a facility for breaking the physical address presentation of the system RAM into a configurably-sized contiguous block below the 4GB limit and then making the rest available above the 4GB limit. If you're curious, the register (in intel parlance) is often called TOLUD (Top of Low Useable DRAM).
Yes, furthermore, given modern OS designs on x86 architecture, a process cannot utilize more than 2gb (windows without /3gb boot option) or 3gb (linux, most BSDs, windows with /3gb and apps specially built to use the 3/1 instead of 2/2 split.)
However, that limitation does not preclude you from having a machine running eight processes using 2GB of physical memory each.
The processor feature is called PAE (Physical Address Extension). It works, basically, by adding an extra level of processor pagetable indirection.
Incidentally, I have a quad P3-700 (It's a Dell PowerEdge 6450) propping a door open that could support 8GB of RAM if you had enough registered, ECC PC-133 SDRAM to populate the sixteen dimm slots.
Anyways, here's a snippet from the beginning of a 32 bit machine running Linux which has 4GB of RAM:
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 0000000000097c00 (usable)
[ 0.000000] BIOS-e820: 0000000000097c00 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000defafe00 (usable)
[ 0.000000] BIOS-e820: 00000000defb1e00 - 00000000defb1ea0 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000defb1ea0 - 00000000e0000000 (reserved)
[ 0.000000] BIOS-e820: 00000000f4000000 - 00000000f8000000 (reserved)
[ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fed40000 (reserved)
[ 0.000000] BIOS-e820: 00000000fed45000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 000000011c000000 (usable)
The title of that list should really be "Physical Address Space map." Either way, notice that the majority of the RAM is available up until 0xDEFAFE00 and the rest is available from 0x100000000 to 0x11c000000 - a range that's clearly above the 4GB limit.
Yes, it's running a bigmem kernel... But that's what bigmem kernels are for.
Oh, incidentally, even windows 2000 supported PAE. The bigger problem is the chipset. Not all of them support remapping a portion of RAM above 4GB.
fnord.
"Sure, given enough bits, it's bound to happen sooner or later, but it isn't something I'd worry about. :-)"
The last numbers I saw said something like 1 bit-flip per gigabyte month of RAM, but this is on ground.
In space (and to some extent high altitude airplanes as well) applications, bit-flips are extremely common and usually occur more than once per day and 2 megabytes of RAM (I saw some statistics on this from one mission, I believe it was SMART).
In general, I would say that the following should use ECC:
Servers and mainframes
Avionics and space computers
Laptops (they are frequently used in high altitude airplanes)
Desktops do not in general need them, since a bit error usually do not have catastrophic effects and the likelihood of them happening is not that great.
much MUCH smaller.
Travelling 1 um through paper doesn't get you to the other side. It will get you through several bit sites in modern RAM.
I fail to see how the parent is a troll, regardless of whether he is right or not.
That's because I wasn't trolling. Yes, I do know people here on slashdot don't like to hear positive opinions on Vista, but in fact Vista isn't all that bad.
I use Linux exclusively on my desktop pc at home and at work. I've been using Linux for over a decade. When I bought a laptop a year and a half ago, it came with Vista. Vista is IMHO a great improvement over XP. It's not even slow on decent hardware.ÂI have yet to receive my first BSOD since SP1 was released. SP0 gave me a few BSODs, maybe 5 in total.
That being said, I use Linux for work and Vista for play. So the comparison may not be entirely fair.
This is your sig. There are thousands more, but this one is yours.
3GB is not so much a "sweet spot" as it is a limitation based on a 32 bit OS.
You can address 4GB max using 32 bits.
I beg your pardon, but this is limited by your version of windows, not by hardware nor even by 32bit systems.
http://www.microsoft.com/whdc/system/platform/server/PAE/PAEdrv.mspx
http://www.ida.liu.se/~abdmo/SNDFT/docs/ram-soft.html
This references an IBM study, which is what I think I actually remember but could not find quickly this morning.
"In a study by IBM, it was noted that errors in cache memory were twice as common above an altitude of 2600 feet as at sea level. The soft error rate of cache memory above 2600 feet was five times the rate at sea level, and the soft error rate in Denver (5280 feet) was ten times the rate at sea level."
My bet is that it is cerenkov radiaton as a high speed charged particle breaks the speed of light in the fluid in the eyeball.
Indeed, these flashes have pretty much already been identified as the result of Cerenkov radiation.
Looks like these truths are not so self-evident after all...
Well, if it takes your corporate IT staff that long to rebuild a computer, they're probably doing it by hand while putting out other fires, which is foolish. Better IT departments have standard images that have been made for and tested upon the computer models that they've standardized upon. Barring hardware failure, the result is a stable Windows environment with few software problems that aren't user-inflicted. In addition, rebuilding a system takes less than an hour: Gigabit Ethernet drops to the benches make backing up a system and restoring a clean image to it go very quickly. Rebuilds for purely remote users are a priority as well. They have access to their email and calendar via OWA, but not to any corporate systems that require VPN access, so getting their laptop repaired and back to them quickly is important: We try to get them repaired and sent out the day we receive them, and have been known to work Saturdays as well to get a system back to someone by the next Monday. We also maintain a hot spare pool: One laptop of every model that we support is on hand to overnight to someone whose laptop is broken. So, in all cases except where the hard drive is broken or the software on it borked, we can have a person up and running the next day. They then send us their computer, and we handle the warranty issues and return it to them.
We also don't permit anyone (ourselves included), to run Windows as Administrator or equivalent except for purposes of installing software or patching. While the computers are joined to our domain, remote/traveling users also have a local user account that is Administrator-equivalent whose name is "[their domain login name].local". They are given the password to it (which is different than their domain password) and told not to use it except to install software or in emergencies (but if they get to that point, they're expected to call: We have a person whose main job is to support remote/traveling users, and she's very good - not only is she an intelligent person, she's a skilled technician and knows our systems inside and out).
It sounds to me as though there are number of things going on: First, you're getting poor Windows installations. Secondly, there's probably a degree of PEBKAC going on as well. You say that you use Macs at home, so there's almost certainly more than a little resistance to using Windows stemming from attitude: "Macs are better and so I don't have to/won't learn how to use Windows". I've seen this more than once in our company: People that have Macs at home tend to be smug about them and pounce upon every problem (whether real or perceived) with their Windows computers at work. That's OK: After awhile you learn which people are your "problem children", and accommodate them as best you can.
In any event, I am sorry for your difficulties, and hope that they are remedied soon.
Nothing is 100% stable. That's an awfully high standard to reach. And I get uptimes of a month on my Vista machine too, so I fail to see how you're demonstrating a point of how Windows is so far behind.
"16MB (fuck off, MiB fascists)" - The Mighty Buzzard
Again this is a culture issue - there's nothing stopping Windows applications from running from a single folder (and indeed, plenty of them do). Conversely, I don't see why one couldn't make a Linux or OS X application that installed some system files (they do have shared libraries, right?)
And indeed, it's worth noting that Quicktime on Windows is as bad as an offender as any other application when it comes to installing background rubbish and insisting on running all the time, so Apple don't get off lightly here.
I think you hit on it there. Windows, it's self, has gotten better over the years. It's buggy, but I've never seen an OS that isn't. From my years of working on customer walk in and corporate contract machines, Windows "buggyness" usually comes from 3 vectors (in order of severity): flaky drivers, flaky software, or PEBKAC. Mac has less "crashes" only due to a controlled hardware pool. Start attaching lost of 3rd party hardware and see how your mileage goes. Linux has the advantage of having mostly open drivers, so you get geeks tinkering and putting back. But you still need those geeks to have the hardware and time to fix it. Windows does not have those advantages, because of the market they want to participate in. If this bugs you that much, use another OS. God knows there are plenty. No OS is perfect. I personally use 3 on a daily basis (XP, OS X, Ubuntu Linux). And yes, they all crash occasionally.
"To Do Is To Be" - Socrates, "To Be Is To Do" - Sartre, "Do Be Do Be Do" - Sinatra
And why is it posted as a response to a "fp" type post? This is usually what karma whores do to game the system. An implicit confirmation of flaws in the comment system?
...if you catch it while hibernating
Be careful. Vista hibernates with one eye open. It can wake itself up from hibernation to do updates. I dual boot my laptop with Linux Mint (an Ubuntu variant). Every week, I'd go to turn on my computer only to find that the battery was dead. Checking the startup logs showed that linux was starting up at about 3:00 in the morning. After googling, I found out that many people were having that problem. The suggested solution was to turn off Vista automatic updates. I checked my Vista, and sure enough, it was set to update at 3am. I turned that setting off: no battery issue. I turned it back on: battery drained.
My CMOS settings pages do not have any facility for waking the laptop at a specific time, so I don't know how Vista manages it. I only know that it can. So beware! Vista hibernates with one eye open.
When our name is on the back of your car, we're behind you all the way!
I get uptimes of 4-5 weeks on Vista. I have to reboot on the Wednesday after the second Tuesday every month for updates.
I have an uptime of about 6 months on Ubuntu since the last time I rebooted to put an extra hard drive in. I don't have to reboot for updates.
Some simple tests:
Being one who has maintained an 1100 node cluster with 8800 pieces of ECC RAM I can tell you we chase bad RAM sticks ALL THE TIME! It's not necessarily due to cosmic activity, the RAM just exhibits bad behavior as the circuits get older and things start to separate and break down due to thermal load over time. Even a small defect that would let the RAM pass the manufacturers tests will eventually lead to a DIMM failure down the road. Most average human beings will never determine why their machine crashes every few days if it is a RAM issue. Some power users will even overlook it because they have too much faith in RAM that *was* good when they bought it, but now that it's two or three years old ...
I wouldn't trust a single app to verify your RAM. Run a couple different tests and see if you can nail down the problem. I can look and see how we're tracking that and get back to you.
I had a box once that kept crashing randomly. I thought it was the OS, the memory, couldn't figure it out. Finally realized that one of the memory slots was bad. Kept that one empty and it was solid.
Granted, it was a $30 motherboard.