Slashdot Mirror


Reliability of Computer Memory?

olddoc writes "In the days of 512MB systems, I remember reading about cosmic rays causing memory errors and how errors become more frequent with more RAM. Now, home PCs are stuffed with 6GB or 8GB and no one uses ECC memory in them. Recently I had consistent BSODs with Vista64 on a PC with 4GB; I tried memtest86 and it always failed within hours. Yet when I ran 64-bit Ubuntu at 100% load and using all memory, it ran fine for days. I have two questions: 1) Do people trust a memtest86 error to mean a bad memory module or motherboard or CPU? 2) When I check my email on my desktop 16GB PC next year, should I be running ECC memory?"

29 of 724 comments (clear)

  1. Memtest not perfect. by Galactic+Dominator · · Score: 5, Informative

    My experience with memtest is you can trust the results if it says the memory is bad, however if the memory passed it could still be bad. Troubleshooting your scenario should involve replacing the DIMM's in questions with known good modules while running Windows.

    --
    brandelf -t FreeBSD /brain
    1. Re:Memtest not perfect. by 0100010001010011 · · Score: 5, Funny

      I bet Windows will love you replacing the DIMM's while running.

    2. Re:Memtest not perfect. by Anonymous Coward · · Score: 5, Interesting

      Another nice tool is prime95. I've used it when doing memory overclocking and it seemed to find the threshold fairly quickly. Of course your comment still stands - even if a software tool says the memory is good, it might not necessarily be true.

    3. Re:Memtest not perfect. by Antidamage · · Score: 5, Funny

      I've often had it pick up bad ram, usually within the first five minutes. One time, the memory in question had been through a number of unprotected power surges. The motherboard and power supply were dead too.

      You can reliably replicate my results by removing the ram, snapping it in half and putting it back in. No need to wait for a power surge to see memtest86 shine.

    4. Re:Memtest not perfect. by Hal_Porter · · Score: 5, Funny

      Memtestx86 is bögus. My machine alwayS generated errors when I run the test but it works fOne otherwise ÿ

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    5. Re:Memtest not perfect. by Idaho · · Score: 5, Interesting

      My experience with memtest is you can trust the results if it says the memory is bad, however if the memory passed it could still be bad.

      I wonder how strongly RAM stability depends on power fluctuations. While you're testing memory using Memtest, the GPU is not used at all, for example. When playing a game and/or running some heavy compile-jobs, on the other hand, overall power usage will be much higher. I wonder if this may reflect on RAM stability, especially if the power supply is not really up to par?

      If so, you might never find out about such a problem by using (only) memtest.

      --
      Every expression is true, for a given value of 'true'
  2. tinfoil is the answer by Anonymous Coward · · Score: 5, Funny

    wrap your _whole_ computer in tinfoil to deflect those pesky cosmic rays. it also works to keep them out of your head too.

  3. Answers by jawtheshark · · Score: 5, Interesting

    1) Yes

    2) No

    Now to be serious. Home PC do not come yet with 6GB or 8GB. Most new home PC still seem to have between 1GB and 4GB. Where the 4GB variety is rare because of the fact that most home PCs still come with a 32-bit operating system. 3GB seems to be the sweet spot for higher-end-home-pcs. Your home PC will most likely not have 16GB next year. Your workstation at work, perhaps, but then even perhaps.

    At the risk of sounding like "640KByte is enough for everyone", I have to ask why you think why you need 16GB to check your email next year. I'm typing this on a 6 year old computer, I'm running quite a few applications at the same time and I know a second user is logged in. Current memory usage: 764Meg RAM. As a general rule, I know that Windows XP runs fine on 512Meg RAM and is comfortable with 1GB RAM. The same is true for GNU/Linux running Gnome.

    Now, at work with Eclipse loaded, a couple of application servers, a database and a few VMs... Yeah, there indeed you get memory starved quickly. You have to keep in mind that such usage pattern is not that of a typical office worker. I can imagine that a heavy Photoshop user would want every bit of RAM he can get too. The Word-wielding-office-worker? I don't think so.

    Now, I can't speak for Vista. I heard it runs well on 2GB systems, but I can't say. I got a new work laptop last week and booted briefly in Vista. It felt extremely sluggish and my machine does have 4Gig RAM. Anyway, I didn't bother and put Debian Lenny/amd64 on it and didn't look back.

    I my idea, you have quite a twisted sense of reality regarding to the computers people actually use.

    Oh, and frankly... If cosmic rays would be a big issue by now with huge memories, don't you think that more people would be complaining? I can't say why Ubuntu/amd64 ran fine on your machine. Perhaps GNU/Linux has built-in error correction and marks bad RAM as "bad".

    --
    Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)
    1. Re:Answers by megabeck42 · · Score: 5, Informative

      Just FYI, 32bit Intel processors from the Pentium Pro generation and forward (with the exception of most, if not all of the Pentium-M's) have 36 physical address pins or more?

      Many, but not all, chipsets have a facility for breaking the physical address presentation of the system RAM into a configurably-sized contiguous block below the 4GB limit and then making the rest available above the 4GB limit. If you're curious, the register (in intel parlance) is often called TOLUD (Top of Low Useable DRAM).

      Yes, furthermore, given modern OS designs on x86 architecture, a process cannot utilize more than 2gb (windows without /3gb boot option) or 3gb (linux, most BSDs, windows with /3gb and apps specially built to use the 3/1 instead of 2/2 split.)

      However, that limitation does not preclude you from having a machine running eight processes using 2GB of physical memory each.

      The processor feature is called PAE (Physical Address Extension). It works, basically, by adding an extra level of processor pagetable indirection.

      Incidentally, I have a quad P3-700 (It's a Dell PowerEdge 6450) propping a door open that could support 8GB of RAM if you had enough registered, ECC PC-133 SDRAM to populate the sixteen dimm slots.

      Anyways, here's a snippet from the beginning of a 32 bit machine running Linux which has 4GB of RAM:
      [ 0.000000] BIOS-provided physical RAM map:
      [ 0.000000] BIOS-e820: 0000000000000000 - 0000000000097c00 (usable)
      [ 0.000000] BIOS-e820: 0000000000097c00 - 00000000000a0000 (reserved)
      [ 0.000000] BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)
      [ 0.000000] BIOS-e820: 0000000000100000 - 00000000defafe00 (usable)
      [ 0.000000] BIOS-e820: 00000000defb1e00 - 00000000defb1ea0 (ACPI NVS)
      [ 0.000000] BIOS-e820: 00000000defb1ea0 - 00000000e0000000 (reserved)
      [ 0.000000] BIOS-e820: 00000000f4000000 - 00000000f8000000 (reserved)
      [ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fed40000 (reserved)
      [ 0.000000] BIOS-e820: 00000000fed45000 - 0000000100000000 (reserved)
      [ 0.000000] BIOS-e820: 0000000100000000 - 000000011c000000 (usable)

      The title of that list should really be "Physical Address Space map." Either way, notice that the majority of the RAM is available up until 0xDEFAFE00 and the rest is available from 0x100000000 to 0x11c000000 - a range that's clearly above the 4GB limit.

      Yes, it's running a bigmem kernel... But that's what bigmem kernels are for.

      Oh, incidentally, even windows 2000 supported PAE. The bigger problem is the chipset. Not all of them support remapping a portion of RAM above 4GB.

      --
      fnord.
  4. RAID(?) for RAM by Xyde · · Score: 5, Interesting

    With memory becoming so plentiful these days (I haven't seen many home PC's with 6 or 8GB granted, but we're getting there) it seems that a single error on a large capacity chip is getting more and more trivial. Isn't it a waste to throw away a whole DIMM? Why isn't it possible to "remap" this known-bad address, or allocate some amount of RAM for parity the way software like PAR2 works? Hard drive manufacturers already remap bad blocks on new drives. Also it seems to me that, being a solid state device, small failures in RAM aren't necessarily indicative of a failing component like bad sectors on a hard drive are. Am I missing something really obvious here or is it really just easier/cheaper to throw it away?

  5. Joking aside... by BabaChazz · · Score: 5, Informative

    First, it was not cosmic rays; memory was tested in a lead vault and showed the same error rate. Turns out to have been alpha particles emitted by the epoxy / ceramic that the memory chips were encapsulated in.

    That said: Quite clearly given your experience, Vista and Ubuntu load the memory subsystem quite differently. It is possible that Vista, with its all-over-the-map program flow, is missing cache a lot more often and so is hitting DRAM harder; I don't have the background to really know. I believe that Memtest86, in order to put the most strain on memory and thus test it in the most pessimal conditions, tries to access memory in patterns that equally hit physical memory hardest. But, what I have found is that some OSs, apparently including Ubuntu, will run on memory that is marginal, memory that Memtest86 picks up as bad.

    As for ECC in memory... The problem is that ECC carries a heavy performance hit on write. If you only want to write 1 byte, you still have to read in the whole QWord, change the byte, and write it back to get the ECC to recalculate correctly. It is because of that performance hit that ECC was deprecated. The problem goes away to a large extent if your cache is write-back rather than write-through; though there will be still a significant number of cases where you have to write a set of bytes that has not yet been read into cache and does not comprise a whole ECC word.

    That said, it is still used on servers...

    But I don't expect it will reappear on desktops any time soon. Apparently they have managed to control the alpha radiation to a great extent, and so the actual radiation-caused errors are now occurring at a much lower rate, significantly lower than software-induced BSODs.

    1. Re:Joking aside... by bertok · · Score: 5, Insightful

      As for ECC in memory... The problem is that ECC carries a heavy performance hit on write. If you only want to write 1 byte, you still have to read in the whole QWord, change the byte, and write it back to get the ECC to recalculate correctly. It is because of that performance hit that ECC was deprecated. The problem goes away to a large extent if your cache is write-back rather than write-through; though there will be still a significant number of cases where you have to write a set of bytes that has not yet been read into cache and does not comprise a whole ECC word.

      AFAIK, on modern computer systems all memory is always written in chunks larger than a byte. I seriously doubt there's any system out there that can perform single-bit writes either in the instruction set, or physically down the bus. ECC is most certainly not "depreciated" -- all standard server memory is always ECC, I've certainly never seen anything else in practice from any major vendor.

      The real issue is that ECC costs a little bit more than standard memory, including additional traces and logic in the motherboard and memory controller. The differential cost of the memory is some fixed percentage (it needs extra storage for the check bits), but the additional cost in the motherboard is some tiny fixed $ amount. Apparently for most desktop motherboard and memory controllers that few $ extra is far too much, so consumers don't really have a choice. Even if you want to pay the premium for ECC memory, you can't plug it into your desktop, because virtually none of them support it. This results in a situation where the "next step up" is a server class sytem, which is usually at least 2x the cost of the equivalent speed desktop part for reasons unrelated to the memory controller. Also, because no desktop manufacturers are buying ECC memory in bulk, it's a "rare" part, so instead of, say, 20% more expensive, it's 150% more expensive.

      I've asked around for ECC motherboards before, and the answer I got was: "ECC memory is too expensive for end-users, it's an 'enterprise' part, that's why we don't support it." - Of course, it's an expensive 'enterprise' part BECAUSE the desktop manufacturers don't support it. If they did, it'd be only 20% more expensive. This is the kind of circular marketing logic that makes my brain hurt.

  6. Depends by gweihir · · Score: 5, Interesting

    My experience with a server that recorded about 15TB of data is something like 6 bit-errors per year that could not be traced to any source. This was a server with ECC RAM, so the problem likely occured in busses, network cards, and the like, not in RAM.

    For non-ECC memory, I would strongly syggest running memtest86+ at least a day before using the system and if it gives you errors, replace the memory. I had one very persistend bit-error in a PC in a cluster, that actually reqired 2 days of memtest86+ to show up once, but did occure about once per hour for some computations. I also had one other bit-error that memtest86+ did not find, but the Linux commandline memory tester found after about 12 hours.

    The problem here is that different testing/usage patterns result in different occurence probability for weak bits, i.e. bits that only sometimes fail. Any failure in memtest86+ or any other RAM tester indicates a serious problem. The absence of errors in a RAM test does not indicate the memory is necessarily fine.

    That said, I do not believe memory errors have become more common on a per computer basis. RAM has become larger, but also more reliable. Of course, people participating in the stupidity called "overclocking" will see a lot more memory errors and other errors as well. But a well-designed system with quality hardware and a thourough initial test should typically not have memory issues.

    However there is "quality" hardware, that gets it wrong. My ASUS board sets the timing for 2 and 4 memory modules to the values for 1 module. This resulted in stable 1 and 2 module operation, but got flaky for 4 modules. Finally I moved to ECC memory before I figuerd out that I had to manually set the correct timings. (No BIOS upgrade available that fixed this...) This board has a "professional" in its name, but apparently, "professional" does not include use of generic (Kingston, no less) memory modules. Other people have memory issues with this board as well that they could not fix this way, seems that somethimes a design just is bad or even reputed manufacturers do not spend a lot of effort to fix issues in some cases.In can only advise you to do a thourough forum-search before buying a specific mainboard.

     

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  7. Settings matter too by Max+Romantschuk · · Score: 5, Informative

    Not all memory is created equal. Memory can be bad if Memtest detects errors, or you can simply be running it at the wrong settings. Usually there are both "normal" and "performance" settings for memory on higher end motherboards, or sometimes you can tweak all sorts of cycle-level stuff manually (CAS latency etc.).

    Try running your memory with the most conservative settings before you assume it's bad.

    --
    .: Max Romantschuk :: http://max.romantschuk.fi/
  8. Re:The truth by Mr+Z · · Score: 5, Interesting

    Note: having more memory increases your error rate assuming a constant rate of error (per megabyte) in the memory. However, if the error rate drops as technology advances, adding more memory does not necessarily result in a higher system error rate. And based on what I've seen, this most definitely seems to be the case.

    Actually, error rates per bit are increasing, because bits are getting smaller and fewer electrons are holding the value for your bit. An alpha particle whizzing through your RAM will take out several bits if it hits the memory array at the right angle. Previously, the bits were so large that there was a good chance the bit wouldn't flip. Now they're small enough that multiple bits might flip.

    This is why I run my systems with ECC memory and background scrubbing enabled. Scrubbing is where the system actively picks up lines and proactively fixes bit-flips as a background activity. I've actually had a bitflip translate into persistent corruption on the hard drive. I don't want that again.

    FWIW, I work in the embedded space architecting chips with large amounts of on-chip RAM. These chips go into various infrastructure pieces, such as cell phone towers. These days we can't sell such a part without ECC, and customers are always wanting more. We actually characterize our chip's RAM's bit-flip behavior by actively trying to cause bit-flips in a radiation-filled environment. Serious business.

    Now, other errors that parity/ECC used to catch, such as signal integrity issues from mismatched components or devices pushed beyond their margins... Yeah, I can see improved technology helping that.

  9. Re:Paranoia? by Anonymous Coward · · Score: 5, Funny

    and having sex in the back of a red 1948 Buick convertible at a drive-in movie theater on Tuesday night, Feb. 29th under a blue moon... all at the same time....

    Mom?

  10. Re:Surprise? by Erik+Hensema · · Score: 5, Informative

    Yes. Vista is rock solid on solid hardware. Seriously. Vista is as reliable as Linux. Some people wreck their vista installation, some people wreck their Linux installation.

    --

    This is your sig. There are thousands more, but this one is yours.

  11. Re:Surprise? by bigstrat2003 · · Score: 5, Informative

    Agreed. People who will sit and tell me with a straight face that Vista, in their experience, is unstable are either very unlucky, or liars. Windows stopped being generally unstable years ago. Get with the times.

    --
    "16MB (fuck off, MiB fascists)" - The Mighty Buzzard
  12. Re:Surprise? by alexandre_ganso · · Score: 5, Funny

    ... vista is way too slow for my, and many other's tastes.

    Now you got what he meant with "rock solid"....

  13. Re:Surprise? by erroneus · · Score: 5, Insightful

    I find that when a Windows machine, from Windows 2000 on up, when taken care not to install too many programs and/or immature or junk-ware, then Windows remains quite stable and usable. The trouble with Windows is the culture. It seems everything wants to install and run a background process or a quick-launcher or a taskbar icon. It seems many don't care about loading old DLLs over newer ones. There is a lot of software misbehavior in Windows-world. (To be fair, there is software misbehavior in MacOS and Linux as well, but I see it far less often.) But Windows by itself is typically just fine.

    Since the problem is Windows culture and not Windows itself, one has to educate one's self in order to avoid the pitfalls that people tend to associate with Windows itself.

  14. Here's the article I remember RE alpha particles. by Jay+Tarbox · · Score: 5, Informative

    http://www.ida.liu.se/~abdmo/SNDFT/docs/ram-soft.html

    This references an IBM study, which is what I think I actually remember but could not find quickly this morning.

    "In a study by IBM, it was noted that errors in cache memory were twice as common above an altitude of 2600 feet as at sea level. The soft error rate of cache memory above 2600 feet was five times the rate at sea level, and the soft error rate in Denver (5280 feet) was ten times the rate at sea level."

  15. Re:Paranoia? by redirect+'slash'+nil · · Score: 5, Informative

    My bet is that it is cerenkov radiaton as a high speed charged particle breaks the speed of light in the fluid in the eyeball.

    Indeed, these flashes have pretty much already been identified as the result of Cerenkov radiation.

    --
    Looks like these truths are not so self-evident after all...
  16. Re:Surprise? by andy9o · · Score: 5, Funny

    PEBKAC

  17. Re:Surprise? by MobyDisk · · Score: 5, Insightful

    People who will sit and tell me with a straight face that Vista, in their experience, is stable are either very lucky, or Microsoft shills.

    See? I can say the opposite, and provide just as much evidence? Do I get modded to 5 as well? Where's your statistics on the stability of Vista? Did it work well for you, therefore, it works well for everyone else?

    I worked for a company that bought a laptop of every brand, so that when the higher-ups went into meetings with Dell, HP, Apple, etc. they had laptops that weren't made by a competitor. They have had problems like laptops not starting-up the first time due to incompatible software. That was a recent as 6 months ago. My mother-in-law bought a machine that has plenty of Vista-related problems (audio cutting out, USB devices not working, random crashes in explorer) on new mid-range hardware that came with Vista. But I have a neighbor who found it fixed lots of problems with gaming under XP.

    There's plenty of issues. Vista's problems weren't just made-up because you didn't experience them.

    Everybody's experience is different. Quit making blanket statements based on nothing.

  18. Re:Surprise? by Anonymous Coward · · Score: 5, Informative

    I have to send the machine to corporate headquarters so they can do the reinstall, leaving us without one for a week.

    Well, if it takes your corporate IT staff that long to rebuild a computer, they're probably doing it by hand while putting out other fires, which is foolish. Better IT departments have standard images that have been made for and tested upon the computer models that they've standardized upon. Barring hardware failure, the result is a stable Windows environment with few software problems that aren't user-inflicted. In addition, rebuilding a system takes less than an hour: Gigabit Ethernet drops to the benches make backing up a system and restoring a clean image to it go very quickly. Rebuilds for purely remote users are a priority as well. They have access to their email and calendar via OWA, but not to any corporate systems that require VPN access, so getting their laptop repaired and back to them quickly is important: We try to get them repaired and sent out the day we receive them, and have been known to work Saturdays as well to get a system back to someone by the next Monday. We also maintain a hot spare pool: One laptop of every model that we support is on hand to overnight to someone whose laptop is broken. So, in all cases except where the hard drive is broken or the software on it borked, we can have a person up and running the next day. They then send us their computer, and we handle the warranty issues and return it to them.

    We also don't permit anyone (ourselves included), to run Windows as Administrator or equivalent except for purposes of installing software or patching. While the computers are joined to our domain, remote/traveling users also have a local user account that is Administrator-equivalent whose name is "[their domain login name].local". They are given the password to it (which is different than their domain password) and told not to use it except to install software or in emergencies (but if they get to that point, they're expected to call: We have a person whose main job is to support remote/traveling users, and she's very good - not only is she an intelligent person, she's a skilled technician and knows our systems inside and out).

    It sounds to me as though there are number of things going on: First, you're getting poor Windows installations. Secondly, there's probably a degree of PEBKAC going on as well. You say that you use Macs at home, so there's almost certainly more than a little resistance to using Windows stemming from attitude: "Macs are better and so I don't have to/won't learn how to use Windows". I've seen this more than once in our company: People that have Macs at home tend to be smug about them and pounce upon every problem (whether real or perceived) with their Windows computers at work. That's OK: After awhile you learn which people are your "problem children", and accommodate them as best you can.

    In any event, I am sorry for your difficulties, and hope that they are remedied soon.

  19. Re:Surprise? by inasity_rules · · Score: 5, Insightful

    Dude! Take a chill pill. This is not FUD. The gp is just relating his experience, and here's a shock, YMMV! So just sit back and have another beer.

    BTW, I've also had major hassles with windows - mostly related to viruses. As it happens this forced me to switch 100% to linux and I'm happy here, but not everyone who switches is. Personally I like the bandwidth I save from not constantly downloading AV updates, and the speed increase from not running AV. But hey, where you are computing power and bandwidth are probably cheap. Again, YMMV.

    --
    I have determined that my sig is indeterminate.
  20. Re:Surprise? by unoengborg · · Score: 5, Insightful

    You are right and you are wrong. Yes, it's true that Vista, XP or even Windows 2k are rock solid, but only as long as you don't add third party hardware driveres of dubious quality. Unfortunately many hardware venders don't spend as much effort as they should to develop good drivers. Just using the drivers that comes with windows leaves you with a rather small set of supported hardware, so people install whatever drivers that comes with the hardware they buy, and as a result they get BSOD if they are unlucky, and then they blame Microsoft.

    --
    God is REAL! Unless explicitly declared INTEGER
  21. New Microsoft ad slogan by mkcmkc · · Score: 5, Funny

    You must be unlucky or the cause.

    This would make a great slogan for Microsoft's new ad campaign:

    • Windows: if it doesn't work perfectly, it's your fault.
    --
    "Not an actor, but he plays one on TV."
  22. Re:Surprise? by Simetrical · · Score: 5, Insightful

    I worked for a company that bought a laptop of every brand, so that when the higher-ups went into meetings with Dell, HP, Apple, etc. they had laptops that weren't made by a competitor. They have had problems like laptops not starting-up the first time due to incompatible software. That was a recent as 6 months ago. My mother-in-law bought a machine that has plenty of Vista-related problems (audio cutting out, USB devices not working, random crashes in explorer) on new mid-range hardware that came with Vista. But I have a neighbor who found it fixed lots of problems with gaming under XP.

    On the other hand, my Linux server freezes up and needs to be reset (sometimes even reboot -f doesn't work) every few days due to a kernel bug, probably some unfortunate interaction with the hardware or BIOS. (I'm using no third-party drivers, only stock Ubuntu 8.04.) And hey, in the ext4 discussions that popped up recently, it emerged that some people had their Linux box freeze every time they quit their game of World of Goo. Just yesterday I had to kill X via SSH on my desktop because the GUI became totally unresponsive, and even the magic SysRq keys didn't seem to work. Computers screw up sometimes.

    What's definitely true is that Windows 9x was drastically less stable any Unix. Nobody could use it and claim otherwise with a straight face. Blue screens were a regular experience for everyone, and even Bill Gates once blue-screened Windows during a freaking tech demo.

    This is just not true of NT. I don't know if it's quite as stable as Linux, but reasonably stable, sure. Nowhere near the hell of 9x. I used XP for several years and now Linux for about two years, and in my experience, they're comparable in stability. The only unexpected reboots I had on a regular basis in XP was Windows Update forcing a reboot without permission. Of course there were some random screwups, as with Linux. And of course some configurations showed particularly nasty behavior, as with Linux (see above). But they weren't common.

    Of course, you're right that none of us have statistics on any of this, but we all have a pretty decent amount of personal experience. Add together enough personal experience and you get something approaching reality, with any luck.

    --
    MediaWiki developer, Total War Center sysadmin