Slashdot Mirror


Salvaging Defective DRAM

An anonymous reader writes "Ever wonder what happens to DRAM that fails quality assurance testing during manufacturing? Turns out a lot of it ends up as 'downgrade' memory and ends up in OEM memory modules. Last resort: use it in an answering machine, where the sampled audio can be very tolerant of bit errors."

18 of 211 comments (clear)

  1. Alternatively, you could use the... by Ari+Rahikkala · · Score: 5, Informative
  2. Buying ram on the internet..... by Ogrez · · Score: 5, Informative

    This is the prime example of why I tell people I know not to buy ram off of the internet unless its from a major company that has good support. To many people buy 15-90 day warranty ram because its cheap, and when it fails they are upset that they have to replace it. If you pay a bit more money you get lifetime warranty ram... and why do you think they are willing to warranty it that long, because they know it works. people dont understand the testing process and think they are getting the same product buying cheap ram, as opposed to inexpensive ram...

    --


    Fire in the hands of the village idiot is no tool, but a weapon of mass destruction
    1. Re:Buying ram on the internet..... by Anonymous Coward · · Score: 1, Informative

      like every other post on slashdot: memtest86 is your friend. get memory from online person, stick into a machine, run memtest86 for a few days. (or few hours depending on how paranoid you really are.)

  3. Some updates to the article... by IvyMike · · Score: 5, Informative

    There are some things in the article that are pretty out of date:

    To reduce the test time, parallel chip testing usually is accomplished with eight to 16 chips in a row.

    That's pretty low parallelism; there are memory testers out there that test over 200 devices at a time right now. And even the older, more common systems are probably testing 64 in parallel.

    A special ink jet color marks the good dies.

    This hasn't been true for years. Each device's pass/fail status is stored in a database, along with all other test results, and the whole process is automated enough that good die are binned out automatically. No need to physically mark the chip.

    Due to the imperfection of the process, a percentage of the DRAM die contains some faulty cells.

    That percentage is 100%. At modern memory sizes, you never get a perfect device without going through repair.

  4. I am seeing a lot of this by acidrain69 · · Score: 5, Informative

    There are a lot of peeps complaining about substandard ram. If you had RTFA, you'd realize that the downgrade ram is reconfigured to skip the bad parts in the chips, so that it comes out as a normal module. Just because there is a faulty bit or 10 in a modules, doesn't mean the reast of that module is bound to fail. It could just have been an imperfection in the silicon or the circuit process.

    The downgrade ram has to pass further tests to insure the detours around the bad parts worked.

    Granted, I probably wouldn't use this stuff in a mission critical server, but if you are buying for a mission critical server, you should be getting ECC registered with lifetime warranties anyway. Now for a small web or file server, or even a desktop, I'd use this.

    Other people have mentioned memtest86. This program is your friend. Don't even bother with BIOS POST tests of RAM, just use this every once in a while if you REALLY want to find the problems. Too bad it won't run on my alpha server :(

    --
    -- Having a Creationist Museum is like having an Atheist place of worship
    1. Re:I am seeing a lot of this by Anonymous Coward · · Score: 3, Informative

      "If the chip is half-bad, there are good chances that it has defects in the other half."

      Actually, no that is not correct. Errors are caused by a localized defect which affects what is really (in human terms) a small point on the die. A particle of comtaminant, for instance, only a micron or two in size.

      Ever wonder why NAND FLASH (used in Smart Media, Compact FLASH, etc) are cheaper than NOR FLASH (called linear FLASH, used for BIOS and other code storage, etc)? Because not only is it designed to be fault correcting, but the spec allows for up to a certain number of sectors to be completely bad (uncorrectable by the on board ECC bits). This means higher yeild since many more get to pass in spite of defects.

      J

    2. Re:I am seeing a lot of this by Anonymous Coward · · Score: 1, Informative

      no, you buy a mission critical server that uses hot swap ram and failover ram.

      My compaq ML-530 has 12 Dimm slots for it's ram each pair of two are redundant.. I.E. 2 256 meg dimms makes 256 meg of ram everything is duplicated in the other. PLUS there are a set of 4 backup dimm slots that get activated when one is detected to fail and then I can open the box and remve the one with the red LED on next to it's slot and pop in a new one, tell the server that a new one is installed and activate that slot again.

      Mission critical uses hardware that is NOTHING like the consumer level junk most of you have. Hot swap PCI slots, hot swap RAM hot swap SCSI.... etc...

      basically if you cant hot swap everything then your server is low quality.

      and yes, the only thing I cant hot swap is a processor, but it can disable up to two of them before it is in a next one that dies is a fail mode. (4 processor Xeon motherboard)

    3. Re:I am seeing a lot of this by tintruder · · Score: 3, Informative
      If the chip is half-bad, there are good chances that it has defects in the other half. Usually, it's a problem with the process and not just random quirks.

      Not true. All processes are subject to variation.

      When a wafer is produced with hundreds or thousands of discrete die on it, some are always better than others. For instance, in the 5" process where the first Pentiums were fabricated, you could have a yield of 60%-80% good die with those 60%-80% spanning a whole range of marked chip speeds. Same process, same wafer, different mhz. Different price when sold.

      If you've ever seen a fab in production, you would also see steps where manual (vacuum wand) handling is needed. Even in the filtered air of a clean room, the open movement of a wafer handled like this often leads to particles becoming affixed to the surface. The smaller the process (e.g. .09u vs..9u)the more damage a single particle can do.

      Process washings with chemicals or pure water do a good job of assuring no (well, few) particles stay affixed, but even so, some steps of metrology show that all cannot be avoided.

      Will a single particle hurt a single die? Maybe. Maybe not. It depends on where it lands and at what step in the process.

      Once the die are tested for yield and function and sorted by this performance, they are sold in batches.

      Not every die is tested completely though, but rather a restrictive set of "tell-tale" measurements are taken on most (at good fabs) and exhaustive testing done only on a small sample. Lots of statistical analysis helps know what to test and how hard to test it.

      Move to the final assembler, and all sorts of production glitches can cause bad modules. Primarily though, either minimally qualifying RAM or random sample tested RAM makes it into generic modules. Still, all the other components, the circuit board, connectors and solder itself can contribute to problems.

      In any case, the bad part in any chip is likely local because even minimal QA testing will eliminate obvious or widespread failures.

      Of course, piss-poor process does yield chips more prone to failure by breakdown of the traces or local thermal failure due to bubbles, impurities, or poor assembly.

  5. How to identify DIMMs using bad RAMs by udif · · Score: 5, Informative
    It's quite simple. Really.

    DRAM chips are usually have either 4, 8 or 16 bits per word. In order to construct a DIMM, 64 bits are needed. This means that with 4 bit DRAMs, you need 16 chips, with 8 bit DRAMs you need 8 vhips, and with 16 bit DRAMs you need 4 chips. usually you will see only the 4 or 8 bit DRAMs, because these occupy less board area for the same capacity. 16 bit DRAMs are only used for low capacity DIMMs.

    When your DIMM supports ECC, it's 72 bits wide, which makes it more complicated. Usually its made of 18, 4-bit chips, or 9 8-bit chips.

    (back in the 30 and 72 pin SIMM days, when memories were 8 or 32 bit wide, you could see ECC SIMMs that use 3 chip for 2x4+1=9 bits, or 2x16+4=36 bits).

    If you see DIMMs with 12 chips, This is usually a cheap OEM SIMM using partially good DRAMs.

    The Best way to identify such a DIMM, is to write down the marking on ALL the chips on it, and look them up in the internet. You then sum up all the DRAM bit widths, and see what you come up with:

    If its 64 bits, its a normal DRAM.

    If its 72 bits, its probably an ECC DIMM.

    If its more, it's probably a DRAM using partially good DRAMs.

  6. Re:I figured by AvitarX · · Score: 4, Informative

    Reel advice for Linux users with bad ram.

    Run memt86 and use the output for the badram patch for the kernel.

    that will actually work and cut e vary minimal amount of ram out.

    --
    Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
  7. Re:Does it get worse? by Ogrez · · Score: 3, Informative

    Yes, RAM will develop faults from use. Its just not very common, Mostly caused by overclocking, voltage spikes, and power surges.

    --


    Fire in the hands of the village idiot is no tool, but a weapon of mass destruction
  8. Re:Use it for Linux ;-) by WolfWithoutAClause · · Score: 2, Informative
    I dunno, sounds like it'd be good if it were able to detect the error, carry on with what it was doing, and alert the sysadmin to the problem, who can then schedule downtime to fix it, rather than having an unexpected hardware error in the middle of something important ...

    No. The idea of the patch wasn't to stop it crashing, you probably can't do that; the idea was to analyse it when the system booted and work around it then- it's perfectly possible to send the admin an email summarising it though.

    There's something very cool about the concept of buying a tonne of memory for a tenth of the price and suddenly having a system with nearly four gigabytes of memory ;-)

    Then again, isn't that what ECC memory is for?

    No. AFAIK ECC memory can correct only bit errors within a word; but addressing errors slip right past it. The patch can handle addressing errors, blocks that just don't work, blocks that mirror back to the same location etc. etc.

    --

    -WolfWithoutAClause

    "Gravity is only a theory, not a fact!"
  9. I just figured it was at Fry's by Artifex · · Score: 5, Informative

    Seriously, I've had some of their OEM memory as part of a package deal, and it was very nasty stuff.

    What's worse, before they would take it back, they wanted to "test" it, testing being limited to a couple runs of PC-Doctor, which is totally lightweight.

    To make a long story short, they refused to take it back the first time, later it blew up my motherboard. They replaced the motherboard (it was part of the package) and sent me home, where I discovered my Athlon XP was also damaged. I took it up there, and they wanted to run PC-Doctor on it, but the "technician" (hah!) cracked the CPU while putting it in a "test board," so "oops, I guess we're replacing that."

    P.S. One of the guys at the return desk who I got to know quite well told me, when I asked him why the "test boards" they were using always changed, that he thought they were boards that belonged to customers. Whether that meant boards in for repairs, or returned boards, I don't know or care - either is bad news.

    P.P.S. This was at the Fry's in Wilsonville, Oregon. There is also an idiotic troll in the service department there who, after ignoring me waiting at an empty counter for 10 minutes while he chatted on the phone, wanted to charge me for a "missing" monitor stand on a monitor I was returning, refusing for 15 minutes to look in the bottom of the box under the styrofoam because monitor stands always come attached to the monitors, didn't you know? He finally looked when I demanded to talk to the manager, and of course it was there. I had a long discussion with the manager anyway over his, and their, incompetence (I reminded him of the memory fiasco) but the troll was still lurking there the last time I dropped by for consumables, which is all I will ever buy from Fry's, now. You can't miss him - he looks like he'd feel more at home in a raincoat, instead of his cheesy lab coat, roaming a playground on a sunny day.

    --
    Get off my launchpad!
  10. Re:Cheap memory. by Anonymous Coward · · Score: 4, Informative

    No Most major manufacturers use quality ram.

    Compaq and IBM both use Kingston Memory. They also like to jack up prices for their "rebranded" Compaq/IBM ram which is just really a Kingston module with an even higher price.

    Toshiba uses Samsung. I'm not sure about manufacturers like Dell or Gateway.

  11. Re:ECC worth it? by Fulcrum+of+Evil · · Score: 3, Informative

    But it's also theoretically possible for any number of other things to break, and spontaneous RAM failure seems very, very low on the list of things to worry about.

    Well, the thing about RAM failure is that, unless you do something like ECC, you won't detect the errors until it causes a crash. Probably, you'll lose some data to corruption first. The other thing is that RAM errors can be induced by bad power or other transient problems. Finally, it does happen, so better safe than sorry - you're spending $2k on a server, so why cheap out on a $50 part?

    --
    "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
  12. Re:HP HP-UX memory. by Teancom · · Score: 2, Informative

    What is this "Kingston" memory of which you speak? AFAIK, Kingston does not make ram, they throw other people's die on their modules (and sometimes they just buy the modules whole). It's pretty much a crap-shoot of whether or not you're getting samsung, hynix, micron (who just signed a deal to start selling to them again), or etc. So saying kingston memory is crap would be akin to saying dell makes crappy hard drives...

    Not a flame, just a clarification :-)

  13. Re:Cheap memory. by Stonent1 · · Score: 3, Informative

    Dell uses Micron and Infineon (Siemens) for SDRAM and DDR. For RDRAM I think they mainly use Toshiba. I always recommend Crucial to people because it is just the retail branch of Micron. Lifetime warranty and I've never had a failed stick.

  14. Re:ECC worth it? by PurpleFloyd · · Score: 3, Informative

    ECC isn't there for the tiny chance that one, and only one, chip on the module would catch fire and die. It's there so that any random "bit rot" (single-bit errors) is caught and corrected before it causes damage. All RAM is susceptable to this; it can be caused by cosmic rays (!) or by radioactive decay (can't remember if it's alpha or beta) of minute quantities of radioisotopes in the chips' substrate. While it will only happen once in every ten years or so on average, it does happen and can cause a system crash. ECC is about reducing the possible risk (it would have to flip 3 bits simultaneously to fool ECC RAM).

    --

    That's it. I'm no longer part of Team Sanity.