Patch To Allow Linux To Use Defective DIMMs
BtG writes: "BadRAM is a patch to Linux 2.2 which allows it to make use of faulty memory by marking the bad pages as unallocatable at boot time. If there were a source of cheap faulty DIMMs this would make building Linux boxes with buckets of memory significantly cheaper; it also demonstrates another advantage of having the source code to one's operating system." The BadRAM page has a great explanation of the project's motivation and status. Now where can I pick up some faulty-but-fixable 512MB RAM sticks?
I must be reading way to much slashdot.. I read the headline, and thought
"Of course Signal 11 is no more.. He left after a big blowout with Rob..."
--
This message brought to you by Colin Davis
Colin Davis
You'll probably get better results simply by cleaning off the contacts with a pencil eraser (remembering to brush away all the eraser dust first) and firmly re-inserting them into the socket.
--
"Open source is good." - Steve Jobs
"Open source is evil." - Microsoft
I'm amazed by how little this crowd know about details of semiconductor manufacturing. Defects are unavoidable! There, I said it. With the transistor sizes that we are pushing today, a speck of dust ruins an entire blcok. All you can do is *limit* the extent to which this happens by being as strict as possible with your clean room. But *some* contaminents will always get through. Perfection is unachievable. You have to accept this.
Alright, so we've accepted that some dies are necessarily going to be damaged. Why not make the hardware such that it can resist imperfections? Well, actually we do. RAM being as simple and homogenous as it is, lends itself well to this approach. Here's the idea: you add extra "blocks" of memory to a decode line. Then, if one of the "regular" blocks is destroyed by a process imperfection, the post-fab die can be modified with laser to reroute data to the extra backup block. So you invest some die room in backup structures, so that a die with only a few errors can be "corrected" and will still function as intended. This is basically like keeping a spare tire. If you get one blowout, you're still in business, but two and you are in trouble. Of course, you can package as many extras as necessary, but it may not make economic sense. Here you calculate the appropriate trade off between die size and yield to make the decision.
Anyway, long story short: your DRAM is already "bad". Quite a few RAM chips contain process errors that are rerouted around in hardware so that you, the consumer, need never know. To you, the process is transparent. All you should care about is that you get your *functional* RAM cheaper, because the manufacturer would have had to scrap that die otherwise.
This post discusses software "rerouting" around blocks that had more errors than could be corrected in hardware, but somehow still made it out the door. What's wrong with that?
Will semiconductor manufacturers suddenly think "Gee...let's not worry about yield anymore?" You'd better bet they won't. And even if they did, if the software rerouting is so clean as to not be noticeable (which is the only way it would fly), what do you care? You'd get your RAM cheaper.
--Lenny
Doesn't this make Linux look like a throwback to those old days of hobbies, like Amature Radio making QRP rigs in sardine tins?
"Hello, Kingston, I'm looking for any old cruddy defective RAM, got any? Uh.. No.. I won't be reselling it to Linux users, I swear that I am with a major US ISP and we want to put it into our servers! Call Rambus, you say? Hello? Hello?"
--
A feeling of having made the same mistake before: Deja Foobar
Check out the 'mem=exactmap' boot-time option in the 2.4 kernel series - it got added a couple of weeks ago. That way you can specify and exclude faulty RAM via boot parameters.
which allows it to make use of faulty memory... *sigh* ....of course my wife had to be reading over my shoulder and asked "Great, now is there anything I can install in you to make use of YOUR faulty memory...." She thinks she's funny. =)
Every time the topic of bad RAM comes up I can't help but tell this story:
We had just installed an Exchange server we were rolling out the Exchange client to all the desktop PCs. Unfortunately, no one had thought to ask if they could take it--which many of them couldn't. So we were feverishly digging up all the RAM we could find and sticking it into machines as fas as we could. I happened to find a 32MB stick (glory be!) in an unused PC. I said to my boss: "Hey, I found a big one!" He turns around and asked "Is it any good?" while simultaneously reaching for it, and ZAP audibly discharges static electricity right into the thing. We look at each other for a moment and then I say "Not anymore."
I was wrong, though--it was fine.
--
An abstained vote is a vote for Bush and Gore.
Non-meta-modded "Overrated" mods are killing Slashdot
(Hey Ryan! Here's your proof!)
Sure, you wouldn't want to intentionally put bad memory into a production machine, but what if good memory goes bad? This patch, if further developed to perform periodic testing and updating of the bad memory map *during operation*, could actually harden the linux kernel against spontaneous hardware failure!
If we ever want to see linux used in mission critical systems like air traffic control, embedded medical devices, or military applications, then projects like this are the key. Fault tolerance now exists for memory (this project), storage (RAID), and communication (redundant NICs). The next target should be the CPU.
How about projects to detect the types of errors a failing (typically, overheated) cpu produces, and adjust the scheduler accordingly to insert idle time and cool down the cpu? Or to use one cpu to monitor another in multiprocessor systems, and avoid using a processor that starts producing faulty results?
Acually memory fails for many diferent reasons. I personaly work in the test department at a large semiconductor company that makes SDRAM. All memory gets tested before it gets soldered to the PCB but it still can encounter a fail after it leads. Single bit fails and the like are acually fairly common. Most people don't even notice them. Also there are speed related problems, heat related problems, and mechanical problems that come up. For example, the early AMD chipsets had problems with certain memory. Memory also has clock issues and other little details that can effect things dramaticly. However this project seems to be a little far fetched since most memory gets a little worse over time. This is okay for a temp fix but your memory will slowly get worse with time. Usually within 6 months the memory is almost totally bad. Another problem with using bad memory is that in several cases memory will draw a larger idle current than other modules. And if you have more bad modules there is a higher current load. This can lead to damaged parts on your motherboard. Another thing to realize is that load style can effect your stability. In several situations it has been found that windows can run over top of a memory error because it tends to not stress the memory quite as much as your basic high load unix setup. Thats my $.02 on the issue I guess.. It seems like this is basicly using a hard drive that is whining and spuddering. Not a smart move stability wise.
Actually this is quiet handy. Windows always worked better with dodge memory than Linux did because Linux always tries to use as much memory as possible for caching where as Windows didn't.
It made it notorious for working with dodge memory, failing to boot half of the time. I've seen people blame Linux for bad hardward because it would work with Windows.
It's nice that Linux now could just go
*ARGH YOU HAVE CRAP MEMORY*
shrug it's shoulders and chug along anyway.
handfull of busted 256m DIMMS: $10.71 with tax
6 reboots, a little math, and a partial kernel compile: 21min
The look on my roommate's face when I typed "top": priceless!
You must have some sort of problem with linux. This is a valuable, and technically interesting addition to the Linux kernel, and all you can do is act like everybody in the world who needs 256MB DIMMs also has $135 ready.
I know you're just trolling, and I shouldn't respond, but for students, and anybody who has access to memory modules that are experiencing known, predictable faults, this would be great. Not everybody has some fancy $30,000/year job, y'know.
--
"Don't trolls get tired?"
Modern DRAM doesn't have much trouble with bad cells, and the yields are quite good. So there isn't a big supply of DRAM with bad cells that fail solidly. Most DRAM problems today are at the edges: at the buffers, the connectors, or clock synchronization - the things that can be messed up during installation.
Personally, I get ECC RAM even on desktops, just so I know it's working. It eliminates arguments with tech support when the hardware really is broken.
- A.P.
--
* CmdrTaco is an idiot.
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
You can check out Best Buy or CompUSA for some faulty RAM. They seem to have a never ending supply of it. Not only that but you can pay the price that you can get it of of the net for good ram!
huh?