Patch To Allow Linux To Use Defective DIMMs

Cool! Put it in the installer! by Moe+Yerca · 2000-10-25 00:01 · Score: 2

If a distro install could test your memory on the install boot and autoconfigure the kernel to ignore bad memory, gee golly! I know I've got a few sticks somewhere that are suspect... never bothered to test them.

Now if we can just get some kernel drivers that can bypass other bad hardware... umm... uh... ok... so I don't have any examples, but dammit! Don't you love that Snicker's commercial with the guy wanting to go to lunch with his poster of the panda bear?! Pretty pretty panda! Pretty pretty panda!

I INVENTED PANTS!

Re:Now there's a point to the BIOS memory test? by Ex-NT-User · 2000-10-25 02:46 · Score: 2

Actually the BIOS Memeory test does a "rough" memory test. The test itself is a bit different from one BIOS manufacturer to another, but for instance with Phoenix bios the test is as follows:

Write a pattern to ram.
Read it back. And compare to what it should be.
Write a aliasing pattern to ram.
Read it back and make sure u got what u expected.

This will catch quite a few serious memory problems. The 2 cases I saw recently were:

1. PC100 memory that wasn't quite up to par. (Droping bits randomly)

2. A friend of mine put a PC100 dimm in a mobo set for PC133 dimms. The PC100 ram worked.. almost.

In both cases the results were random lockups and application crashes. Turning on the BIOS ram test quickly identified the problem. Which was resolved by putting quality memory in the box.

These tests are only really usefull the first time you boot your box or if you are suspecting bad RAM. (It's a quick way to test for serious memory problems without having to pull out a RamChecker)

Why bad ram may not be sold by Demon-Xanth · 2000-10-25 00:54 · Score: 2

Say a manufacturer has a DIMM with a bad address line, visually it's a normal DIMM. They decided to sell it for a massively low price. Some unscrupulous buyer utilizing "sell and run" tactics buys a bunch and sells them as normal DIMMs. Buyers call the manufacturer complaining that the DIMMs they bought are bad and the retailer no longer exists.

Note: This has happened to me, I bought two hard to find keyboards only to find both had water damage upon arrival (packaging still good) and the retailer disappeared.

--
If you think education is expensive, you should try ignorance -- Derek Bok, president of Harvard

I'm not so sure about this by John+Jorsett · 2000-10-25 00:56 · Score: 2

I can see a lot of obstacles to becoming a dealer in bad RAM, including the hassles of having to test it and characterize that nature and extent of its problems. In order for someone to know whether the price and product were right, s/he'd have to have some detailed info about the defects, and it would vary from stick to stick. I'd think that the dealer would have to list each stick independently, along with the defect info. The customer would have to buy each stick as a unique entity, presumably. The logistics costs of doing all this and keeping the inventory records up to date would be quite costly. The economics of this are questionable, IMO, although given my parsimonious nature, I'd love to be proven wrong.

Re:Signal 11 no more? by Signal+11 · 2000-10-25 02:50 · Score: 2

at least have the balls to say it logged in you piece of shit.

CmdrTaco that really does not become you.

--

Bad Memory doesn't go to waste by IvyMike · 2000-10-25 00:58 · Score: 3

There isn't this huge supply of bad memory out there (Radio Shack jokes aside) because memory manufacturers are pretty clever. Bad memory is put into things like:

Audio storage devices, like answering machines and mp3 players, where a bit or two of failure will just end up as a teeny bit more noise.

Cheap digital cameras (once again, a bad pixel here or there....)

Toys. They actually call bad memory "toy memory" sometimes.

SIMMS. You take (for example) 4 bad chips and 1 good chip and get the equivalent of 4 good chips (by replacing bad io's on the bad chips with io's on the good chip). There are jillions of ways to do this, and companies have pretty much done them all.

Sell them at CompUSA to people who don't know any better. (Sorry, couldn't resist)

If I were you, I'd download memtest86 right now.

Re:Finally! by Yardley · 2000-10-25 02:51 · Score: 2

This RAM idea is great. It shows the true spirit of open source. We can fix anything from broken RAM to Microsoft.

Now what about something to make me burn less coasters? ;)

There's new error-prevention technology available, but I believe it relies on hardware and software, to keep you from burning coasters.

#1) Sanyo's BURN-Proof technology (available on the newest Creative, QPS, Plextor, LaCie etc. writers)

#2) Ricoh's JustLink technology (available on its CD-R/RW/DVD-ROM combination drive among others)

Both technologies automatically prevents buffer under-run errors which are the leading cause of coasters.

If I were in the market for a new burner, I'd go with the $349 Ricoh combination drive. It does 12x CD-R, 10x CD-RW, 32x CD-ROM, and 8x DVD-ROM all in one device. That's smart.

--

--

--
He lives in a world where those who do not run the client software of the omnipresent meme are unacceptable.

Re:Bad Ram by imp · 2000-10-25 01:02 · Score: 2

Actually, these sorts of marking bad memory things have been around for a long time. The trouble with them is that RAM isn't as determanistic as you'd like to believe. Once you get one or two bad pages in RAM, more tend to break rather quickly. This has been tried in the past and I think that these systems will still be flakey even when not touching the bad ram.

I've also seen machines that had bad ram lock up randomly, even when the bad pages are never touched.

Let me just say that I have my doubts.

Re:Why bother? by John_Booty · 2000-10-25 09:53 · Score: 2

"It's worth doing because it keeps a working system up, and Linux should have that"

Huh? Isn't ECC functionality handled in the BIOS, not the OS? So... Linux does have that functionality, eh?

--

OtakuBooty.com: Smart, funny, sexy nerds.

Re:Oh, sure, Linux users are this desperate by dboyles · 2000-10-25 00:02 · Score: 2

Doesn't this make Linux look like a throwback to those old days of hobbies, like Amature Radio making QRP rigs in sardine tins?

Sounds to me like you're describing what the true definition of "hacking" is. Let's see, if you can get a certain amount of RAM by doing a little hacking for less than you'd pay in a store, what's wrong with that? People do this in their everyday lives. As I type this, I have a penny in my car, wedged between the stereo head unit and the side where it mounts to hold the thing in place. No, it doesn't look pretty. Yes, it did the job (and the price was right).

Perhaps in corporate, "everything must look nice and neat" environments, this isn't a valid solution for adding RAM. But for the CS student who has an old DIMM sitting around, it's pretty damn cool.

--
-- "Complacency is a far more dangerous attitude than outrage." -Naomi Littlebear

Re:Signal 11 no more? by E1ven · 2000-10-25 00:03 · Score: 5

I must be reading way to much slashdot.. I read the headline, and thought
"Of course Signal 11 is no more.. He left after a big blowout with Rob..."

--

This message brought to you by Colin Davis

--
Colin Davis

Linux is not the be all and end all by FallLine · 2000-10-25 10:13 · Score: 2

If we ever want to see linux used in mission critical systems like air traffic control, embedded medical devices, or military applications, then projects like this are the key.

IANAESE, but Linux will never be used in a life critical medical device, never mind implantable medical devices. Firstly, the FDA requirements are simply too strict to allow linux's usage. Secondly, it's both overkill and underkill at once. Linux may be relatively efficient compared to systems like Windows, but it's not anywhere near small enough for traditional embedded systems. Third, Linux simply does more than it would need ever need to, why use it? Fourth, it's not setup for DSP type operations. Fifth, do you really want to unnecessarily trust your life to linux just so you can make a statement?

Re:Oh, sure, Linux users are this desperate by ackthpt · 2000-10-25 00:03 · Score: 2

Yes, but are you really going to impress management to adopt Linux when demonstrating that it can run on on a machine with defective parts?

It's like saying, "This new Mercedes E320 is as good as the one without a dent in the door." Both run, both are equally safe (assuming its just a superficial dent), but it just doesn't sell itself.

--

--

A feeling of having made the same mistake before: Deja Foobar

BTW, a bit of ancient, related trivia by Mr+Z · 2000-10-25 10:31 · Score: 2

By the way, here's some ancient related trivia. The INTV Productions video game cartridge "Triple Challenge" integrated the previously-released Chess, Checkers and Backgammon on a single game cartridge. In its original form, the Chess cartridge came equipped with a 1K SRAM onboard, as the game required extra memory.

At the time INTV went to produce the Triple Challenge carts, they discovered that since RAM had grown in capacity over the years, 1K SRAMs weren't available in quantity for reasonable prices, and larger SRAMs were too expensive as well. They almost had to cancel the Triple Challenge cart.

That is, until they found someone with a stack of 2K SRAMs, in which half the RAM was good, the other half was bad. Since the game only needed 1K, it ignored the bad half, and off they went.

Cool, eh?

--Joe
--
Program Intellivision!

--
Program Intellivision!

Re:Why bother? by Animats · 2000-10-25 10:39 · Score: 2

Isn't ECC functionality handled in the BIOS, not the OS?

On x86 systems, the memory controller handles the ECC error correction, and you get an interrupt which allows you to log the event. Often this interrupt is handled by the BIOS. But the BIOS typically doesn't do anything but log the event. The OS can do more; it can map the bad block out, probably without a shutdown.

Just how useful is this, really? by b1t+r0t · 2000-10-25 00:06 · Score: 5

See, the thing about bad RAM and SIMMs/DIMMs is that they can test the chips before soldering them onto the circuit board. If they want, they can even test them before putting the chips in a plastic case. So if you have "bad" RAM, it's more likely to be a defect in the soldering process that renders the whole stick (or an entire column of data bits) useless, or bad contact with the socket.

You'll probably get better results simply by cleaning off the contacts with a pencil eraser (remembering to brush away all the eraser dust first) and firmly re-inserting them into the socket.

--

--
"Open source is good." - Steve Jobs
"Open source is evil." - Microsoft

Re:Just how useful is this, really? by b1t+r0t · 2000-10-25 01:32 · Score: 2

Moreover, most of the problems I've seen with bad memory have been intermittent random failures related (apparently) to thermal stress and/or specific timing patterns, not fixed "sticky bits".
Ah yes, timing and all that other rot. Two stories here.
My first every memory upgrade was many years ago, adding 4116 RAM to a TRS-80 Expansion Interface. I put it in and it didn't work. When I looked at the address range with the TRSDOS debugger, it contained random values that changed every time I looked at it! I managed to get it to work right by cranking the power supply voltage down into the low 4.x volt range.
And a few months ago, I got some old IBM 72 pin 8MB SIMMs at a computer show that were probably pulls from old PS/2 machines. Some of them worked, some didn't. I didn't realize until a few days later that this was 80ns RAM, and the motherboard only supported 60/70ns RAM. I was lucky to get any of them to work right.
And then there was all that talk about "cosmic rays" messing up DRAM, until eventually it was discovered that the radiation was coming from within the chip itself!

--

--
"Open source is good." - Steve Jobs
"Open source is evil." - Microsoft
Re:Just how useful is this, really? by egnor · 2000-10-25 00:20 · Score: 2

Moreover, most of the problems I've seen with bad memory have been intermittent random failures related (apparently) to thermal stress and/or specific timing patterns, not fixed "sticky bits".

I thought it was already relatively common for RAM manufacturers to test for single bit errors in the factory and route around the affected cell, which would negate the economic value of doing this in software. (They should certainly be incented to do this, otherwise they'd have terrible yield issues.)

This sounds like the "bad block" detectors that used to be necessary for hard drives, but aren't any longer (hard drives these days remap bad blocks internally)...
Re:Just how useful is this, really? by alhaz · 2000-10-25 01:56 · Score: 2

Well, here's my story.

I have a Toshiba ultraportable. Nice little box, 3 pounds, magnesium case, didn't cost a whole lot, nice bright screen, 96 megs of ram (maxed out)

64 of that 96 megs of ram are on an add-on card. For whatever reason, the cost of this proprietary memory card for this particular notebook has skyrocketed.

I've had the notebook about 10 months when, out of the blue, it starts rebooting spontaniously. I fire up memtest86, there's a few chunks of bad ram.

I take it out and start frantically searching for a source of a replacement. Currently, Kingston is the only manufacturer i can find shipping it, and they want 1/2 what the whole notebook goes for on eBay, including the memory card. About $350 for 64 megs of ram. No Effin Way. Just not going to happen. I'm not poor but that's just plain stupid. I'd be disgusted with myself if i spent that much on that little memory when i'd be better off selling the whole notebook and buying another.

So i found the badram patch. Patched my kernel. Found that my lilo was too old to allow the whole commandline. Downloaded and installed new lilo.

And, it doesn't work.

Well, it sortof works. Now, with that ram, it randomly locks up, instead of randomly reboots. Big improvement, right? Wrong.

I don't know what the problem is. A friend with a background in the semiconductor industry says that memtest86 is written from an outsider's point of view regarding memory, but that he's under a prior NDA, and Motorola would probably be Quite Upset if he leaked old documents that would tell the author how to improve it. Maybe i just need a better memtest86. Maybe i ought to expand the ranges so that an area around the affected areas are also blocked out. I don't know.

All I know is, in my case, it didn't really help. And that a notebook with 32 megs of ram and a really slow harddrive is useful for little more than an xterminal.

--
This is just like television, only you can see much further.

Re:Hello this was on Kernel Traffic a long time ag by jarek · 2000-10-25 02:54 · Score: 2

Well, I read kernel traffic and kernel mailing list but somehow this escaped me. Thanks slashdot. /jarek great stuff btw

How to find bad ram cheap by The+Dev · 2000-10-25 03:01 · Score: 2

Now where can I pick up some faulty-but-fixable 512MB RAM sticks?

Oops, now you can't :(

Does Slashdot readership know nothing of hardware? by slothbait · 2000-10-25 01:09 · Score: 5

I'm amazed by how little this crowd know about details of semiconductor manufacturing. Defects are unavoidable! There, I said it. With the transistor sizes that we are pushing today, a speck of dust ruins an entire blcok. All you can do is *limit* the extent to which this happens by being as strict as possible with your clean room. But *some* contaminents will always get through. Perfection is unachievable. You have to accept this.

Alright, so we've accepted that some dies are necessarily going to be damaged. Why not make the hardware such that it can resist imperfections? Well, actually we do. RAM being as simple and homogenous as it is, lends itself well to this approach. Here's the idea: you add extra "blocks" of memory to a decode line. Then, if one of the "regular" blocks is destroyed by a process imperfection, the post-fab die can be modified with laser to reroute data to the extra backup block. So you invest some die room in backup structures, so that a die with only a few errors can be "corrected" and will still function as intended. This is basically like keeping a spare tire. If you get one blowout, you're still in business, but two and you are in trouble. Of course, you can package as many extras as necessary, but it may not make economic sense. Here you calculate the appropriate trade off between die size and yield to make the decision.

Anyway, long story short: your DRAM is already "bad". Quite a few RAM chips contain process errors that are rerouted around in hardware so that you, the consumer, need never know. To you, the process is transparent. All you should care about is that you get your *functional* RAM cheaper, because the manufacturer would have had to scrap that die otherwise.

This post discusses software "rerouting" around blocks that had more errors than could be corrected in hardware, but somehow still made it out the door. What's wrong with that?

Will semiconductor manufacturers suddenly think "Gee...let's not worry about yield anymore?" You'd better bet they won't. And even if they did, if the software rerouting is so clean as to not be noticeable (which is the only way it would fly), what do you care? You'd get your RAM cheaper.

--Lenny

Finally! by AntiPasto · 2000-10-24 23:47 · Score: 3

I've *always* thought that software could make up for bad hardware (err... well I guess that's the point to bad sectors marked on disks, fault-tollerance, and network routing)... but this is getting back to basics in a great way. Now what about something to make me burn less coasters? ;)

----

Re:Finally! by 11223 · 2000-10-25 00:11 · Score: 2

Hrmn, only burnt one coaster? I've burnt quite a bit more, but they were all my fault: if the ISO is set up correctly (hah!) then I never make a single coaster. And this is on BeOS, where I don't even get to set the priority of cdrecord. And I've got an off-brand CDR drive.
Re:Finally! by Yardley · 2000-10-25 04:22 · Score: 2

If major speed is on your mind, Yamaha just announced some 16x writers. In conjunction with Oak Technology, Yamaha is bringing out a 16x/16x/40x CD-R/RW and just came out with (in Japan and parts of Europe) a 16x/10x/40x CD-R/RW.

"Yamaha first to market with 16X CD-RW drive designed around Oak's controller that reduces CD burn time to under 5 minutes"
16X Write
16X ReWrite
40X Read / Audio Ripping

Yamaha's CRW2100:
16X Write
10X ReWrite
40X Read / Audio Ripping

These drives use an 8MB Memory Buffer for their high speed and to avoid buffer under-run. I can't find any indication if they use either Sanyo's or Ricoh's error prevention technology. I don't think they do.

An interesting article on Plextor's newest drive talks about a newer form of BURN-proof and also JustLink hints that 24x write drives may be down the road.

--

--

--
He lives in a world where those who do not run the client software of the omnipresent meme are unacceptable.
Re:Finally! by GoRK · 2000-10-25 04:25 · Score: 2

It's called BURNproof and it's a Hardware/Software combo deal. The new plextor's support it and other drives can't be that far behind. One time I accidentally started burning about 50K extremely small files onto a CD on the fly on my new burner at 12X before realizing OOPS it's gonna UNDERRUN! Then it underran, the light kicked from blinky yellow (write) to green (you just underran you commie pig) and then back to blinky yellow (foiled again!)

Amazing!

Now if you're trying to correct a meatspace error your're having, that's a different story. Use rewritables!

The reason you couldn't do this on a normal burner is that the change requires a bit of extra code in the firmware to handle the ready to restart and a laser that can switch from read to write very fast. CDRW drives can (and do) correct errors such as this when they are writing to RW media.

~GoRK

This IS good for Linux's rep by cirne · 2000-10-25 01:10 · Score: 2

No, it seems to be more like "Your new Ford Explorer will automatically check your tires and reinforce any weak spots it finds". Honestly, (in my experience) integrated circuts are generally the second part of any electronic component to go, after moving parts (of which computers have very few). So, when memory goes bad, what would you rather do? Have the computer fix it for you, or go out and buy a new module? I personally have a pair of SIMMs which aren't in top shape anymore... since I don't have the money, the time, or the ambition to get replacements, I just put them in a low-load router box, which occasionally gets rebooted to clear out memory problems. I'd much rather have the system check my memory for me. So, unless someone decides to put out a major FUD about it, I don't see how it could be advertised in any but a good light.

Re:Hello this was on Kernel Traffic a long time ag by ttyRazor · 2000-10-25 01:12 · Score: 2

This is news to ME, and I'm glad it was here to hear it. Sites like /. are meant to bring attention to a wide range of topics, while others aim to provide prompt coverage of narrower topics. Sure, it's annoying to see a story about something you already heard elsewhere a while ago, but it's important for those that missed it the first time.

Great, deliberate instability :-/ by Nick+Driver · 2000-10-24 23:48 · Score: 2

Now we have a way to deliberately make Linux instable.... if you subscribe to the theory that if a DIMM has bad areas then that increases the probability that more of its areas will fail in the future.

reliable ? by cookieman · 2000-10-24 23:48 · Score: 2

Only if the DIMMs don't degrade further slowly, right?
Anyway it's a nice thing.

--
Just another coder...

Oh, sure, Linux users are this desperate by ackthpt · 2000-10-24 23:48 · Score: 5

Doesn't this make Linux look like a throwback to those old days of hobbies, like Amature Radio making QRP rigs in sardine tins?

"Hello, Kingston, I'm looking for any old cruddy defective RAM, got any? Uh.. No.. I won't be reselling it to Linux users, I swear that I am with a major US ISP and we want to put it into our servers! Call Rambus, you say? Hello? Hello?"

--

--

A feeling of having made the same mistake before: Deja Foobar

Re:Oh, sure, Linux users are this desperate by Ed+Avis · 2000-10-25 16:21 · Score: 2

The idea of defective hard disks has been around for years. Modern disks will map out defects automatically so they're hidden from you, but they're still there (a lot of the time). So if you don't have to throw a disk away because of one error, why do the same for memory? People didn't suddenly junk all their Pentiums when the F00F bug was discovered, they just worked around it in software.

Still, I can appreciate there is a psychological problem with knowing that your system is not 100% flawless - plus, if RAM has some bad bits, might it not have others you don't know about?

The only way BadRAM would take off, I believe, is if RAM manufacturers started shipping each DIMM with a list of known defects, as used to be done for disks. At present, a single defect means the RAM is not used, so the only defective memory modules are dodgy no-name ones you might not want to trust anyway. If, OTOH, the manufacturer guarantees that there are no flaws other than the handful given in the defect list, there'd be no reason not to use the memory provided you trust the manufacturer.

--
-- Ed Avis ed@membled.com
Re:Oh, sure, Linux users are this desperate by Ed+Avis · 2000-10-26 17:47 · Score: 2

I wouldn't use RAM with intermittent faults. But if it had a handful of known bad bits, with a guarantee that all the others were solid, I wouldn't have a problem with mapping out the bad 0.001% (with say 0.1% wasted space) and using the rest.

usually second rate parts will not even be allowed to feature the original manufactures name or part numbers, after all, they don't want to get stuck supporting it.

That's the problem - the perception that RAM with any defects at all is 'second rate'. In the past this has certainly been true because it wasn't possible to map bits out. If RAM starts being seen more like hard disks (and until a few years ago, floppies and LCDs) - seen as something which may have known defects without being unreliable - then manufacturers will be only too keen to improve their yields by selling the chips that are only almost-perfect.

(I wonder - could you do this with other hardware? If one of the registers on your CPU is broken, could you sell it and tell the customer to use a compiler that won't use that register? That would be totally infeasible today, but in the future I can see it _could_ happen. For example, if the whole system were in bytecode with a small native-code bootstrap that finds out about the CPU's defects and sets up the JIT compiler appropriately. There have been cheaper chips which were rumoured to be defective versions of more expensive ones - eg the Intel 486SX may originally have been a use for 486DXes where the FPU turned out defective.)

--
-- Ed Avis ed@membled.com
Re:Oh, sure, Linux users are this desperate by ackthpt · 2000-10-25 00:38 · Score: 2

...once it's rolled into production, we justify the expenditure for new/better hardware.

Lucky you, where I once worked (two places actually) this is how these things played out:

If you have it running on that [cobbled together pile of leftovers] then we'll just leave it the way it is.
It eventually breaks because the upgrade never happens (due to always fewer priorities than necessities met)
The [cobbled together pile of leftovers] never should have been done, it makes us look bad when a [cobbled together pile of leftovers] failure deprives users of one of our key services.

In retrospect, it's funny, but wasn't each time innovation was thus mishandled.

This too often seemed to resemble the institutional project model:

1. Plan is proposed
2. Wild enthusiasm
3. Plan is put into action
4. Process fails
5. Feelings of hurt, loss and disillusionment
6. Search for the guilty
7. Punishment of the innocent
8. Promotion of non-participants

--

--

A feeling of having made the same mistake before: Deja Foobar
Re:Oh, sure, Linux users are this desperate by gaudior · 2000-10-24 23:54 · Score: 2

To a large extent, Linux is still in the hobbyist category. That's no a bad thing, it's just a recognition that it's not shrink-wrapped for general consumption. There is still a very high bar a person needs to jump over before Linux becomes truly accessible. Those of us using linux for business purposes recognize this, and we are willing to take this into account when we make risk assesments.
--
Re:Oh, sure, Linux users are this desperate by ackthpt · 2000-10-25 00:49 · Score: 2

Nothing wrong with that. My perspective is of one who tries to convince an employer to look beyond packaged solutions. The patch is fine for the hobbiest, but be honest how far would you trust a defective DIMM? When you first install, no problem. When you are playing a few games, no problem. When you are setting up your own server to run on a DSL/CableModem, minor problem. When you depend upon it, bigger problem.

I'd be more impressed with on the fly patching of bad memory to keep a server going rather than having it hang. That would be a selling point.

--

--

A feeling of having made the same mistake before: Deja Foobar
Re:Oh, sure, Linux users are this desperate by ackthpt · 2000-10-25 02:25 · Score: 2

I'm not taking a swipe at Amature Radio, or even QRP. The reference was to building it in a sardine tin (can't get a decent metal box from Radio Shack anymore, but you can get Cue Cats for free (then tear em apart))

I'm well aware of the ingenuity of radio amatures, my father has been one for decades, and the innovation which made early repeaters work (retuning savlaged military/commercial radio equipment.) This is great as a hobby, great if it helps out in an emergency, but, as with Linux seeking acceptance, try not to overlook those who opt to ditch the throwback for a stable, professional package with reliability. I can't see calling up Icom and asking for technical support after I nibbled a hole in the casing of my UHF handheld and wired in a customization any more than I can see calling up a tech at 1AM on a Friday because the cut-price 512M DIMM just flaked out a little more and took the system down.

When you buy good memory, it's expected to be 100% good, no intermittancy. If it flakes you replace it, hopefully under warranty or field searvice agreement. When you buy "iffy" memory, you accept that it is known to be broken, but you have no guarrantee how it is broken and whether that state of broken is stable.

Annecdote: A disreputable computer technician, who often overcharged for simple repairs and patches, is hit by a truck and killed immediately. He appears in hell and a demon welcomes him, and directs him down a passageway to his eternal punishment. The tech inquires as to what it will be. The demon indicates the punishment fits the crime and opens a door. The tech looks in and sees an enormous cavern filled with PCs, all tagged as broken. The tech says, oh, I'll spend eternity fixing broken computers? The demon says, yes, but since this is hell, they all have intermittent problems.

--

--

A feeling of having made the same mistake before: Deja Foobar

Similar solution exists in the 2.4 kernel already! by Anonymous Coward · 2000-10-25 01:15 · Score: 4

Check out the 'mem=exactmap' boot-time option in the 2.4 kernel series - it got added a couple of weeks ago. That way you can specify and exclude faulty RAM via boot parameters.

Anything similar? by mbadolato · 2000-10-24 23:50 · Score: 5

which allows it to make use of faulty memory... *sigh* ....of course my wife had to be reading over my shoulder and asked "Great, now is there anything I can install in you to make use of YOUR faulty memory...." She thinks she's funny. =)

My bad RAM story by OlympicSponsor · 2000-10-25 00:10 · Score: 4

Every time the topic of bad RAM comes up I can't help but tell this story:

We had just installed an Exchange server we were rolling out the Exchange client to all the desktop PCs. Unfortunately, no one had thought to ask if they could take it--which many of them couldn't. So we were feverishly digging up all the RAM we could find and sticking it into machines as fas as we could. I happened to find a 32MB stick (glory be!) in an unused PC. I said to my boss: "Hey, I found a big one!" He turns around and asked "Is it any good?" while simultaneously reaching for it, and ZAP audibly discharges static electricity right into the thing. We look at each other for a moment and then I say "Not anymore."

I was wrong, though--it was fine.
--
An abstained vote is a vote for Bush and Gore.

--
Non-meta-modded "Overrated" mods are killing Slashdot
(Hey Ryan! Here's your proof!)

Re:What about intermittent failures? by fgodfrey · 2000-10-25 13:39 · Score: 2

There are ways of addressing this issue too. The group I work in at SGI is responsible for (among many other things) Irix and Linux RAS features on our hardware. RAS is Reliability, Availability, and Serviceability. One of the things we observed is that the most likely case to get a double bit error (we have ECC memory in all our Origin servers/supercomputers) was grabbing a new page off the freelist and then bzero'ing it.

The theory on why this occurs is that the memory on the freelist isn't being accessed (well, ok, we have some bugs occasionally, but... :) and it degrades because of this. Since you don't care about the data on the page, it kinda sucks to panic during the bzero. So, Irix, starting with 6.5.7, knows how to "nofault" this bzero operation and if it fails, it grabs a new page off the freelist and discards the bad page. This is a feature we are thinking of adding to Linux. Other types of pages for which this same recovery can work are mapped files (ie, program and library text/read only data) and clean user data (ie, just swapped in and not yet modified). This is about the best way to solve the intermittant failure problem.

One other interesting thing we've noticed with RAM is that failure rates over time stay pretty much constant since, while manufacturing techniques are getting much better, the memories are getting larger. This causes failure rates to stay nearly flat.

--
Go Badgers! -- #include "std/disclaimer.h"

The Environment by Overnight+Delivery · 2000-10-25 14:39 · Score: 2

Ethical motivation:
I like to preserve as much of the environment as I can. The production of chips is a very resource intensive process, and the complexity of a chip means that a lot of the produced chips are incorrect. I dislike wasting good materials, and even if they are merely `good enough' they should be taken seriously. By allowing the use of such `good enough' memory chips, I hope to help preserving the environment.

Computers are pretty darn unbiodegradable, yet the pace of progress makes them obsolete at an ever increasing pace. How many 386's are somewhere other than landfill? A 386 is not actually that old when you compare it to a washing machine or a fridge.

A lot of people are slamming this because it has some practical limitations, so what!

This guy has done a pretty cool hack, but has also done something positive about side of our industry that most of don't think about very often.

--

When it absolutely positively has to be there.

Testing methods by skoda · 2000-10-25 01:17 · Score: 2

I don't know anything about memory industry testing methods in particular, but given that RAM chips are produced in volume, I doubt every chip is inspected by a human. More likely, there is an automated inspection system that checks for surface defects, perhaps runs quick functionality tests, and batch sampling inspection by human inspectors.

Not that that would lead to lower quality overall. It might even be better since people might get sloppy after looking at a few hundred identical chips every day, whereas machines don't get bored. (well, except for my computer. It insists I play Unreal Tourn. now and again :)
-----
D. Fischer

--
ShoutingMan.com

Some things NEVER CHANGE... by mcrbids · 2000-10-25 01:18 · Score: 2

I'm reminded of a Digital VAX 7/1150 (or was it 711/50? I don't remember) I worked on. It was the size of two refridgerators, and required TWO room air conditioners to keep temperatures in the room reasonable - to deliver roughly the processing power of a '286...

But it was an AWESOME machine. And, mapping out memory that was bad was something it did on the fly! It would find a bad memory spot, and do one of several things with it:

1) Stop using it;

2) If the problem was intermittent, it would only store PROGRAM CODE there - which, if the memory was bad, it could re-load from the hard disk!

3) If the memory tested good for a while doing program code, (a few days, I think) it would return that RAM to general use.

An amazing machine - with some features that pale even a big, powerful *nix box today.. For example, versioning of just about EVERYTHING... *:1, *:2, *:3, etc, and while there was a "root" user (called admin on this system) there could be more than one! (My login, "dirdisb" was a "root" login too, and you could always tell when you looked at a file whether admin or dirdisb actually did it - much better than *nix style, IMHO)

I seem to recall that there was a patch or something you could apply that would make it use ALL hard disk space to create as many versions as possible of documents - or just 10. (we used the latter)

This machine, as slow as it was, would comfortably handle 20 simultaneous users! (granted, no X-windows or GUI at all)

With patches such as this badram patch (which IMHO should be added to the kernel by default) we are getting some of these really cool features back...

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.

Your bad RAM has to... by BornInASmallTown · 2000-10-25 01:19 · Score: 2

...be bad in the right way. If, for example, the most significant bit of the addressing bus were damaged, you would only have access to half of the chip's memory at a maximum.

To fix this problem, you'd have to use 2 "half-working" chips to get the same amount of memory that 1 of the non-damaged ones would have provided.

It seems that buying several damaged chips to make up for the one non-damaged chip would not be very cost effective in the long run.

Re:Your bad RAM has to... by b1t+r0t · 2000-10-25 01:59 · Score: 2

...be bad in the right way. If, for example, the most significant bit of the addressing bus were damaged, you would only have access to half of the chip's memory at a maximum.
It's worse than that. DRAM is addressed by rows and columns, so each address line controls two bits, and not necessarily two adjacent bits.
If an entire address line is bad, it's time to make a keychain holder. At the factory, this type of problem won't even make it out of initial die testing, much less all the way to manufacturing a DIMM.

--

--
"Open source is good." - Steve Jobs
"Open source is evil." - Microsoft

Re:Now there's a point to the BIOS memory test? by Megane · 2000-10-25 01:22 · Score: 2

I actually had a friend who ran DOS debug on his BIOS and noted that it *did* actually test every one of the X86's registers during boot

The Atari 7800 had lots more ROM space than it could possibly use for just the 960-bit digital signature lockout code, so they included a full 6502 CPU test in there. Sheesh.

--
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }

Re:What about Quality Control rejects? by Lord+Kano · 2000-10-25 03:52 · Score: 2

Or for that matter, anything that is a half-way decent color copy of the original. If I didn't hate the bastards so much, I'd love to be an FBI badge at one of those shows. Particularly one on commission :).

Bogus software wouldn't bother me at all, at least if it worked when I got it home. I'm talking about defective modems that were taken from a trash heap somewhere and new boxes made for them and put out for sale.

I could give a damn if someone is selling bootlegs of Freddy the Fish or something like that.

LK

--
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano

mod this guy up by Barbarian · 2000-10-25 01:40 · Score: 2

mod this guy up, he knows what he's talking about.

--

My Article by SomeOtherGuy · 2000-10-25 00:16 · Score: 2

{ Joke Mode On }

From the "Why did this article get posted...and not mine" department:
Ohh..This sounds a lot like the story I submitted a week ago called:
"How to make your good memory make use of bad software."

{Joke Mode Off }

--
(+1 Funny) only if I laugh out loud.

A better solution... by Anonymous Coward · 2000-10-25 00:34 · Score: 2

Buy a real computer. Pay a lot of money for it. Get a warranty/service contract (they're often free for a certain level of support). Bad RAM? Send it back. Bad mainboard? Send it back. Bad anything for the first three years? You guessed it...send it back. Get the replacement by next-day air. Free.

A lot of vendors offer service contracts and warranties. But peecee vendors, accustomed to dealing with...shall we say, less than reliable operating systems, will try to make you go through 543 steps and tests before allowing you to send your hardware for replacement, because most problems in that world are either OS bugs or user error. In the real computer market, they don't fuck around. You paid a lot for your system, and you can expect it to work. When you call them up and say you have a bad foowhatzit, they send you a new one (unless they're Sun, in which case they make you sign an NDA first - bad Sun, bad!). They expect, and rightly so in most cases, that you know what you're doing and it isn't a software problem. No runaround, no bullshit, no cost to you. This is one of several reasons I'll never own another peecee. The service just ain't the same.

I understand the concept of trying to get the most you can out of any hardware you might have. But I also think that people stuck with such hardware ought to learn their lesson next time instead of relying on hacks, however clever, to work around their poor buying decisions. Anyone actually seeking out bad memory to use with this is insane. Firstly, there's good reason to believe that if memory is failing, other areas in the same part may fail as well, perhaps with less frequency or at a later time. Second, even if the cost is half that of a good part, is it really worth saving 50 bucks and having to configure this thing, test it, and make sure periodically that no other memory areas fail? I would suggest that technical work of this type is worth at least 50 bucks an hour...so if you value your time fairly, it's unlikely that you'll win out of this. I'll gladly pay some extra money to know that I won't get a sig11 the next time I go to compile something...and if I do, I can get replacement parts the next morning at no cost, without any hassle. I don't work for any vendors. I'm just a sysadmin who'd rather read slashdot than argue with tech support.

Re:Now there's a point to the BIOS memory test? by Anonymous Coward · 2000-10-25 00:20 · Score: 2

Does it run every possible combination of CPU instructions on boot up?

It can't. Running every possible combination would take an indefinately long period of time (infinity).

Does it check every single block on the hard drive? No!

This is because the hard drive is not essential to the functioning of the computer. With modern operating systems, usually a hard drive is required, but again, it's not essential.

Does it check all the blocks of floppies, CDs, DVDs, etc to make sure they work?

This would be absurd. "Please insert a DVD, CD and floppy to boot".

If the memory test is essential to the functioning of the system, why do they let you skip it?

You then go on to contridict yourself by saying Obviously, the smart thing to do is to _wait_ for the memory to fail rather than test the whole lot for a minute or two..

make sure nobody replaces linux by jannic · 2000-10-25 00:34 · Score: 2

If you're operating a linux server and someone wants to replace it with w2k, well, let him try. But don't tell him that the RAM is defective. (It works with linux, so what?) :-)

Re:What about Quality Control rejects? by Lord+Kano · 2000-10-25 00:37 · Score: 2

As much as I despise the racial/cultural epithets that you just used. I have a policy as it relates to computer show shoppining. Never buy from someone who doesn't speak english as their primary language. Never buy from someone from more than 1 state away. Never buy something that comes in a box that is a low quality black and white copy of the original.

LK

--
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano

SIMMs too, right? by Straker+Skunk · 2000-10-25 01:45 · Score: 2

Ahh, this would have been useful with an old P90/32MB motherboard/memory combo I recently gave away...

It was quite fun, running a system (FreeBSD) with a single-bit memory error. Sure, gcc would die on occasion, but then there was the oddness of having a script break because a file http_log was missing (mysteriously renamed to httx_log). The best part was actually figuring out which bit was bad...

--
iSKUNK!

No, this *is* good for production use! by Random+Q.+Hacker · 2000-10-25 00:23 · Score: 5

Sure, you wouldn't want to intentionally put bad memory into a production machine, but what if good memory goes bad? This patch, if further developed to perform periodic testing and updating of the bad memory map *during operation*, could actually harden the linux kernel against spontaneous hardware failure!

If we ever want to see linux used in mission critical systems like air traffic control, embedded medical devices, or military applications, then projects like this are the key. Fault tolerance now exists for memory (this project), storage (RAID), and communication (redundant NICs). The next target should be the CPU.

How about projects to detect the types of errors a failing (typically, overheated) cpu produces, and adjust the scheduler accordingly to insert idle time and cool down the cpu? Or to use one cpu to monitor another in multiprocessor systems, and avoid using a processor that starts producing faulty results?

Re:No, this *is* good for production use! by mmontour · 2000-10-25 00:41 · Score: 2

I'd suggest one slight change to your post:

Fault tolerance now exists for memory ( ECC RAM ), storage (RAID), and communication (redundant NICs).

Imperfect knowledge, but ... them's the breaks. by timothy · 2000-10-25 00:23 · Score: 3

I knew from the badRAM website that it was discussed on kt (and so read that earlier today), but I hadn't noticed it there when it first appeared -- sometimes I'm too interested in other topics, sometimes I don't read it all the way through, whatever. There's a lot of information in the world. I'm glad that someone sent in the link and explained it a bit (so I was intrigued and looked through it), which is what this site is about.

But how many people saw it on kt? For purely selfish reasons, I'd like to see a lot more people know about this project, because I find it very interesting and useful-looking. Plus, I think it's just a neat hack in general, and I'd like to point it out.

If it's too old for you, then ... don't read it or waste your own time commenting :) There are a lot of projects out there that have been laboring quietly which may have spectacular results at any time -- do you not want them discussed because they're "old news"? The in-progress Tux2 filesystem was no secret, for instance, (that, too, was discussed on kt), but how many people had heard of it before ALS? Not nearly as many as would have been interested, I warrant, and the comments on the slashdot story about it indicate that.

YMMV, whaddya do?

OK.

timothy

--
jrnl: http://tinyurl.com/c2l8yr / foes: http://tinyurl.com/ckjno5

Re:Now there's a point to the BIOS memory test? by sab39 · 2000-10-25 00:26 · Score: 2

I actually had a friend who ran DOS debug on his BIOS and noted that it *did* actually test every one of the X86's registers during boot - obviously this didn't get printed in the boot sequence (because if it had, the message would flash up for about 8 clock cycles) but it actually did a sequence like:

put arbitrary number in first register
copy first register to second register
...
copy second-last register to last register
compare last register to first register
if (different) HALT

(I have a feeling it did this twice, with 01010... and then with 10101...)

He thought this was quite clever until he realized that a bad bit in the first register would still pass the test (and also negate the test of that bit on all the other registers)...

Re:Bad Ram by rlowe69 · 2000-10-25 00:37 · Score: 2

To be honest, doing this does nothing but ENCOURAGE the manufacturers to continue to make bad ram. Why not make these manufacturing companies make GOOD ram so we don't have to do this in the first place?

I think this comment is a result of not knowing how difficult it is to make "good" RAM (or *any* good electronics for that matter).

The complex process behind making RAM means that there will ALWAYS be defective ones in the batches that can't meet standards set by the manufacturer to cover their ass on warrantees, etc. If they are going to end up throwing this hardware out (or recycling the pieces if that's cheap) then they might be able to make more money selling the defective RAM.

While having "partially defective" RAM on the market may seem bad, if the price point is right it could be useful for some people. Like if I could get 512MB with 50MB defective for 100 bucks (w/ maybe a 1 year warrantee on the 462), I'd jump on it in a second. But that's just me.

--
----- rL

Still practical... by Ungrounded+Lightning · 2000-10-25 00:26 · Score: 2

Now we have a way to deliberately make Linux instable.... if you subscribe to the theory that if a DIMM has bad areas then that increases the probability that more of its areas will fail in the future.

Good point! DRAMs are made with extra rows ... during manufacture the memory is checked and the chips are modified to allow the spare rows to replace the faulty rows. That's how they get the yield up so high. You can get faulty DRAM, but it is used mostly in voice-recording applications where the human brain can tolerate a few crackles.

But what happens when there are more faulty rows than spares? Answer: They sell it to the crackling-audio people, for cheap. Such chips might not have a higer tendency toward progressive RAM-cancer than those with fewer faults (though I will be happy to stand corrected if someone has contrary data.)

By marking the bad rows bad, Linux never allocates them. With virtual memory in fixed-size pages and memory-mapped I/O there's no penalty for scattering your data all over the place and hopping over the occasional chuckhole.

Downside would be if there's a flakey cell and the memory test misses it. So a persistent bad-page map might be useful, as would beefing up the startup test if the feature is enabled, and adding a background memory test on the currently unallocated pages, to pick up any really-low-density faults.

If an intermittent cell gives you a hit on a read-only or unmodified page, a hack in the parity-error recovery code could move and refresh it. A read hit on a modified page not yet written back to disk is bad news. (Another background hack could be writing modified pages back part of the time the disk is otherwise idle, to reduce that window.)

--
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way

Re:Bad Ram by mmontour · 2000-10-25 00:26 · Score: 2

Don't RAM chips already have some internal ability to map out bad areas, like the way that IDE/SCSI hard drives come with spare sectors that are automatically mapped in when a regular one fails?

[google]

DRAMs typically improve yield by using spare rows and columns to replace those occupied by defective cells. The repairs are performed using current-blown fuses, laser-blown fuses, or laser-annealed resistor connections. [Coc94] references one case in which memory repair increased the yield from 1% to 51%.

from http://www.cs.berkeley.edu/~rfromm/Courses/SP96/cs 294-4/project1/dram-test.html

oh, bull by legLess · 2000-10-25 00:27 · Score: 2

It's a lot more like saying, "This old Mercedes will only run on hi-test gasoline, but the new one burns any old crap you put in the tank and runs just as well."

And if that doesn't impress management, take your faulty DIMM and throw it in a Win2k box. Sit back and watch the fireworks.

--
This isn't as much "normalization" as it is "don't take so many drugs when you're designing tables."

Re:PS/2 by rnturn · 2000-10-25 00:38 · Score: 2

``Didn't PS/2s do this way back in the day?''

PS/2?! Heck, didn't PDP-11s do this? I can still remember (barely, though) PDP memory boards with socketed discrete memory chips that included a two/three spare chips so you could replace the bad chips yourself after locating them using an XXDP+ dianostic. I don't recall bad chips bringing down the system but in those days, when a big PDP had 4MB of memory, every little bit counted and the OS worked around it.

I heard somewhere several years ago that Windows got around flaky memory not by marking pages as `bad' but by forcing additional wait states if questionable memory was detected at boot time. Any truth to this?

--

--
CUR ALLOC 20195.....5804M

Bad RAM for cache by wmoyes · 2000-10-25 00:40 · Score: 2

Flaky RAM would be very useful for caching. Simply add a checksum to all data stored in the flaky RAM. If when reading the data the checksum is invalid just pretend that it was a cache miss. Most of the functionality and almost no chance of failure. Note that only the storage of the actual cached data should be put into the bad RAM, not the translations tables.

I once had a motherboard that had a problem refreshing RAM above the 1MB boundary. You could write and read just fine, but as time passed you would watch individual bits revert back to 1's. It was kind of amusing watching all the graphics in doom change ;^). I 'fixed' the problem by writing a TSR that tricked all software into thinking I only had 640k of memory. That memory would have been fine for cache if the data was protected by a checksum/CRC.

Real information... by Anonymous Coward · 2000-10-25 00:41 · Score: 5

Acually memory fails for many diferent reasons. I personaly work in the test department at a large semiconductor company that makes SDRAM. All memory gets tested before it gets soldered to the PCB but it still can encounter a fail after it leads. Single bit fails and the like are acually fairly common. Most people don't even notice them. Also there are speed related problems, heat related problems, and mechanical problems that come up. For example, the early AMD chipsets had problems with certain memory. Memory also has clock issues and other little details that can effect things dramaticly. However this project seems to be a little far fetched since most memory gets a little worse over time. This is okay for a temp fix but your memory will slowly get worse with time. Usually within 6 months the memory is almost totally bad. Another problem with using bad memory is that in several cases memory will draw a larger idle current than other modules. And if you have more bad modules there is a higher current load. This can lead to damaged parts on your motherboard. Another thing to realize is that load style can effect your stability. In several situations it has been found that windows can run over top of a memory error because it tends to not stress the memory quite as much as your basic high load unix setup. Thats my $.02 on the issue I guess.. It seems like this is basicly using a hard drive that is whining and spuddering. Not a smart move stability wise.

mod parent up please by XNormal · 2000-10-25 04:53 · Score: 2

Answering machines use ARAM chips which are actually faulty DRAMs. They map the defects and avoid using those areas.

----

--
Stop worrying about the risks of nuclear power and start worrying about the risks of not using nuclear power.

Umm.. by Auckerman · 2000-10-24 23:52 · Score: 2

Forgive my ignorance, but can you actually purchase ram that is "bad" and labeled as such? Also, anyone who wants "loads of RAM" is going to be in a position to need buy good RAM, because their business and research will depend on it.

With that said, if I can get a 512MB DIMM for $100 because 50MB of it is inusable, I'll buy it and install this hack, even though having any where near that much ram on my Linux box will not help me(it's little more than a mp3 player and web browser, most of my work is on my Mac).

--

Burn Hollywood Burn

Re:Umm.. by misleb · 2000-10-24 23:58 · Score: 2

We are not talking about 50MB out of 512MB bad. I believe we are talking about single bits. Single 4k pages of addressable memory. Thats 4k out of 512MB. What if you could get a 512MB DIMM for $30 wouldn't you take it?

-matthew

--
"THERE IS NO JUSTICE, THERE IS ONLY ME." -Death
Re:Umm.. by banky · 2000-10-24 23:59 · Score: 2

we have a local reseller of used machines and parts that has buckets and buckets full of ram marked as bad. There's probably a company like this in every large town; they seem to do quite well, making a nice buck selling "throw away" boxes and other stuff like that.

--
ZOMG I WOULD LOVE TO KNOW ABOUT YOUR FEELINGS ON MACINTOSH VERSUS WINDOWS, VI VERSUS EMACS, AND HOW YOU'RE NOT A DORK

Signal 11 no more? by Mel · 2000-10-24 23:52 · Score: 4

Actually this is quiet handy. Windows always worked better with dodge memory than Linux did because Linux always tries to use as much memory as possible for caching where as Windows didn't.

It made it notorious for working with dodge memory, failing to boot half of the time. I've seen people blame Linux for bad hardward because it would work with Windows.

It's nice that Linux now could just go

*ARGH YOU HAVE CRAP MEMORY*

shrug it's shoulders and chug along anyway.

Re:Is this good for Linux's rep? by Abcd1234 · 2000-10-24 23:54 · Score: 3

Okay, no offense, but unless you're joking, that's one of the most ridiculous things I've ever heard. What this does is associates Linux with the ability to compensate for bad hardware. It makes Linux look MORE robust, not less... yeesh, some people...

Chips may still not work by Fervent · 2000-10-24 23:54 · Score: 2

512MB sticks are still expensive, faulty or not.

Plus, some of the motivation is a little aschew. If you want to push these chips into an old machine, you still have the problems of RAM limitations due to motherboard design. A fat lot 512MB of semi-faulty memory is going to do in a board that can only support up to 32MB (or better yet, an older chipset that supports up to 8 or 2).

--

- I don't care if they globalize against free speech. All my best free thoughts are done in my head.

Better hurry... by darial · 2000-10-25 00:28 · Score: 5

I beat feet to my local purveyor of crappy used hardware as soon as I saw this, and all I have to say is:

handfull of busted 256m DIMMS: $10.71 with tax

6 reboots, a little math, and a partial kernel compile: 21min

The look on my roommate's face when I typed "top": priceless!

Swiss Cheese by twisty · 2000-10-25 00:30 · Score: 3

Those who know me realize my memory is already Swiss Cheeze. ;-) But I think that this latest breakthrough takes Exception Handling to new levels of fault tolerance...

Linux forced its way into our IT Department when it could restore a trashed system into something useful. Here at The Salvation Army, we endevor to be good stewards of what we are given. We have an IBM PC Server 350 (now named "Methusela") that crashed one day for no apparent reason. It refused to run Windows anymore... not even Win98 or Win95!

But it ran Linux flawlessly. Well, actually it did point out one flaw on its own: The internal Ethernet controller was getting an unusually high number of bad packets. It would receive DHCP assignments, even do some web work in Linux... but it was enough to shut Windows down completely. Even after installing a working NIC, Windows could not run due to the faulty internal NIC, but Linux ran fine!

Likewise, we found an instant way to crash every WinNT system in the building. Someone was re-arranging the hubs and switches, and accidentally created a packet loop by plugging a switch back to itself... in three seconds every WinNT system on the network went straight to the Blue Screen of Death.

It one thing to handle the rules well, but quite another to deal with the exceptions!

anti-linux? by mosch · 2000-10-25 00:33 · Score: 4

You must have some sort of problem with linux. This is a valuable, and technically interesting addition to the Linux kernel, and all you can do is act like everybody in the world who needs 256MB DIMMs also has $135 ready.

I know you're just trolling, and I shouldn't respond, but for students, and anybody who has access to memory modules that are experiencing known, predictable faults, this would be great. Not everybody has some fancy $30,000/year job, y'know.

-- "Don't trolls get tired?"

Coming soon to Mac OS-X by burris · 2000-10-25 00:44 · Score: 2

Considering that the kernel of Mac OS-X is Open Source, you'll eventually be able to have a hack like this on your Mac as well.

Burris

One step further: by Soko · 2000-10-25 00:47 · Score: 2

If he can only get the Kernel to do this on the fly...

--
"Depression is merely anger without enthusiasm." - Anonymous

Re:What about Quality Control rejects? by Detritus · 2000-10-25 00:48 · Score: 2

It's a bad idea. Defective parts have an annoying habit of being resold as good parts by unscrupulous vendors. That is why many manufacturers make a point of destroying defective parts, so they can't sneak back into the supply chain.

--
Mea navis aericumbens anguillis abundat

get real, this is only useful for high reliability by Splork · 2000-10-25 06:07 · Score: 2

The use of a patch like this is only on systems with ECC ram for moving data out of pages which had non-catastrophic memory errors so that the system can keep running without using the flakey bits until that DIMM can be replaced.

bad ram is just that, bad, and is likely to have more failures over time.

Hello this was on Kernel Traffic a long time ago by Squeezer · 2000-10-24 23:56 · Score: 2

Hello Slashdot??? This is old news. The kernel bad ram patch was discussed on the weekly kernel traffic digest several weeks ago. Do the slashdot story posters read any news sites other then slashdot? Other news sites are out there...

--
Does the name Pavlov ring a bell?

Could have used this last year.... by MattW · 2000-10-24 23:56 · Score: 2

I built a box from a hole-in-the-wall parts reseller that did volume, volume, volume in the silicon valley, and started having some stability issues. So I started doing mass kernel recompiles (100 at a time) as a test, and sure enough, gcc exited with errors, at random points. However, I've heard that this is not necessarily 'bad bits' on the memory sticks, but rather an inability of the memory to actually keep up with the 100Mhz bus, even though it was billed as pc-100 RAM. Anyhow, after that, I always sprung for the premium Toshiba lifetime-guarantee ram at fryes, and I just got the other parts elsewhere.

Re:Is this good for Linux's rep? by British · 2000-10-24 23:57 · Score: 2

Both my almost-never-used linux boxes are entirely composed of 2nd hand hardware. I'd say it would come in handy. Dumpster divers would appreciate it.

A long time ago... by Richy_T · 2000-10-25 06:36 · Score: 2

I suggested this very thing on a Linux newsgroup and it was poopooed as being not worth the effort. Kinda nice to see my idea vindicated. Even though I now would tend to agree with the poopooers.

Rich

Re:Bad Ram by Abcd1234 · 2000-10-24 23:57 · Score: 2

I think the point this guy is trying to make, on the economic side of things, is that there's a limit where the ratio of good to bad RAM is as high as it's going to be. IOW, there's always going to be a certain percentage of bad RAM in a given production run. So, why not make this semi-broken RAM viable, by selling it cheap for commodity PCs. This could help reduce prices on cheap PCs and make computing power and the Internet more accessible to those with tight funding (poorer folks, libraries, schools, non-profit orgs).

Now there's a point to the BIOS memory test? by kyz · 2000-10-24 23:58 · Score: 3

I've never trusted PCs because the BIOS 'tests the memory' before booting up. Why do they do this?

Does it run every possible combination of CPU instructions on boot up? No!
Does it check every single block on the hard drive? No!
Does it check all the blocks of floppies, CDs, DVDs, etc to make sure they work? No!
If the memory test is essential to the functioning of the system, why do they let you skip it?

Obviously, the smart thing to do is to _wait_ for the memory to fail rather than test the whole lot for a minute or two. After doing a full test once, the first time you boot, you can leave a very low priority memory tester running, or leave the full test to some quiet period with a cron job - a decent memory test of course, not that half-witted test that BIOSes do.

--
Does my bum look big in this?

... by slothbait · 2000-10-25 08:06 · Score: 2

No flame here, just clarification...

> we are talking about the die that passed the initial wafer probe and then were packaged and then fail somehow at or after the packaging or even shipping stages.

Yes, we definitely are. I only brought up the fact that the industry already routes around process errors in DRAM's to demonstrate a point. He seemed frightened by the possibility that future DRAM's we buy might not be 100% "clean". I wanted to demonstrate that 100% clean isn't necessary, and infact isn't produced now, by and large. What *is* necessary is 100% functional parts. DRAM manufacturers know this, and use it to improve yields, thus driving down cost. No foul play there.

> So in conclusion, I think there are plenty of us at ./ that have a clue about memories, memory protection architecture, fabrication, production, and testing techniques.

I expect there are, but none of them were posting. Instead, most of the posts demonstrated a clear lack of understanding of the process. The consequences of techniques like Linux's remapping seemed to worry the original poster, and I wanted to explain why it wasn't something to worry about since a similar process is already performed quite successfully. Further, I wanted to emphasize that this is a perfectly valid technique for increasing yield, and is transparent to the user. It isn't like the manufacturers are trying to rip people off.

> Why don't you just allow us to think that is't really cool to use software to lock out blocks of RAM?

I'm not saying it isn't cool. Some seemed weary of the idea, though, and I wanted to point out that their present DIMM's use something very much like this. As long as the software can do this transparently (as the hardware does), what does it matter to the user?

> but I seriously doubt that they are applicable to such dense cells as memories

Believe me: they are. Think of it: a massive die area that will be completely destroyed by a single speck. Wouldn't you prefer an ever so slightly (say 5-10%) larger die that can withstand one or two specks? The redundency is very easy to use in a homogenous structure like DRAM (millions of identical cells). All that has to be done to "swap in" the replacement RAM block is to modify some address lines. That can be done by electrically blowing fuses on die, or through laser modification. One of my former employers was *most* found of the laser approach. It added quite a bit of flexibility to their designs.

--Lenny

Why bother? by Animats · 2000-10-25 00:51 · Score: 4

It would make more sense to use ECC RAM, which can tolerate even intermittent bad bits. You get an interrupt on an ECC correction, at which point the OS should stop using that memory, without crashing. Mainframes were doing that decades ago. It's worth doing because it keeps a working system up, and Linux should have that feature. It's a big win for server farms.

Modern DRAM doesn't have much trouble with bad cells, and the yields are quite good. So there isn't a big supply of DRAM with bad cells that fail solidly. Most DRAM problems today are at the edges: at the buffers, the connectors, or clock synchronization - the things that can be messed up during installation.

Personally, I get ECC RAM even on desktops, just so I know it's working. It eliminates arguments with tech support when the hardware really is broken.

They throw them away now by A+nonymous+Coward · 2000-10-25 00:52 · Score: 2

So *any* price they can get, above and beyond shipping and handling, is pure profit to them. And there's a cost to throwing them out which they wouldn't have any more.

--

--
Infuriate left and right

Predictable faults? by ackthpt · 2000-10-25 02:35 · Score: 2

So tell me, how, without a memory tester, do you know what's predictable vs. unpredictable? As far as I'm concerned with DIMMs, 5% broken is 100% broken.

--

--

A feeling of having made the same mistake before: Deja Foobar

Err... by Wakko+Warner · 2000-10-24 23:59 · Score: 4

*how* much cheaper can faulty RAM be? I mean, 256MB SDRAM DIMMs are already $135 apiece... Would it really be worth it to get a dodgy piece of memory if the difference in price is negligible?

- A.P.

--
* CmdrTaco is an idiot.

--
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"

Best Buy by FraggleMI · 2000-10-25 00:00 · Score: 4

You can check out Best Buy or CompUSA for some faulty RAM. They seem to have a never ending supply of it. Not only that but you can pay the price that you can get it of of the net for good ram!

--
huh?

If only it made sense.. by verbatim · 2000-10-25 00:01 · Score: 2

Say that current-day RAM modules would cost $100. These are the 100% correct modules, all others are wasted. By selling a decent subset of the wasted ones for a profit of, say, $20 each, a new audience is addressed, namely the one that thinks $100 is a lot (mostly home users). The professional market has the tendency to go for high-quality materials, and they will continue to be willing to pay $100 per module.

Okay, so you car dealer marks down the car 80% because ONE of the pistons doesn't work right. However, the rest work fine and he installed a thing-a-ma-bob to make the engine ignore that piston.. ummm..

What kind of warranties will the end user get for the memory? what kind of performance is eaten by this program? does the memory run up to spec? will it still work in 2 months?

There could be a niche market for "used" memory sticks, but "damaged" or "defective" may not sell all too well...

I agree, however, that this does seem like a cool way to resurect older systems into useful appliances (print servers, routers/gateways, etc).

Verbatim

--
Price, Quality, Time. Pick none. What, you thought you had a choice?

Slashdot Mirror

Patch To Allow Linux To Use Defective DIMMs

93 of 247 comments (clear)