Tracking Down a Single-Bit RAM Error

Ugh, single bit errors by Kufat · 2010-06-24 11:15 · Score: 3, Interesting

One of my computers had an intermittent failure in a RAM chip/line/something somewhere that mostly manifested as SHA/MD5 failures when I was checksumming large files that I'd downloaded. Never showed up in Memtest86, but eventually I eliminated every other possibility. IIRC, I solved it by underclocking the machine and then replacing it when I was able.

Re:Ugh, single bit errors by Kepesk · 2010-06-24 11:57 · Score: 1

And I thought debugging a MythTV install was hard...

--
Help me fix my brother's injured butt!
Re:Ugh, single bit errors by sortius_nod · 2010-06-24 12:27 · Score: 0

I've had almost the same problem in the past with an old DDR2/AMD machine. Clocked to full speed, it'd fall over repeatedly, clocked down it was fine. I took one of the RAM sticks out, ran fine at full speed.
I'm not sure why you'd want ECC ram in a desktop, unless it's some sort business critical machine that you're willing to spend 5 or 6 times what a normal desktop costs. For day to day use, ECC is overkill. You can get warranty on most chips for 1 to 3 years depending on the manufacturer, and if it's out of warranty, either buy a new machine or buy new ram. All in all, it'll cost more to run ECC due to the board required to effectively utilise the ECC capabilities. I'm not even sure some consumer boards are capable of taking ECC (ASRock and the like that come in cheap desktops).
As someone who has worked in support for 15 years, the troubleshooting shouldn't be "interesting", it's basic diagnosis. The idea that it was "cosmic radiation" is just, well, bullshit. Chips die, have manufacturing faults, or just get old. Nothing new here.
Re:Ugh, single bit errors by rudy_wayne · 2010-06-24 12:36 · Score: 3, Informative

I'm not sure why you'd want ECC ram in a desktop, unless it's some sort business critical machine that you're willing to spend 5 or 6 times what a normal desktop costs.
This may have been true at one time, but ECC RAM is no longer that expensive. I just looked at prices on Newegg:
8 GB DDR3 $214.99
8 GB DDR3 ECC $274.99
In some cases, depending on the brand and the speed, ECC is actually *CHEAPER*.
Re:Ugh, single bit errors by hawguy · 2010-06-24 12:41 · Score: 4, Interesting

I think the original article showed why you'd want ECC in a desktop machine -- random bit errors do happen in real life. I don't see how a warranty makes this less of an issue -- if my machine silently corrupts data due to a bit error, getting a $50 replacement DIMM isn't really going to satisfy me. Does ECC really cost 5X over non-ECC?
If he was processing data or editing a spreadsheet, then that bit error could have corrupted his data. If he was compiling a program for distribution (perhaps to thousands of machines), that bit error could have corrupted his executable, causing errors on all of the machines it was deployed to.
After reading this article, the question that comes to mind is why am I *not* running ECC on my desktop?
Re:Ugh, single bit errors by hawguy · 2010-06-24 12:50 · Score: 2, Informative

I went to Dell's site and configured a few Dell Desktops (non-ECC) and Workstations (with ECC), and prices were similar for comparable systems. Though the Workstations that supported ECC didn't support many low-end processors, so if i didn't want ECC and didn't care about processor performance I could have gotten a desktop for about 60% of the price of the cheapest workstation with ECC. But I didn't see a 5x increase for ECC.
Re:Ugh, single bit errors by Timothy+Brownawell · 2010-06-24 12:55 · Score: 5, Insightful

I'm not sure why you'd want ECC ram in a desktop, unless it's some sort business critical machine that you're willing to spend 5 or 6 times what a normal desktop costs. For day to day use, ECC is overkill.
My desktop has 8GB of ECC in it. This cost I think $40 more than non-ECC, and meant I got an Althon II x4 instead of a Core i5. That "5 or 6 times what a normal desktop costs" is either bullshit or Intel-onlyism (which is just another kind of bullshit).
Re:Ugh, single bit errors by besalope · 2010-06-24 13:04 · Score: 2, Informative

You'll also need a consumer-level motherboard with ECC support. Which are not common, which means you'll be stuck with a server-grade motherboard which costs more, has potential to change: cpu compatibility, case compatibility, and features on the board itself.
There's alot more to making the change from non-ECC to ECC than just swapping out your ram.
Re:Ugh, single bit errors by Timothy+Brownawell · 2010-06-24 13:09 · Score: 2, Informative

You'll also need a consumer-level motherboard with ECC support. Which are not common, which means you'll be stuck with a server-grade motherboard
Or, you know, go AMD. Because they don't limit ECC to only server parts.
Re:Ugh, single bit errors by AHuxley · 2010-06-24 13:26 · Score: 1

Its not too bad, for 1066MHz ECC.
Have a look at Mac Pro ram options for 8 gb.
http://eshop.macsales.com/shop/memory/Mac-Pro-Memory#1066-memory

--
Domestic spying is now "Benign Information Gathering"
Re:Ugh, single bit errors by nabsltd · 2010-06-24 13:40 · Score: 2, Informative

Or, you know, go AMD. Because they don't limit ECC to only server parts.
Or, just buy any one of a half-dozen motherboards costing less than $200 and add a Xeon that is priced within 5% of the equivalent spec non-Xeon.
Sure, these might not be the best motherboards for gaming (although they are pretty competitive compared to other socket 1156 motherboards), but for a workstation doing everything else, they're great.
And, this way you get a motherboard that is thoroughly tested with ECC RAM (as that's what is expected to be used), and likely far better BIOS control of the ECC.
Re:Ugh, single bit errors by Anonymous Coward · 2010-06-24 14:48 · Score: 0

How did Intel come into this?
Re:Ugh, single bit errors by Z00L00K · 2010-06-24 16:14 · Score: 1

Another issue that may be here is the latency clocking. Not all control chips are able to handle the latency that the memory has.
I did experience weird memory troubles in one machine but when I moved the modules to another it was fine.

--
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
Re:Ugh, single bit errors by billcopc · 2010-06-24 17:24 · Score: 3, Informative

Depends on the type of desktop. ECC these days doesn't cost much more than non-ECC... Dell and HP may not want to admit it, but I buy ECC DDR3 all the time as I build a lot of white-box servers, and frankly even the lamest "gaming" Ram carries a higher premium than ECC.
The tricky thing is that while most (all?) current AMD boards can take ECC ram (unbuffered, not registered), no consumer Intel boards can handle ECC - you need to step up to a Xeon processor and chipset. Luckily the single-processor setups don't cost all that much more than their mid-range consumer equivalents, but you do have to sacrifice buzzy features like USB 3.0, SLI/Crossfire, eSATA and overclocking. One exception to this is the EVGA Classified SR-2, which has absolutely everything, but it's $600 and requires a special oversized chassis (or a lot of dremel work).
I'm going to put this out there: if someone is genuinely concerned about bit errors to a degree where the loss of work due to a minor crash or reboot is significant enough, go ahead and spend an extra 10% on ECC. Even if you pack that board with 96gb of memory, it's still cheaper than six months of therapy and thorazine :P

--
-Billco, Fnarg.com
Re:Ugh, single bit errors by sjames · 2010-06-24 17:39 · Score: 2, Interesting

One off single bit flips DO happen in otherwise perfectly good hardware.
I saw one years ago. A '386 running a single batch process (under DOS). It was supposed to be a massive sorting operation (500MB was a lot back then) and the results came out terribly scrambled. Each entry was fine except that they were not in order as they should have been. Since it was a batch I had the luxury of running it again. The error NEVER repeated. The same machine ran flawlessly for the rest of it's natural life after passing every test I could throw at it.
It could have been a cosmic ray, alpha decay or even a really unfortunately timed power spike.
ECC wouldn't have helped though, it had parity and that didn't detect anything, so the flipped bit was likely in the CPU itself.
Re:Ugh, single bit errors by billcopc · 2010-06-24 17:49 · Score: 1

ECC ram carries about a 15% premium, which means you're only paying for that 1/8th extra memory to store that parity bit. Single-socket Xeons aren't that much costlier than their i5/i7 analogs. It's the dual-socket boards and CPUs that cost 2-3 times more.
Single-socket supermicro board ? $200 to $250
Single-socket Xeon W3520 2.66ghz ? $270 or about 10-15% more than the i7-920
Dual-socket supermicro board ? $500 to $750
Dual-socket Xeon quad 2.66ghz ? $950 each!
Me, I want a quad-socket hexa-core system with hyper-threading, just so I can have 48 penguins when I boot. And, y'know, get KDE compiled before they release another version with even more regressions :P

--
-Billco, Fnarg.com
Re:Ugh, single bit errors by bzipitidoo · 2010-06-24 18:00 · Score: 1

Even a decade ago, ECC was more available than you might think. I still have a 350 MHz Pentium II with 192M of ECC RAM. The motherboard was nothing special. ECC RAM cost rather more, not quite 50% more as I recall, but noticeable. When I bought it in 1997, I put in only 64M and waited for prices to come down before adding more.
ECC wasn't worth it though. Haven't had any troubles with any computers that could be traced to RAM errors. Seems that if a situation came up where ECC was making a difference, the problem could be better fixed by removing whatever was causing the problems, be that bad RAM, a bad power supply, or some other noisy source of EMI and RFI. Or add more shielding to the case.

--
Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
Re:Ugh, single bit errors by afidel · 2010-06-24 18:08 · Score: 1

That's why the Nehalem's have ECC all through the data bus and internally.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:Ugh, single bit errors by Anonymous Coward · 2010-06-24 18:55 · Score: 2, Informative

By only supporting ECC on their expensive server processors.
Re:Ugh, single bit errors by Anonymous Coward · 2010-06-25 02:41 · Score: 0

I was checksumming large files that I'd downloaded.
Were them from a .xxx website?
Re:Ugh, single bit errors by WuphonsReach · 2010-06-25 04:06 · Score: 1

One of my computers had an intermittent failure in a RAM chip/line/something somewhere that mostly manifested as SHA/MD5 failures when I was checksumming large files that I'd downloaded. Never showed up in Memtest86, but eventually I eliminated every other possibility. IIRC, I solved it by underclocking the machine and then replacing it when I was able.

Try Prime95 next time if you're trying to track down intermittent errors like that with CPU/RAM.

(Those of us in the know have been using it for 10-15 years now, because it pegs the CPU at 100% and does very complex calculations with cross-checks that uses up lots and lots of memory.)

--
Wolde you bothe eate your cake, and have your cake?
Re:Ugh, single bit errors by Anonymous Coward · 2010-06-25 04:32 · Score: 0

This may have been true at one time, but ECC RAM is no longer that expensive.
It's not that ECC RAM is no longer expensive, it's that regular RAM prices skyrocketed to the level of ECC RAM.
Re:Ugh, single bit errors by Anonymous Coward · 2010-06-25 06:46 · Score: 0

After reading this article, the question that comes to mind is why am I *not* running ECC on my desktop?
Probably the same reason you don't use a RAID array and redundant power supplies in your desktop machine. There's a lot of redundancy that servers use which is overkill on a desktop system. Switching to more expensive ECC memory because you might end up with a flipped bit at some point is just being paranoid. Your hard disk could easily end up with a flipped bit when it's writing data. What about your video memory on your graphics card? It's not ECC either! If you start worrying pretty soon you'll need to have redundant checks on everything.
If a flipped bit in memory was such a regular occurance you'd hear about it corrupting data all the time. If you do have a bad DIMM, you'll know because your system will do weird things (like the segfault issue in the article). We don't have to panic.
Re:Ugh, single bit errors by jon3k · 2010-06-25 09:43 · Score: 1

It's not even close to 5x it's not even 2x. Feel free to checkout newegg.com and compare. In fact there are some cost breakdowns earlier in the comments now.
Re:Ugh, single bit errors by hawguy · 2010-06-27 16:09 · Score: 1

But my hard disk already uses ECC to check for bit errors at the block level, so I'm already paying for ECC on the hard drive. And the SATA path to the hard drive also uses some CRC or ECC checking.
While my video card may have a flipped bit, it's not going to silently corrupt my data
I've seen estimates ranging from 1 flipped bit per month to 4 flipped bits per day for 4GB of RAM -- do you have evidence that bit flips are rare aside from anecdotal evidence that people aren't reporting corruption all the time?

WOW! Cosmic rays? by aqk · 2010-06-24 11:16 · Score: 0

No wonder I have so many errors on my old laptop! This explains everything!

--

.
- aqk
F U /. illiterates.

ask Voyager 2 program managers by jschen · 2010-06-24 11:16 · Score: 1

I was hoping this would be more info about the Voyager 2 incident that occurred recently. No doubt, a detailed account of what they recently went through to find and fix the problem would be most interesting.

Re:ask Voyager 2 program managers by Anonymous Coward · 2010-06-24 18:57 · Score: 0

I was hoping this would be more info about the Voyager 2 incident that occurred recently. No doubt, a detailed account of what they recently went through to find and fix the problem would be most interesting.
Yeah I couldn't understand why people were so mystified by the flipped bit on Voyager 2 - attributing it to alien meddling - when the thing is probably intersecting with all sorts of cosmic rays. I'm surprised this hadn't happened earlier...

Takes me back by tsotha · 2010-06-24 11:16 · Score: 4, Interesting

When I was in college one of my physics professors told us he doubted programs would ever get bigger than a few hundred kilobytes because cosmic rays would cause the larger programs to fail too frequently.

Re:Takes me back by griffjon · 2010-06-24 13:11 · Score: 1

There's a Redmond joke in here somewhere. Regardless, I'm going to start blaming all my typos on bitflips caused by cosmic rays.

--
Returned Peace Corps IT Volunteer
Re:Takes me back by pjy · 2010-06-24 14:49 · Score: 1

640K is enough for anybody!
Re:Takes me back by kcelery · 2010-06-24 16:00 · Score: 1

to be fair, the price tag for 640K was big.
Re:Takes me back by Thanatos81 · 2010-06-24 18:45 · Score: 2, Informative

To be fair, Gates never said that line. http://en.wikiquote.org/wiki/Bill_Gates#Misattributed
Re:Takes me back by Jurily · 2010-06-24 21:56 · Score: 2, Funny

larger programs to fail too frequently
We showed him right, huh?
Re:Takes me back by roman_mir · 2010-06-24 23:45 · Score: 1

That is amusing. Of-course almost any statement has something it that can seem insightful, especially looking at it from the future.
Applications are modularized, most success in operating systems comes from simplifying the individual components and increasing connections between those components. Of-course the cosmic-rays do not cause as much damage to our machines as some had imagined they would but our large programs have plenty of problems with them that are not due to any cosmic-rays but are there simply because of how large the programs are, and that is a function of the number of amount of stuff that our programs do nowadays.
The biggest mistake of your ex-professor is this fundamental misunderstanding of reality, where bugs or any other failures are accepted as long as the failures and problems are offset by the usefulness of the application (product/service) itself.
I mean think about it, the BP Oil leak can actually destroy the Gulf of Mexico and even other larger parts of the ocean and of the coastal lines, but we still are going to use oil, gas and coal just because of how useful they are to us.

--
You can't handle the truth.
Re:Takes me back by BikeHelmet · 2010-06-28 16:21 · Score: 1

When I was in college one of my physics professors told us he doubted programs would ever get bigger than a few hundred kilobytes because cosmic rays would cause the larger programs to fail too frequently.
To be fair to him, a lot of modern programs aren't all that large. Oh sure, there's icons and text and tons of data - but executable code is usually a couple megabytes or less, even for large games. And usually those megabytes can be stripped away by turning off optimizations like inlining, or using one of those strip tools that removes duplicate code or debugging data.
Even though modern OS's are very complex, and guzzle down memory, it's still a whole bunch of tiny programs running, and surprisingly little of that memory used is executable code.
And that's good, because big programs inevitably crash. :P

Easter Earthquake by ushering05401 · 2010-06-24 11:17 · Score: 5, Interesting

I don't know about cosmic rays, but immediately following the Easter day Earthquake in Guadalupe Victoria (about three hundred miles from where I was located) I tried to fire up my laptop and then my desktop, both of which had been suspended to RAM. Neither one would wake up, though the lappie displayed a garbled screen. No errors in the log files (Ubuntu 9.10 on the sys76 lappie, Deb Lenny on desktop).

Re:Easter Earthquake by Darkness404 · 2010-06-24 11:21 · Score: 3, Insightful

Wouldn't that be more likely caused by fluctuations in the power supply though? I'm not an electrical engineer nor an expert on earthquakes, but wouldn't it be possible that a quick loss of power or too high of power for a split second could mess up the data on the RAM?

--
Taxation is legalized theft, no more, no less.
Re:Easter Earthquake by ushering05401 · 2010-06-24 11:24 · Score: 1

I was thinking EMP related to the seismic activity. IIRC that is still somewhat controversial though.
Re:Easter Earthquake by timeOday · 2010-06-24 13:02 · Score: 1

In all my years of attempting to use suspend-to-RAM on Linux, it has always been, ahem, highly probabilistic.
Re:Easter Earthquake by Culture20 · 2010-06-24 13:54 · Score: 2, Interesting

Or RAM contact points during shaking: contact, no contact, contact, no contact, different contacts at different milliseconds.
Re:Easter Earthquake by drinkypoo · 2010-06-24 23:06 · Score: 1

Or RAM contact points during shaking: contact, no contact, contact, no contact, different contacts at different milliseconds.
A computer that experiences sufficient vibration during a quake to separate a spring-loaded pin from a memory module is utterly likely to have more problems than being unable to resume from suspend mode. It's more likely that they failed to resume due to running Linux :)
(I kid, but only one of the half-dozen machines I am using right now resume correctly when running Linux)

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

RAM error? by Camel+Pilot · 2010-06-24 11:19 · Score: 5, Interesting

Forget a RAM error, I have seen a bit on a file on the disk flip.

After years of successful operation a Perl script quite working. On investigation a G was transformed to a W a difference of one bit. The file mod date was years old.

Re:RAM error? by hhedeshian · 2010-06-24 11:25 · Score: 1

Yes, so have I (though not quite as entertaininWly as your account). I keep secretly hoping that people's disks corrupt themselves every time they laugh at me for suggesting that ZFS and BTRFS have a place in this world.
Re:RAM error? by marcansoft · 2010-06-24 11:25 · Score: 5, Interesting

I experienced almost exactly that issue with a RAM error. My system was apparently stable, and then one day I got a syntax error in a system Perl script: one character had changed. The script was owned by root and otherwise untouched. After puzzling over it for quite a while I realized it could be a RAM error and ran memtest86. It reported a single permanently stuck bit in my 512MB of RAM. I found a kernel patch to manually mark problem RAM areas as reserved and kept on running with that RAM for a few years.
Are you sure that perl script issue was caused by a drive error? A RAM error can cause the same apparent problem, if the corruption happens in the kernel's cache. However, it shouldn't be permanent as it will not be written back to disk (the cache won't be dirty) unless someone actually modifies the file.
Re:RAM error? by History's+Coming+To · 2010-06-24 11:32 · Score: 2, Funny

Aha, my plan worked perfectly *rubs hands in delight*. I hack the entire internet at once by flipping single bits on a large number of machines. The maths is kind of chaotic. It's fun to track viruses as ant-algorithm analogies too.

--
Please consider this account deleted, I just can't be bothered with the spam anymore.
Re:RAM error? by Saint+Stephen · 2010-06-24 11:32 · Score: 1

Would the perl script be loaded at the same address in RAM every time? Wouldn't that likely be a one-time unrepeatable problem?
Re:RAM error? by marcansoft · 2010-06-24 11:43 · Score: 2, Informative

The perl script will stay cached until something else pushes it out of RAM or until you reboot the system. In general, files are loaded once and stick around for quite a while unless you're low on RAM. In my case, it stayed cached while I investigated it, and I could see the broken character with various viewers. Bad RAM could also cause an intermittent issue if it happened to affect memory used by the Perl interpreter to load the file (that would change each time), but in this case it affected the kernel's file cache, which is quite persistent in the medium or even long term.
I probably had the RAM error for a long time and never noticed. It likely caused a few kernel panics and segfaults along the way, but I probably attributed those to stuff like buggy X11 drivers. The broken Perl script was the first odd thing that I could directly attribute to a RAM problem, later confirmed with memtest86 (the broken bit also matched the change that happened to the character).
Re:RAM error? by Chris+Burke · 2010-06-24 11:44 · Score: 2, Informative

Would the perl script be loaded at the same address in RAM every time? Wouldn't that likely be a one-time unrepeatable problem?
If the stuck bit was in the file cache, then it would be repeatable for as long as the script stayed cached, plus you could load the file up in a text editor and see the changed character, etc. Then it would mysteriously go away.

--

The enemies of Democracy are
Re:RAM error? by Vellmont · 2010-06-24 11:54 · Score: 1

How did you verify it was actually on the disk, and not read from disk cache in memory?
Disk sectors have CRC checksums on them, so it's just extremely unlikely the bits flipped on the physical medium. It seems even less likely the bit got flipped somehow that caused a write to disk (and your file mod date would suggest this was unlikely as well).

--
AccountKiller
Re:RAM error? by JWSmythe · 2010-06-24 11:58 · Score: 1

I'd second the idea of a filesystem error. I had a mystery error show up similar to what he described. Someone modified one of my files, only changing one character. I was the only one with access to the machine. I fixed it, and voila, problem solved. A few weeks later, filesystem errors started showing up in the system log. It was a failing drive, not just a dirty filesystem. It must have been cosmic radiation damaged the disk. :)

--
Serious? Seriousness is well above my pay grade.
Re:RAM error? by Anonymous Coward · 2010-06-24 12:01 · Score: 1, Funny

Just goes to show you, computers are a bit pedantic.
Re:RAM error? by hondo77 · 2010-06-24 12:06 · Score: 1

Forget a RAM error, I have seen a bit on a file on the disk flip.
After years of successful operation a Perl script quite working. On investigation a G was transformed to a W a difference of one bit. The file mod date was years old.
Ditto, except it was something like a w to a 7.

--
I live ze unknown. I love ze unknown. I am ze unknown.
Re:RAM error? by Rinikusu · 2010-06-24 12:16 · Score: 2, Funny

/*After years of successful operation a Perl script quite working*/
And a bit flipped to an e?

--
If you were me, you'd be good lookin'. - six string samurai
Re:RAM error? by Xyrus · 2010-06-24 12:29 · Score: 1

Quit working? I'm surprised that didn't turn your perl script into pong.

--
~X~
Re:RAM error? by Anonymous Coward · 2010-06-24 12:39 · Score: 0

RAID-1 with checksuming + ZFS + full scrub every week.
It will detect possible flips and restore proper file content / metadata from second hard drive.
Actually ZFS, even without RAID-1 (mirror) automatically keeps metadata 2 or 3 times (depending on importance) even on single disk (they are on different areas of disk for additional safety), so even with single this it can help prevent bit flips. This can be enabled for data using copies=2 or copies=3 (paranoia) attributes.
Re:RAM error? by Anonymous Coward · 2010-06-24 12:46 · Score: 0

I'm sorry. I messed with my butterflies and flipped your bit. I'll take more care next time.
Re:RAM error? by agw · 2010-06-24 14:56 · Score: 1

A few years ago I noticed that a file on a disk (probably Windows) had a slightly changed file name (one character was different). I checked and the character was just one bit off the one it should have. Of course I don't have proof the disk was at fault. I'm sure anything inside the disks is protected. Most likely the "disk errors" are just memory errors written back to the disk.
Re:RAM error? by jimmydevice · 2010-06-24 16:33 · Score: 1

The question if btrfs or and other complex, thinly redundant, state dependent file system is inherently fragile is interesting. I would like to subscribe to your mailing list.
Re:RAM error? by petsounds · 2010-06-24 18:19 · Score: 2, Funny

And 10,000 years from now, your Perl script has become the complete works of Shakespeare...
Re:RAM error? by prockcore · 2010-06-24 19:15 · Score: 1

On investigation a G was transformed to a W a difference of one bit.
That is unlikely due to the way magnetic media actually stores data. Bits are stored as changes in polarity. No change in polarity means 0, change means 1 (or vice-versa).. and for many lowlevel format types, all bytes in a sector are xored with the previous byte. Change the polarity of 1 tiny part of the disk will change at least 2 bits, and corrupt the entire sector.
So a disk that's +++-+++ is actually 001100. Change that - to a +, and it becomes 000000. In order to change just 1 bit, you have to reverse the polarity of every bit in the sector after it... like trying to untwist a rope from the middle.
Re:RAM error? by noidentity · 2010-06-24 20:02 · Score: 1

One time I heard a glitch in a some music files. I examined them with a hex editor and there were some blocks of zeroes where there should't be, beginning and ending on multiples of 512. I'm guessing that was some corrupt hard disk sectors or a bad transfer. Sucky either way, but fortunately just music. Overall, hard disks are quite impressive. I suppose if they could use similar error-detecting and remapping technology in primary RAM, they'd be using much lower quality memory chips.
Re:RAM error? by Megane · 2010-06-24 23:44 · Score: 1

It was probably a bad bit in your disk cache. Did you check it again after a reboot?
Last year I ended up with a random stick of Patriot PC3200 that someone had discarded. So I stuck it in my old G4, which uses PC2700. A few days later, BitTorrent was complaining that the contents of a file were wrong. It knew this because a .torrent file is a very good checksum of the file that you are downloading. I think I even rebooted and it did it again a couple of days later.
That's when I realized that it must have been due to a bad RAM bit, and that bit kept ending up in the disk cache area. Since BT was downloading data equivalent to a sizable proportion of the installed RAM, most of the disk cache would have contained the downloaded files.
I took the stick out and the problem hasn't come back since. Maybe that had something to do with why the stick was discarded. (And I can't run memtest86 since a G4 has no "86" in it.)

--
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
Re:RAM error? by Tolkien · 2010-06-25 03:13 · Score: 1

That's simple hard disk corruption and it happens every day without you realizing it. Why do you think you have to re-download music files once in a while because they start getting skippy? :)

--
how is babby formed?

It's not cosmic. It's from the die/package by EmagGeek · 2010-06-24 11:19 · Score: 5, Informative

Soft errors in DRAM are far more likely to be the result of alpha particle decay from materials in the die and packaging.

Re:It's not cosmic. It's from the die/package by cusco · 2010-06-24 11:24 · Score: 2, Interesting

People don't realize that lead is mildly radioactive, and the decay from solders on the connectors or chassis can also cause bit flips. Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.

--
"Think about how stupid the average person is. Now, realise that half of them are dumber than that." - George Carlin
Re:It's not cosmic. It's from the die/package by Vellmont · 2010-06-24 11:36 · Score: 2, Interesting

Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.

That sounds a bit fishy.
I _think_ I might be willing to believe the radioactivity of lead, presumably from contamination through some other source radioactive mineral in the ore that decays into radioactive lead. What I have a hard time believing though is that supercomputer makers wouldn't just use non-lead solder, which has been around for years and has actually been mandated for use in recent years in electronics.

--
AccountKiller
Re:It's not cosmic. It's from the die/package by Anonymous Coward · 2010-06-24 11:38 · Score: 1, Interesting

And in lead-free solders, frequently the indium. Sure, it decays really, really, really slowly, but when you're looking at literally single particles potentially causing an issue, it's just another possible cause.
Re:It's not cosmic. It's from the die/package by JesseL · 2010-06-24 11:47 · Score: 1

Never worked with lead-free solder have you?
It's only very recently that it's become practical for widespread use and it's still not settled how well it will work in applications that require maximum reliability. The problems with higher melting points, reduced wetting, tin whiskers, appropriate fluxes, etc. took a long time to sort out.
I'm sure that when a lot of early supercomputers were being built the components used would have been destroyed by the temperatures required to solder without lead.

--
"Prefiero morir de pie que vivir siempre arrodillado!"
Re:It's not cosmic. It's from the die/package by rolfeb · 2010-06-24 11:51 · Score: 1

People don't realize that lead is mildly radioactive, and the decay from solders on the connectors or chassis can also cause bit flips. Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.
I'm unclear as how this "processing" of the lead has reduced its natural radiaoctivity...

--
"Think about how stupid the average person is. Now, realise that half of them are dumber than that." - George Carlin
OK, I guess you win!
Re:It's not cosmic. It's from the die/package by Vellmont · 2010-06-24 12:26 · Score: 2, Interesting

Maybe. It just sounds like an urban legend to me. I was also able to find a 25 year old patent claiming that gold-tin solder assured both high reliability in chip making.
http://www.google.com/patents/about?id=MZY1AAAAEBAJ&dq=4512950

--
AccountKiller
Re:It's not cosmic. It's from the die/package by Anonymous Coward · 2010-06-24 13:20 · Score: 3, Informative

People don't realize that lead is mildly radioactive, and the decay from solders on the connectors or chassis can also cause bit flips. Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.
I'm unclear as how this "processing" of the lead has reduced its natural radiaoctivity...
Pb-210 is in the U-238 and Rn-222 decay chains, so lead ore in the ground has a constant source of Pb-210 being generated due to uranium contamination. Likewise, radon gas can seep into the lead ore deposits and provide a fresh influx of Pb-210. Once the lead is smelted and purified, the uranium contanimation is removed and it's not being exposed to radon so the number of Pb-210 atoms in the sample starts decreasing significantly.
Re:It's not cosmic. It's from the die/package by Anonymous Coward · 2010-06-24 13:24 · Score: 0

I think maybe the point is more to find metals smelted before we detonated thousands of nuclear bombs in the atmosphere. Smelting takes a lot of air to work, and modern metals will have incorporated radionuclides during that process.
Re:It's not cosmic. It's from the die/package by cusco · 2010-06-24 13:50 · Score: 1

Thanks, that's it. Couldn't remember the process, just that the process stopped after refining.

--
"Think about how stupid the average person is. Now, realise that half of them are dumber than that." - George Carlin
Re:It's not cosmic. It's from the die/package by locofungus · 2010-06-24 22:26 · Score: 1

Steel is recovered from German warships sunk at Scapa Flow in 1919. Because the steel was forged prior to the nuclear age and the ships have been lying in 20+m of water this steel is extremely low in radioactive carbon.
Tim.

--
God said, "div D = rho, div B = 0, curl E = -@B/@t, curl H = J + @D/@t," and there was light.
Re:It's not cosmic. It's from the die/package by Rigrig · 2010-06-25 04:25 · Score: 1

There are actual radioactive lead isotopes, and they can interfere with sensitive particle experiments, so neutrino hunters are apparently very happy with roman lead.
I've never heard of it being a problem in supercomputers though, and if your computer flips bits from the tiny bit of radiation lead produces I'd imagine you might be doing something wrong.

--
**TODO** [X] Steal someone elses sig.

Re:erm.... by JesseL · 2010-06-24 11:21 · Score: 3, Informative

Would it really be so hard to read the article before posting?

--
"Prefiero morir de pie que vivir siempre arrodillado!"

faulty RAM by mojo-raisin · 2010-06-24 11:21 · Score: 4, Interesting

I've been working with some large microarray datasets recently, and so had to double my computer's memory to 8GB.

As I've done for years, I went to Fry's to get some Corsair chips... installed F13 64bit to replace my older 32bit distro... and crash-o-matic began. Mostly from Chrome and Mercurial.

I ran memtester86+ and sure enough, verified my first purchase of faulty memory.

So, I went back to Fry's and exchanged for another pair of Corsair 2GB chips. This time, I ran memtester86+ first thing... ANOTHER bad set, so back it sent to Fry's.

*Third* set of memory was Kingston, and a trip through memtester86+ verified no errors. Yay!

Computer has been stable, too.

With more and more RAM in computers, my next box will have ECC.

Re:faulty RAM by Anonymous Coward · 2010-06-24 11:33 · Score: 0

Got bitten by "performance" RAM twice, one stick of of 2*1GB Kingston HyperX DDR developed a single bit error after about a year, second was OCZ DDR2, box suddenly locked up and wouldn't even POST with that stick installed, swap sticks, memtest, 1000s of errors = dead chip. After that I got Kingston ValueRAM ECC for my AMD64 boxes, no problems ever since.
Re:faulty RAM by jd · 2010-06-24 11:55 · Score: 1

As RAM gets ever-larger, densities get ever-greater, and the energy requirements for corruption get ever-smaller, the amount of error-correction needed is going to increase. That seems obvious. Well, to an extent. There are space-rated chips that use lead-lined casing to make them radiation-resistant. Having the motherboard run cooler will decrease the thermally-generated random noise in the system. If you're using a full-immersion system, the coolant might easily absorb some of the cosmic rays not otherwise blocked. So you have plenty of options in that direction.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:faulty RAM by Burdell · 2010-06-24 12:31 · Score: 2, Informative

Did you buy all new RAM, or add to existing? If you added to existing, did you test just the new RAM, or with the existing in there as well?
Lots of RAM has different timings these days, and even when the timing is supposed to be the same, I've seen new RAM cause problems with old RAM to surface (possibly also from temperature changes). I had a system with 2G (2x1G) Corsair RAM, and then I added another 2G (2x1G) of the same model Corsair; the system started crashing. I assumed (as most would) that the problem was the new RAM. I ran memtest86+ for about 18 hours on just the new RAM and had no problems. I stuck the original 2G back in and the system crashed; I ran memtest86+ on just the old RAM; no problem. With all 4 sticks in, memtest86+ would show errors. By moving sticks around and figuring out the address mapping on my system, I tracked it down to one of the original sticks. I then ran memtest86+ for about 48 hours on just that stick, and it did eventually show an error (Corsair replaced it and I have had no more problems).
RAM generates a good bit of heat these days, and adding RAM generates even more heat in a small space. My faulty RAM has the heat spreaders included, but the motherboard puts the RAM slots so close together there's still little space for heat to dissipate.
Re:faulty RAM by DigiShaman · 2010-06-24 14:51 · Score: 1

Almost always it's a memory timing issue. This is especially true for "performance" type memory where the timing tables stored on the SPD chip are not optimal. If you have an ASUS motherboard or some such, they will generally have a plethora of RAM timing and voltages you can adjust to compensate. Unless your a tuner/gamer I would avoid this genre of RAM.
Not trying to slashvertise here, but purchase your memory from Crucial.com. They provide a nice drill-down menu that will display the *exact* module you need for the make/model of PC or motherboard. From servers, desktops, and laptops. I've never had problems with their memory. It's not l337 fast memory, but it's rock solid and dependable.

--
Life is not for the lazy.
Re:faulty RAM by Anonymous Coward · 2010-06-24 17:05 · Score: 0

Welcome to Fry's QC. Best prices, worst retardibility.
Re:faulty RAM by dargaud · 2010-06-24 21:29 · Score: 1

My previous mobo had ECC, but on the current one I decided that it was too expensive as I wanted 8Gb. But when I run at the nominal RAM speed, I get random system freeze. I've had to downclock the CPU and RAM a notch. Yeah, the more bits you have, the more likely ECC will be useful, that was a bad move on my part.

--
Non-Linux Penguins ?
Re:faulty RAM by Agripa · 2010-06-25 04:19 · Score: 1

As RAM gets ever-larger, densities get ever-greater, and the energy requirements for corruption get ever-smaller, the amount of error-correction needed is going to increase.
It is not quite that simple. The struggle to make smaller DRAM storage capacitors without lowering the stored charge means that the charge stored per volume has gone up while the volume has gotten smaller making the cell a smaller target and more resistant. Any given ionization event ends up spreading the same total charge over a greater number of increasingly resistant DRAM cells. The result is that the susceptibility to radiation induced soft errors has leveled off or even gone down per bit of storage in the last couple of generations while the number of bits installed in each system has gone up thereby making up for any improvement.
In contrast to the DRAM arrays themselves, the DRAM sense amplifiers and associated logic have become more susceptible such that they increasingly contribute to the soft error rate. For similar reasons, SRAM cache is an order of magnitude more sensitive to radiation induced soft errors than the DRAM arrays but it have been protected by ECC almost since they were integrated with the CPU.
Re:faulty RAM by mojo-raisin · 2010-06-25 05:32 · Score: 1

I was adding to existing memory, but I tested the 2 old sticks and 2 new sticks separately.
The 2 old sticks were faultless, while the 2 new sticks gave >1000 errors in tests 6 & 7 of memtest86+.
When I tested batch #2, same deal: I left out my previous sticks that I would be adding to, and once again tests 6&7 revealed many errors.
Batch #3 tested well on its own. I haven't run the test with all 4 sticks plugged in at once... perhaps I should.
Re:faulty RAM by Anonymous Coward · 2010-06-25 07:01 · Score: 0

Whilst I normally find Crucial memory to be good, I have had one bad experience with some 1GB DDR memory bought in 2006 or so. This memory would consistently fail with 1000s of memtest errors after anything from a few weeks to a year or so. However, the last replacement (the 5th or 6th, IIRC) does seem to have lasted longer. At least Crucial were good about replacing it each time. It would appear that plenty of other people had exactly the same problem with this particular memory from Crucial, so I assume early models had a design or process flaw.

fascinating by vux984 · 2010-06-24 11:23 · Score: 4, Insightful

Its interesting to me because my first instinct would have been to assume something got corrupted and my first step would have been to reboot. If the problem persisted through a reboot then I might have gone down the rabbit hole in similiar
fashion to try and find and fix the root cause.

There are enough sofware bugs, kernel bugs, driver bugs, hardware hiccups due to marginal equipment, power fluctuation, interference, random noise... and i suppose even cosmic radiation that I would rarely think to spend the time to trace a transient problem unless it was reproducible accross reboots, or at least happened on multiple separate occasions.

Re:fascinating by amentajo · 2010-06-25 05:39 · Score: 1

Note that the article is a blog post on ksplice.com (Ksplice is a service that provides kernel updates without having to reboot... we've seen mention of this on Slashdot before.)
Yes, my first instinct would also be to reboot and try again. And I'm going to take a wild guess that the author (Nelson Elhage) had that thought too.
But Elhage is the main engineer behind Ksplice, and his business focuses on improving Linux's performance between reboots.
This tells me three things:
(1) He has a business interest to try, as much as possible, to avoid rebooting without determining the root cause,
(2) He's really good at it, and
(3) He's rooting for hardware problem :-)
This sentence from the article's introduction suggests to me that this is what is going on:

I spent about half an hour digging to discover what had gone wrong, and eventually determined, conclusively, that my problem was a single undetected flipped bit in RAM.

Too bad many consumer mainboards don't support ECC by Goyuix · 2010-06-24 11:23 · Score: 1

Some of the nicer boards will tolerate ECC memory being inserted, but won't actually do any meaningful error correction (like scrubbing) - but a disturbingly large number of consumer boards (BIOS limitation perhaps?) don't actually do ANYTHING with ECC memory, and the really cheap ones won't even boot with it present. I used to have the same mindset of purchasing only ECC RAM for the same reason - but the unfortunate truth is that hardware support for it just isn't there without spending $$$ on a decent board too.

radioactive isotope in the chip by mirix · 2010-06-24 11:24 · Score: 3, Interesting

I would think it's more likely there is trace radioactive elements in the epoxy the chip is encapsulated in.

Actually, I recall reading that in the early solid state memory days, they had problems with this. I don't remember what the solution was, but I thought it was to make the circuit somewhat resilient to it, as it was impossible to get 100% neutral epoxy, there's always going to be traces of something radioactive.

I think they tested the cosmic ray theory by running the same chip with and without lead shielding, and did not find a significant difference in errors, they then assumed it was impurities in the chips themselves decaying.

--
Sent from my PDP-11

Old, old story by jmichaelg · 2010-06-24 11:30 · Score: 5, Interesting

Back in the early 80's, HP published a paper on random bit errors in RAM. They looked at chips from a variety of vendors and determined that the RAM coming out of Japan was the most reliable. That paper caused a lot of US RAM vendors to shutter their doors as there was a sea change in purchasing habits.

A few years later, I ran into John Scully while we were waiting for a flight. I mentioned the paper to him and asked him how Apple could seriously expect to sell a Macintosh specifically aimed at the Scientific community if it didn't have ECC. He blithely said "it's not a problem..." 20+ years hence and most of us still don't have ECC so it seems he was right.

Re:Old, old story by Anonymous Coward · 2010-06-24 12:00 · Score: 5, Informative

For a more recent analysis (by folks at Google and U.Toronto) see "DRAM Errors in the Wild: A Large-Scale Field Study" in ACM SIGMETRICS/Performance 09.
They did an extensive analysis of DRAM failures from many vendors and debunk several myths as well as indicating that the soft error rate can be much higher than previously thought.
Well worth a read...
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
Re:Old, old story by Timothy+Brownawell · 2010-06-24 12:38 · Score: 2, Informative

as well as indicating that the soft error rate can be much higher than previously thought.
I'm not sure it really does; true they had enormous average (mean) error rates, but it sounded like this was misleading due to an incredibly skewed distribution. Going by the number of servers with zero errors, one error, and multiple errors over a year, and the failures-vs-age data, I came to the conclusion that there's about a 1/5 chance that you'll see one random single-bit error over a typical lifetime (I think I used 5-6 years), but also a similar chance that part of your ram will go bad after a couple years and give you a sudden flood of errors. It would have been very nice if they'd counted servers with 0,1,2,3,...10-20, 20-50, ... etc errors/year (preferably with a pretty graph), instead of only breaking it into zero, one, many.
Re:Old, old story by antifoidulus · 2010-06-24 13:10 · Score: 1

Actually all of Apple's "pro" products(ie Mac Pros and XServes) DO have ECC ram, a decision that actually caused quite an uproar in the mac community when it was first introduced(with the g5 powermac IIRC). However it has yet to trickle down into any of Apple's other products, which are all 100%* based on laptop components. Do laptops even have ECC ram? With RAM densities increasing bit errors like the one mentioned in the article are only going to increase.

*The quad core iMacs have Desktop CPUs in them, but every other component(including memory) is a part made for laptops.

--
Monstar L
Re:Old, old story by Trogre · 2010-06-24 13:42 · Score: 1

Except for the Mac Book "Pro", that is :)

--
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
Re:Old, old story by klashn · 2010-06-24 17:32 · Score: 1

Laptops with SO-DIMM memory currently do not have the option for ECC. There is no real demand for it yet with DDR3. Since we're getting into smaller form factor blade servers, I'd expect to see it starting with DDR4.
Re:Old, old story by Anonymous Coward · 2010-06-24 19:26 · Score: 0

A few years later, I ran into John Scully while we were waiting for a flight. I mentioned the paper to him and asked him how Apple could seriously expect to sell a Macintosh specifically aimed at the Scientific community if it didn't have ECC. He blithely said "it's not a problem..." 20+ years hence and most of us still don't have ECC so it seems he was right.
The Macintosh IIfx (probably the most science-y Macintosh ever released under John Scully) had parity-bit memory.
It really must not have been a sales issue, because after it was discontinued, Apple didn't bother with any sort of error tolerant memory until the G5 models about five years ago.
Re:Old, old story by Anonymous Coward · 2010-06-24 19:48 · Score: 0

"I mentioned the paper to him and asked him how Apple could seriously expect to sell a Macintosh specifically aimed at the Scientific community if it didn't have ECC. "
Just avoid holding it in that way.

Cosmic Ray Protection... by r00tyroot · 2010-06-24 11:33 · Score: 2, Funny

I'm putting tinfoil hats on all of my servers, right away!

Re:Cosmic Ray Protection... by treeves · 2010-06-24 12:21 · Score: 1

Thick sheets of lead would work better than tin foil. (Pre-emptive whooosh.)

--
...the future crusty old bastards are already drinking the Kool-Aid.
Re:Cosmic Ray Protection... by fishexe · 2010-06-24 16:14 · Score: 1

I'm putting tinfoil hats on all of my servers, right away!
Aluminum foil, man! Aluminum! Aluminum is for the cosmic rays. Tin is for the orbital mind control lasers!

--
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009

Google Study of DRAM Error Rates (Link Inside) by Anonymous Coward · 2010-06-24 11:41 · Score: 0

Link to Google's PDF is contained in this story : http://www.computerworld.com/s/article/9139161/Google_DRAM_error_rates_vastly_higher_than_previously_thought

From the article,

" A study released this week by Google Inc. and the University of Toronto showed that data error rates on dynamic RAM memory modules are vastly higher than previously thought and may be more responsible for system shutdowns and service interruptions.

The study (download .pdf), which used tens of thousands of Google's servers, showed that about 8.2% of all dual in-line memory modules (DIMM) are affected by correctable errors and that an average DIMM experiences about 3,700 correctable errors per year.

"Our first observation is that memory errors are not rare events. About a third of all machines in the fleet experience at least one memory error per year, and the average number of correctable errors per year is over 22,000," the report states.

"These numbers vary across platforms, with some platforms seeing nearly 50% of their machines affected by correctable errors, while in others only 12%-27% are affected."

The median number of errors per year on a Google server that had at least one error ranged from 25 to 611..."

Re:Google Study of DRAM Error Rates (Link Inside) by Timothy+Brownawell · 2010-06-24 12:49 · Score: 1

About a third of all machines in the fleet experience at least one memory error per year, and the average number of correctable errors per year is over 22,000," the report states.
But also 93% of those with errors, have multiple errors. This permits a bit of number crunching, to conclude that 3% have single random errors in a year and 30% probably have bad ram or other hardware issues.

Re:Too bad many consumer mainboards don't support by Mad+Merlin · 2010-06-24 11:42 · Score: 2, Insightful

This is one area where AMD is light years ahead of Intel. With Intel, you have to buy a Xeon and a server chipset to have ECC support, which basically is going to run you at least a grand or two just for the CPU and motherboard (at least if you want an i7 based Xeon). AMD on the other hand supports ECC across the board, and you just need a motherboard which supports it, which is most of them (total cost: <$500).

Thanks for the gouging Intel!

--
Game! - Where the stick is mightier than the sword!

All data channels are noisy by Anonymous Coward · 2010-06-24 11:46 · Score: 0

Any electronics/communications engineer will tell you that every data channel is noisy and you must expect corruption at some point, even if the odds seem vanishingly small. And no doubt in these super high transistor count and clock frequency CPUs and chips we are using these days there must be devices and methods used inside them to keep the logic transfer and computation validity on the straight and narrow.

Re:All data channels are noisy by Chris+Burke · 2010-06-24 11:52 · Score: 2, Informative

And no doubt in these super high transistor count and clock frequency CPUs and chips we are using these days there must be devices and methods used inside them to keep the logic transfer and computation validity on the straight and narrow.
Other than ECC on the cache arrays... No. Not a scrap.
If you want reliability on every internal signal and register against cosmic ray strikes, because you're a military or aerospace contractor, you pay boku bucks for it, settle for having way less than what we would currently call performance. And even then I highly doubt anyone is actually putting ECC on each and every bus or set of latches. You just radiation harden the device as much as possible, and then use three of them so if one gets the wrong answer because of a particle strike, the other two will out-vote it.

--

The enemies of Democracy are
Re:All data channels are noisy by Anonymous Coward · 2010-06-24 12:41 · Score: 1, Funny

you pay boku bucks for it
Is "boku" some sort of retarded mangling of beaucoup?
Re:All data channels are noisy by StikyPad · 2010-06-24 13:00 · Score: 2, Funny

Walla!

--
https://www.eff.org/https-everywhere
Re:All data channels are noisy by Chris+Burke · 2010-06-24 14:11 · Score: 2, Funny

Uh, no, not at all... *shifty eyes*

--

The enemies of Democracy are
Re:All data channels are noisy by Asic+Eng · 2010-06-24 15:54 · Score: 1

Devices like that are available in the automotive field, as well. Freescale makes this one: http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MPC564xL It has ECC on RAMs and Flash, Logic BIST and Memory BIST at reset and two CPUs running in lockstep which are constantly monitored by hardware.
Re:All data channels are noisy by fishexe · 2010-06-24 16:15 · Score: 1

you pay boku bucks for it
Is "boku" some sort of retarded mangling of beaucoup?
Yes.

--
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009
Re:All data channels are noisy by Chris+Burke · 2010-06-25 04:19 · Score: 1

Oh hey thanks for bringing that up. Mainstream processors have some BIST too, at least on the cache arrays. I didn't think of it because I was focused on detecting errors during operation.
You say they use two CPUs in lockstep, which got me thinking why not three so you can correct errors rather than just detect them? But automotives are very cost sensitive, and since a bad answer probably isn't going to make the car blow up*, and flagging an error so the 'check engine' light can be turned on and the car taken in for repair is probably good enough. Is that a good guess?
* Or blow up at the wrong time, or not at all, if we're comparing to certain military applications. :)

--

The enemies of Democracy are
Re:All data channels are noisy by Asic+Eng · 2010-06-26 20:36 · Score: 1

Well you don't necessarily need to recover on the fly. Seeing that your device has somehow entered an invalid state you could go through reset, run BIST (to make sure that nothing is permanently damaged) and restart the application. Log the event, so that if it happens frequently you can conclude something is damaged. It really depends on what you want to do though - if you have a mechanical backup system (e.g. power-steering, power-brakes) then "fail silent" is good enough (manual steering and manual braking would still be available). If you only have that system (e.g. steer-by-wire) you would probably want "fail operational" (like triple-voting).
Of course, most of the errors are going to be in the software (that's about a factor 10 for the complexity of the overall system, so between 10 and 10^2 times more likely to attract errors), so realistically you'd probably be better off to have two independent algorithms checking each other. That's very expensive in terms of development effort though, and hardware is comparably cheap.

not cosmic most likely by Anonymous Coward · 2010-06-24 11:51 · Score: 0

People talking about bits flipping in RAM or on disk -- these are external bus errors due to noise before the data gets to the memory or disk drive.

Re:erm.... by JWSmythe · 2010-06-24 11:54 · Score: 1

I was ready to send him a link to purchase a tinfoil hat (and tinfoil server cover too), but in his article, he says it could be cosmic radiation, or flaky hardware. I'd lay money on the second, and not the first.

I used to joke that cosmic radiation made particular servers crash. We couldn't find any other reason for it, even with a fresh OS (that was identical to our other servers), and swapping various parts. Ya, cosmic radiation went through the building above us, to the server about 30 feet underground, and hit one in the middle of the rack, and not all the ones over it. It was good for the wheel of excuses, but (obviously) not a real answer. Oddly enough, the cosmic radiation stopped messing with that server when we finally took it out of service, and the one that went in the same position, with the same job, running the same OS did fine. :)

Ya, ya, I know, it's probably whatever part we didn't replace (the motherboard), but cosmic radiation sounded better. :) At the time there were quite a few news stories about it, so I was able to link to those in my report blaming cosmic radiation. :)

They call me crazy. I call myself eccentric with a sense of humor. :) My girlfriend at the time even made me a tinfoil hat, that I'd sometimes wear around the house as I babbled nonsense about impending alien invasions. :)

--
Serious? Seriousness is well above my pay grade.

Re:Too bad many consumer mainboards don't support by X0563511 · 2010-06-24 11:54 · Score: 1

Wrong. A few Dell PE servers have P4s in them, and -require- ECC memory.

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...

Also by Sycraft-fu · 2010-06-24 11:56 · Score: 5, Informative

Disks have a lot, and I mean a LOT of ECC on them. It is not a situation of "I need to write a 1 so I'll place one at this location on the drive." They use a complex encoding scheme so that bit errors on the disk don't yield data errors to the user.

Then there's the fact that bits aren't even stored as bits really. All current drives use (E)PRML which is (Enhanced) Partial Response Maximum Likelihood. What this means is bits aren't encoded as a high-low state or FM wave or any of that. They are written using flux reversals, but the level is not carefully controlled, it can't be. So when you read the data the drive actually looks at an analogue wave. It encodes the partial response it gets, and then finds the maximumly likely pattern that matches.

Sounds like voodoo but works really well. Things are not simple thresholds or the like, it is a complex system and ends up being quite robust and resilient to error.

So it is highly unlikely that you had a bit flipped on a disk. Would require some amazing circumstances to happen. The RAM error is far more likely. Not just the cosmic ray thing but, as the parent noted, bad RAM. Normally when RAM fails, it fails catastrophically and it is immediately apparent. Not always though. It can not only fail on single bit locations, but only during certian ops. That is why memtest does so many different tests. One kind might works fine, another might fail. Rare, but I've seen it on a few systems.

Re:Also by marcansoft · 2010-06-24 12:24 · Score: 2, Informative

However, single-bit errors are possible with faulty disk hardware. The cache RAM on the disk or its interface can be flaky, and for PATA disks a bad cable can cause single-bit errors. SATA disks usually catch IO errors since they use a more complicated encoding and make use of checksums.
Re:Also by Scaba · 2010-06-24 12:54 · Score: 5, Funny

Then there's the fact that bits aren't even stored as bits really. All current drives use (E)PRML which is (Enhanced) Partial Response Maximum Likelihood. What this means is bits aren't encoded as a high-low state or FM wave or any of that. They are written using flux reversals, but the level is not carefully controlled, it can't be. So when you read the data the drive actually looks at an analogue wave. It encodes the partial response it gets, and then finds the maximumly likely pattern that matches.
I doubt this is true. The disk would have to be spinning at 88 mph in order to activate the flux capacitor, and the power brick would need to supply 1.21 gigawatts to the drive, which exceeds the capacity of even the most tricked-out gaming PC. I think you'd better check your science, my friend.
Re:Also by Vellmont · 2010-06-24 13:17 · Score: 1

However, single-bit errors are possible with faulty disk hardware.

I'm sure you're right, but in this case there's essentially no way a disk hardware failure is going to cause the same bit to fail the same way, but no other bits fail.
In this case, I'd expect it's a bit flip in the OS disk cache.

--
AccountKiller
Re:Also by marcansoft · 2010-06-24 13:42 · Score: 1

The bit fails while it is read from the disk, then persists in the OS cache. The end result is the same (a corrupted OS cache), but the cause is different, as the bit flipped before it ever made it to the cache.
Re:Also by DigiShaman · 2010-06-24 14:38 · Score: 1

We use Dell PowerEdge servers with Dell Open Manage software installed. When a single bit error occurs, it will log a warning with regards the module at fault. I've cleared the log and reseated the ECC RAM module only for it to happen again within a few minutes.
So yes, silicon chips (or gates inside) go bad. I can't tell you why or how exactly, just that they do.

--
Life is not for the lazy.
Re:Also by turing_m · 2010-06-24 15:15 · Score: 1

So it is highly unlikely that you had a bit flipped on a disk. Would require some amazing circumstances to happen.
Single bit errors happened 10% of the time at CERN. And if we discount a one-off problem with WD drive firmware that caused 80% of errors, this would shoot up to 50%.
http://www.zdnet.com/blog/storage/data-corruption-is-worse-than-you-know/191

--
If I have seen further it is by stealing the Intellectual Property of giants.
Re:Also by Bengie · 2010-06-24 15:36 · Score: 1

One of the popular Open Source clustered file systems, I think gluster, had/has a write up about how they moved to ECC memory on their nodes.
They had an issue where about every TB of data written to the cluster, one of the nodes would report having different checksums than the other nodes. At first the thought maybe a hardware fault. Turns out switching to ECC completely removed the issue. They said the end result was every so much data will inherently acquire a an error when traveling through the memory.
It's not a matter of "if", but when.
Re:Also by Lehk228 · 2010-06-24 16:22 · Score: 1

if the disk cache has a bad bit you should get a fairly steady stream of errors at a rate of about (dataReadingRate/diskCacheSize)* numberBitsDamagedInCache

--
Snowden and Manning are heroes.
Re:Also by noidentity · 2010-06-24 20:06 · Score: 1

I think your point about hard disk storage is that a single flipped bit is next to impossible, whereas an entire scrambled sector is much more likely (and even more likely still is an error that is silently corrected). So a single-bit error is much more likely due to RAM than magnetic storage.
Re:Also by Anonymous Coward · 2010-06-24 21:37 · Score: 0

Assuming a 12 cm platter diameter, it only takes (88 mph) / (2 * pi * (6 cm)) = 104.351318 hertz (revolutions per second) for the outer edge of the platter to achieve 88 mph.
Re:Also by SharpFang · 2010-06-24 22:27 · Score: 1

The essential difference is disk surface is not really discrete bit-wise.
In RAM, you get a path, a circuit of transistors and capacitors, and a bit of memory is built from them. They are a specific structure that holds one bit, and they can't hold 1.1 bit, or 0.8 bit - the volume is the circuit is the bit. On disk you have a continuous surface which you magnetize to your whim, and then read the layout of magnetic levels and interpret them - the fact the 0.0001mm length of surface corresponds to 1 bit is your own convention and you can change it at will, within noise levels and read speeds of your hardware. So you can reduce length of a disk bit, and work around reduced reliability with error corrections. You can't reduce the size of a RAM bit other than reducing the size of the electronics...

--
45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
Re:Also by mattj452 · 2010-06-25 00:25 · Score: 1

The problem he is experiencing is most likely a hard drive problem. 1. He is consistently experiencing it. A restart of the computer would most likely having 'expr' end up in a different location in RAM memory, so the error should not be consistent. 2. There is nothing here indicating a single bit error. He sees that he suddenly tries to read from an invalid location. The problem can be either that the read address has been corrupted, or the instruction itself (i.e. it shouldn't have been a read). Also, there is nothing saying that this is the only place the fault occurs, only that this is the first place it occurs. The rest of the program may very well be corrupt, as can the entire sector.
Re:Also by Ambiguous+Coward · 2010-06-25 02:30 · Score: 1

You're still missin' all them jiggywhats.

--
Their may be a grammatical error, misspeling, or evn a typo in this post.
Re:Also by noidentity · 2010-06-25 05:10 · Score: 1

I was referring to the guy who had a single character change in his script, where the character change was a single bit flipping. I was noting that with a single-bit error like this, it's extremely unlikely to happen on a disk device.
Re:Also by Just+Some+Guy · 2010-06-25 07:02 · Score: 1

and the power brick would need to supply 1.21 gigawatts to the drive
I have an external 2.5GW PSU for my Radeon.

--
Dewey, what part of this looks like authorities should be involved?

Re:erm.... by sakdoctor · 2010-06-24 11:57 · Score: 2, Funny

My RAM is shielded against cosmic rays by my mothers basement.

"Could not reproduce" by Anonymous Coward · 2010-06-24 11:58 · Score: 0

Yeah, but can he find a way to reproduce the error?

Interesting tracking, unimportant issue. by Anonymous Coward · 2010-06-24 12:03 · Score: 0

Sure, ECC will be nice at some point on desktop boards.
But unlike what most of these studies *speculate*, ram error rates are still negligible, even over years. And let's face it, there's no point panicking for a possible bit flip every month. Probabilities that your OS goes bad for another reason(disk corruption, bus errors, buggy software) are incredibly higher.
In any case, the point is that unless your data is bit-important, it does not matter. And very few applications need bit important data, and practically none that any desktop computer should be dealing with.

The one thing that would be interesting would be large sets statistics depending on the manufacturers. After all, we all remember the cheap generic DDR debacle a couple years ago.

Cosmic rays, my ass. Occam's Razor time. by Anonymous Coward · 2010-06-24 12:04 · Score: 5, Insightful

You are on the right track. As someone with over a quarter century of background in combined embedded software and hardware design (the most recent decade for life-dependant systems), it always amazes me how quickly pseudo-technical people jump to wild speculation for observations that they cannot explain.

They fail to understand that a hardware system is an imperfect representation of the theory (probably the biggest failure in the schooling of software developers and even some hardware is to get this message into their heads). While they feel comfort in the theory of a binary system, they utterly fail to understand that our real systems, like us, are imperfect and, like us, live in an analog world. Simple things like temperature variations, noise from common (rather than cosmic) sources, marginal design timing, imperfect components, simple intermittents, etc., are 10^24 times more likely the cause.

But they're not as fascinating as wild speculation, are they?

Had a MySQL problem once. (Once... ha.) by falzer · 2010-06-24 12:14 · Score: 2, Interesting

I had a mysql replication server which was reading SQL commands from a binary log on a master server. One day after years of operation I noticed an update failed. I didn't see anything at first by looking at the query, but when I looked closely I noticed the query had a single character changed, and of that character only one bit had changed. It was something like a P becoming a Q and thus giving a syntax error.

True story.

Re:Had a MySQL problem once. (Once... ha.) by amentajo · 2010-06-25 05:03 · Score: 1

I guess that this shows that you have to... mind your P's and Q's when writing a program!
Re:Had a MySQL problem once. (Once... ha.) by noidentity · 2010-06-25 05:23 · Score: 1

I looked closely I noticed the query had a single character changed, and of that character only one bit had changed. It was something like a P becoming a Q and thus giving a syntax error.

You know what they say; mind your P's and Q's. They obviously foresaw binary computers and flipped bits.

Radioactive packaging by overshoot · 2010-06-24 12:18 · Score: 2, Interesting

I recall reading that in the early solid state memory days, they had problems with this. I don't remember what the solution was, but I thought it was to make the circuit somewhat resilient to it, as it was impossible to get 100% neutral epoxy,

The worst problem was with ceramic DIP packages -- the really good ones for when you needed reliability (partly because the plastic ones tended to allow moisture to get in, and then condensation on thermal cycling.) The standard ceramic packaging material contained trace amounts of thorium, which is an alpha emitter. The alpha bombardment was enough to flip bits.

There have been several fixes since then. Using materials that don't contain radioactive species was one. The one you're probably remembering is that the manufacturers apply a polymer coating to the surface of the die, which is enough to stop a lot of alpha particles and a fair number of electrons. Getting rid of lead in packaging is also good, because lead tends to contain some radioactive traces.

On the other hand, there's flat nothing to be done about cosmic rays and damn little to be done about X-rays and thermal noise (you do keep your memory cold, don't you? Thermal noise is proportional to KT/qe after all.) So at some point we get to where there are too many bits which need minimal energy to flip them -- and then you have errors.

Pity that so few mobos actually support ECC, though.

--
Lacking <sarcasm> tags, /. substitutes moderation as "Troll."

Re:Radioactive packaging by Anonymous Coward · 2010-06-24 13:46 · Score: 1, Interesting

It's also worth pointing out that it's not galactic cosmic rays (i.e. heavy ions) that are hitting the memory device. While traversing the miles of atmosphere, GCRs collide with atomic oxygen or hydrogen causing a particle / photon cascade. It's the neutral neutrons by-products that make it to the surface and in turn collide with atoms within the device. That final displaced atom is what will ionize a track through a memory cell transistor and cause the upset.
Re:Radioactive packaging by timnbron · 2010-06-24 16:22 · Score: 1

Very interesting. We had a problem in telephone exchanges about 25 years ago. All data was held in memory - no disk drives (except for billing records I recall). Some little old granny would mysteriously acquire a premium service. It only affected the lines that were hardly ever used. It was tracked down to "alpha particle corruption", which gradually eroded the charge, which effectively flipped the bit to a 1 and gave the subscriber a random service.
Don't know any more than that, but the old hand that described it to me, did so with unusual glee...

--
There are some who call me ... Tim.

Cosmic Rays Tend to Flip Multiple Bits by bezenek · 2010-06-24 12:20 · Score: 1

Cosmic ray events tend to affect multiple neighboring transistors. For this reason, they tend to affect multiple bits. However, by laying out memory cells so immediate neighbors are from different locations, the ability of single-bit-correction-double-bit-detection (SECDED) methods to detect most events is usually preserved.

The main concern is for structures with no error correction, such as the gates in the processor pipeline. Several research ideas have been put forward. See here (PDF) for a good overview of the issues.

-Todd

--
Omne ignotum pro magnifico.

This would be important by overshoot · 2010-06-24 12:22 · Score: 0, Troll

People don't realize that lead is mildly radioactive

That is an important consideration for old computers (prior to 2005 or so.) The newer ones are pretty much lead-free.

Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.

Billions of years in the ground, and only a few centuries on the roof and all of the radioactivity is gone! Wow!

--
Lacking <sarcasm> tags, /. substitutes moderation as "Troll."

Re:This would be important by Vellmont · 2010-06-24 12:45 · Score: 4, Interesting

Billions of years in the ground, and only a few centuries on the roof and all of the radioactivity is gone! Wow!

The author needs to provide a reference, but there's a few ways I can think of that a processing stage, and a few centuries would produce something less radioactive than something produced more recently. I think all of them stem from the ore containing a source material that gets separated through the refining process, but the daughter products from the source don't. Here's one scenario:
Ore = Lead + radio-isotope a + radio-isotope B.
radio-isotope A decays to radio-isotope B
radio-isotope A: 4 billion year half-life.
radio-isotope B: 20 year half life, decays to stable isotope C.
during refining, radio isotope A gets nearly completely refined out to parts per trillion. radio isotope B is similar to lead chemically, and remains at 1 parts per million (at time of refining).
200 years go by. (10 half lives of radio isotope B)
radio isotope B is now at 1/2^10 concentration, or about 1 part per trillion. Significantly less than when it was first refined. The added radioactivity from radio-isotope A decaying into B is negligible due to the long half-life of A.
These numbers and process are obviously made up to show how it MIGHT work. It still remains to be seen if it's actually true or not.

--
AccountKiller
Re:This would be important by robthebloke · 2010-06-24 13:21 · Score: 1

Billions of years in the ground, and only a few centuries on the roof and all of the radioactivity is gone! Wow!
it was blessed....
Re:This would be important by Anonymous Coward · 2010-06-24 15:20 · Score: 0

If the cathedral tops are from Eastern Europe, the extra radioactivity from contamination from Chernobyl may well negate whatever tiny bonus of being 'older' they'd get.
There is also the matter of the lead being "older"... it isn't, unless some comparison lead was made in a lab somehow. All the lead ore might have been mined at different times but all the lead was present during the formation of the earth. One sample might come from older or newer asteroids during the earth's formation, I guess, but they'd still be billions of years old in any case.

Re:Too bad many consumer mainboards don't support by Anonymous Coward · 2010-06-24 12:33 · Score: 0

He's referring to all embedded memory controller Intel Parts. Supposedly the Core i3/i5 chips have ECC support enabled on them, but unfortunately none of the consumer boards support them (you'd need an 1156 server mobo + core i3/i5 cpu).

AMD's chips since Socket 939 have supported ECC out of the box. I haven't had a chance to test it myself, but if the Nforce M430 mobo I have will run ECC with a cheap low-end sempron, all my future cpu/mobo purchases will be AMD for just this reason.

Re:Too bad many consumer mainboards don't support by Anonymous Coward · 2010-06-24 12:34 · Score: 0

You do have to carefully check whether the motherboard manufacturer has included the bios support. The upper end models from ASUS and Gigabyte do generally support ECC, but the lower end models and the models from other popular "consumer grade" motherboard manufacturers don't generally include the support. Intel's recent westmere i3 and i5 models do support ECC, but you need a bios support once again, which in the case of an Intel based "consumer grade" motherboard is even more difficult to find. I don't know a single one. OEMs have their own bios versions and support for ECC with their single socket server models, like the other comment states. I'm writing this with my consumer oriented Phenom 9750 and 8GB of ECC memory, so any typos are most likely not my computers fault. ;)

TFA by talcite · 2010-06-24 12:36 · Score: 1

I just read the article and it's quite good. The author goes into detail about how he used a series of checksums and source verification to find the bug, isolate it and fix it. I found it quite fascinating and I recommend reading it if you have a few minutes of time.

Just being pedantic by solarium_rider · 2010-06-24 12:37 · Score: 1

There is no such thing as ECC RAM. The ECC (usually hamming) is performed by the memory controller. You can't just buy a stick of 72 pin DIMM and use that in any old PC. You have to have a memory controller that supports ECC. It should also be noted that this kills performance by increasing latency (decode and encode the ecc bits) and may also require read-modify-writes.

--
-- How many sigs are as useless as this one?

Re:Just being pedantic by Timothy+Brownawell · 2010-06-24 13:08 · Score: 1

It should also be noted that this kills performance
By something like 1-5% if I remember correctly, which only matters in benchmarks and dicksize contests.

and may also require read-modify-writes
Um, yeah... that's only possible if you haven't and don't read anything on that same cache line, and even then mightn't happen based on what assumptions your cache makes or might be no different than non-ECC is your cache is only able to talk to your memory in units of a full cache line anyway.
Re:Just being pedantic by erice · 2010-06-24 16:11 · Score: 1

There is no such thing as ECC RAM. The ECC (usually hamming) is performed by the memory controller. You can't just buy a stick of 72 pin DIMM and use that in any old PC. You have to have a memory controller that supports ECC. It should also be noted that this kills performance by increasing latency (decode and encode the ecc bits) and may also require read-modify-writes.
Quite true and a bit of a lost opportunity. Internal to the DRAM, an entire row is accessible at one time. More bits would allow more efficient error correction methods. ECC computation would only need to be done when a row is opened or closed. The whole thing could be done transparently aside from somewhat longer delays to open and close rows. It would be useful, however, to have a method of informing the host controller if there were uncorrectable errors.
I doubt this will happen any time soon. The prevailing design philosophy has been to keep complexity out of the DRAM parts in order to control cost. Adding high speed logic for ECC would be a significant departure.
Re:Just being pedantic by Lehk228 · 2010-06-24 16:27 · Score: 1

only matters in benchmarks and dicksize contests.

any ECC wins over any Non-ECC in a DSW, even PC66 ECC

--
Snowden and Manning are heroes.
Re:Just being pedantic by klashn · 2010-06-24 17:27 · Score: 1

ECC RAM essentially means that there's an extra DRAM component that stores the ECC data, but yes, the memory controller has to support it.

Re:erm.... by Moodie-1 · 2010-06-24 12:37 · Score: 1

You live below your mother's basement??? (LOL!!) You'd have to, to be well-shielded from cosmic rays. Living in a basement doesn't shield you from rays coming from above. And even so, some rays are so energetic that they'll reach you even if you lived a mile underground in a mine.

Re:erm.... by Thing+1 · 2010-06-24 12:43 · Score: 2, Funny

My girlfriend at the time even made me a tinfoil hat, that I'd sometimes wear around the house as I babbled nonsense about impending alien invasions. :)

I am both shocked and amazed that you eventually broke up.

--
I feel fantastic, and I'm still alive.

Reboot? by fava · 2010-06-24 12:46 · Score: 1

The article author has obviously never used windows. SOP would be a reboot, which would have solved the problem.

The whole thing would have taken minutes.

Re:Reboot? by Anonymous Coward · 2010-06-24 13:08 · Score: 1, Insightful

And leave you in a state of utter ignorance. It isn't about solving it, it's about understanding it.
Re:Reboot? by tomhudson · 2010-06-24 23:28 · Score: 1

And leave you in a state of utter ignorance. It isn't about solving it, it's about understanding it.

This is the 3rd ksplice blog article - and we're now 3 for 3 in crap. It wasn't cosmic rays - it was bad ram.
Who keeps submitting this shit, anyways?

Re:Cosmic rays, my ass. Occam's Razor time. by GNUALMAFUERTE · 2010-06-24 12:47 · Score: 1

Not only that, but they are also systems we can only approach from a very abstract perspective when it comes to debugging. Our options to debug complex hardware are very abstract, inaccurate, and incomplete.

--
WTF am I doing replying to an AC at 5 A.M on a Friday night?

Ah, Grasshoppah by gyrogeerloose · 2010-06-24 12:51 · Score: 1

"What is the sound of one bit flipping?"

Or

"If a disk crashes in a server farm and there's no one there to hear it, does it make a sound?"

--
This ain't rocket surgery.

Re:Ah, Grasshoppah by robthebloke · 2010-06-24 13:15 · Score: 1

no, because we run ssd's.... :p

Re:erm.... by JWSmythe · 2010-06-24 12:52 · Score: 1

Well fear not, it's been a series of upgrades since then. :) My girlfriend now is perfect, I can't imagine a better upgrade from here.

--
Serious? Seriousness is well above my pay grade.

Ksplice ... go figure by GNUALMAFUERTE · 2010-06-24 13:00 · Score: 5, Interesting

The guy that posted this is a Ksplice developer. In case you didn't knew, KSplice allows you to patch your running kernel without rebooting. Nice.

Anyway, this guys sees a random memory error. He conveniently goes on a debugging rampage, while we all know the most logical first step would be rebooting that damn machine. Random memory errors do happen.

He says he "hasn't gotten around" to memtesting his RAM yet. So, let me get this straight ... he implies that random cosmic rays caused the error, but he hasn't yet tested his ram for what is the most possible cause of the issue?

Then he goes on to explain that you don't even need to reboot your machine due to damn cosmic radiation. Or kernel updates. Because you have Ksplice.

Come on.

--
WTF am I doing replying to an AC at 5 A.M on a Friday night?

Re:Ksplice ... go figure by fishexe · 2010-06-24 16:17 · Score: 1

The guy that posted this is a Ksplice developer...Then he goes on to explain that you don't even need to reboot your machine due to damn cosmic radiation. Or kernel updates. Because you have Ksplice.
Come on.
Hey, there have been weirder, more circuitous attempts to advertise on Slashdot.

--
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009
Re:Ksplice ... go figure by b1t+r0t · 2010-06-24 23:28 · Score: 1

He says he "hasn't gotten around" to memtesting his RAM yet.
But if he did that, he'd lose his uptime!
Sounds like we need a Linux kernel module that does memtest86 type stuff during idle time, looking for RAM pages with hard errors. Or at least a checksum/ECC for the disk cache pages that can be checked during idle time. Once a bad page is found, the kernel can then lock it out.

--

--
"Open source is good." - Steve Jobs
"Open source is evil." - Microsoft
Re:Ksplice ... go figure by tomhudson · 2010-06-24 23:33 · Score: 1

Like their other 2 articles that appeared on slashdot, it's just shit.
Not even high grade shit - it's the sort of stuff you get from some kid who "sort of" has a clue - a demonstration that a little knowledge is a dangerous thing.
Next ksplice article will be about the "black screen of death" virus that overtakes their laptops every day after an hour or two ... and how they were able to fix it by hardening their system doing blah blah blah ... because it couldn't be that their batteries would die after a couple hours use.
We need a -1 Fucktards mod for story submissions.
Re:Ksplice ... go figure by ShakaUVM · 2010-06-25 00:51 · Score: 1

>>He says he "hasn't gotten around" to memtesting his RAM yet. So, let me get this straight ... he implies that random cosmic rays caused the error, but he hasn't yet tested his ram for what is the most possible cause of the issue?
Yeah, nerds are kind of weird like that. We'll "waste time" figuring out something simply because it is interesting.
>>Then he goes on to explain that you don't even need to reboot your machine due to damn cosmic radiation. Or kernel updates. Because you have Ksplice.
That's probably a more elegant tool than what our netops guy at UCSD used to do, which was to manually patch the kernel on the fly using a debugger, because he couldn't reboot the machine. I think it was because a custom application had been running on it for years, but no longer existed on disk anywhere.
Dude was pretty hardcore.
Re:Ksplice ... go figure by beleriand · 2010-06-25 02:01 · Score: 1

No, he didn't "see" a random memory error. Maybe you can see them, i sure as hell can not.
He noticed a program had stopped working for unexplained reasons, and did some pretty nifty debugging to get to the bottom of it. Yes, i'd have rebooted too. But not because i think that what he did was a waste of time, just wouldn't have known how to do it.
And as for memtest, yes you can use it to find out if you have really shitty ram. But it's no magic wand, what if the ram is working mostly fine, and an error appears on average only once a year? You'd have to run memtest for years to test for that with certainity, and what's the point of having a computer then if you can't use it?
What i found funny btw is the hostname "psychotique", no wonder it is giving problems.
Re:Ksplice ... go figure by Anonymous Coward · 2010-06-25 05:17 · Score: 0

Just because memtest isn't perfect is a silly reason not to run it when you witness a memory error. Certainly, I'd run it before telling the world I'd seen a cosmic ray.
Re:Ksplice ... go figure by dannys42 · 2010-07-06 05:36 · Score: 1

That actually sounds like an awesome idea to me!

Re:erm.... by gandhi_2 · 2010-06-24 13:00 · Score: 1

I did read it. I liked the article, actually.

I didn't take into account that he probably never reboots, thereby always using the cached copy.

The k-splice ad on TFA made me laugh in this case.

guess we should put echo 3 > /proc/sys/vm/drop_caches in chrontab.

--
THL phish sticks

to increase ram or not to increase ram? by Anonymous Coward · 2010-06-24 13:08 · Score: 0

So, here I am with my paltry 2 Gigs of ram in my system drooling over the idea of having some much larger amount, like this fellow's 12 Gigs, and then find out that it's a likely source of errors due to persistent caching of hard drive reads.

Memory failures due to alpha particle switching, one of my faves, or cosmic rays (are we sure we can't get neutrino's in there as well?) were a known evil but it looks like having the cache more frequently overwritten might be an advantage to having smaller amounts of memory. (at least, non-ecc memory.)

Now I have to run off and see if my motherboard will accept ECC memory before I go out and do buy more memory.

Re:to increase ram or not to increase ram? by klashn · 2010-06-24 17:23 · Score: 1

Its not the motherboard that will prevent you from using ECC memory, it's most likely the fusing of your CPU. If you don't have a Server CPU, ECC will not be enabled. Keep in mind, currently there is no solution for SO-DIMM ECC memory, so you must be talking about desktop ECC (UDIMM) memories.

A cheaper solution.. by Anonymous Coward · 2010-06-24 13:11 · Score: 0

I found ECC RAM was too expensive for my home server..

so does anybody know where I can get a cheap, THICK lead sheet?

Re:Cosmic rays, my ass. Occam's Razor time. by Anonymous Coward · 2010-06-24 13:18 · Score: 5, Interesting

On the subject of the imperfect nature of machines, I found this post by Richard D. James (aka Aphex Twin, a noted electronic music composer) quite interesting. He describes how the physical machinery of analog electronic music machines means it is near impossible to duplicate them in digital programs.

link

Author: analord
Date: 02-07-05 03:14

some people bought the analogue equipment when it was unfashionable and very cheap though.
some of us are over 30 you know!
anyone remember when 303`s were £50? and coke was 16p a tin? crisps 5p

also you have overlooked A LOT of other points because its not all about the overall frequency response of the recording system its how the sound gets there in the first place.
here are some things which you can`t get from a plugin,they are often emulated but due to their hugely complex nature are always pretty crass aproximations..

the sound of analogue equpiment including EQ, changes very noticably over even a few hours due to temperature changes within a circuit.
Anyone who has tried to make tracs on a few analogue synths and make them stay in tune can tell you this,you leave a trac running for a few hours come back and think Im sure I didnt fucking write that,I must be going mental!

this affects all the components in a synth/EQ in an almost infinte amount of tiny ways.
and the amount differs from circuit to circuit depending on the design.

the interaction of different channels and their respective signals with an analogue mixer are very complex,EQ,dynamics....
any fx, analogue or digital that are plugged into it all have their own special complex characteristics and all interact with each other differently and change depending on their routing.
Nobody that ive heard of has even begun to start emulating analogue mixer circuitry in software,just the aesthetics,it will come but im sure it will be a crap half hearted effort like most pretend synth plugins are.
they should be called PST synths, P for pretend not virtual.

Every piece of outboard gear has its own sound ,reverbs,modulation effects etc
real room reverb, this in itself companies have spent decades trying to emulate and not even got close in my opinion, even the best attempts like Quantec and EMT only scratch the surface.

analogue EQ is currently impossible in theory to be emulated digitally,quite intense maths shit involed in this if youre really that interested,you could look it up...good luck.

your soundcard will always make things sound like its come from THAT soundcard..they ALL impose their different sound characteristics onto whatever comes out of them they are far from being totally neutral devices.

all the components of a circuit like resistors and capacitors subtley differ from each other depending on their quality but even the most high quality milatary spec ones are never EXACTLY the same.

no two analogue synths can ever be built exactly the same,there are tiny human/automated errors in building the circuits,tweaking the trimpots for example which is usually done manually in a lot of analogue shit.
just compare the sound of 2 808 drum machines next to each other and you will see what I mean,you always thought an 808 was an 808 right?
same goes for 303`s they all sound subltey different,different voltage scaling of the oscillator is usually quite noticable.

VST plugins are restricted by a finite number of calculations per second these factors are WAY beyond their CURRENT capability.

Then there is the question of the physicallity of the instrument this affects the way a human will emotionally interact with it and therfore affect what they will actually do with it! often overlooked from the maths heads,this is probably the biggest factor I think.
for example the smell of analogue stuff as well as the look of it puts y

I've seen this by Eil · 2010-06-24 13:21 · Score: 2, Informative

A few years ago I came across a thread on a FreeBSD mailing list where a build of some package was failing and the submitter couldn't tell why because he wasn't a developer. The failure was unusual and no one else could reproduce it. Eventually, the problem was traced back to a character in the source differing from the original. The character was a one-bit difference from the correct character, and it was suggested to the submitter that he reboot and memtest his memory. Sure enough, one single bad bit out of around 512MB.

Re:Too bad many consumer mainboards don't support by Anonymous Coward · 2010-06-24 13:32 · Score: 0

Uh...no. I've got a Dell server from the Ark with PIII chips that demands ECC.

ha! by serbanp · 2010-06-24 13:32 · Score: 5, Insightful

The really impressive thing is that this guy resisted the urge to just reboot his machine. Otherwise, the clues would have vanished and the expr binary would have run again without any issue.

Maybe that's why the first step one takes when something behaves weird on a Windows system is to reboot it...

Re:ha! by mpoon · 2010-06-24 13:48 · Score: 2, Informative

If you take a look at the website hosting the blog (Ksplice), you might notice that "this guy" works for a company that produces software which eliminates the need for reboots...
Re:ha! by serbanp · 2010-06-24 13:59 · Score: 1

What happened to slashdot, isn't reading the FA enough anymore to get the medal?
Seriously speaking, I figured out ksplice is a special place when reading the blog comments.
Re:ha! by doomy · 2010-06-24 19:50 · Score: 1

I'm not sure why he didn't run memtest right off. He says he has not run it and was meaning to do it since that happened. I think mostly he wanted to write a story, one that has nothing to do with cosmic rays (btw).

--
...free your source and the rest would follow...

Re:erm.... by Anonymous Coward · 2010-06-24 13:33 · Score: 0

Actually, he lives in the "oil" reservoir that the Deep Water Horizon hit - so he's got 1 mile of seawater and 2 miles of rock and silt above him to protect him a bit better from cosmic rays.

Now, if he would just stop eating bean burritos, he could save BP a lot of money and public embarrassment.

Re:Too bad many consumer mainboards don't support by AHuxley · 2010-06-24 13:36 · Score: 1

A eccforme.com site with current ~consumer priced motherboard lists would be a fun project.

--
Domestic spying is now "Benign Information Gathering"

Same thing just happened to Voyager 2 by psoriac · 2010-06-24 13:39 · Score: 1

http://www.jpl.nasa.gov/news/news.cfm?release=2010-151

Mission managers at NASA's Jet Propulsion Laboratory in Pasadena, Calif., had been operating the spacecraft in engineering mode since May 6. They took this action as they traced the source of the pattern shift to the flip of a single bit in the flight data system computer that packages data to transmit back to Earth.

--
I browse Slashdot at +3, Funny

enterprise hardware/software by eclectus · 2010-06-24 13:56 · Score: 1

I sure am glad my OS and hardware can detect and correct memory errors on the fly and disable the dimms if need be. I know this is a linux-fest, but Solaris fault management is pretty awesome. I've seen it detect a failing cpu, evacuate the memory attached to it and disable the cpu without a hiccup.

--
This signature is a waste of 42 characters

Re:enterprise hardware/software by Anonymous Coward · 2010-06-24 14:15 · Score: 0

Ride that dead horse, get over it Oracle is driving Solaris into the ground as fast as it can.

Re:erm.... by Thing+1 · 2010-06-24 14:00 · Score: 1

Me, I've upgraded to my heart. (Read comment history.) Best, upgrade, ever. Takes a bit of practice though.

--
I feel fantastic, and I'm still alive.

Tin foil hat by ITI_guy · 2010-06-24 14:14 · Score: 1

Very interesting article indeed but I wish the author would have included one more detail. Does he believe in tin foil hats? He could only speculate this was caused by a cosmic ray and not a bad memory stick. Instead of running memcheck I recommend he wrap his desktop in tinfoil for a week or so and see if this prevents any further bit flips.

Roman ingots to shield particle detector by drerwk · 2010-06-24 14:20 · Score: 3, Informative

Roman ingots to shield particle detector
http://www.nature.com/news/2010/100415/full/news.2010.186.html

Re:Cosmic rays, my ass. Occam's Razor time. by bitflip · 2010-06-24 14:20 · Score: 2, Funny

It was me.

Sorry 'bout that.

Re:erm.... by fishexe · 2010-06-24 14:22 · Score: 2, Funny

You live below your mother's basement???

Sure. In his mother's sub-basement.

--
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009

Memory Refresh Timing more likely by spydum · 2010-06-24 14:26 · Score: 2, Informative

I know, cosmic rays sound so much cooler, but it's far more likely he has some crappy memory and/or his memory refresh timings are too high.

DRAM memory cells have to be refreshed pretty often (anywhere from 7.8usec-12usec), otherwise they become unreliable. If his BIOS has the memory timings set to something obscurely long, it may be there are specific rows/cells on his DRAM modules that are too weak to read after bleeding off a bit of charge. Changing the refresh timing would likely improve the situation, causing the memory to refresh it's state more often.

+1 Informative by fishexe · 2010-06-24 14:31 · Score: 2, Funny

I shouldn't have spent all my mod points yesterday. I guess my hardware knowledge is obsolete; I had no idea modern HDDs don't store individual bits anymore.

--
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009

Re:+1 Informative by Sycraft-fu · 2010-06-24 15:52 · Score: 2, Interesting

Been common with all kinds of things for some time. CD-ROMs, for example, use EFM, eight-to-fourteen modulation at their most base level. Eight logical bits are actually written as fourteen pits on the disc. Again the reason is error correction. They have more error correction at higher levels, and even more for data CDs. That's why data and audio CDs don't add up. An 80 minute CD holds 700MB of data. But do the math on 44.1kHz, 16-bit, 2-track audio and it takes 800MB of data to hold. Well in data mode the CD has additional ECC and that takes up the space.
As our media densities have increased, there is just no way to have 100% bit accurate recovery of data all the time. So, we just don't try. Instead, write it in a fashion that the errors can be dealt with.
For sure the PRML that HDDs use is the coolest and the most voodoo. It really seems like the kind of shit that should work. I mean taking samples (PCM style) of an analogue wave and comparing it to what it is likely to be? GTFO...
But it works, and works great.
There's all sorts of nifty shit like that at the hardware level that just fascinates me. Another one is 8b/10b encoding, which most popular buses (PCIe, 1394, etc) use. You take each byte and encode it as 10 bits, instead of 8. Why? Because it gives multiple different ways of encoding it. In particular, you have have different 1s and 0s patterns or on the wire, different high and low voltage. You then have the system balance those out, giving you no DC component to the signal. If you didn't, and you got a lot of 1s (high voltage) you'd have DC on the wire and that could cause all kinds of trouble. However 8b/10b solves that problem quite nicely. You get 0 DC and pass all your data. Does incur overhead though. There are newer schemes of a similar fashion coming online with more complexity to get the same effect with less overhead.
Re:+1 Informative by indeterminator · 2010-06-24 22:53 · Score: 1

For sure the PRML that HDDs use is the coolest and the most voodoo. It really seems like the kind of shit that should work. I mean taking samples (PCM style) of an analogue wave and comparing it to what it is likely to be? GTFO...
Actually, if you look at what your cell phone does all the time to get the bits out of the air... that is comparable at least, probably even worse. There is some particularly nasty stuff in carrier frequency + phase estimation (w/ noise present!) before you even get to calculate any likelyhoods on your actual payload. Whatever the case, there is a strong math basis behind that voodoo.

Re:erm.... by tuxicle · 2010-06-24 14:43 · Score: 1

With a magma-powered computer?

Re:Cosmic rays, my ass. Occam's Razor time. by Peach+Rings · 2010-06-24 14:52 · Score: 1

Electronics are designed well within tolerances for temperature and EM interference. At least, good ones are. Since my fans are broken, I've been running the GPU in my Thinkpad to 107C every day for a few years when I play games. No problems yet.

As someone with over a quarter century of background in

As someone who hasn't been in school in 30 years, memory loss, sits on the porch with a shotgun hollering at kids, has to call his grandson to install the newfangled Norton Internet Security because you've been screwing around with FPGAs for decades and last used a web browser in 1995, etc

210Pb 1/2 life 22 years by drerwk · 2010-06-24 15:07 · Score: 1

http://esp.cr.usgs.gov/info/lacs/lead.htm

Re:210Pb 1/2 life 22 years by Vellmont · 2010-06-24 15:29 · Score: 1

Heh. So I guess my scenario was more accurate than I could even guess. With the exception of a few intermediate decay products, the wild-ass-guess was amazingly accurate.

--
AccountKiller
Re:210Pb 1/2 life 22 years by Anonymous Coward · 2010-06-24 21:52 · Score: 0

*golf clap*

windows crashes by prkamath · 2010-06-24 15:20 · Score: 2, Funny

And we used to blame Microsoft engineering team for all the crashes we experienced !!

Re:erm.... by The_mad_linguist · 2010-06-24 15:48 · Score: 1

No, he just lives beneath-

blah blah his mom's fat.

Re:Cosmic rays, my ass. Occam's Razor time. by jimmydevice · 2010-06-24 16:06 · Score: 1

Yep, We had 2 old wavetec audio distortion meters used for calibrating aviation ILS tone frequencies. One was purchased new. The other I picked up at R2D5 surplus in Portland. Both were calibrated by a outside service and the deviation was about 20% right after calibration. I don't know if we got ripped off, Or the meter just wasn't accurate?!?. The R2D5 box was owned by the FAA before I picked it up ( complete with FAA cal stickers on the screws ). Analog is just that, ANALOG. FWIT I'm buzzed too.

Use people's puters as gamma ray telescope? by lotho+brandybuck · 2010-06-24 16:08 · Score: 1

I wonder if anyone's considered using a large set of networked computers (volunteers) as a gamma ray telescope.

Re:erm.... by jimmydevice · 2010-06-24 16:17 · Score: 1

I'm offended that you push idolatry to the next level. The Politically correct response would be

The period^w White space at the end of this sentence represents Mohamed in drag ==> " "

I'm assuming that the quotes don't offend anyone.

Re: Mod parent up by Cochonou · 2010-06-24 16:22 · Score: 1

RAM upsets at gound level (and in aircraft avionics, for the matter) are primarily caused by neutrons created by cosmic ray decay in the upper atomsphere, through indirect ionization. Galactic Cosmic Rays (heavy ions) are more a concern to satellite designers.

Re:erm.... by jimmydevice · 2010-06-24 16:23 · Score: 1

Now I understand why Portland State University's computers were in the sub-basement back in the 70's.
On the other hand, A IBM 1130 and Honeywell H300 build from discrete transistors and core memory would have probably
survived a direct nuclear blast.

Jumping straight to the cosmic ray conclusion... by fishexe · 2010-06-24 16:29 · Score: 1

Lister: Your explanation for anything slightly peculiar is [cosmic rays], isn't it? You lose your keys, it's [cosmic rays]. A [bit] falls off the [RAM], it's [cosmic rays]. That time we used up a whole bog roll in a day, you thought that was [cosmic rays] as well.

Rimmer: Well we didn't use it all, Lister. Who did?

Lister: Rimmer, [COSMIC RAYS] used our bog roll?

Rimmer: Just cause they're [cosmic rays] doesn't mean to say they don't have to visit the little boys' room. Only they probably do something weird and [cosmic ray]-esque, like it comes out of the top of their [waveforms] or something.

--
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009

Re:Cosmic rays, my ass. Occam's Razor time. by steelfood · 2010-06-24 16:37 · Score: 1

Most of the memory problems turned out to be power supply or otherwise some kind of power problem. Very rarely is it the memory itself. From my experience, a faulty power supply will first manifest as memory issues, and then gradually increase in severity to affect much more of the system. And from experience, bad power supplies are often a result of dirty (i.e. inconsistent, unstable) power going into the machine, irrespective of any "surge protectors" that may be between the wall and the machine.

Even a bad BIOS battery can throw off something like the system clock and cause issues further down the line.

A few times, I had capacitors on the actual mobo blow out on me (and it's possible some of the PS problems I had were due to faulty capacitors), but that's easily spotted.

--
"If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."

I wonder if he would have been right by NotSoHeavyD3 · 2010-06-24 16:40 · Score: 1

Except he didn't take into consideration that shrinking the size of computer hardware probably means that a huge program now actually fits into a smaller space in physical reality than an old program did on older hardware. (Any hardware guru want to do the math and find out how much physical area a 100k program took back in the 80's vs a 150mb program now?)

--
Did you know 80 to 90% of the moderators on slashdot wouldn't recognize a troll even if one dragged them under a bridge.

Re:Cosmic rays, my ass. Occam's Razor time. by w0mprat · 2010-06-24 16:55 · Score: 2, Informative

Occam's DIMM I'm affraid. I had a stick of DDR2 that had a stuck bit that caused almost exactly the same issue as TFA. As a hardware geek, not a *nix geek with time to waste, I went straight to memtest86, and there it was, one single stuck bit.

Although interesting, TFA it is without a doubt the most pedantic and roundabout way I've ever read of establishing your rig is not stable.

From TFSA:

And in fact, since that incident, I've had several other, similar problems. I haven't gotten around to memtesting my machine, but that does suggest I might just have a bad RAM chip on my hands.

Yeah he has a stuck or semi-stuck bit and a hour or two of his life he won't get back.

In such a circumstance I've found underclocking and overvolting the DIMM might coax it to work again but it's best to RMA or bin it.

--
After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.

Cosmic Fibonacci Failure! by Anonymous Coward · 2010-06-24 17:06 · Score: 0

Circa 1994 i was transferring all of my data on from low density 5.25" floppies to gasp CD-R! The disks were stored in some old huge recipe filing box that just happened to fit about a 100 deep.

The following discs failed:

1st, 2nd, 3rd, 5th, 8th, 13th, 21st, 34th, 55th and the 89th.

Only in a long boring backup scenario could anyone ever figure that out.

I test memory on a daily basis by klashn · 2010-06-24 17:19 · Score: 1

Working in a lab environment I test DDR3 memory on a daily basis and we run into a lot of failures from JEDEC violation to blatant byte/word/dword corruption and even single bit failures. Single bit failures are by the far the worst to debug. Kudos to this guy for tracking it down. I am going to add these debug procedures to my arsenal!

When I encounter a failure, logging all information is of course the first thing I do, but reproducibility is key! With reproducibility, like the article says, you're able to throw as many experiments at it as you can think up. We will run memtest86+ among other tools to gather data on whether the failure reproduces with other tests. In the case we believe it is a DRAM part failure, we will utilize Logic analyzers and Oscilloscopes to determine and prove that the failure is on a specific component.

Sometimes failures we encounter are DIMM vendor issues, sometimes our own, induced by bad in house memory test software/hardware

Learned this lesson years ago by OrangeTide · 2010-06-24 17:31 · Score: 1

We were debugging a problem showing up in the field, turns on the developer building the system image had a bad bit and was consistently introducing a bug in every build in the same component. After a very frantic week, we realized we could only reproduce the bug if he touched it. (poor guy, probably felt pretty insecure about his abilities at first. even though it turned out not to be him). We replaced all developer's machines with ECC capable equipment loaded with ECC memory as soon as possible.

As for ECC's cost. ECC is not available in the same varieties of price and performance. ECC that is just slightly more expensive than good quality but average RAM is also about the same performance. You can find really cheap RAM that is half the price of what I would consider "good quality but average RAM", so ECC is considered twice as expensive as "normal [cheap ass] ram".

If I were just going to play games on my computer, or even write up documents I think non-ECC would be perfectly reasonable. As a developer I now realize that debugging software problems that are really just bad hardware is a huge waste of my time and sanity. I'd pay 10x more for a system if it meant I didn't have to do that crap. Luckily a good quality server motherboard and some ECC ram is not too much more expensive than a fast and cheap computer.

Harddrives are another issue, obviously some sort of RAID(1,10,Z,etc) can be great at dealing with day to day corruption. And backups are great at dealing with catastrophic events such as drive failures, controller malfunctions, fires, or malevolent software. People often forget that controllers can go berserk when they set up their awesome elite fail-proof RAID. The controllers, be it a smart RAID or a dumb multiport SATA controller, can corrupt the data it copies, write to the wrong disk sectors, and numerous other systematic corruption. It helps to pay for a good quality card, but it helps more to never trust your equipment 100% and have a backup plan.

Luckily when CPUs glitch they usually stop running because of the cascade of interdependent transistor states that can make further execution impossible without a hard reset. CPUs can misbehave when their power supplies are not up to the the demands, I use the term power supply in a generic electrical engineer's sense. The big metal fan box with a switching power supply in it is the main component of your CPU's power supply, the other component is the voltage converters on the motherboard (where it is surrounded by metal capacitors) that is equally important to the health of your system. A bad PSU can weaken your motherboard's circuitry. And bad motherboard circuitry can be susceptible to easier damage. If either fails your CPU can glitch, produce incorrect computations(hard to debug!) or in rare cases cease to function.

--
“Common sense is not so common.” — Voltaire

Re:Too bad many consumer mainboards don't support by klashn · 2010-06-24 17:37 · Score: 1

This is how vendors keep their market segmentation. ECC supported only on servers. Consumers don't need it, so prevent them from using ECC so server customers can't buy cheap setups with ECC!

Re:Cosmic rays, my ass. Occam's Razor time. by Anonymous Coward · 2010-06-24 17:56 · Score: 0

Electronics are designed well within tolerances for temperature and EM interference. At least, good ones are. Since my fans are broken, I've been running the GPU in my Thinkpad to 107C every day for a few years when I play games. No problems yet.

No offense intended, but your comment portrays a complete lack of understanding of the subject. It might be best if you sit this one out.

As someone who hasn't been in school in 30 years, memory loss, sits on the porch with a shotgun hollering at kids, has to call his grandson to install the newfangled Norton Internet Security because you've been screwing around with FPGAs for decades and last used a web browser in 1995, etc

Sorry, I can't decipher this gibberish. Like I said, it might be best if you sit this one out.

Re:Cosmic rays, my ass. Occam's Razor time. by afidel · 2010-06-24 18:00 · Score: 1

Across a few thousands DIMM's in my datacenter we tend to lose about 1-2 per year, more than we lose PSU's. Of course we control temperature, humidity, and have double conversion UPS's and only use ECC systems so it's kind of an ideal environment for avoiding all but the most serious of problems.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.

Re:Too bad many consumer mainboards don't support by Anonymous Coward · 2010-06-24 18:56 · Score: 0

You know, there are many, many other spots a bit can flip an ruin your day outside of RAM, anywhere on the processor for one. The Xeon line can detect and correct a lot more problems than regular desktop chips. There is a name for these types of features.. RAS. It is costly, and overlooked by probably most of /. It's the reason Power and SPARC systems cost so much, and mainframes still exist. If you ever wondered what makes a Xeon, Opteron, or Itanium so special, RAS is part of it. Look up Machine Check Exception.

A desktop Athlon is not going to have the same RAS features a Xeon, Itanium, Opteron, etc will. A processor with weak RAS features but ECC RAM is well, a good start I guess, but ghetto. Find out what kind of MCEs an Athlon can live to report.

evoting: spontaneous inversion of a bit in the RAM by Anonymous Coward · 2010-06-24 19:18 · Score: 0

http://catless.ncl.ac.uk/Risks/23.47.html#subj7

The event occurred in the election held on 18 May 2003. An expert review
determined that as no software defects had been found on inspecting the
source code and no test had been able to reproduce the error, it was
probably attributable to a spontaneous inversion of a bit in the RAM of the
PC (no explicit mention of cosmic rays). However the report concluded that
even if the voting system under review was not perfect the totality of
controls was sufficient to be confident in the overall result. I wonder.

Re:erm.... by Anonymous Coward · 2010-06-24 19:23 · Score: 0

Heck, sometimes I reply before reading the comments

AMD vs. Intel with ECC, prices in Germany by Lonewolf666 · 2010-06-24 20:00 · Score: 2, Informative

Checking at alternate.de (not the cheapest online shop, but good for comparisons because they have both consumer and server parts):

Intel:
The only Socket 775 boards that support ECC seem to be those with the 32xx MCH chipset. Starting at 195 Euros (Asus P5BV-C).
For Socket 1156, the consumer chipsets allow ECC but you still need to find a board with BIOS support. Sadly Alternate does not list the ECC support status, but you might find one that supports ECC among the cheaper ones for 80-90 Euros. You do, however, need a Xeon which starts at 213 Euros (Xeon X3430, 4 x 2.4 GHz)
So mainboard plus a quad CPU costs you around 300 Euros at Alternate.

AMD:
Board situation (Socket AM3) similar to Intel's Socket 1156, boards with ECC support are available for 80-90 Euros.
Unlike Intel, even cheap desktop CPUs support ECC. As a cheap quad, Alternate offers the Athlon II X4 635 for 108 Euros.
So mainboard plus quad CPU costs you around 200 Euros, 100 less than with Intel.

--
C - the footgun of programming languages

Re:AMD vs. Intel with ECC, prices in Germany by dave420 · 2010-06-25 00:25 · Score: 1

But that Intel setup will beat the pants off the AMD setup. Well worth the extra 100.
Re:AMD vs. Intel with ECC, prices in Germany by dwinks616 · 2010-06-25 03:30 · Score: 1

That depends on what tasks you use it for. The AMD is quite competitive for gaming, 3d modeling/CAD and a number of other tasks. Not to mention that when you compare a $300 AMD to a $300 Intel setup, you end up with a hexacore processor running against a quad. If you run programs that can utilize those extra cores, the performance difference is pretty much non-existent. Not to mention you don't have to triple check to make sure AMD hasn't disabled hardware virtualization for no good reason or otherwise crippled their chips. Spend $50 more on the AMD processor and add a $50 aftermarket cooler on it and overclock it a good bit. I have a significantly overclocked AMD quad core that beats the pants off a similarly priced Intel of the same era and spent about $50 less. If I were to upgrade right now, I'd go with a 12-core Opteron mostly because I use my workstation for virtualization a lot, and 12 cores means I can boot up a virtual server with 2 cores, a few virtual workstations on a core each, and still have plenty of cores left idle for other tasks.
Re:AMD vs. Intel with ECC, prices in Germany by jon3k · 2010-06-25 09:42 · Score: 1

Yeah but you only overclocked the AMD - that's a silly comparison. Take any similarly priced Intel vs AMD chip, overclock both of them, see who comes out ahead. AMD has some great deals at the low end of the market but they just can't compete at the "enthusiast desktop" end of the spectrum. Compare overclocked 875K vs 1090t and you'll find that the 875k not only wins out in performance but also has a lower overall power consumption.
Re:AMD vs. Intel with ECC, prices in Germany by Lonewolf666 · 2010-06-26 00:56 · Score: 1

Citing the first benchmark I could find, the Passmark CPU list
(http://www.cpubenchmark.net/cpu_lookup.php?cpu=Intel+Xeon+X3430+%40+2.40GHz) rates
-the Xeon X3430 @ 2.40GHz with 2,962 points
-and the AMD Athlon II X4 635 at 3,360 points
So without claiming that this benchmark is the final word, I'd expect the AMD to be at least on a similar performance level to the Xeon. If you have evidence to the contrary, please post it ;-)

--
C - the footgun of programming languages
Re:AMD vs. Intel with ECC, prices in Germany by nabsltd · 2010-07-02 03:00 · Score: 1

So without claiming that this benchmark is the final word, I'd expect the AMD to be at least on a similar performance level to the Xeon.
Passmark is an excellent benchmark that thoroughly utilizes every core. So, it measures the absolute limit of the processor (in general...there could be slight variations based on the motherboard, RAM, etc.).
As such, it's a pretty good indicator of general server performance, and does give you a feel for "normal" deskstop performance (where at least some multitasking is involved), but it's not a good indicator of single-thread performance.

ECC RAM alone does not help much. by Hurricane78 · 2010-06-24 20:20 · Score: 1

"I know I'm never buying a desktop without ECC RAM ever again!"

There are still the CPU, the cache, the hard disk, the network, and a ton of buses in-between, where a bit could be flipped.
Unless you add ECC data right at the creation of the data, and pass it trough all the way to the end, you can’t be sure of anything.

--
Any sufficiently advanced intelligence is indistinguishable from stupidity.

Re:ECC RAM alone does not help much. by pankkake · 2010-06-24 22:15 · Score: 1

Hard drives have ECC, some filesystems too, and there's RAID. I don't know about all network transports, but most have ECC at various layers.

--
Kill all hipsters.

good article by amn108 · 2010-06-24 20:24 · Score: 1

Not much to say here, except that it was a wonderful article to read!

God of the gaps? by Rogerborg · 2010-06-24 20:30 · Score: 1

He couldn't figure it out, so he attributed the fault to a No See Um. Might as well blame it on goblins, or declare that A Wizard Did It.

--
If you were blocking sigs, you wouldn't have to read this.

Nasty by Anonymous Coward · 2010-06-24 20:33 · Score: 0

Someday, somewhere, the No-Execute bit will be flipped and it will be exploited by Cowboy Neal and the world will end.

Damn those cosmic rays and cowboy hax0rs.

Re:erm.... by Anonymous Coward · 2010-06-24 21:41 · Score: 0

You must be new here

Ha Ha by marqs · 2010-06-24 22:15 · Score: 1

They all laughed when i put my tinfoil hat on and encased my computer in led.
But who is laughing now?

Re:Ha Ha by shiftless · 2010-06-25 22:22 · Score: 1

They all laughed when i put my tinfoil hat on and encased my computer in led.
Does the bright, cheerful light ward away bit-flipping spirits?
I don't get it.

Am I missing something? by gladish · 2010-06-24 22:21 · Score: 1

Do these two numbers actually differ by one bit? 0x0000000000001a70 0x401a70 It looks to me like a byte is getting zero'd somewhere. Bad ram.

Re:Am I missing something? by multipartmixed · 2010-06-25 05:14 · Score: 1

0x401a70 - 0x001a70 = 0x400000
So, yes, they differ by one bit.

--

Do daemons dream of electric sleep()?

Prior art by Anonymous Coward · 2010-06-24 23:27 · Score: 0

Did this about 8 years ago on a mail server for Windows. It was a multithreaded application with thread trap detection and restart. On error, the thread protection code would generate a disassembly of the current state of the trapped thread and email it back to support. In one case, the disassembly showed a definite single bit error in the ram affecting code. The customer didn't believe it until we showed him the disassembly. The fault went away when they swapped ram.

Re:erm.... by Anonymous Coward · 2010-06-24 23:42 · Score: 0

PURCHASE a tinfoil hat? Are you crazy? You build them yourself..who knows what they include in the prebuilt ones.

Stray cosmic ray? by yamum · 2010-06-24 23:45 · Score: 1

... as opposed to a cosmic ray on a leash?

Dude you have a lot of faith by AbbeyRoad · 2010-06-25 00:04 · Score: 1

From the article:

"I can't prove this was due to a cosmic ray, or even a hardware error. It could have been some OS bug in my kernel that accidentally did a wild write into my memory in a way that only flipped a single bit. But that'd be a pretty weird bug."

Dude you have a lot of faith.

-paul

Speed on the outside of a hard drive's platter by tepples · 2010-06-25 00:06 · Score: 1

But hard drives aren't 5"; they're 3.5" or smaller. Assume 10800 RPM on a high-end 3.5" desktop drive, with the actual platter being slightly smaller in diameter, so 3.0" * Pi * 10800 RPM * 60 min/hr / 63360 in/mi = 96.4 mph.

Spent 2 weeks dealing with bad ECC memory by Hohlraum · 2010-06-25 00:19 · Score: 1

My advice to anyone who ever buys a complete set of memory for a new computer. If you have any problems just demand that all the sticks be replaced or you want a full refund. Memory issues are some of the most time consuming BS wastes of time there is when it comes to computer repair. I could have replaced the memory 4 times over in the amount of time I spent working on it only to find out it was more than one bad stick in the same batch. I'll never get that time back again :(

From parity to ECC to ... crap by surfsys · 2010-06-25 00:32 · Score: 1

Back in the day (~1992), we sold Intel 486 desktops with parity memory. When PC's went to a 64-bit data path (not to be confused with a 64-bit OS), we sold Intel desktops with ECC memory. (I remember seeing an IBM white paper that claimed that ECC memory is more reliable than non-ECC memory by a factor of ~2000.) Then Intel pulled the memory controller inside the CPU, and didn't bother to implement ECC on their line of desktop processors, apparently having decided that nobody on a workstation gives a damn about data integrity. Thanks Intel!.

Re:Cosmic rays, my ass. Occam's Razor time. by ockegheim · 2010-06-25 00:56 · Score: 1

My absurdly clever ex-housemate was an electronics engineer, and in his spare time would tinker with analog electronics because it was much more 'interesting' than digital.

--
I’m old enough to remember 16K of memory being described as “whopping”

Re:Cosmic rays, my ass. Occam's Razor time. by JediTrainer · 2010-06-25 01:26 · Score: 1

I agree, but I would start thinking even simpler. My wife and I had all sorts of weird issues with our computers a few years back.

My biggest clues were that the issues all appeared shortly after we moved, and with 2 out of 3 of our machines.

Long story short, after much hair pulling, a decent UPS solved the problem. Our machines were acting weird and random things weren't working because of unclean power, and apparently the PSUs weren't tolerating this all that well.

--

You can accomplish anything you set your mind to. The impossible just takes a little longer.

Re:Cosmic rays, my ass. Occam's Razor time. by hitmark · 2010-06-25 01:29 · Score: 1

107C? that could be used as a hot plate for tea water.

--
comment first, facts later. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

Re:Cosmic rays, my ass. Occam's Razor time. by boristdog · 2010-06-25 02:06 · Score: 1

Actually, it you've taken your computer on any air travel you have greatly increased the chances of damage by cosmic rays.
My company has to take this loss into account every time we ship a load of wafers overseas.

Of course, we're usually shipping hundreds of thousands to millions of die at a time, so we see it a lot more often.

My userid, FTW! by RandomBitFlipper · 2010-06-25 02:06 · Score: 2, Funny

Woohoo!

Re:He never said that by Zinho · 2010-06-25 02:08 · Score: 1

To be fair, Gates never said that line.

http://en.wikiquote.org/wiki/Bill_Gates#Misattributed

He didn't have to; he designed that principle into his systems so we all had to live with it for the last 35+ years. DOS was limited to 640kiB of RAM, resulting in users needing to move programs to "upper memory" (640kiB-1MiB) or "extended memory" (1MiB+) addresses by tricking the OS once larger memory cards became inexpensive. XP (32 bit) is limited to 3.1GiB, making it pointless to install even 4GiB of RAM in an XP box since nearly 1/4 of it will never be addressed. Microsoft continues to make the same mistake to this day; there's still a memory limit of ~192GiB in Win7 64 Bit. I expect that in about 5 years RAM will be cheap to buy in quantities larger than 192GiB and Microsoft will start looking silly again because we'll have to resort to DOS-era tricks to make it usable.

Code speaks louder than words. I don't care if he said it or not, he wrote it. And his employees continue to re-write it with every Windows release.

--
"Space Exploration is not endless circles in low earth orbit." -Buzz Aldrin

Not really definitive by name_already_taken · 2010-06-25 02:17 · Score: 1

To be fair, Gates never said that line. http://en.wikiquote.org/wiki/Bill_Gates#Misattributed

Doesn't that just say that Bill Gates says that he never said that line?

It doesn't provide proof that he didn't say it, any more than a defendant in court saying "no, your honor, I did not do that crime." is considered as proof of innocence.

--
Putting moderation advice in your .sig lowers your karma!

Re:Not really definitive by Your.Master · 2010-06-25 19:34 · Score: 1

But having no evidence whatsoever that a person committed the crime *is* considered proof of evidence.
It's unreasonable to ask people to prove the negative, even if the premise is widely-held to be true. Nobody can track a primary source for the quote.
Re:Not really definitive by Your.Master · 2010-06-25 19:36 · Score: 1

*sigh* proof of innocence, not proof of evidence.
This is proof that I don't proof my posts.

Controversial by name_already_taken · 2010-06-25 02:22 · Score: 1

I was thinking EMP related to the seismic activity. IIRC that is still somewhat controversial though.

In that it doesn't exist, yes it's controversial.

The only credible papers written about electromagnetism from or prior to earthquakes talk about resistivity changes (which are not emitted EM) or waves. No pulses. (The P in EMP stands for pulse.)

--
Putting moderation advice in your .sig lowers your karma!

Re:He never said that by Anonymous Coward · 2010-06-25 02:34 · Score: 0

Only pedantic fags use that GiB shit.

Re:He never said that by MikeBabcock · 2010-06-25 03:12 · Score: 1

Only complete idiots don't realize SI exists.

--
- Michael T. Babcock (Yes, I blog)

Re:He never said that by drsmithy · 2010-06-25 03:14 · Score: 1

He didn't have to; he designed that principle into his systems so we all had to live with it for the last 35+ years. DOS was limited to 640kiB of RAM, resulting in users needing to move programs to "upper memory" (640kiB-1MiB) or "extended memory" (1MiB+) addresses by tricking the OS once larger memory cards became inexpensive.

That limit exists because the 8088 CPU can only address 1MB of RAM, and some memory must be reserved for other hardware devices.

Further, as newer systems became available, with higher limits, OSes were updated or created to take advantage of that - OS/2, Windows 95, Windows NT, etc, can all utilise the additional address space provided with the 286 and then 386.

XP (32 bit) is limited to 3.1GiB, making it pointless to install even 4GiB of RAM in an XP box since nearly 1/4 of it will never be addressed.

That also exists because a 32-bit x86 CPU can only address 4GB of RAM (without resorting to hacks like PAE that are a) typically unstable with consumer-level hardware drivers and b) require special programming to take advantage of). Out of that 4GB, some amount (varying on several factors) must be reserved for other hardware devices - which is why the amount of visible RAM can vary from ~2.5 to ~3.7GB.

Microsoft continues to make the same mistake to this day; there's still a memory limit of ~192GiB in Win7 64 Bit. I expect that in about 5 years RAM will be cheap to buy in quantities larger than 192GiB and Microsoft will start looking silly again because we'll have to resort to DOS-era tricks to make it usable.

Even in *10* years it's unlikely 192GB of RAM in a desktop PC (or even "workstation") will be at all common. Further, in 5 years Windows 7 will have been replaced (and its successor probably be close to replacement at that), or that limit will have been increased. Windows 2008R2 has the same kernel and will address up to 2TB, the limit isn't inherent or architectural.

ECC ram is slower, gives very little benifit by bored · 2010-06-25 03:18 · Score: 1

Besides costing more (due to required extra/wider ram chips), ECC ram is slower.

This is primary caused by extra read/modify/write cycles done by the memory controller to keep the ECC in sync for short writes. This RMW sequence can cause a fair amount of performance loss (i've seen 8% on a custom application, doing a lot of pointer chasing/updating).

Furthermore, as anyone who monitors a lot of servers with ECC will attest. Its really rare to see a soft ECC correction (I've personally never seen one). If there are bit errors being corrected/detected its always been a full blow hardware failure.

Re:He never said that by Anonymous Coward · 2010-06-25 04:00 · Score: 0

I'm not even sure MS even wrote the 640K memory limitation in there at all. I believe it had more to do with a function of IBMs initial shortsightedness and the product MS bought to create DOS, CP/M. The 32 bit limit is a limitation of the hardware being able to address memory.

Single bit errors by mollog · 2010-06-25 04:27 · Score: 2, Interesting

While working as a failure analysis technician at a company that made a disk controller, I came across a single-bit error in static RAM cache that was repeatable. I was lucky to have the software and hardware tools available and I eventually tracked down the failure mode. Setting a bit at a certain location would cause another, different location's bit to get set. Just that one bit. And only if you set it. Resetting it did not cause the other bit to reset.

This turned out to be a manufacturing problem with a particular run of RAM. I starting finding more of these bad parts and could reproduce the failures. I guess what I'm saying is that this could well be a manufacturing defect in the RAM.

--
Best regards.

Re:Cosmic rays, my ass. Occam's Razor time. by Anonymous Coward · 2010-06-25 05:55 · Score: 0

That's entirely true. Even the worst software interface couldn't possible be as difficult and time consuming as programming a song into a real 303.

Re:erm.... by Moodie-1 · 2010-06-25 06:59 · Score: 1

Hah! Good one! Here's the ultimate "Your mom's so fat..." joke: "Your mom's so fat she blocks neutrinos!"

Re:He never said that by Bungie · 2010-06-25 07:45 · Score: 1

It has nothing to do with Microsoft at all, it's all because of Intel and IBM. The origional 8088 IBM systems could address 1MB of memory, and they decided to reserve the upper 384KB for hardware addressing, leaving 640K of conventional memory for programs. Then when the 286 came out and could address 16MB of memory they decided to use the same memory mapping so they wouldn't break compatability with older applications. During the DOS days Microsoft was trying to work around the 640K barrier using things like XMS and UMB's.

The 3GB barrier you complained about is because of the a similar thing, memory mapped I/O address space being reserved at the top of the memory area. Under 32-bit Windows they tried using PAE to allow the extra memory to be accessed but it broke a lot of drivers which expected pointers to always be 32 bits in size. Rather than break a ton of drivers they decided to keep the 4 GB limit on Windows XP (I think Windows Server may be able to address the full memory using PAE because it has more stable drivers).

Finally, the 192GB limit in 64-bit Windows is because of overhead involved for the Windows Memory Manager to keep track of pages beyond the 192GB barrier (like requiring larger internal data structures). Instead of having the memory manager waste resources so it can track insane amounts of memory which most people are nowhere near using, they set a limitation.

--
The clash of honour calls, to stand when others fall.

Another example by snoone · 2010-06-25 08:08 · Score: 1

For those interested, I tracked down a single bit issue on a Windows machine a few months ago and recorded the adventure here: http://analyze-v.com/?p=558 -scott

Re:He never said that by Anonymous Coward · 2010-06-25 09:55 · Score: 0

If you don't see the difference between being unable to predict the explosive growth of computers 35+ years ago and saying flat out "No one ever needs more than X RAM" you're a fucking moron. Sorry.

Re:Cosmic rays, my ass. Occam's Razor time. by inf0stud · 2010-06-25 13:02 · Score: 1

Great post, but date was it posted? 02-07-05 could be 2002-07-05, 2002-05-07, 2005-07-02 or 2005-02-07? Please read http://w3.org/QA/Tips/iso-date

Re:Cosmic rays, my ass. Occam's Razor time. by Spugglefink · 2010-06-25 14:17 · Score: 1

10^24 times more likely the cause.

But they're not as fascinating as wild speculation, are they?

You're all missing the point that Raptor Jesus flipped the bit, because he is angry.

Re:Too bad many consumer mainboards don't support by Anonymous Coward · 2010-06-25 16:33 · Score: 0

One can find the ECC support information quite quickly from the motherboard manuals BIOS sections at the latests. Memory manufacturers recommendation pages are very useful in this respect as well. I mostly buy Kingston's JEDEC memory and used their service to find a nice board for their cheap (at that time) ECC kit.

Re:He never said that by Zinho · 2010-06-25 19:30 · Score: 1

Thanks for the thoughtful reply! I seem to have rousted some trolls from under their bridges, and I appreciate the time you took to give a polite response.

I like a good argument, though, so I'm going to reply to you and keep this discussion going =)

That limit exists because the 8088 CPU can only address 1MB of RAM, and some memory must be reserved for other hardware devices.

The problem wasn't the limits of the 8088, it's that DOS was written assuming that those hardware limits would always be there. Specifically, instead of checking for memory availability and putting those reserved addresses at the end of addressable memory, DOS instead specified the range between 640kiB and 1MB as reserved. I believe that being forced to live with that poor design choice was the source of the fictitious "no-one will ever . . ." quote.

Further, as newer systems became available, with higher limits, OSes were updated or created to take advantage of that - OS/2, Windows 95, Windows NT, etc, can all utilise the additional address space provided with the 286 and then 386.

Unfortunately, DOS was the order of the day for nearly 20 years; the operating systems you listed were all released in the early-to-mid 90's. Until then we were stuck with DOS and its 16-bit limits even on the 32-bit 386 & 486.

. . . a 32-bit x86 CPU can only address 4GB of RAM (without resorting to hacks like PAE that are a) typically unstable with consumer-level hardware drivers and b) require special programming to take advantage of).

As another poster pointed out, the instability of drivers under PAE is largely due to driver programmers making the "no one will ever . . ." assumption again. I'd argue that Microsoft set the precedent, and the 3rd party developers followed it.

Even in *10* years it's unlikely 192GB of RAM in a desktop PC (or even "workstation") will be at all common. Further, in 5 years Windows 7 will have been replaced . . .

That sounds suspiciously like "no one will ever . . ." to me. Moore's law disagrees with you on RAM availability; 10 years is enough time for 6 or 7 doublings of circuit density, I hope to have 1024 GiB of memory in my desktop by then. The histories of Windows 3, Windows 95, and Windows XP also contradict your "will have been replaced" assertion - Microsoft's strongest historic competitor has been its own obsolete software, including versions that are officially unsupported. I expect that Win7 will still be alive and twitching 10 years from now, having only recently left its official support period.

Regardless, the real issue is that the design of DOS left Microsoft poorly positioned for the transition from 16 to 32 bit hardware. It seems that instead of learning from the users' pain during that upgrade Microsoft continued to use coding practices that left their OS poorly positioned for the 32 to 64 bit upgrade.

There's a good argument to be made that Microsoft shouldn't have to support the installation on Windows on hardware it wasn't designed for; eg. XP shouldn't have been expected to run gracefully on 64 bit systems. The counter-argument to that is that Vista, which began development after 64-bit chips were available on the market, also failed to gracefully bridge the 64-bit divide.

Microsoft should know better. Its developers cannot have been ignorant of Moore's Law, and should have seen the 64-bit transition coming. Despite being staffed with some of the world's smartest programmers Microsoft seems mired in its own legacy of poor initial decisions. Fair or not, justified or not, the perception of those who have watched its history and used other systems without the same frustrations is that Microsoft products are not designed in a future-proof manner. No one will be surp

--
"Space Exploration is not endless circles in low earth orbit." -Buzz Aldrin

Re:Cosmic rays, my ass. Occam's Razor time. by Dputiger · 2010-06-26 05:20 · Score: 1

I'm familiar with Day-Month-Year and Month-Day-Year. If putting the year in back confuses someone, they're a moron.

Re:He never said that by drsmithy · 2010-06-26 16:18 · Score: 1

The problem wasn't the limits of the 8088, it's that DOS was written assuming that those hardware limits would always be there. Specifically, instead of checking for memory availability and putting those reserved addresses at the end of addressable memory, DOS instead specified the range between 640kiB and 1MB as reserved. I believe that being forced to live with that poor design choice was the source of the fictitious "no-one will ever . . ." quote.

This is like arguing Linux was badly designed because it couldn't use more than 4GB of RAM in 1991.

Unfortunately, DOS was the order of the day for nearly 20 years; the operating systems you listed were all released in the early-to-mid 90's. Until then we were stuck with DOS and its 16-bit limits even on the 32-bit 386 & 486.

The first version of OS/2 was released in 1987, and Windows/286 and Windows/386 in 1988.

That people kept using DOS, does not mean that other OSes capable of using the protected modes of the 286+ didn't exist.

As another poster pointed out, the instability of drivers under PAE is largely due to driver programmers making the "no one will ever . . ." assumption again. I'd argue that Microsoft set the precedent, and the 3rd party developers followed it.

Microsoft set no such precedent, your basic premise is flawed. The memory limitations of DOS exist because of the fundamental design of the hardware it was designed to run on. Problems with drivers on PAE systems exist because developers didn't bother to test that configuration. These two scenarios are completely different.

That sounds suspiciously like "no one will ever . . ." to me.

It's nothing of the sort. You can buy machines today with more than 192GB of RAM in them, so clearly desktops will have that sort of memory eventually. My argument is that even in 10 years, it's unlikely to be a configuration seen in a consumer desktop.

Moore's law disagrees with you on RAM availability; 10 years is enough time for 6 or 7 doublings of circuit density, I hope to have 1024 GiB of memory in my desktop by then.

I said nothing about availability. As I said, you can already buy systems today with more than that much RAM in it. I expect we'll be able to buy "standard" x86 servers with ~1TB of RAM by the end of next year. However, I don't think that in 10 years a *desktop PC* with 192GB of RAM in it will be common. 10 years ago a high-end desktop PC had a gig of RAM. Five years ago it was 4GB. Today it's 8GB - maybe 16GB - of RAM. Further, there is a point of diminishing returns - the benefits of going from 1GB to 2GB are clear and obvious, even for relatively light users. The difference between 2 and 4GB for most people is minor, and from 4GB to 8GB essentially nonexistant. My predictions are that a typical PC in 5 years will have 8-16GB of RAM, and in 10 years 48-64GB, with high-end machines have 50-100% more.

Also, this is before even getting into the general market shift away from desktop PCs and towards laptops and other mobile devices, which tend to be significantly more limited in RAM capacity purely due to the physical form factor.

The histories of Windows 3, Windows 95, and Windows XP also contradict your "will have been replaced" assertion - Microsoft's strongest historic competitor has been its own obsolete software, including versions that are officially unsupported. I expect that Win7 will still be alive and twitching 10 years from now, having only recently left its official support period.

I'm also sure it will be "alive and twitching". However, it will have had at least two, probably three Service Packs released, and at least one, quite possibly two, successors. The limitations in Windows 7 _today_, are not relevant to system configurations that might exist in a decade, any more than the memory limitations of Linux 1.0 mean my ~150 64-bit Linux servers with 8-32GB+ of RAM don't exist.

Cosmic Rays? by Anonymous Coward · 2010-06-28 11:18 · Score: 0

I would always suspect a bad memory chip before cosmic rays, since any interaction with them is orders of magnitude less likely. I have been running 3 computers over the last 10 or so years (no, they are not that old) and have had no such errors. In all honesty, one of them is a SuperMicro workstation board with ECC, and one is RDRAM. Also, I never buy "cheap" memory. Having said that, if I thought I had a memory error, I'd let memtest run on it for a day or so just to see if I could fault it before I thought of cosmic rays.
However, thanks for a great article! I learned some stuff, and am pretty sure lots of others did too!

Re:He never said that by Zinho · 2010-07-01 15:51 · Score: 1

You've probably figured out from my long delay in responding that I don't have a good comeback for that. =)

Thanks for the replies! As I said before, my opinions on this aren't necessarily rational, and you do a good job presenting evidence of that. It's tragic that knowing that I'm wrong doesn't soothe the aches caused by years of Microsoft hate coloring my interpretation of their actions.

Meanwhile, you've earned a fan. Anyone who can doggedly persist at politely correcting someone who's clearly letting anger cloud his reason is worthy of my attention. Perhaps next time we get into a Slashdot back-and-forth I'll have a better position to argue from =)

--
"Space Exploration is not endless circles in low earth orbit." -Buzz Aldrin

Slashdot Mirror

Tracking Down a Single-Bit RAM Error

277 comments