Tracking Down a Single-Bit RAM Error
Hanji writes "We have discussed here before the potential effects of and protections against cosmic ray radiation, but for the average computer user, it's an obscure threat that doesn't affect them in any real way. Well, here's a blog post that describes a strange segfault and, after extensive debugging, traces it down to a single bit flip, probably caused by a stray cosmic ray. Lots of helpful descriptions of Linux debugging techniques in this one, and a pretty clear demonstration that this can be a real problem. I know I'm never buying a desktop without ECC RAM ever again!" The author acknowledges that it might not have been a cosmic ray-based error, but the troubleshooting steps are interesting no matter what the cause.
One of my computers had an intermittent failure in a RAM chip/line/something somewhere that mostly manifested as SHA/MD5 failures when I was checksumming large files that I'd downloaded. Never showed up in Memtest86, but eventually I eliminated every other possibility. IIRC, I solved it by underclocking the machine and then replacing it when I was able.
No wonder I have so many errors on my old laptop! This explains everything!
.
- aqk
F U
I was hoping this would be more info about the Voyager 2 incident that occurred recently. No doubt, a detailed account of what they recently went through to find and fix the problem would be most interesting.
When I was in college one of my physics professors told us he doubted programs would ever get bigger than a few hundred kilobytes because cosmic rays would cause the larger programs to fail too frequently.
I don't know about cosmic rays, but immediately following the Easter day Earthquake in Guadalupe Victoria (about three hundred miles from where I was located) I tried to fire up my laptop and then my desktop, both of which had been suspended to RAM. Neither one would wake up, though the lappie displayed a garbled screen. No errors in the log files (Ubuntu 9.10 on the sys76 lappie, Deb Lenny on desktop).
Forget a RAM error, I have seen a bit on a file on the disk flip.
After years of successful operation a Perl script quite working. On investigation a G was transformed to a W a difference of one bit. The file mod date was years old.
Soft errors in DRAM are far more likely to be the result of alpha particle decay from materials in the die and packaging.
Would it really be so hard to read the article before posting?
"Prefiero morir de pie que vivir siempre arrodillado!"
I've been working with some large microarray datasets recently, and so had to double my computer's memory to 8GB.
As I've done for years, I went to Fry's to get some Corsair chips... installed F13 64bit to replace my older 32bit distro... and crash-o-matic began. Mostly from Chrome and Mercurial.
I ran memtester86+ and sure enough, verified my first purchase of faulty memory.
So, I went back to Fry's and exchanged for another pair of Corsair 2GB chips. This time, I ran memtester86+ first thing... ANOTHER bad set, so back it sent to Fry's.
*Third* set of memory was Kingston, and a trip through memtester86+ verified no errors. Yay!
Computer has been stable, too.
With more and more RAM in computers, my next box will have ECC.
Its interesting to me because my first instinct would have been to assume something got corrupted and my first step would have been to reboot. If the problem persisted through a reboot then I might have gone down the rabbit hole in similiar
fashion to try and find and fix the root cause.
There are enough sofware bugs, kernel bugs, driver bugs, hardware hiccups due to marginal equipment, power fluctuation, interference, random noise... and i suppose even cosmic radiation that I would rarely think to spend the time to trace a transient problem unless it was reproducible accross reboots, or at least happened on multiple separate occasions.
Some of the nicer boards will tolerate ECC memory being inserted, but won't actually do any meaningful error correction (like scrubbing) - but a disturbingly large number of consumer boards (BIOS limitation perhaps?) don't actually do ANYTHING with ECC memory, and the really cheap ones won't even boot with it present. I used to have the same mindset of purchasing only ECC RAM for the same reason - but the unfortunate truth is that hardware support for it just isn't there without spending $$$ on a decent board too.
I would think it's more likely there is trace radioactive elements in the epoxy the chip is encapsulated in.
Actually, I recall reading that in the early solid state memory days, they had problems with this. I don't remember what the solution was, but I thought it was to make the circuit somewhat resilient to it, as it was impossible to get 100% neutral epoxy, there's always going to be traces of something radioactive.
I think they tested the cosmic ray theory by running the same chip with and without lead shielding, and did not find a significant difference in errors, they then assumed it was impurities in the chips themselves decaying.
Sent from my PDP-11
Back in the early 80's, HP published a paper on random bit errors in RAM. They looked at chips from a variety of vendors and determined that the RAM coming out of Japan was the most reliable. That paper caused a lot of US RAM vendors to shutter their doors as there was a sea change in purchasing habits.
A few years later, I ran into John Scully while we were waiting for a flight. I mentioned the paper to him and asked him how Apple could seriously expect to sell a Macintosh specifically aimed at the Scientific community if it didn't have ECC. He blithely said "it's not a problem..." 20+ years hence and most of us still don't have ECC so it seems he was right.
I'm putting tinfoil hats on all of my servers, right away!
Link to Google's PDF is contained in this story : http://www.computerworld.com/s/article/9139161/Google_DRAM_error_rates_vastly_higher_than_previously_thought
From the article,
" A study released this week by Google Inc. and the University of Toronto showed that data error rates on dynamic RAM memory modules are vastly higher than previously thought and may be more responsible for system shutdowns and service interruptions.
The study (download .pdf), which used tens of thousands of Google's servers, showed that about 8.2% of all dual in-line memory modules (DIMM) are affected by correctable errors and that an average DIMM experiences about 3,700 correctable errors per year.
"Our first observation is that memory errors are not rare events. About a third of all machines in the fleet experience at least one memory error per year, and the average number of correctable errors per year is over 22,000," the report states.
"These numbers vary across platforms, with some platforms seeing nearly 50% of their machines affected by correctable errors, while in others only 12%-27% are affected."
The median number of errors per year on a Google server that had at least one error ranged from 25 to 611..."
This is one area where AMD is light years ahead of Intel. With Intel, you have to buy a Xeon and a server chipset to have ECC support, which basically is going to run you at least a grand or two just for the CPU and motherboard (at least if you want an i7 based Xeon). AMD on the other hand supports ECC across the board, and you just need a motherboard which supports it, which is most of them (total cost: <$500).
Thanks for the gouging Intel!
Game! - Where the stick is mightier than the sword!
Any electronics/communications engineer will tell you that every data channel is noisy and you must expect corruption at some point, even if the odds seem vanishingly small. And no doubt in these super high transistor count and clock frequency CPUs and chips we are using these days there must be devices and methods used inside them to keep the logic transfer and computation validity on the straight and narrow.
People talking about bits flipping in RAM or on disk -- these are external bus errors due to noise before the data gets to the memory or disk drive.
I was ready to send him a link to purchase a tinfoil hat (and tinfoil server cover too), but in his article, he says it could be cosmic radiation, or flaky hardware. I'd lay money on the second, and not the first.
I used to joke that cosmic radiation made particular servers crash. We couldn't find any other reason for it, even with a fresh OS (that was identical to our other servers), and swapping various parts. Ya, cosmic radiation went through the building above us, to the server about 30 feet underground, and hit one in the middle of the rack, and not all the ones over it. It was good for the wheel of excuses, but (obviously) not a real answer. Oddly enough, the cosmic radiation stopped messing with that server when we finally took it out of service, and the one that went in the same position, with the same job, running the same OS did fine. :)
Ya, ya, I know, it's probably whatever part we didn't replace (the motherboard), but cosmic radiation sounded better. :) At the time there were quite a few news stories about it, so I was able to link to those in my report blaming cosmic radiation. :)
They call me crazy. I call myself eccentric with a sense of humor. :) My girlfriend at the time even made me a tinfoil hat, that I'd sometimes wear around the house as I babbled nonsense about impending alien invasions. :)
Serious? Seriousness is well above my pay grade.
Wrong. A few Dell PE servers have P4s in them, and -require- ECC memory.
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Disks have a lot, and I mean a LOT of ECC on them. It is not a situation of "I need to write a 1 so I'll place one at this location on the drive." They use a complex encoding scheme so that bit errors on the disk don't yield data errors to the user.
Then there's the fact that bits aren't even stored as bits really. All current drives use (E)PRML which is (Enhanced) Partial Response Maximum Likelihood. What this means is bits aren't encoded as a high-low state or FM wave or any of that. They are written using flux reversals, but the level is not carefully controlled, it can't be. So when you read the data the drive actually looks at an analogue wave. It encodes the partial response it gets, and then finds the maximumly likely pattern that matches.
Sounds like voodoo but works really well. Things are not simple thresholds or the like, it is a complex system and ends up being quite robust and resilient to error.
So it is highly unlikely that you had a bit flipped on a disk. Would require some amazing circumstances to happen. The RAM error is far more likely. Not just the cosmic ray thing but, as the parent noted, bad RAM. Normally when RAM fails, it fails catastrophically and it is immediately apparent. Not always though. It can not only fail on single bit locations, but only during certian ops. That is why memtest does so many different tests. One kind might works fine, another might fail. Rare, but I've seen it on a few systems.
My RAM is shielded against cosmic rays by my mothers basement.
Yeah, but can he find a way to reproduce the error?
Sure, ECC will be nice at some point on desktop boards.
But unlike what most of these studies *speculate*, ram error rates are still negligible, even over years. And let's face it, there's no point panicking for a possible bit flip every month. Probabilities that your OS goes bad for another reason(disk corruption, bus errors, buggy software) are incredibly higher.
In any case, the point is that unless your data is bit-important, it does not matter. And very few applications need bit important data, and practically none that any desktop computer should be dealing with.
The one thing that would be interesting would be large sets statistics depending on the manufacturers. After all, we all remember the cheap generic DDR debacle a couple years ago.
You are on the right track. As someone with over a quarter century of background in combined embedded software and hardware design (the most recent decade for life-dependant systems), it always amazes me how quickly pseudo-technical people jump to wild speculation for observations that they cannot explain.
They fail to understand that a hardware system is an imperfect representation of the theory (probably the biggest failure in the schooling of software developers and even some hardware is to get this message into their heads). While they feel comfort in the theory of a binary system, they utterly fail to understand that our real systems, like us, are imperfect and, like us, live in an analog world. Simple things like temperature variations, noise from common (rather than cosmic) sources, marginal design timing, imperfect components, simple intermittents, etc., are 10^24 times more likely the cause.
But they're not as fascinating as wild speculation, are they?
I had a mysql replication server which was reading SQL commands from a binary log on a master server. One day after years of operation I noticed an update failed. I didn't see anything at first by looking at the query, but when I looked closely I noticed the query had a single character changed, and of that character only one bit had changed. It was something like a P becoming a Q and thus giving a syntax error.
True story.
The worst problem was with ceramic DIP packages -- the really good ones for when you needed reliability (partly because the plastic ones tended to allow moisture to get in, and then condensation on thermal cycling.) The standard ceramic packaging material contained trace amounts of thorium, which is an alpha emitter. The alpha bombardment was enough to flip bits.
There have been several fixes since then. Using materials that don't contain radioactive species was one. The one you're probably remembering is that the manufacturers apply a polymer coating to the surface of the die, which is enough to stop a lot of alpha particles and a fair number of electrons. Getting rid of lead in packaging is also good, because lead tends to contain some radioactive traces.
On the other hand, there's flat nothing to be done about cosmic rays and damn little to be done about X-rays and thermal noise (you do keep your memory cold, don't you? Thermal noise is proportional to KT/qe after all.) So at some point we get to where there are too many bits which need minimal energy to flip them -- and then you have errors.
Pity that so few mobos actually support ECC, though.
Lacking <sarcasm> tags,
Cosmic ray events tend to affect multiple neighboring transistors. For this reason, they tend to affect multiple bits. However, by laying out memory cells so immediate neighbors are from different locations, the ability of single-bit-correction-double-bit-detection (SECDED) methods to detect most events is usually preserved.
The main concern is for structures with no error correction, such as the gates in the processor pipeline. Several research ideas have been put forward. See here (PDF) for a good overview of the issues.
-Todd
Omne ignotum pro magnifico.
That is an important consideration for old computers (prior to 2005 or so.) The newer ones are pretty much lead-free.
Billions of years in the ground, and only a few centuries on the roof and all of the radioactivity is gone! Wow!
Lacking <sarcasm> tags,
He's referring to all embedded memory controller Intel Parts. Supposedly the Core i3/i5 chips have ECC support enabled on them, but unfortunately none of the consumer boards support them (you'd need an 1156 server mobo + core i3/i5 cpu).
AMD's chips since Socket 939 have supported ECC out of the box. I haven't had a chance to test it myself, but if the Nforce M430 mobo I have will run ECC with a cheap low-end sempron, all my future cpu/mobo purchases will be AMD for just this reason.
You do have to carefully check whether the motherboard manufacturer has included the bios support. The upper end models from ASUS and Gigabyte do generally support ECC, but the lower end models and the models from other popular "consumer grade" motherboard manufacturers don't generally include the support. Intel's recent westmere i3 and i5 models do support ECC, but you need a bios support once again, which in the case of an Intel based "consumer grade" motherboard is even more difficult to find. I don't know a single one. OEMs have their own bios versions and support for ECC with their single socket server models, like the other comment states. I'm writing this with my consumer oriented Phenom 9750 and 8GB of ECC memory, so any typos are most likely not my computers fault. ;)
I just read the article and it's quite good. The author goes into detail about how he used a series of checksums and source verification to find the bug, isolate it and fix it. I found it quite fascinating and I recommend reading it if you have a few minutes of time.
There is no such thing as ECC RAM. The ECC (usually hamming) is performed by the memory controller. You can't just buy a stick of 72 pin DIMM and use that in any old PC. You have to have a memory controller that supports ECC. It should also be noted that this kills performance by increasing latency (decode and encode the ecc bits) and may also require read-modify-writes.
-- How many sigs are as useless as this one?
You live below your mother's basement??? (LOL!!) You'd have to, to be well-shielded from cosmic rays. Living in a basement doesn't shield you from rays coming from above. And even so, some rays are so energetic that they'll reach you even if you lived a mile underground in a mine.
I am both shocked and amazed that you eventually broke up.
I feel fantastic, and I'm still alive.
The article author has obviously never used windows. SOP would be a reboot, which would have solved the problem.
The whole thing would have taken minutes.
Not only that, but they are also systems we can only approach from a very abstract perspective when it comes to debugging. Our options to debug complex hardware are very abstract, inaccurate, and incomplete.
WTF am I doing replying to an AC at 5 A.M on a Friday night?
"What is the sound of one bit flipping?"
Or
"If a disk crashes in a server farm and there's no one there to hear it, does it make a sound?"
This ain't rocket surgery.
Well fear not, it's been a series of upgrades since then. :) My girlfriend now is perfect, I can't imagine a better upgrade from here.
Serious? Seriousness is well above my pay grade.
The guy that posted this is a Ksplice developer. In case you didn't knew, KSplice allows you to patch your running kernel without rebooting. Nice.
Anyway, this guys sees a random memory error. He conveniently goes on a debugging rampage, while we all know the most logical first step would be rebooting that damn machine. Random memory errors do happen.
He says he "hasn't gotten around" to memtesting his RAM yet. So, let me get this straight ... he implies that random cosmic rays caused the error, but he hasn't yet tested his ram for what is the most possible cause of the issue?
Then he goes on to explain that you don't even need to reboot your machine due to damn cosmic radiation. Or kernel updates. Because you have Ksplice.
Come on.
WTF am I doing replying to an AC at 5 A.M on a Friday night?
I did read it. I liked the article, actually.
I didn't take into account that he probably never reboots, thereby always using the cached copy.
The k-splice ad on TFA made me laugh in this case.
guess we should put echo 3 > /proc/sys/vm/drop_caches in chrontab.
THL phish sticks
So, here I am with my paltry 2 Gigs of ram in my system drooling over the idea of having some much larger amount, like this fellow's 12 Gigs, and then find out that it's a likely source of errors due to persistent caching of hard drive reads.
Memory failures due to alpha particle switching, one of my faves, or cosmic rays (are we sure we can't get neutrino's in there as well?) were a known evil but it looks like having the cache more frequently overwritten might be an advantage to having smaller amounts of memory. (at least, non-ecc memory.)
Now I have to run off and see if my motherboard will accept ECC memory before I go out and do buy more memory.
I found ECC RAM was too expensive for my home server..
so does anybody know where I can get a cheap, THICK lead sheet?
On the subject of the imperfect nature of machines, I found this post by Richard D. James (aka Aphex Twin, a noted electronic music composer) quite interesting. He describes how the physical machinery of analog electronic music machines means it is near impossible to duplicate them in digital programs.
link
Author: analord
Date: 02-07-05 03:14
some people bought the analogue equipment when it was unfashionable and very cheap though.
some of us are over 30 you know!
anyone remember when 303`s were £50? and coke was 16p a tin? crisps 5p
also you have overlooked A LOT of other points because its not all about the overall frequency response of the recording system its how the sound gets there in the first place.
here are some things which you can`t get from a plugin,they are often emulated but due to their hugely complex nature are always pretty crass aproximations..
the sound of analogue equpiment including EQ, changes very noticably over even a few hours due to temperature changes within a circuit.
Anyone who has tried to make tracs on a few analogue synths and make them stay in tune can tell you this,you leave a trac running for a few hours come back and think Im sure I didnt fucking write that,I must be going mental!
this affects all the components in a synth/EQ in an almost infinte amount of tiny ways.
and the amount differs from circuit to circuit depending on the design.
the interaction of different channels and their respective signals with an analogue mixer are very complex,EQ,dynamics....
any fx, analogue or digital that are plugged into it all have their own special complex characteristics and all interact with each other differently and change depending on their routing.
Nobody that ive heard of has even begun to start emulating analogue mixer circuitry in software,just the aesthetics,it will come but im sure it will be a crap half hearted effort like most pretend synth plugins are.
they should be called PST synths, P for pretend not virtual.
Every piece of outboard gear has its own sound ,reverbs,modulation effects etc
real room reverb, this in itself companies have spent decades trying to emulate and not even got close in my opinion, even the best attempts like Quantec and EMT only scratch the surface.
analogue EQ is currently impossible in theory to be emulated digitally,quite intense maths shit involed in this if youre really that interested,you could look it up...good luck.
your soundcard will always make things sound like its come from THAT soundcard..they ALL impose their different sound characteristics onto whatever comes out of them they are far from being totally neutral devices.
all the components of a circuit like resistors and capacitors subtley differ from each other depending on their quality but even the most high quality milatary spec ones are never EXACTLY the same.
no two analogue synths can ever be built exactly the same,there are tiny human/automated errors in building the circuits,tweaking the trimpots for example which is usually done manually in a lot of analogue shit.
just compare the sound of 2 808 drum machines next to each other and you will see what I mean,you always thought an 808 was an 808 right?
same goes for 303`s they all sound subltey different,different voltage scaling of the oscillator is usually quite noticable.
VST plugins are restricted by a finite number of calculations per second these factors are WAY beyond their CURRENT capability.
Then there is the question of the physicallity of the instrument this affects the way a human will emotionally interact with it and therfore affect what they will actually do with it! often overlooked from the maths heads,this is probably the biggest factor I think.
for example the smell of analogue stuff as well as the look of it puts y
A few years ago I came across a thread on a FreeBSD mailing list where a build of some package was failing and the submitter couldn't tell why because he wasn't a developer. The failure was unusual and no one else could reproduce it. Eventually, the problem was traced back to a character in the source differing from the original. The character was a one-bit difference from the correct character, and it was suggested to the submitter that he reboot and memtest his memory. Sure enough, one single bad bit out of around 512MB.
Uh...no. I've got a Dell server from the Ark with PIII chips that demands ECC.
The really impressive thing is that this guy resisted the urge to just reboot his machine. Otherwise, the clues would have vanished and the expr binary would have run again without any issue.
Maybe that's why the first step one takes when something behaves weird on a Windows system is to reboot it...
Actually, he lives in the "oil" reservoir that the Deep Water Horizon hit - so he's got 1 mile of seawater and 2 miles of rock and silt above him to protect him a bit better from cosmic rays.
Now, if he would just stop eating bean burritos, he could save BP a lot of money and public embarrassment.
A eccforme.com site with current ~consumer priced motherboard lists would be a fun project.
Domestic spying is now "Benign Information Gathering"
http://www.jpl.nasa.gov/news/news.cfm?release=2010-151
Mission managers at NASA's Jet Propulsion Laboratory in Pasadena, Calif., had been operating the spacecraft in engineering mode since May 6. They took this action as they traced the source of the pattern shift to the flip of a single bit in the flight data system computer that packages data to transmit back to Earth.
I browse Slashdot at +3, Funny
I sure am glad my OS and hardware can detect and correct memory errors on the fly and disable the dimms if need be. I know this is a linux-fest, but Solaris fault management is pretty awesome. I've seen it detect a failing cpu, evacuate the memory attached to it and disable the cpu without a hiccup.
This signature is a waste of 42 characters
Me, I've upgraded to my heart. (Read comment history.) Best, upgrade, ever. Takes a bit of practice though.
I feel fantastic, and I'm still alive.
Very interesting article indeed but I wish the author would have included one more detail. Does he believe in tin foil hats? He could only speculate this was caused by a cosmic ray and not a bad memory stick. Instead of running memcheck I recommend he wrap his desktop in tinfoil for a week or so and see if this prevents any further bit flips.
Roman ingots to shield particle detector
http://www.nature.com/news/2010/100415/full/news.2010.186.html
It was me.
Sorry 'bout that.
You live below your mother's basement???
Sure. In his mother's sub-basement.
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009
I know, cosmic rays sound so much cooler, but it's far more likely he has some crappy memory and/or his memory refresh timings are too high.
DRAM memory cells have to be refreshed pretty often (anywhere from 7.8usec-12usec), otherwise they become unreliable. If his BIOS has the memory timings set to something obscurely long, it may be there are specific rows/cells on his DRAM modules that are too weak to read after bleeding off a bit of charge. Changing the refresh timing would likely improve the situation, causing the memory to refresh it's state more often.
I shouldn't have spent all my mod points yesterday. I guess my hardware knowledge is obsolete; I had no idea modern HDDs don't store individual bits anymore.
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009
With a magma-powered computer?
Electronics are designed well within tolerances for temperature and EM interference. At least, good ones are. Since my fans are broken, I've been running the GPU in my Thinkpad to 107C every day for a few years when I play games. No problems yet.
As someone who hasn't been in school in 30 years, memory loss, sits on the porch with a shotgun hollering at kids, has to call his grandson to install the newfangled Norton Internet Security because you've been screwing around with FPGAs for decades and last used a web browser in 1995, etc
http://esp.cr.usgs.gov/info/lacs/lead.htm
And we used to blame Microsoft engineering team for all the crashes we experienced !!
No, he just lives beneath-
blah blah his mom's fat.
Yep, We had 2 old wavetec audio distortion meters used for calibrating aviation ILS tone frequencies. One was purchased new. The other I picked up at R2D5 surplus in Portland. Both were calibrated by a outside service and the deviation was about 20% right after calibration. I don't know if we got ripped off, Or the meter just wasn't accurate?!?. The R2D5 box was owned by the FAA before I picked it up ( complete with FAA cal stickers on the screws ). Analog is just that, ANALOG. FWIT I'm buzzed too.
I wonder if anyone's considered using a large set of networked computers (volunteers) as a gamma ray telescope.
I'm offended that you push idolatry to the next level. The Politically correct response would be
The period^w White space at the end of this sentence represents Mohamed in drag ==> " "
I'm assuming that the quotes don't offend anyone.
RAM upsets at gound level (and in aircraft avionics, for the matter) are primarily caused by neutrons created by cosmic ray decay in the upper atomsphere, through indirect ionization. Galactic Cosmic Rays (heavy ions) are more a concern to satellite designers.
Now I understand why Portland State University's computers were in the sub-basement back in the 70's.
On the other hand, A IBM 1130 and Honeywell H300 build from discrete transistors and core memory would have probably
survived a direct nuclear blast.
Lister: Your explanation for anything slightly peculiar is [cosmic rays], isn't it? You lose your keys, it's [cosmic rays]. A [bit] falls off the [RAM], it's [cosmic rays]. That time we used up a whole bog roll in a day, you thought that was [cosmic rays] as well.
Rimmer: Well we didn't use it all, Lister. Who did?
Lister: Rimmer, [COSMIC RAYS] used our bog roll?
Rimmer: Just cause they're [cosmic rays] doesn't mean to say they don't have to visit the little boys' room. Only they probably do something weird and [cosmic ray]-esque, like it comes out of the top of their [waveforms] or something.
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009
Most of the memory problems turned out to be power supply or otherwise some kind of power problem. Very rarely is it the memory itself. From my experience, a faulty power supply will first manifest as memory issues, and then gradually increase in severity to affect much more of the system. And from experience, bad power supplies are often a result of dirty (i.e. inconsistent, unstable) power going into the machine, irrespective of any "surge protectors" that may be between the wall and the machine.
Even a bad BIOS battery can throw off something like the system clock and cause issues further down the line.
A few times, I had capacitors on the actual mobo blow out on me (and it's possible some of the PS problems I had were due to faulty capacitors), but that's easily spotted.
"If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."
Except he didn't take into consideration that shrinking the size of computer hardware probably means that a huge program now actually fits into a smaller space in physical reality than an old program did on older hardware. (Any hardware guru want to do the math and find out how much physical area a 100k program took back in the 80's vs a 150mb program now?)
Did you know 80 to 90% of the moderators on slashdot wouldn't recognize a troll even if one dragged them under a bridge.
Although interesting, TFA it is without a doubt the most pedantic and roundabout way I've ever read of establishing your rig is not stable.
From TFSA:
And in fact, since that incident, I've had several other, similar problems. I haven't gotten around to memtesting my machine, but that does suggest I might just have a bad RAM chip on my hands.
Yeah he has a stuck or semi-stuck bit and a hour or two of his life he won't get back.
In such a circumstance I've found underclocking and overvolting the DIMM might coax it to work again but it's best to RMA or bin it.
After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.
Circa 1994 i was transferring all of my data on from low density 5.25" floppies to gasp CD-R! The disks were stored in some old huge recipe filing box that just happened to fit about a 100 deep.
The following discs failed:
1st, 2nd, 3rd, 5th, 8th, 13th, 21st, 34th, 55th and the 89th.
Only in a long boring backup scenario could anyone ever figure that out.
Working in a lab environment I test DDR3 memory on a daily basis and we run into a lot of failures from JEDEC violation to blatant byte/word/dword corruption and even single bit failures. Single bit failures are by the far the worst to debug. Kudos to this guy for tracking it down. I am going to add these debug procedures to my arsenal!
When I encounter a failure, logging all information is of course the first thing I do, but reproducibility is key! With reproducibility, like the article says, you're able to throw as many experiments at it as you can think up. We will run memtest86+ among other tools to gather data on whether the failure reproduces with other tests. In the case we believe it is a DRAM part failure, we will utilize Logic analyzers and Oscilloscopes to determine and prove that the failure is on a specific component.
Sometimes failures we encounter are DIMM vendor issues, sometimes our own, induced by bad in house memory test software/hardware
We were debugging a problem showing up in the field, turns on the developer building the system image had a bad bit and was consistently introducing a bug in every build in the same component. After a very frantic week, we realized we could only reproduce the bug if he touched it. (poor guy, probably felt pretty insecure about his abilities at first. even though it turned out not to be him). We replaced all developer's machines with ECC capable equipment loaded with ECC memory as soon as possible.
As for ECC's cost. ECC is not available in the same varieties of price and performance. ECC that is just slightly more expensive than good quality but average RAM is also about the same performance. You can find really cheap RAM that is half the price of what I would consider "good quality but average RAM", so ECC is considered twice as expensive as "normal [cheap ass] ram".
If I were just going to play games on my computer, or even write up documents I think non-ECC would be perfectly reasonable. As a developer I now realize that debugging software problems that are really just bad hardware is a huge waste of my time and sanity. I'd pay 10x more for a system if it meant I didn't have to do that crap. Luckily a good quality server motherboard and some ECC ram is not too much more expensive than a fast and cheap computer.
Harddrives are another issue, obviously some sort of RAID(1,10,Z,etc) can be great at dealing with day to day corruption. And backups are great at dealing with catastrophic events such as drive failures, controller malfunctions, fires, or malevolent software. People often forget that controllers can go berserk when they set up their awesome elite fail-proof RAID. The controllers, be it a smart RAID or a dumb multiport SATA controller, can corrupt the data it copies, write to the wrong disk sectors, and numerous other systematic corruption. It helps to pay for a good quality card, but it helps more to never trust your equipment 100% and have a backup plan.
Luckily when CPUs glitch they usually stop running because of the cascade of interdependent transistor states that can make further execution impossible without a hard reset. CPUs can misbehave when their power supplies are not up to the the demands, I use the term power supply in a generic electrical engineer's sense. The big metal fan box with a switching power supply in it is the main component of your CPU's power supply, the other component is the voltage converters on the motherboard (where it is surrounded by metal capacitors) that is equally important to the health of your system. A bad PSU can weaken your motherboard's circuitry. And bad motherboard circuitry can be susceptible to easier damage. If either fails your CPU can glitch, produce incorrect computations(hard to debug!) or in rare cases cease to function.
“Common sense is not so common.” — Voltaire
This is how vendors keep their market segmentation. ECC supported only on servers. Consumers don't need it, so prevent them from using ECC so server customers can't buy cheap setups with ECC!
Electronics are designed well within tolerances for temperature and EM interference. At least, good ones are. Since my fans are broken, I've been running the GPU in my Thinkpad to 107C every day for a few years when I play games. No problems yet.
No offense intended, but your comment portrays a complete lack of understanding of the subject. It might be best if you sit this one out.
As someone who hasn't been in school in 30 years, memory loss, sits on the porch with a shotgun hollering at kids, has to call his grandson to install the newfangled Norton Internet Security because you've been screwing around with FPGAs for decades and last used a web browser in 1995, etc
Sorry, I can't decipher this gibberish. Like I said, it might be best if you sit this one out.
Across a few thousands DIMM's in my datacenter we tend to lose about 1-2 per year, more than we lose PSU's. Of course we control temperature, humidity, and have double conversion UPS's and only use ECC systems so it's kind of an ideal environment for avoiding all but the most serious of problems.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
You know, there are many, many other spots a bit can flip an ruin your day outside of RAM, anywhere on the processor for one. The Xeon line can detect and correct a lot more problems than regular desktop chips. There is a name for these types of features.. RAS. It is costly, and overlooked by probably most of /. It's the reason Power and SPARC systems cost so much, and mainframes still exist. If you ever wondered what makes a Xeon, Opteron, or Itanium so special, RAS is part of it. Look up Machine Check Exception.
A desktop Athlon is not going to have the same RAS features a Xeon, Itanium, Opteron, etc will. A processor with weak RAS features but ECC RAM is well, a good start I guess, but ghetto. Find out what kind of MCEs an Athlon can live to report.
http://catless.ncl.ac.uk/Risks/23.47.html#subj7
The event occurred in the election held on 18 May 2003. An expert review
determined that as no software defects had been found on inspecting the
source code and no test had been able to reproduce the error, it was
probably attributable to a spontaneous inversion of a bit in the RAM of the
PC (no explicit mention of cosmic rays). However the report concluded that
even if the voting system under review was not perfect the totality of
controls was sufficient to be confident in the overall result. I wonder.
Heck, sometimes I reply before reading the comments
Checking at alternate.de (not the cheapest online shop, but good for comparisons because they have both consumer and server parts):
Intel:
The only Socket 775 boards that support ECC seem to be those with the 32xx MCH chipset. Starting at 195 Euros (Asus P5BV-C).
For Socket 1156, the consumer chipsets allow ECC but you still need to find a board with BIOS support. Sadly Alternate does not list the ECC support status, but you might find one that supports ECC among the cheaper ones for 80-90 Euros. You do, however, need a Xeon which starts at 213 Euros (Xeon X3430, 4 x 2.4 GHz)
So mainboard plus a quad CPU costs you around 300 Euros at Alternate.
AMD:
Board situation (Socket AM3) similar to Intel's Socket 1156, boards with ECC support are available for 80-90 Euros.
Unlike Intel, even cheap desktop CPUs support ECC. As a cheap quad, Alternate offers the Athlon II X4 635 for 108 Euros.
So mainboard plus quad CPU costs you around 200 Euros, 100 less than with Intel.
C - the footgun of programming languages
"I know I'm never buying a desktop without ECC RAM ever again!"
There are still the CPU, the cache, the hard disk, the network, and a ton of buses in-between, where a bit could be flipped.
Unless you add ECC data right at the creation of the data, and pass it trough all the way to the end, you can’t be sure of anything.
Any sufficiently advanced intelligence is indistinguishable from stupidity.
Not much to say here, except that it was a wonderful article to read!
He couldn't figure it out, so he attributed the fault to a No See Um. Might as well blame it on goblins, or declare that A Wizard Did It.
If you were blocking sigs, you wouldn't have to read this.
Someday, somewhere, the No-Execute bit will be flipped and it will be exploited by Cowboy Neal and the world will end.
Damn those cosmic rays and cowboy hax0rs.
You must be new here
They all laughed when i put my tinfoil hat on and encased my computer in led.
But who is laughing now?
Do these two numbers actually differ by one bit? 0x0000000000001a70 0x401a70 It looks to me like a byte is getting zero'd somewhere. Bad ram.
Did this about 8 years ago on a mail server for Windows. It was a multithreaded application with thread trap detection and restart. On error, the thread protection code would generate a disassembly of the current state of the trapped thread and email it back to support. In one case, the disassembly showed a definite single bit error in the ram affecting code. The customer didn't believe it until we showed him the disassembly. The fault went away when they swapped ram.
PURCHASE a tinfoil hat? Are you crazy? You build them yourself..who knows what they include in the prebuilt ones.
... as opposed to a cosmic ray on a leash?
From the article:
"I can't prove this was due to a cosmic ray, or even a hardware error. It could have been some OS bug in my kernel that accidentally did a wild write into my memory in a way that only flipped a single bit. But that'd be a pretty weird bug."
Dude you have a lot of faith.
-paul
But hard drives aren't 5"; they're 3.5" or smaller. Assume 10800 RPM on a high-end 3.5" desktop drive, with the actual platter being slightly smaller in diameter, so 3.0" * Pi * 10800 RPM * 60 min/hr / 63360 in/mi = 96.4 mph.
My advice to anyone who ever buys a complete set of memory for a new computer. If you have any problems just demand that all the sticks be replaced or you want a full refund. Memory issues are some of the most time consuming BS wastes of time there is when it comes to computer repair. I could have replaced the memory 4 times over in the amount of time I spent working on it only to find out it was more than one bad stick in the same batch. I'll never get that time back again :(
Back in the day (~1992), we sold Intel 486 desktops with parity memory. When PC's went to a 64-bit data path (not to be confused with a 64-bit OS), we sold Intel desktops with ECC memory. (I remember seeing an IBM white paper that claimed that ECC memory is more reliable than non-ECC memory by a factor of ~2000.) Then Intel pulled the memory controller inside the CPU, and didn't bother to implement ECC on their line of desktop processors, apparently having decided that nobody on a workstation gives a damn about data integrity. Thanks Intel!.
My absurdly clever ex-housemate was an electronics engineer, and in his spare time would tinker with analog electronics because it was much more 'interesting' than digital.
I’m old enough to remember 16K of memory being described as “whopping”
I agree, but I would start thinking even simpler. My wife and I had all sorts of weird issues with our computers a few years back.
My biggest clues were that the issues all appeared shortly after we moved, and with 2 out of 3 of our machines.
Long story short, after much hair pulling, a decent UPS solved the problem. Our machines were acting weird and random things weren't working because of unclean power, and apparently the PSUs weren't tolerating this all that well.
You can accomplish anything you set your mind to. The impossible just takes a little longer.
107C? that could be used as a hot plate for tea water.
comment first, facts later. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
Actually, it you've taken your computer on any air travel you have greatly increased the chances of damage by cosmic rays.
My company has to take this loss into account every time we ship a load of wafers overseas.
Of course, we're usually shipping hundreds of thousands to millions of die at a time, so we see it a lot more often.
Woohoo!
To be fair, Gates never said that line.
http://en.wikiquote.org/wiki/Bill_Gates#Misattributed
He didn't have to; he designed that principle into his systems so we all had to live with it for the last 35+ years. DOS was limited to 640kiB of RAM, resulting in users needing to move programs to "upper memory" (640kiB-1MiB) or "extended memory" (1MiB+) addresses by tricking the OS once larger memory cards became inexpensive. XP (32 bit) is limited to 3.1GiB, making it pointless to install even 4GiB of RAM in an XP box since nearly 1/4 of it will never be addressed. Microsoft continues to make the same mistake to this day; there's still a memory limit of ~192GiB in Win7 64 Bit. I expect that in about 5 years RAM will be cheap to buy in quantities larger than 192GiB and Microsoft will start looking silly again because we'll have to resort to DOS-era tricks to make it usable.
Code speaks louder than words. I don't care if he said it or not, he wrote it. And his employees continue to re-write it with every Windows release.
"Space Exploration is not endless circles in low earth orbit." -Buzz Aldrin
To be fair, Gates never said that line. http://en.wikiquote.org/wiki/Bill_Gates#Misattributed
Doesn't that just say that Bill Gates says that he never said that line?
It doesn't provide proof that he didn't say it, any more than a defendant in court saying "no, your honor, I did not do that crime." is considered as proof of innocence.
Putting moderation advice in your
I was thinking EMP related to the seismic activity. IIRC that is still somewhat controversial though.
In that it doesn't exist, yes it's controversial.
The only credible papers written about electromagnetism from or prior to earthquakes talk about resistivity changes (which are not emitted EM) or waves. No pulses. (The P in EMP stands for pulse.)
Putting moderation advice in your
Only pedantic fags use that GiB shit.
Only complete idiots don't realize SI exists.
- Michael T. Babcock (Yes, I blog)
He didn't have to; he designed that principle into his systems so we all had to live with it for the last 35+ years. DOS was limited to 640kiB of RAM, resulting in users needing to move programs to "upper memory" (640kiB-1MiB) or "extended memory" (1MiB+) addresses by tricking the OS once larger memory cards became inexpensive.
That limit exists because the 8088 CPU can only address 1MB of RAM, and some memory must be reserved for other hardware devices.
Further, as newer systems became available, with higher limits, OSes were updated or created to take advantage of that - OS/2, Windows 95, Windows NT, etc, can all utilise the additional address space provided with the 286 and then 386.
XP (32 bit) is limited to 3.1GiB, making it pointless to install even 4GiB of RAM in an XP box since nearly 1/4 of it will never be addressed.
That also exists because a 32-bit x86 CPU can only address 4GB of RAM (without resorting to hacks like PAE that are a) typically unstable with consumer-level hardware drivers and b) require special programming to take advantage of). Out of that 4GB, some amount (varying on several factors) must be reserved for other hardware devices - which is why the amount of visible RAM can vary from ~2.5 to ~3.7GB.
Microsoft continues to make the same mistake to this day; there's still a memory limit of ~192GiB in Win7 64 Bit. I expect that in about 5 years RAM will be cheap to buy in quantities larger than 192GiB and Microsoft will start looking silly again because we'll have to resort to DOS-era tricks to make it usable.
Even in *10* years it's unlikely 192GB of RAM in a desktop PC (or even "workstation") will be at all common. Further, in 5 years Windows 7 will have been replaced (and its successor probably be close to replacement at that), or that limit will have been increased. Windows 2008R2 has the same kernel and will address up to 2TB, the limit isn't inherent or architectural.
Besides costing more (due to required extra/wider ram chips), ECC ram is slower.
This is primary caused by extra read/modify/write cycles done by the memory controller to keep the ECC in sync for short writes. This RMW sequence can cause a fair amount of performance loss (i've seen 8% on a custom application, doing a lot of pointer chasing/updating).
Furthermore, as anyone who monitors a lot of servers with ECC will attest. Its really rare to see a soft ECC correction (I've personally never seen one). If there are bit errors being corrected/detected its always been a full blow hardware failure.
I'm not even sure MS even wrote the 640K memory limitation in there at all. I believe it had more to do with a function of IBMs initial shortsightedness and the product MS bought to create DOS, CP/M. The 32 bit limit is a limitation of the hardware being able to address memory.
While working as a failure analysis technician at a company that made a disk controller, I came across a single-bit error in static RAM cache that was repeatable. I was lucky to have the software and hardware tools available and I eventually tracked down the failure mode. Setting a bit at a certain location would cause another, different location's bit to get set. Just that one bit. And only if you set it. Resetting it did not cause the other bit to reset.
This turned out to be a manufacturing problem with a particular run of RAM. I starting finding more of these bad parts and could reproduce the failures. I guess what I'm saying is that this could well be a manufacturing defect in the RAM.
Best regards.
That's entirely true. Even the worst software interface couldn't possible be as difficult and time consuming as programming a song into a real 303.
Hah! Good one! Here's the ultimate "Your mom's so fat..." joke: "Your mom's so fat she blocks neutrinos!"
It has nothing to do with Microsoft at all, it's all because of Intel and IBM. The origional 8088 IBM systems could address 1MB of memory, and they decided to reserve the upper 384KB for hardware addressing, leaving 640K of conventional memory for programs. Then when the 286 came out and could address 16MB of memory they decided to use the same memory mapping so they wouldn't break compatability with older applications. During the DOS days Microsoft was trying to work around the 640K barrier using things like XMS and UMB's.
The 3GB barrier you complained about is because of the a similar thing, memory mapped I/O address space being reserved at the top of the memory area. Under 32-bit Windows they tried using PAE to allow the extra memory to be accessed but it broke a lot of drivers which expected pointers to always be 32 bits in size. Rather than break a ton of drivers they decided to keep the 4 GB limit on Windows XP (I think Windows Server may be able to address the full memory using PAE because it has more stable drivers).
Finally, the 192GB limit in 64-bit Windows is because of overhead involved for the Windows Memory Manager to keep track of pages beyond the 192GB barrier (like requiring larger internal data structures). Instead of having the memory manager waste resources so it can track insane amounts of memory which most people are nowhere near using, they set a limitation.
The clash of honour calls, to stand when others fall.
For those interested, I tracked down a single bit issue on a Windows machine a few months ago and recorded the adventure here: http://analyze-v.com/?p=558 -scott
If you don't see the difference between being unable to predict the explosive growth of computers 35+ years ago and saying flat out "No one ever needs more than X RAM" you're a fucking moron. Sorry.
Great post, but date was it posted? 02-07-05 could be 2002-07-05, 2002-05-07, 2005-07-02 or 2005-02-07? Please read http://w3.org/QA/Tips/iso-date
10^24 times more likely the cause.
But they're not as fascinating as wild speculation, are they?
You're all missing the point that Raptor Jesus flipped the bit, because he is angry.
One can find the ECC support information quite quickly from the motherboard manuals BIOS sections at the latests. Memory manufacturers recommendation pages are very useful in this respect as well. I mostly buy Kingston's JEDEC memory and used their service to find a nice board for their cheap (at that time) ECC kit.
Thanks for the thoughtful reply! I seem to have rousted some trolls from under their bridges, and I appreciate the time you took to give a polite response.
I like a good argument, though, so I'm going to reply to you and keep this discussion going =)
That limit exists because the 8088 CPU can only address 1MB of RAM, and some memory must be reserved for other hardware devices.
The problem wasn't the limits of the 8088, it's that DOS was written assuming that those hardware limits would always be there. Specifically, instead of checking for memory availability and putting those reserved addresses at the end of addressable memory, DOS instead specified the range between 640kiB and 1MB as reserved. I believe that being forced to live with that poor design choice was the source of the fictitious "no-one will ever . . ." quote.
Further, as newer systems became available, with higher limits, OSes were updated or created to take advantage of that - OS/2, Windows 95, Windows NT, etc, can all utilise the additional address space provided with the 286 and then 386.
Unfortunately, DOS was the order of the day for nearly 20 years; the operating systems you listed were all released in the early-to-mid 90's. Until then we were stuck with DOS and its 16-bit limits even on the 32-bit 386 & 486.
. . . a 32-bit x86 CPU can only address 4GB of RAM (without resorting to hacks like PAE that are a) typically unstable with consumer-level hardware drivers and b) require special programming to take advantage of).
As another poster pointed out, the instability of drivers under PAE is largely due to driver programmers making the "no one will ever . . ." assumption again. I'd argue that Microsoft set the precedent, and the 3rd party developers followed it.
Even in *10* years it's unlikely 192GB of RAM in a desktop PC (or even "workstation") will be at all common. Further, in 5 years Windows 7 will have been replaced . . .
That sounds suspiciously like "no one will ever . . ." to me. Moore's law disagrees with you on RAM availability; 10 years is enough time for 6 or 7 doublings of circuit density, I hope to have 1024 GiB of memory in my desktop by then. The histories of Windows 3, Windows 95, and Windows XP also contradict your "will have been replaced" assertion - Microsoft's strongest historic competitor has been its own obsolete software, including versions that are officially unsupported. I expect that Win7 will still be alive and twitching 10 years from now, having only recently left its official support period.
Regardless, the real issue is that the design of DOS left Microsoft poorly positioned for the transition from 16 to 32 bit hardware. It seems that instead of learning from the users' pain during that upgrade Microsoft continued to use coding practices that left their OS poorly positioned for the 32 to 64 bit upgrade.
There's a good argument to be made that Microsoft shouldn't have to support the installation on Windows on hardware it wasn't designed for; eg. XP shouldn't have been expected to run gracefully on 64 bit systems. The counter-argument to that is that Vista, which began development after 64-bit chips were available on the market, also failed to gracefully bridge the 64-bit divide.
Microsoft should know better. Its developers cannot have been ignorant of Moore's Law, and should have seen the 64-bit transition coming. Despite being staffed with some of the world's smartest programmers Microsoft seems mired in its own legacy of poor initial decisions. Fair or not, justified or not, the perception of those who have watched its history and used other systems without the same frustrations is that Microsoft products are not designed in a future-proof manner. No one will be surp
"Space Exploration is not endless circles in low earth orbit." -Buzz Aldrin
I'm familiar with Day-Month-Year and Month-Day-Year. If putting the year in back confuses someone, they're a moron.
The problem wasn't the limits of the 8088, it's that DOS was written assuming that those hardware limits would always be there. Specifically, instead of checking for memory availability and putting those reserved addresses at the end of addressable memory, DOS instead specified the range between 640kiB and 1MB as reserved. I believe that being forced to live with that poor design choice was the source of the fictitious "no-one will ever . . ." quote.
This is like arguing Linux was badly designed because it couldn't use more than 4GB of RAM in 1991.
Unfortunately, DOS was the order of the day for nearly 20 years; the operating systems you listed were all released in the early-to-mid 90's. Until then we were stuck with DOS and its 16-bit limits even on the 32-bit 386 & 486.
The first version of OS/2 was released in 1987, and Windows/286 and Windows/386 in 1988.
That people kept using DOS, does not mean that other OSes capable of using the protected modes of the 286+ didn't exist.
As another poster pointed out, the instability of drivers under PAE is largely due to driver programmers making the "no one will ever . . ." assumption again. I'd argue that Microsoft set the precedent, and the 3rd party developers followed it.
Microsoft set no such precedent, your basic premise is flawed. The memory limitations of DOS exist because of the fundamental design of the hardware it was designed to run on. Problems with drivers on PAE systems exist because developers didn't bother to test that configuration. These two scenarios are completely different.
That sounds suspiciously like "no one will ever . . ." to me.
It's nothing of the sort. You can buy machines today with more than 192GB of RAM in them, so clearly desktops will have that sort of memory eventually. My argument is that even in 10 years, it's unlikely to be a configuration seen in a consumer desktop.
Moore's law disagrees with you on RAM availability; 10 years is enough time for 6 or 7 doublings of circuit density, I hope to have 1024 GiB of memory in my desktop by then.
I said nothing about availability. As I said, you can already buy systems today with more than that much RAM in it. I expect we'll be able to buy "standard" x86 servers with ~1TB of RAM by the end of next year. However, I don't think that in 10 years a *desktop PC* with 192GB of RAM in it will be common. 10 years ago a high-end desktop PC had a gig of RAM. Five years ago it was 4GB. Today it's 8GB - maybe 16GB - of RAM. Further, there is a point of diminishing returns - the benefits of going from 1GB to 2GB are clear and obvious, even for relatively light users. The difference between 2 and 4GB for most people is minor, and from 4GB to 8GB essentially nonexistant. My predictions are that a typical PC in 5 years will have 8-16GB of RAM, and in 10 years 48-64GB, with high-end machines have 50-100% more.
Also, this is before even getting into the general market shift away from desktop PCs and towards laptops and other mobile devices, which tend to be significantly more limited in RAM capacity purely due to the physical form factor.
The histories of Windows 3, Windows 95, and Windows XP also contradict your "will have been replaced" assertion - Microsoft's strongest historic competitor has been its own obsolete software, including versions that are officially unsupported. I expect that Win7 will still be alive and twitching 10 years from now, having only recently left its official support period.
I'm also sure it will be "alive and twitching". However, it will have had at least two, probably three Service Packs released, and at least one, quite possibly two, successors. The limitations in Windows 7 _today_, are not relevant to system configurations that might exist in a decade, any more than the memory limitations of Linux 1.0 mean my ~150 64-bit Linux servers with 8-32GB+ of RAM don't exist.
I would always suspect a bad memory chip before cosmic rays, since any interaction with them is orders of magnitude less likely. I have been running 3 computers over the last 10 or so years (no, they are not that old) and have had no such errors. In all honesty, one of them is a SuperMicro workstation board with ECC, and one is RDRAM. Also, I never buy "cheap" memory. Having said that, if I thought I had a memory error, I'd let memtest run on it for a day or so just to see if I could fault it before I thought of cosmic rays.
However, thanks for a great article! I learned some stuff, and am pretty sure lots of others did too!
You've probably figured out from my long delay in responding that I don't have a good comeback for that. =)
Thanks for the replies! As I said before, my opinions on this aren't necessarily rational, and you do a good job presenting evidence of that. It's tragic that knowing that I'm wrong doesn't soothe the aches caused by years of Microsoft hate coloring my interpretation of their actions.
Meanwhile, you've earned a fan. Anyone who can doggedly persist at politely correcting someone who's clearly letting anger cloud his reason is worthy of my attention. Perhaps next time we get into a Slashdot back-and-forth I'll have a better position to argue from =)
"Space Exploration is not endless circles in low earth orbit." -Buzz Aldrin