Intel Patents On-Chip Cosmic Ray Detectors
holy_calamity writes "Intel has been awarded a patent for building cosmic ray detectors into chips, to guard against soft errors where a high energy particle from space changes a value in a circuit. It's a problem that largely only affects RAM. As component sizes shrink futher, "this problem is projected to become a major limiter of computer reliability in the next decade", says the patent. Intel's solution is to build in a detector that responds to cosmic errors by repeating the latest operation, reloading previous instructions, or rolling back to a previous state. You can also read the full patent."
Actually you can prove cosmic rays cause memory errors. IBM did so in the 90's; there was mention of this (and a link) in the article. As memory cells become smaller they WILL become sensitive to ionizing radiation. Intel seems to think we will get there sometime in the next decade or so.
POWER6 has actually be shipping with this for a while - if an instruction fails (cosmic ray or not, although in terms of random bit-flipping events they account for a large percentage), it gets automatically retried, transparently to the rest of the system. Without this sort of thing you generally take a hard fault - so this type of protection is great to see. Same thing on a SPARC64, incidentally (but not UltraSPARC - ie Niagara or children). What sets the POWER6 apart from both SPARC64 and this patent is if that instruction fails repeatedly Possibly indicating a chip fault), in many cases it can actually back the instruction out of the failing core and slap it onto another core, also transparently and avoiding a hard crash. Someone noted that this has been done on mainframes for years - yup, also true. This is another case of UNIX-class technology making inroads up the platform stack.
They didn't, they've created a detector which works out whether the chip was hit by a cosmic ray or not. Then the ram is somehow restored to the state previous to the last operation and that operation is then repeated. I'm not even sure that hit is the right word, they've developed a detector that is capable of knowing when a cosmic ray travels through the same space as the chip, I don't know that they care whether or not the ray actually hit something or just traveled through the open space between the atoms.
It's a lot less likely to cause problems than trying to guess which bit it was, and far less expensive than building a RAIMM(TM) to compensate for it.
For RAM - there is really no problem - just use error checking. It's got to be easier to add an extra couple of bits to the width of your RAM to permit error-correction than to have a cosmic ray detector for every single bit.
The tricky problem isn't RAM - it's computational elements. There is no single way to error-correct computational elements because they are so diverse. A multiplier would need different protection to an adder which is different from a shift-register. Hence, the idea of rolling back (say) the last instruction executed and having a "do-over".
But for large arrays of homogeneous circuitry - like RAM - this doesn't seem worth the effort.
www.sjbaker.org
well, if the detector is the size of a penny, then yes probably pretty rare to detect cosmic rays... but if the detector is the size of a pc case, it will get hits every few seconds. cosmic rays ARE very common, and not all of them are magnetically deflected, or stopped by the atmosphere. they just happen to be very small, and the frequency of hits to a small target is less than to a large target. about 8% of the radiation humans are exposed to each year are from cosmic rays. http://en.wikipedia.org/wiki/Cosmic_ray
so clearly to a human sized target, the impact ratio is significant.
https://www.gnu.org/philosophy/free-sw.html
More for laughs than anything else, I started logging them and found that a server with 16GB got maybe one ot two hits per week. After that I started to take ECC seriously - for professional quality servers.
You probably don't need it for the domestic appliance quality stuff that people run at home - but for real work, get some decent kit
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
This post expresses my opinion, not that of my employer. And yes, IAAL.
With cosmic rays, it's not just "gone". Instead, you get a shower of new energetic particles generated by the collision which compounds the risk of operational errors. The patent specifically mentions alpha particles knocked out of the atoms in the chip by the ray which travel through the circuits causing havoc.
The patent also mentions that the detector may sense side effects of collision (such as voltage spikes) rather than the ray particle itself. Thus, the damage has already been done by the time the detector sees the event.