Intel Patents On-Chip Cosmic Ray Detectors
holy_calamity writes "Intel has been awarded a patent for building cosmic ray detectors into chips, to guard against soft errors where a high energy particle from space changes a value in a circuit. It's a problem that largely only affects RAM. As component sizes shrink futher, "this problem is projected to become a major limiter of computer reliability in the next decade", says the patent. Intel's solution is to build in a detector that responds to cosmic errors by repeating the latest operation, reloading previous instructions, or rolling back to a previous state. You can also read the full patent."
How did they manage to build a detector that can work out whether the cosmic rays collided with the actual bits (no pun intended) that hold the data? According to the oracle, cosmic rays collide with nuclei in an essential random way, so there's no way a detector could just see a ray passing through and know whether it was on a collision course. Perhaps they are detecting the pions and other subatomic particles that result from a collision actually occurring? If they've found a way to do that then it sounds fairly ingenious to me and a well-deserved patent.
apterous.org
Cosmic ray detector certainly makes for better marketing hype than ECC.
I know at least four people who REALLY could have used this. Oh well, too late now.
SJW: Someone who has run out of real oppression, and has to fake it.
It won't take long for someone to figure out how to detect the gamma errors and create what amounts to a geiger counters on laptop computers. If this bill passes http://www.villagevoice.com/news/0803,thompson,78873,2.html will everyone be required to get a permit for their laptop computers? ;-)
I saw a display in the visitors' center at CERN that detected cosmic rays. A cloud chamber, maybe.
Either way, the... 2m by 2m (IIRC) display would detect cosmic rays about once every 2 seconds. This would mean my PC case is perforated by cosmic rays several times each minute. That's not rare.
All rites reversed 2010
Currently, chips (both computational and memory) are protected against soft errors using multiple methods. There are rad hardening methods (both hardware and software) and most of the latest research involves using error correcting codes. Simply duplicating the output and comparing can only detect errors in one bit. The more the times you duplicate, the more you can detect (it progresses as n-1), and the max length of error that can be corrected is half that. However, this takes a lot of space (duplication that is), so generally other codes such as Hamming or BCH codes are used.
The main problem using codes and everything is that cosmic ray errors cause whats called single event upsets and most codes can not detect 100% of errors where the hamming weight of the error (sum of number of ones in the error vector) is larger than the designed specification of the error. The problem comes when the SEU manifests itself as a multi-bit fault and the error vector cannot be detected by the code. SEU's are the most common type of errors in space application : See http://www.eas.asu.edu/~holbert/eee460/see.html
The contribution of the cosmic error detector is that if you know you have a cosmic ray at some point in time, you can flush and redo your computation (for computation channels eg microprocessors etc) or flush that line in memory (for memory channels) in case of SEU's and that is a pretty big deal.
Legally obligatory sig : My opinions are my own... etc etc
POWER6 has actually be shipping with this for a while - if an instruction fails (cosmic ray or not, although in terms of random bit-flipping events they account for a large percentage), it gets automatically retried, transparently to the rest of the system. Without this sort of thing you generally take a hard fault - so this type of protection is great to see. Same thing on a SPARC64, incidentally (but not UltraSPARC - ie Niagara or children). What sets the POWER6 apart from both SPARC64 and this patent is if that instruction fails repeatedly Possibly indicating a chip fault), in many cases it can actually back the instruction out of the failing core and slap it onto another core, also transparently and avoiding a hard crash. Someone noted that this has been done on mainframes for years - yup, also true. This is another case of UNIX-class technology making inroads up the platform stack.
For RAM - there is really no problem - just use error checking. It's got to be easier to add an extra couple of bits to the width of your RAM to permit error-correction than to have a cosmic ray detector for every single bit.
The tricky problem isn't RAM - it's computational elements. There is no single way to error-correct computational elements because they are so diverse. A multiplier would need different protection to an adder which is different from a shift-register. Hence, the idea of rolling back (say) the last instruction executed and having a "do-over".
But for large arrays of homogeneous circuitry - like RAM - this doesn't seem worth the effort.
www.sjbaker.org
... Just mount the chips in a vertical fashion. I work in an X-ray crystallography lab and we have a large format CCD detector. It's maybe about half a foot in diameter, but because it is mounted vertically, I see a cosmic ray streak maybe once every 200 or so 40 second exposures. Compare that to a cosmic ray detector of roughly the same size which is mounted horizontally in the other side of the building. It's counting cosmic rays almost constantly.
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0 is the magic number.
it seems painfully inefficient to 'redo' stuff that doesn't seem to be wrong just because a cosmic ray was detected.
1) The likelihood of a cosmic ray is ridiculously small. So small in fact that the cost of rewinding progress when they are detected would be completely unnoticeable.
2) We *do* have the ability to package CPUs such that they are protected by CPUs. The problem is that the packages are so large and expensive that no one would buy them given the current probability of soft errors.
So the solution is most definitely NOT to stop shrinking transistors. Even in 10 process technology generations, the mean time to a soft error actually affecting a bit on a CPU is something like 1 million hours. Never mind whether or not that particular soft error is critical.
The laws of probability forbid it!
well, if the detector is the size of a penny, then yes probably pretty rare to detect cosmic rays... but if the detector is the size of a pc case, it will get hits every few seconds. cosmic rays ARE very common, and not all of them are magnetically deflected, or stopped by the atmosphere. they just happen to be very small, and the frequency of hits to a small target is less than to a large target. about 8% of the radiation humans are exposed to each year are from cosmic rays. http://en.wikipedia.org/wiki/Cosmic_ray
so clearly to a human sized target, the impact ratio is significant.
https://www.gnu.org/philosophy/free-sw.html
More for laughs than anything else, I started logging them and found that a server with 16GB got maybe one ot two hits per week. After that I started to take ECC seriously - for professional quality servers.
You probably don't need it for the domestic appliance quality stuff that people run at home - but for real work, get some decent kit
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
Tin foil hats, for RAM!