Serious Computer Glitches Can Be Caused By Cosmic Rays (computerworld.com)
The Los Alamos National Lab wrote in 2012 that "For over 20 years the military, the commercial aerospace industry, and the computer industry have known that high-energy neutrons streaming through our atmosphere can cause computer errors." Now an anonymous reader quotes Computerworld:
When your computer crashes or phone freezes, don't be so quick to blame the manufacturer. Cosmic rays -- or rather the electrically charged particles they generate -- may be your real foe. While harmless to living organisms, a small number of these particles have enough energy to interfere with the operation of the microelectronic circuitry in our personal devices... particles alter an individual bit of data stored in a chip's memory. Consequences can be as trivial as altering a single pixel in a photograph or as serious as bringing down a passenger jet.
A "single-event upset" was also blamed for an electronic voting error in Schaerbeekm, Belgium, back in 2003. A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible. "This is a really big problem, but it is mostly invisible to the public," said Bharat Bhuva. Bhuva is a member of Vanderbilt University's Radiation Effects Research Group, established in 1987 to study the effects of radiation on electronic systems.
Cisco has been researching cosmic radiation since 2001, and in September briefly cited cosmic rays as a possible explanation for partial data losses that customer's were experiencing with their ASR 9000 routers.
A "single-event upset" was also blamed for an electronic voting error in Schaerbeekm, Belgium, back in 2003. A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible. "This is a really big problem, but it is mostly invisible to the public," said Bharat Bhuva. Bhuva is a member of Vanderbilt University's Radiation Effects Research Group, established in 1987 to study the effects of radiation on electronic systems.
Cisco has been researching cosmic radiation since 2001, and in September briefly cited cosmic rays as a possible explanation for partial data losses that customer's were experiencing with their ASR 9000 routers.
Whenever a user calls up to ask why his computer rebooted after I install an update, I say... drumroll, please... gamma radiation.
When your computer crashes or phone freezes, don't be so quick to blame the manufacturer.
If my computer crashes or phone freezes, it's almost certainly the fault of the person who released the software without properly debugging it. Cosmic rays are very low on the list of reasons why your device has malfunctioned.
Anons need not reply. Questions end with a question mark.
There's something you can do about it. It's very easy, but you won't like it.
Make every component in triplicate. Everything in the CPU, everything in the RAM, everything in storage, etc. If the three aren't equal, go with the value shared by two of them and rewrite the different one with that value.
Not only is this not actually all that easy (all of your triplicate systems have to be clocked together in sync, you need a shitload of extra hardware to do the comparison, etc.) it's grossly unnecessary. Standard off-the-shelf error detection and correction can (and routinely does) handle radiation induced errors. It just costs a bit more, because it's a business-level feature. It doesn't matter if that MP3 of Taylor Swift gets mildly corrupted (might even sound better that way, zing), but it very much *does* matter if that bank account gets a flipped bit.
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
You know that several FPGA manufacturers offer this. Xilinx offers a method where this is done in software - when you do design synthesis, more than triple the gates are needed for every circuit allocated in the design. (I think it's done at a higher level - truth tables with the triple redundant bits are generated)
Some do it in hardware, so your design synthesis is the same but the actual software programmable subunits use ternary redundancy.
ECC memory doesn't do anything to help when the bits that get flipped are in the CPU. Or anywhere else that isn't a RAM chip.
Except that the RAM has hundreds or thousands of times as many bits as a CPU, and Flash may have millions of times as many, and dynamic ram has smaller feature size, and is more susceptible to SEUs. So correcting RAM and Flash helps because that is where 99.9% of the problem is.
Even within the CPU, most transistors are used to implement cache, and cache can also be scrubbed (although not with just software).