Slashdot Mirror


Serious Computer Glitches Can Be Caused By Cosmic Rays (computerworld.com)

The Los Alamos National Lab wrote in 2012 that "For over 20 years the military, the commercial aerospace industry, and the computer industry have known that high-energy neutrons streaming through our atmosphere can cause computer errors." Now an anonymous reader quotes Computerworld: When your computer crashes or phone freezes, don't be so quick to blame the manufacturer. Cosmic rays -- or rather the electrically charged particles they generate -- may be your real foe. While harmless to living organisms, a small number of these particles have enough energy to interfere with the operation of the microelectronic circuitry in our personal devices... particles alter an individual bit of data stored in a chip's memory. Consequences can be as trivial as altering a single pixel in a photograph or as serious as bringing down a passenger jet.

A "single-event upset" was also blamed for an electronic voting error in Schaerbeekm, Belgium, back in 2003. A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible. "This is a really big problem, but it is mostly invisible to the public," said Bharat Bhuva. Bhuva is a member of Vanderbilt University's Radiation Effects Research Group, established in 1987 to study the effects of radiation on electronic systems.

Cisco has been researching cosmic radiation since 2001, and in September briefly cited cosmic rays as a possible explanation for partial data losses that customer's were experiencing with their ASR 9000 routers.

9 of 264 comments (clear)

  1. Why not blame the manufacturer? by Anonymous Coward · · Score: 1, Informative

    When your computer crashes or phone freezes, don't be so quick to blame the manufacturer.

    Why not? According to the article, it is well-known phenomena:

    For over 20 years the military, the commercial aerospace industry, and the computer industry have known that high-energy neutrons streaming through our atmosphere can cause computer errors.

    So if it is a well-known problem, and manufacturers are ignoring the problem and creating devices susceptible to such interference, why can I not blame the manufacturer for making hardware with known problems? I would blame the manufacturer if a hearing aid was picking up local radio stations, so why not here?

    1. Re:Why not blame the manufacturer? by ShanghaiBill · · Score: 4, Informative

      Probably b'cos there is nothing that manufacturers can do about cosmic rays

      Except that is not true. Electronic devices can be made more resistant to cosmic rays and other radiation. The easiest way to do so is to use depleted boron instead of "normal" boron as a semiconductor dopant. Boron-10 has a very high neutron absorption cross section while Boron-11 has a very small cross section. Use boron that has been "depleted" of the B10 isotope, and you cut way down on your neutron induced SEUs.

      Another obvious countermeasure is to use ECC memory, and memory scrubbing.

      The problem is not that there is nothing that manufacturers can do, but that consumers aren't willing to pay the extra cost. Would you be willing to pay an extra $100 for your phone if it meant one fewer reboot every decade or so?

    2. Re:Why not blame the manufacturer? by ShooterNeo · · Score: 5, Informative

      You know that several FPGA manufacturers offer this. Xilinx offers a method where this is done in software - when you do design synthesis, more than triple the gates are needed for every circuit allocated in the design. (I think it's done at a higher level - truth tables with the triple redundant bits are generated)

      Some do it in hardware, so your design synthesis is the same but the actual software programmable subunits use ternary redundancy.

    3. Re: Why not blame the manufacturer? by ShanghaiBill · · Score: 5, Informative

      ECC memory doesn't do anything to help when the bits that get flipped are in the CPU. Or anywhere else that isn't a RAM chip.

      Except that the RAM has hundreds or thousands of times as many bits as a CPU, and Flash may have millions of times as many, and dynamic ram has smaller feature size, and is more susceptible to SEUs. So correcting RAM and Flash helps because that is where 99.9% of the problem is.

      Even within the CPU, most transistors are used to implement cache, and cache can also be scrubbed (although not with just software).

    4. Re:Why not blame the manufacturer? by unrtst · · Score: 4, Informative

      Another obvious countermeasure is to use ECC memory ...

      The problem is not that there is nothing that manufacturers can do, but that consumers aren't willing to pay the extra cost. Would you be willing to pay an extra $100 for your phone ...

      ECC memory is not that much more expensive. It's been a few years since I built the desktop I'm using, but I included 16gb of ECC memory (4x 4gb DDR3 ECC KVR1333D3E9SK2/8G). At the time, I think it was around $60. The equivalent normal memory was only a couple bucks cheaper. If Samsung started using ECC memory in all their phones, the cost would be nearly the same with the volume they would be ordering/making.

      FWIW, I did try to do the same comparison just now on newegg and, while it's a bit of a mess, the situation is nearly the same today:
      $34 : Kingston 4GB 240-Pin DDR3 SDRAM ECC Unbuffered DDR3 1333 Server Memory Model KVR13LE9S8/4
      $52 : Kingston 8GB (2 x 4GB) 240-Pin DDR3 SDRAM DDR3 1600 (PC3 12800) Memory Model KVR16N11S8K2/8

      More expensive? Yes.
      $100 more? Nowhere near that much.

    5. Re:Why not blame the manufacturer? by dgatwood · · Score: 3, Informative

      Adding one ECC bit per byte, yes. Adding one parity bit, no. ECC != parity.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

  2. preposterous! by Gravis+Zero · · Score: 5, Informative

    When your computer crashes or phone freezes, don't be so quick to blame the manufacturer.

    If my computer crashes or phone freezes, it's almost certainly the fault of the person who released the software without properly debugging it. Cosmic rays are very low on the list of reasons why your device has malfunctioned.

    --
    Anons need not reply. Questions end with a question mark.
  3. Re:@Intel: Why no ECC for consumer-grade processor by drinkypoo · · Score: 4, Informative

    Actually, wouldn't cosmic rays be capable of flipping bits even in ECC memory and processors, thereby making the whole ECC thing useless?

    No, this is what ECC is for. If a bit is flipped, you can detect it. If you have enough parity bits, you can even detect which bit is flipped, and correct it on the fly. Computation occurs as normal and an error shows up in the syslog.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  4. Re:We've always known this. This is why we have EC by kriston · · Score: 3, Informative

    It's also why systems on spacecraft such as the Space Shuttle had what's called the Data Processing System. It consisted of four systems with identical software and an extra one with the same hardware but a different implementation with the same goals. They checked each others' decisions, and a majority "vote" would lock out the differing system.

    --

    Kriston