Slashdot Mirror


Serious Computer Glitches Can Be Caused By Cosmic Rays (computerworld.com)

The Los Alamos National Lab wrote in 2012 that "For over 20 years the military, the commercial aerospace industry, and the computer industry have known that high-energy neutrons streaming through our atmosphere can cause computer errors." Now an anonymous reader quotes Computerworld: When your computer crashes or phone freezes, don't be so quick to blame the manufacturer. Cosmic rays -- or rather the electrically charged particles they generate -- may be your real foe. While harmless to living organisms, a small number of these particles have enough energy to interfere with the operation of the microelectronic circuitry in our personal devices... particles alter an individual bit of data stored in a chip's memory. Consequences can be as trivial as altering a single pixel in a photograph or as serious as bringing down a passenger jet.

A "single-event upset" was also blamed for an electronic voting error in Schaerbeekm, Belgium, back in 2003. A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible. "This is a really big problem, but it is mostly invisible to the public," said Bharat Bhuva. Bhuva is a member of Vanderbilt University's Radiation Effects Research Group, established in 1987 to study the effects of radiation on electronic systems.

Cisco has been researching cosmic radiation since 2001, and in September briefly cited cosmic rays as a possible explanation for partial data losses that customer's were experiencing with their ASR 9000 routers.

30 of 264 comments (clear)

  1. This is news...? by __aaclcg7560 · · Score: 5, Funny

    Whenever a user calls up to ask why his computer rebooted after I install an update, I say... drumroll, please... gamma radiation.

    1. Re:This is news...? by arglebargle_xiv · · Score: 5, Funny

      A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible.

      How could they tell this apart from standard operations on a Diebold machine?

    2. Re:This is news...? by geekmux · · Score: 5, Funny

      Whenever a user calls up to ask why his computer rebooted after I install an update, I say... drumroll, please... gamma radiation.

      Computers and Incredible Hulks don't interface well together, but a Ctrl-Alt-SMASH sequence? I'd buy that.

  2. That is why Excel crashes all the time on OSX by thesjaakspoiler · · Score: 3, Insightful

    I was convinced that is was a lousy programming job by Microsoft that has more attention to fancy UX components rather than stability. I am waiting for the confirmation that the fact that Excel start searching every known (network) drive for a license if it can't connect to the online subscription service, for every operation, must be due to black matter. Unless it crashes when it tries to display that warning message, then it's just some cosmic ray again. So relieved!

  3. ECC by Bruce+Perens · · Score: 4, Insightful

    This is why ECC is used to protect memory and data busses. At least on the good stuff :-) . One of the issues is die shrink. As the minimum detail slze of the IC process gets smaller, the potential for radiation to flip a bit gets higher.

    Silicon-on-sapphire is the main way to implement silicon-on-insulator, which is more protective of radiation bit flips and less likely to latch-up. But since these have historically been required only for space satellites, they have been horribly expensive. Imagine running an entire IC fabrication just to make a few chips. As there are more applications for rad-hard chips, the price could fall.

  4. Re:@Intel: Why no ECC for consumer-grade processor by unixisc · · Score: 2

    Actually, wouldn't cosmic rays be capable of flipping bits even in ECC memory and processors, thereby making the whole ECC thing useless? Particularly in more recent process nodes, where the lithography scale is approaching atoms, and where cosmic rays would have a far greater effect?

  5. Re:Why not blame the manufacturer? by DontBeAMoran · · Score: 3, Insightful

    There's something you can do about it. It's very easy, but you won't like it.

    Make every component in triplicate. Everything in the CPU, everything in the RAM, everything in storage, etc. If the three aren't equal, go with the value shared by two of them and rewrite the different one with that value.

    --
    #DeleteFacebook
  6. preposterous! by Gravis+Zero · · Score: 5, Informative

    When your computer crashes or phone freezes, don't be so quick to blame the manufacturer.

    If my computer crashes or phone freezes, it's almost certainly the fault of the person who released the software without properly debugging it. Cosmic rays are very low on the list of reasons why your device has malfunctioned.

    --
    Anons need not reply. Questions end with a question mark.
    1. Re:preposterous! by craXORjack · · Score: 4, Funny

      Some pieces of software are just the recipients of more cosmic rays than others. For example, Windows 3.1 used to attract ultra high energy cosmic rays from as far away as Mars and for a time was making astronomers lives difficult due to the showers of particles released when many of those rays would strike molecules in the atmosphere instead of the Microsoft copyrighted code they were aiming for. Other software that attracts higher than normal numbers of cosmic rays are the Therac-25 and Diebold voting machines.

      --
      Liberals call everyone Nazis yet they are the closest thing to it.
  7. NOT bringing down a passenger jet by Alain+Williams · · Score: 2

    Follow through the links: a cosmic ray caused problems, the jets misbehaved for a bit but the duplicated systems protected them from a crash - as they are supposed to after a malfunction.

  8. Re:@Intel: Why no ECC for consumer-grade processor by drinkypoo · · Score: 4, Informative

    Actually, wouldn't cosmic rays be capable of flipping bits even in ECC memory and processors, thereby making the whole ECC thing useless?

    No, this is what ECC is for. If a bit is flipped, you can detect it. If you have enough parity bits, you can even detect which bit is flipped, and correct it on the fly. Computation occurs as normal and an error shows up in the syslog.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  9. Odds by JBMcB · · Score: 4, Insightful

    The odds of a cosmic ray hitting your memory at the exact right spot to flip a bit are one in hundreds of millions. There are just enough computers out there that it happens from time to time. The odds of FIVE rays hitting just the right locations to flip four bits and a parity bit are, pardon the pun, astronomical.

    --
    My Other Computer Is A Data General Nova III.
  10. Re:Why not blame the manufacturer? by Baloroth · · Score: 5, Interesting

    There's something you can do about it. It's very easy, but you won't like it.

    Make every component in triplicate. Everything in the CPU, everything in the RAM, everything in storage, etc. If the three aren't equal, go with the value shared by two of them and rewrite the different one with that value.

    Not only is this not actually all that easy (all of your triplicate systems have to be clocked together in sync, you need a shitload of extra hardware to do the comparison, etc.) it's grossly unnecessary. Standard off-the-shelf error detection and correction can (and routinely does) handle radiation induced errors. It just costs a bit more, because it's a business-level feature. It doesn't matter if that MP3 of Taylor Swift gets mildly corrupted (might even sound better that way, zing), but it very much *does* matter if that bank account gets a flipped bit.

    --
    "None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
  11. Re: ECC by ewanm89 · · Score: 2

    We are already there:
    http://www.pcworld.com/article...
    http://arstechnica.com/gadgets...

    As the IBM article states they are working with Samsung and Global Foundries while the other article is about Intel that is 3 of the major chip fab companies stating they are moving to silicon-germanium hybrid crystal over pure silicon for exactly this reason. Also the fabs on a new process node take time to setup and they need to be ready before circuit design comes in to fab prototype batches so they are usually a couple of years ahead of what is commercially available on the market.

  12. Re:@Intel: Why no ECC for consumer-grade processor by scdeimos · · Score: 2

    Are people really less knowledgeable about computers now than they were in the 80's?

    Yes, absolutely! Have you never sat down with a IT graduate from the 2000's to figure out what they actually know about computer hardware?

  13. Re: "Of course it can," says government by Bruce+Perens · · Score: 2

    The comment I was responding to was regarding HAARP. And that's "except" FYI. :-) ECC is actually more reliable, for its problem domain, than a triple voting system. The probability that you would arrive at a valid ECC code for bad data due to multiple bit flips is much lower than than the probability of two out of three systems voting wrong. So, it is at least theoretically possible to design a computer system with data integrity throughout that exceeds that of a voting system.

  14. Re:Why not blame the manufacturer? by ShanghaiBill · · Score: 4, Informative

    Probably b'cos there is nothing that manufacturers can do about cosmic rays

    Except that is not true. Electronic devices can be made more resistant to cosmic rays and other radiation. The easiest way to do so is to use depleted boron instead of "normal" boron as a semiconductor dopant. Boron-10 has a very high neutron absorption cross section while Boron-11 has a very small cross section. Use boron that has been "depleted" of the B10 isotope, and you cut way down on your neutron induced SEUs.

    Another obvious countermeasure is to use ECC memory, and memory scrubbing.

    The problem is not that there is nothing that manufacturers can do, but that consumers aren't willing to pay the extra cost. Would you be willing to pay an extra $100 for your phone if it meant one fewer reboot every decade or so?

  15. Re:Why not blame the manufacturer? by ShooterNeo · · Score: 5, Informative

    You know that several FPGA manufacturers offer this. Xilinx offers a method where this is done in software - when you do design synthesis, more than triple the gates are needed for every circuit allocated in the design. (I think it's done at a higher level - truth tables with the triple redundant bits are generated)

    Some do it in hardware, so your design synthesis is the same but the actual software programmable subunits use ternary redundancy.

  16. Re:Why not blame the manufacturer? by currently_awake · · Score: 2

    Yes it's cheaper. That's why they invented RAID.

  17. bullshit by gravewax · · Score: 2

    Your phone or computer crash is thousands of times (if not millions) more likely to have been caused by the manufacturer/coders error or fault than cosmic rays. Anyone that decides to consider cosmic rays as a more likely answer deserves to continue to experience their issues.

  18. Re: Why not blame the manufacturer? by ShanghaiBill · · Score: 5, Informative

    ECC memory doesn't do anything to help when the bits that get flipped are in the CPU. Or anywhere else that isn't a RAM chip.

    Except that the RAM has hundreds or thousands of times as many bits as a CPU, and Flash may have millions of times as many, and dynamic ram has smaller feature size, and is more susceptible to SEUs. So correcting RAM and Flash helps because that is where 99.9% of the problem is.

    Even within the CPU, most transistors are used to implement cache, and cache can also be scrubbed (although not with just software).

  19. Re:Why not blame the manufacturer? by unrtst · · Score: 4, Informative

    Another obvious countermeasure is to use ECC memory ...

    The problem is not that there is nothing that manufacturers can do, but that consumers aren't willing to pay the extra cost. Would you be willing to pay an extra $100 for your phone ...

    ECC memory is not that much more expensive. It's been a few years since I built the desktop I'm using, but I included 16gb of ECC memory (4x 4gb DDR3 ECC KVR1333D3E9SK2/8G). At the time, I think it was around $60. The equivalent normal memory was only a couple bucks cheaper. If Samsung started using ECC memory in all their phones, the cost would be nearly the same with the volume they would be ordering/making.

    FWIW, I did try to do the same comparison just now on newegg and, while it's a bit of a mess, the situation is nearly the same today:
    $34 : Kingston 4GB 240-Pin DDR3 SDRAM ECC Unbuffered DDR3 1333 Server Memory Model KVR13LE9S8/4
    $52 : Kingston 8GB (2 x 4GB) 240-Pin DDR3 SDRAM DDR3 1600 (PC3 12800) Memory Model KVR16N11S8K2/8

    More expensive? Yes.
    $100 more? Nowhere near that much.

  20. Re:@Intel: Why no ECC for consumer-grade processor by unrtst · · Score: 2

    Are people really less knowledgeable about computers now than they were in the 80's?

    If you mean on average, I think the answer is probably yes.

    If you mean on average out of the total number of computer users or programmers, then yes (they are less knowledgable), because that pool has increased by lots and lots.

    If you mean on average out of all people, then no. I suspect there are far more people that know what ECC does now than did in the 80's, and the total population count hasn't gone up as much as that number, so there are more people on average, and in total, that know about the inner workings of computers.

    I think there are just far more people touching stuff they know very little about, and we assume they must know *something*, but they don't.
    Compare it to early cars, where every operator had to know a bunch of stuff about it just to keep it running, but it was simple enough that the average operator could learn that stuff. Now, most cars make maintenance very difficult, and many drivers would be hard pressed to do simple things like changing the oil, flushing the radiator, replacing a brake light, replacing the battery, changing a tire, jump starting, etc. That said, there are WAAAAY more people that know WAAY more about cars now than there were in 1930. It's just shifted more to professional/hobbyist knowledge than something that every operator is required to know.

      More people know how to operate them now, but then, operating them has become orders of magnitude simpler.

  21. Thousands of years, same surprises by holophrastic · · Score: 2

    Is anyone surprised that if you store things once, and reference the one place alone, that you get screwed on occasion?

    Is the word "co-roberation" new? How about "validation", "authentication", "verification", and, oh, I don't know, "paper-trail"?

    It's electronic information, not magic. The benefit of not carving into stone is that you can readily duplicate information into multiple places. Use it.

    RAID.

  22. Re:Why not blame the manufacturer? by fuzzyfuzzyfungus · · Score: 2

    If you think that finding a vendor that doesn't keep cutting battery life/SD card slots/headphone jacks/basic safeguards against electrical fire in order to make it thinner, cheaper, or both is hard; just try to find one that ensures sufficient borated polyethylene(with something else to sop up the resulting gamma rays) or other neutron shielding into their products.

    There probably are some, making bits for nuclear reactors and industrial, scientific, and medical users of neutron sources; but it's a niche.

  23. We've always known this. This is why we have ECC by kriston · · Score: 2

    We've always known this. This is why we have ECC memory on servers.

    --

    Kriston

  24. Re:We've always known this. This is why we have EC by kriston · · Score: 3, Informative

    It's also why systems on spacecraft such as the Space Shuttle had what's called the Data Processing System. It consisted of four systems with identical software and an extra one with the same hardware but a different implementation with the same goals. They checked each others' decisions, and a majority "vote" would lock out the differing system.

    --

    Kriston

  25. Re:Why not blame the manufacturer? by dgatwood · · Score: 3, Informative

    Adding one ECC bit per byte, yes. Adding one parity bit, no. ECC != parity.

    --

    Check out my sci-fi/humor trilogy at PatriotsBooks.

  26. Re:Why not blame the manufacturer? by Solandri · · Score: 2
    This is actually a fairly recent development. When I was putting together a file server in 2012, I really wanted to use ECC RAM. But 2x4GB ECC cost more than $250 vs $50 for regular 2x4GB RAM. Add in the extra cost of a server motherboard that supported ECC RAM and the processor restrictions, and I gave up and just built the file server using regular RAM.

    A couple years later, the price of ECC RAM had dropped to only about 50% more than the cost of regular RAM.

    . If Samsung started using ECC memory in all their phones, the cost would be nearly the same with the volume they would be ordering/making.

    The cost would be 12.5% more. :)

  27. Re: ECC by Bruce+Perens · · Score: 4, Interesting

    Read this paper. He postulates 2 soft errors per year for a Xeon 7500 with 24 MB L3 cache at sea level in New York City. He also gives figures for static RAM, which is the stuff of CPU registers.