Slashdot Mirror


Serious Computer Glitches Can Be Caused By Cosmic Rays (computerworld.com)

The Los Alamos National Lab wrote in 2012 that "For over 20 years the military, the commercial aerospace industry, and the computer industry have known that high-energy neutrons streaming through our atmosphere can cause computer errors." Now an anonymous reader quotes Computerworld: When your computer crashes or phone freezes, don't be so quick to blame the manufacturer. Cosmic rays -- or rather the electrically charged particles they generate -- may be your real foe. While harmless to living organisms, a small number of these particles have enough energy to interfere with the operation of the microelectronic circuitry in our personal devices... particles alter an individual bit of data stored in a chip's memory. Consequences can be as trivial as altering a single pixel in a photograph or as serious as bringing down a passenger jet.

A "single-event upset" was also blamed for an electronic voting error in Schaerbeekm, Belgium, back in 2003. A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible. "This is a really big problem, but it is mostly invisible to the public," said Bharat Bhuva. Bhuva is a member of Vanderbilt University's Radiation Effects Research Group, established in 1987 to study the effects of radiation on electronic systems.

Cisco has been researching cosmic radiation since 2001, and in September briefly cited cosmic rays as a possible explanation for partial data losses that customer's were experiencing with their ASR 9000 routers.

264 comments

  1. This is news...? by __aaclcg7560 · · Score: 5, Funny

    Whenever a user calls up to ask why his computer rebooted after I install an update, I say... drumroll, please... gamma radiation.

    1. Re:This is news...? by __aaclcg7560 · · Score: 1, Interesting

      I really would like to visit your house. After I eat a lot of fiber, maybe some bran muffins or flax seeds. That way I can take a great big SHIT and put a very large, moist turd in your microwave. You will just LOVE what happens when it's in there on high for about ten minutes!

      You need to have a better diet. Healthy shit has the consistency of toothpaste.

      It makes for an interesting conversation piece.

      I had a roommate who left a squash inside a toaster oven on low heat overnight. The squash was carbonized all the way through. Charred on the outside, charred on the inside. Now that was a conversation piece.

    2. Re:This is news...? by manu0601 · · Score: 1

      gamma radiation

      I also believed that cosmic rays trouble were about gamma radiation, but TFA says it is all about neutron radiation.

    3. Re:This is news...? by Anonymous Coward · · Score: 0

      And both a wrong. the vast majority are high energy protons with other nuclei being less common. all + charged of course. This is at the top of the atmosphere. After traveling through 10 metric tons per m2 over 20+km it tends to be muons.

    4. Re: This is news...? by Anonymous Coward · · Score: 0

      I disagree, healthy shit should distinctly form "turds" if it's too soft to be a turd then you need more fiber.

    5. Re:This is news...? by ArchieBunker · · Score: 1

      Nah do an upper decker instead.

      --
      Only the State obtains its revenue by coercion. - Murray Rothbard
    6. Re:This is news...? by Anonymous Coward · · Score: 0

      Protons are charged particles and don't make it to the ground, except at the poles. The Earth's magnetic field deflects them. This is where aurorae at the poles come from. Neutrons are not deflected by magnetic fields and typically impact energetically with other atoms in the atmosphere, resulting in showers of particles on the ground below.

    7. Re:This is news...? by Anonymous Coward · · Score: 0

      If you get tired of repeating gamma, you can go alpha and blame the chip manufacturers process.

    8. Re:This is news...? by Z00L00K · · Score: 1

      Fibers are for data communication, you shouldn't eat them.

      --
      If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    9. Re:This is news...? by arglebargle_xiv · · Score: 5, Funny

      A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible.

      How could they tell this apart from standard operations on a Diebold machine?

    10. Re:This is news...? by Anonymous Coward · · Score: 0

      Whenever a user calls up to ask why his computer rebooted after I install an update, I say... drumroll, please... gamma radiation.

      You can't just go throwing words like "gamma radiation" around - users are too dim for that. If however you show them the space weather forecast and tell them God is angry so he's sending solar flares they buy it 99% of the time AND think they understand.

    11. Re:This is news...? by jellomizer · · Score: 1

      It 2017 anything goes for news. I expect the Los Alamos National Lab is worried about its funding so will repurpose one of its old hypothesis and try to get it on Fox News so the president see it and decides to keeps it funding. These organizations if smart realize how manipulatable the president is, and just a few simple things can cause him to change his mind and course. Just as long as you stroke his ego you can do whatever you want.

      I am sorry I didn't want to make this political, but we had a problem with bad science news for a long time, because the people eat it us, and they use it to keep their funding. Unfortunately for some areas such as climate change due to some early overzealous hypothesis created a situation of mistrust of science where the general population and politicians just don't get the scientific process and are unable to weed out what are strong results and poor results and the difference between a hypothesis and a theory.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    12. Re:This is news...? by oobayly · · Score: 1

      I had to google the term after watching an episode of Archer - it was the first time I'd ever heard it. I have literally never heard of anyone (not even about a friend of a friend of a friend...) doing something like that, and I've heard some pretty fucked stuff from my colleagues.

      Is this an actual thing in the US?

    13. Re:This is news...? by geekmux · · Score: 5, Funny

      Whenever a user calls up to ask why his computer rebooted after I install an update, I say... drumroll, please... gamma radiation.

      Computers and Incredible Hulks don't interface well together, but a Ctrl-Alt-SMASH sequence? I'd buy that.

    14. Re: This is news...? by Entrope · · Score: 1

      Lighten up, Francis.

    15. Re:This is news...? by chainsaw1 · · Score: 1

      Both are right. Protons are deflected in the atmosphere and neutrons have no charge and no (almost no) magnetic oment so don't interact with things unless they hit a nucleas head-on (elastic interaction).

      That being said, if a high energy neutron hits something the energy my be sufficient to create other particles (like protons and gamma rays). So in a way both theories in this thread are correct (protons produced in lower atmosphere from neutrons from space).

      If you have the equipment, cosmic ray bit flips in memory can be determined. For this, one must map the status of every memory cell in the affected region to it's physical location in the chip and have a mirrored (RAID 1 like) data set with appreciable physical separation from the data under investigation. If there are many bit flips following a physical straight line path through the memory device, that is likely a cosmic ray.

      --
      - Sig
    16. Re:This is news...? by omnichad · · Score: 1

      May I introduce you to the Bristol Stool Chart. "Normal" is within the range of 3-4.

    17. Re:This is news...? by __aaclcg7560 · · Score: 1

      May I introduce you to the Bristol Stool Chart [wikipedia.org]. "Normal" is within the range of 3-4.

      I drop a Type 4 every morning. ;)

    18. Re:This is news...? by werepants · · Score: 1

      For technical accuracy, it would be better to blame terrestrial neutron cascades generated by cosmic rays. Gamma rays don't actually cause bit upsets or things that would cause a momentary glitch - you need something like a neutron or proton for that. Gamma rays do a perfectly fine job at putting dose on parts, but if the computer has accumulated enough dose to cause a failure, then the user probably died quite a while ago.

      Disclaimer: IAAREE (I Am A Radiation Effects Engineer)

      ;)

    19. Re:This is news...? by Anonymous Coward · · Score: 0

      Is this an actual thing in the US?

      Maybe not in the US, but I understand that in Russia they have an instructional video starring Donald Trump...

    20. Re: This is news...? by Anonymous Coward · · Score: 0

      0 out of 5 dentists recommend shit for toothpaste.

    21. Re:This is news...? by igny · · Score: 1
      --
      In theory there is no difference between theory and practice. In practice there is. - Yogi Berra
    22. Re:This is news...? by Anonymous Coward · · Score: 0

      Every IT guy has done ctl-alt-smash at one point or another. But then you spend the next five minutes searching for keys and popping them back into place, so its usefulness is limited.

    23. Re:This is news...? by Anonymous Coward · · Score: 0

      You might want to ask the Russians.

    24. Re:This is news...? by LienRag · · Score: 1

      More seriously, is there a reference to the actual incident?
      All I can find by googling it is copies of TFA...

    25. Re:This is news...? by ivanjager · · Score: 1

      The standard operation is carefully designed to avoid increasing the count to more than the total number of voters in a given district.

  2. That is why Excel crashes all the time on OSX by thesjaakspoiler · · Score: 3, Insightful

    I was convinced that is was a lousy programming job by Microsoft that has more attention to fancy UX components rather than stability. I am waiting for the confirmation that the fact that Excel start searching every known (network) drive for a license if it can't connect to the online subscription service, for every operation, must be due to black matter. Unless it crashes when it tries to display that warning message, then it's just some cosmic ray again. So relieved!

    1. Re:That is why Excel crashes all the time on OSX by Anonymous Coward · · Score: 0

      I use Excel 2016 on Mac no problems. Overall I've found Office on Mac to be pretty good.

      Otherwise, yes if some application or data gets corrupted for no obvious reason I tell people it could have been cosmic rays flipping bits.

    2. Re:That is why Excel crashes all the time on OSX by aaarrrgggh · · Score: 1

      Funny how copy-paste operations are constantly and consistently corrupted by those damn cosmic rays...

      Oh well, guess I need better shielding.

    3. Re:That is why Excel crashes all the time on OSX by toddestan · · Score: 1

      I assume cosmic rays is also why Outlook constantly pops up that "Need Password" prompt. All this time I was assuming it was a bug introduced back in Office 2007!

  3. Why not blame the manufacturer? by Anonymous Coward · · Score: 1, Informative

    When your computer crashes or phone freezes, don't be so quick to blame the manufacturer.

    Why not? According to the article, it is well-known phenomena:

    For over 20 years the military, the commercial aerospace industry, and the computer industry have known that high-energy neutrons streaming through our atmosphere can cause computer errors.

    So if it is a well-known problem, and manufacturers are ignoring the problem and creating devices susceptible to such interference, why can I not blame the manufacturer for making hardware with known problems? I would blame the manufacturer if a hearing aid was picking up local radio stations, so why not here?

    1. Re:Why not blame the manufacturer? by DontBeAMoran · · Score: 3, Insightful

      There's something you can do about it. It's very easy, but you won't like it.

      Make every component in triplicate. Everything in the CPU, everything in the RAM, everything in storage, etc. If the three aren't equal, go with the value shared by two of them and rewrite the different one with that value.

      --
      #DeleteFacebook
    2. Re:Why not blame the manufacturer? by Baloroth · · Score: 5, Interesting

      There's something you can do about it. It's very easy, but you won't like it.

      Make every component in triplicate. Everything in the CPU, everything in the RAM, everything in storage, etc. If the three aren't equal, go with the value shared by two of them and rewrite the different one with that value.

      Not only is this not actually all that easy (all of your triplicate systems have to be clocked together in sync, you need a shitload of extra hardware to do the comparison, etc.) it's grossly unnecessary. Standard off-the-shelf error detection and correction can (and routinely does) handle radiation induced errors. It just costs a bit more, because it's a business-level feature. It doesn't matter if that MP3 of Taylor Swift gets mildly corrupted (might even sound better that way, zing), but it very much *does* matter if that bank account gets a flipped bit.

      --
      "None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
    3. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      Because you are not willing to pay for a smartphone with redundant memory and CPU chips working in parallel and radiation hardened casing that the aerospace industry has adopted to solve this well-known issue. If I make such a phone, and it is on the shelf next to cheap Chinese junk at 20% of the cost, you are going to pick the cheap Chinese junk every time.

    4. Re:Why not blame the manufacturer? by ShanghaiBill · · Score: 4, Informative

      Probably b'cos there is nothing that manufacturers can do about cosmic rays

      Except that is not true. Electronic devices can be made more resistant to cosmic rays and other radiation. The easiest way to do so is to use depleted boron instead of "normal" boron as a semiconductor dopant. Boron-10 has a very high neutron absorption cross section while Boron-11 has a very small cross section. Use boron that has been "depleted" of the B10 isotope, and you cut way down on your neutron induced SEUs.

      Another obvious countermeasure is to use ECC memory, and memory scrubbing.

      The problem is not that there is nothing that manufacturers can do, but that consumers aren't willing to pay the extra cost. Would you be willing to pay an extra $100 for your phone if it meant one fewer reboot every decade or so?

    5. Re:Why not blame the manufacturer? by ShooterNeo · · Score: 5, Informative

      You know that several FPGA manufacturers offer this. Xilinx offers a method where this is done in software - when you do design synthesis, more than triple the gates are needed for every circuit allocated in the design. (I think it's done at a higher level - truth tables with the triple redundant bits are generated)

      Some do it in hardware, so your design synthesis is the same but the actual software programmable subunits use ternary redundancy.

    6. Re:Why not blame the manufacturer? by ShooterNeo · · Score: 1

      I wonder if this means it's actually cheaper to use 3 separate computers, on cheap off the shelf hardware, than one armored and extra redundant computer. For example, spacecraft guidance or for an autonomous car.

    7. Re:Why not blame the manufacturer? by currently_awake · · Score: 1

      The use of parity on memory can prevent most of this. Unfortunately the higher density cpu's are highly vulnerable, and using mil spec parts isn't mandatory for elections.

    8. Re:Why not blame the manufacturer? by currently_awake · · Score: 2

      Yes it's cheaper. That's why they invented RAID.

    9. Re:Why not blame the manufacturer? by ShaunC · · Score: 1

      I recall reading that most space probes are designed this way, due to increased exposure to radiation. Flipped bits are a serious problem in space, the hope is that only one event occurs at a time so that the other two processors maintain quorum.

      --
      Thanks to the War on Drugs, it's easier to buy meth than it is to buy cold medicine!
    10. Re:Why not blame the manufacturer? by ShooterNeo · · Score: 1

      So the reason the Curiosity rover uses FPGAs with ternary logic (and just 2 computers if I recall) is to save on weight. If they were going to optimal cost efficiency they'd have redundant computers and do what the FPGAs are doing in firmware.

    11. Re: Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      ECC memory doesn't do anything to help when the bits that get flipped are in the CPU. Or anywhere else that isn't a RAM chip.

    12. Re:Why not blame the manufacturer? by Applehu+Akbar · · Score: 1

      Probably b'cos there is nothing that manufacturers can do about cosmic rays, which are beyond even gamma rays in the electromagnetic spectrum in terms of wavelength and frequency.

      Not the manufacturer, but the CTOs: just put all data centers into old mines. This could be a great business for rural areas.

    13. Re: Why not blame the manufacturer? by ShanghaiBill · · Score: 5, Informative

      ECC memory doesn't do anything to help when the bits that get flipped are in the CPU. Or anywhere else that isn't a RAM chip.

      Except that the RAM has hundreds or thousands of times as many bits as a CPU, and Flash may have millions of times as many, and dynamic ram has smaller feature size, and is more susceptible to SEUs. So correcting RAM and Flash helps because that is where 99.9% of the problem is.

      Even within the CPU, most transistors are used to implement cache, and cache can also be scrubbed (although not with just software).

    14. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      Blame the buyer!

      Simple use of ECC memory completely prevents single bit flip errors in RAM.
      Yet people are so stupid that they refuse to pay the marginal extra cost.
      So they continue to risk and corrupt their data like morons.
      Ever since seeing ECC hits in the datacenter,
      ALL my workstations are now ECC, and the four or so times a year
      I log a correction... it totally makes my day :)

      AMD-FX , AMD-ZEN, Intel as below, all support ECC.
      http://ark.intel.com/Search/FeatureFilter?productType=processors&FilterCurrentProducts=true&ECCMemory=true&VTD=true&AESTech=true&RetailSkuAvailable=true

    15. Re:Why not blame the manufacturer? by unrtst · · Score: 4, Informative

      Another obvious countermeasure is to use ECC memory ...

      The problem is not that there is nothing that manufacturers can do, but that consumers aren't willing to pay the extra cost. Would you be willing to pay an extra $100 for your phone ...

      ECC memory is not that much more expensive. It's been a few years since I built the desktop I'm using, but I included 16gb of ECC memory (4x 4gb DDR3 ECC KVR1333D3E9SK2/8G). At the time, I think it was around $60. The equivalent normal memory was only a couple bucks cheaper. If Samsung started using ECC memory in all their phones, the cost would be nearly the same with the volume they would be ordering/making.

      FWIW, I did try to do the same comparison just now on newegg and, while it's a bit of a mess, the situation is nearly the same today:
      $34 : Kingston 4GB 240-Pin DDR3 SDRAM ECC Unbuffered DDR3 1333 Server Memory Model KVR13LE9S8/4
      $52 : Kingston 8GB (2 x 4GB) 240-Pin DDR3 SDRAM DDR3 1600 (PC3 12800) Memory Model KVR16N11S8K2/8

      More expensive? Yes.
      $100 more? Nowhere near that much.

    16. Re: Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      Use depleted boron that doesn't include boron-10 isotope in the Silicon dopant And problem solved.

    17. Re:Why not blame the manufacturer? by Solandri · · Score: 0

      You don't need to have everything in triplicate unless you're in a seriously noisy environment. Most error rates due to cosmic radiation are low enough that simply adding one parity bit per 8 data bits (increasing transistor count by 12.5%. not 200%) is enough to eliminate virtually all bit flip errors.

    18. Re:Why not blame the manufacturer? by fuzzyfuzzyfungus · · Score: 2

      If you think that finding a vendor that doesn't keep cutting battery life/SD card slots/headphone jacks/basic safeguards against electrical fire in order to make it thinner, cheaper, or both is hard; just try to find one that ensures sufficient borated polyethylene(with something else to sop up the resulting gamma rays) or other neutron shielding into their products.

      There probably are some, making bits for nuclear reactors and industrial, scientific, and medical users of neutron sources; but it's a niche.

    19. Re:Why not blame the manufacturer? by dbIII · · Score: 1

      If they were going to optimal cost efficiency

      In that special case saving on weight is optimal cost efficiency. I doubt that any of the rover components cost as much as their percentage of the total weight divided by the cost to get the rover on Mars.

    20. Re: Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      You're wrong, parity causes more errors than it detects. And it does not correct.

      SECDED Hamming memory FTW.

    21. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      The space shuttle used 5 redundant computers, one of which runs code written by a different company.
      Boeing 777s have 3 redundant flight computers. One runs an Intel 80486 CPU, one runs an AMD 29050 CPU, and the third runs a Motorola MC68040 CPU.
      So yes, high reliability systems already run redundant computers.

      .

    22. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      Most unmanned spacecraft have only two redundant hardware chains for cost and weight reasons. Manned missions typically have at least 3, or even more. The space shuttle had 5 redundant general purpose computers. 4 of them ran in parallel, and the 5th one held a backup flight system written by an entirely different company.

    23. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      Its easier than that - Parity checking RAM & Caches give very high rate of protection.

    24. Re:Why not blame the manufacturer? by ShooterNeo · · Score: 1

      Technically correct...the best kind of correct. I was implicitly referring strictly to electronic component cost and development time costs, since there's only going to be a handful of Curiosity rover style projects per decade but there are many thousands of projects to develop safe computerized control systems for cars and robots and everything else.

    25. Re:Why not blame the manufacturer? by dgatwood · · Score: 3, Informative

      Adding one ECC bit per byte, yes. Adding one parity bit, no. ECC != parity.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    26. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      ECC Memory isn't the only added cost, you also need a motherboard and processor that supports it. The cost of these supporting components is a lot more than the difference between ECC and non-ECC memory.

    27. Re:Why not blame the manufacturer? by Solandri · · Score: 2
      This is actually a fairly recent development. When I was putting together a file server in 2012, I really wanted to use ECC RAM. But 2x4GB ECC cost more than $250 vs $50 for regular 2x4GB RAM. Add in the extra cost of a server motherboard that supported ECC RAM and the processor restrictions, and I gave up and just built the file server using regular RAM.

      A couple years later, the price of ECC RAM had dropped to only about 50% more than the cost of regular RAM.

      . If Samsung started using ECC memory in all their phones, the cost would be nearly the same with the volume they would be ordering/making.

      The cost would be 12.5% more. :)

    28. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      Yes it's cheaper. That's why they invented RAID.

      You can RAID memory?

      The problem is bit-flips in RAM, not bit-flips on disk. Bit-flipping on disk, due to being so common, is already easily catered for - our filesystems checksum, detect and repair single-bit errors in blocks of bytes. Bit-flips in longer byte-blocks can be detected but not fixed, hence are "fixed" using a backup of that block (you know - RAID).

      There's a project out there (see G+ "C programming" group) called REXEL that is designed to share a running program over multiple different physical computers. Once of the proposed use-cases under "further research" is using quorum-based problem-solving to detect bit-flips in RAM. Essentially, RAID for executing functions.

    29. Re:Why not blame the manufacturer? by thegarbz · · Score: 1

      it's grossly unnecessary

      That depends on the application. I agree unnecessary for a general purpose computer, probably also unnecessary for servers.

      However if your electronics store critical financial information or are safety systems in control of hazardous facilities then it becomes a bit of a different story. The comparison model is actually one that is adopted by many safety systems.

    30. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      Those are artificial costs though, that Intel imposes because they can. The hardware is present on all chips, but much like with other features, Intel cripples the functionality. Critical reliability and security features should not be used for market segmentation.

    31. Re:Why not blame the manufacturer? by ewanm89 · · Score: 1

      Marginal extra cost, want to look up the difference in price between a Intel Core i7 extreme edition on an X99 board and the equivalent Intel Xeon where the difference between the processors is the ECC memory controller. There are a few low end mobile and embedded processors Intel do with ECC, but majority of their consumer range deliberately do not have it, it is a Xeon "feature" and the price tag that has.

    32. Re:Why not blame the manufacturer? by vtcodger · · Score: 1

      and manufacturers are ignoring the problem and creating devices susceptible to such interference

      If it really is a problem, it could be easily dealt with at very modest cost by using extra memory bits for memory error detection/correction. Although as others point out, modern software is so buggy that it might not be worth the effort to actually improve the hardware a little.

      Those who were around 25 years or so ago will remember that the lack of parity/ECC actually can be laid partly on Microsoft. Early PCs had a parity bit on each byte (and they probably needed it). But memory was expensive back then -- $100US a Megabyte. And Windows needed a fair amount of memory. Which meant it needed a costly box to run on. So MS launched a campaign to convince us all that we really no longer needed that extra bit that added roughly 12% to the cost of memory.

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
    33. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      I call bullshit. I've been building systems with ECC RAM since the early 2000's (because I'm a paranoid fuck) and the price was never even close to a factor 5 more expensive. Perhaps you used registered RAM.

    34. Re: Why not blame the manufacturer? by vtcodger · · Score: 1

      I should think that it wouldn't be that hard to add parity to CPU registers, caches, etc

      OTOH, I'm sure Intel could find a way to make the implementation obtuse and even further complicate their CPUs. And in any case, it's unclear to me what the device is supposed to do when it finds the number it is working with is wrong.

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
    35. Re:Why not blame the manufacturer? by arth1 · · Score: 1

      Probably b'cos there is nothing that manufacturers can do about cosmic rays, which are beyond even gamma rays in the electromagnetic spectrum in terms of wavelength and frequency.

      In addition to what others have pointed out, stop shrinking dies. The smaller a circuit is, the greater the risk that it will be impacted when hit by a neutron. Components from a decade or two ago are a heck of a lot more resilient against cosmic rays than today's components.

      Sure, at lower speeds, but for a great many things you just need enough speed. Split out speed-requiring jobs on cutting-edge hardware, and run other critical services on more reliable hardware.

    36. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 1

      There are ALREADY high-end servers that do exactly that. In-fact, they have been sold and are being sold even today to high-end businesses like Banks, Stock Exchanges and the like.

      One line of products that I am familiar with are : https://en.wikipedia.org/wiki/NonStop_(server_computers)

      Obviously anonymous coward, since I work for the company that makes them (yup the same people that were making the shitty laptops were also making these at the same time.)

    37. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      Yes. Marginal extra cost, an extra 12.5% tacked onto your DIMMs. All those Intel chips already support ECC, because it's cheaper to flip a hardware lock to disable that circuitry than to actually develop a second memory controller that doesn't support it. Now to figure out how to make ECC a standard feature, and force Intel to stop its bullshitery....

    38. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      That's because you're comparing the debacle that was FB-DIMMs, a short lived memory format used on Core2-based Xeons, to regular DIMMs. Of course they cost a shit ton more.

    39. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      Unfortunately the higher density cpu's are highly vulnerable

      I wonder how effective a big chunk of some dense material, like copper, fixed to the CPU, would function as neutron shielding...

    40. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      They can. Use ECC memory. A bit of googling returns than the chance of a cosmic bit error is about 95% for 4GiB over 3 days. At the time scale of memory refreshes, it's nearly non-existent. With ECC memory, the chance of an unrecoverable bit error is about once every 2.7mil years. CPUs use a combination of parity and ECC memory internally. If everything in your computer used ECC or parity memory, the chance of getting an undetected error would be like winning the $400mil Powerball all for yourself.

      This would leave only the CPU registers without protection. We'll assume there are 256 64bit registers worth of non-cache storage in your CPU. This will give a 0.00000213% chance per hour of a single bit error, or about 2% chance per year of a single bit error.

    41. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      I bought 4x2GB ECC back in 2010 or so, I don't remember how much it cost but probably not more than 200 eur; maybe a 20-25 % markup over regular RAM.

      Gigabyte had an AM2 motherboard that supported ECC natively and was around 70-80 EUR. (Consumer motherboards that support ECC are indeed rare. It was by luck that I noticed that feature.)

      I get bitflip warnings in the kernel log a few times a year. Might get more if the machine wasn't located in the cellar.

      Oh, and as the other AC noted, you're looking at buffered, server-grade RAM. Mine are regular DIMMs.

    42. Re:Why not blame the manufacturer? by omnichad · · Score: 1

      You can RAID memory? ..... checksum, detect and repair single-bit errors in blocks of bytes

      ECC Memory?

    43. Re:Why not blame the manufacturer? by Khyber · · Score: 1

      Almost every cosmic ray is an atomic nuclei that's been stripped down. It's not an EM-spectrum object.

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    44. Re: Why not blame the manufacturer? by Khyber · · Score: 1

      "ECC memory doesn't do anything to help when the bits that get flipped are in the CPU"

      Guess what the SRAM used in CPUs is? Fucking ECC, dumbass.

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    45. Re: Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      Well, all of L1, L2 and L3 caches on Intel is fully covered by ECC, even on desktop parts. The internal buses also do error-checking, and sometimes Intel gets it wrong and there's an errata listed in that processor's "specification update" document about it :-p

      The only reason for Intel to not do ECC on main DRAM is market segmentation, the RAM cost increase is below 12.5% in the memory itself (no idea on the processor/IMC, but this is all stuff Intel already has working for every arch, since it is mandatory on the Xeon and embedded Xeon and embedded Atom parts).

      Anyone that would buy a motherboard with "solid capacitors", etc. would shell the extra US$10 per 8GiB module for ECC RAM (the cost in the motherboard itself is negligible).

      AMD has ECC available for main DRAM on several *desktop* parts. I am not sure they protect every cache level with ECC like Intel does nowadays, but I would be surprised if they didn't.

    46. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 0

      it's grossly unnecessary

      Except in aircraft, where it IS done this way.

    47. Re:Why not blame the manufacturer? by bws111 · · Score: 1

      Yes, you can RAID memory. See https://www.ibm.com/developerw...

    48. Re:Why not blame the manufacturer? by boristdog · · Score: 1

      Shoot, we have to account for die loss due to cosmic rays every time we ship a bunch of wafers by air. We've been doing that as long as I've been working here.

      semiconductor fab, 22 years.

    49. Re:Why not blame the manufacturer? by Raenex · · Score: 1

      It doesn't matter if that MP3 of Taylor Swift gets mildly corrupted (might even sound better that way, zing), but it very much *does* matter if that bank account gets a flipped bit.

      Sorry to rain on your popular-bashing parade, but ordinary people actually do important things with their computers besides listening to music.

    50. Re:Why not blame the manufacturer? by werepants · · Score: 1

      Xilinx does offer this, but implementing TMR (Triple Majority Redundancy) isn't always straightforward. The thing is, when you've tripled the size of your circuit, you will now accrue 3 times as many errors - you've just created a much larger target. You've also tripled size and power, and probably slowed the whole design down. And unless you are very, very careful, you introduce a new single point of failure: the voter, which has to mediate between the three elements in question.

      Fun fact: A previous generation of Xilinx TMR IP actually made upset rates WORSE than no protection whatsoever. Source - IAAREE - I Am A Radiation Effects Engineer, I do this for a living. ;)

    51. Re:Why not blame the manufacturer? by Baloroth · · Score: 1

      If you're doing something important on your computer that requires error checking and correction, you should either a) get your employer to pay for the hardware to run it on (if it's a work thing), or b) pay for the hardware to run it on yourself (if it's a personal/self-employed thing). What you should *not* do, is use the wrong tool for the job, like running important error-sensitive operations on consumer grade hardware.

      --
      "None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
    52. Re:Why not blame the manufacturer? by ChrisMaple · · Score: 1

      CPU registers are likely to have a design that differs from cache cells. The registers will be bigger, faster, and more resistant to bit flips than cache.

      --
      Contribute to civilization: ari.aynrand.org/donate
    53. Re:Why not blame the manufacturer? by ChrisMaple · · Score: 1

      After a little google searching, it appears that density isn't effective for neutron shielding; even lead is a poor choice. A material with a high neutron capture area is needed. Apparently, water works well for neutrons.

      --
      Contribute to civilization: ari.aynrand.org/donate
    54. Re:Why not blame the manufacturer? by Agripa · · Score: 1

      I call bullshit. I've been building systems with ECC RAM since the early 2000's (because I'm a paranoid fuck) and the price was never even close to a factor 5 more expensive. Perhaps you used registered RAM.

      After Intel dropped support for ECC in desktop systems and during the beginning of the DDR2 to DDR3 transition, ECC DDR3 was not available and not supported and new Intel systems which supported ECC used FB-DIMM which were way more expensive than ECC DDR2. The difference in price at the time between a new AMD system using ECC DDR2 and a comparable new Intel system using ECC FB-DIMMs was roughly $1000.

  4. @Intel: Why no ECC for consumer-grade processors? by Anonymous Coward · · Score: 0

    Cosmic rays at the rest of this comment.

  5. ECC by Anonymous Coward · · Score: 0

    And this is why we have ECC RAM. It can detect and correct a single bit flip. Cosmic rays aren't likely to trigger multiple bit flips simultaneously in the same block of memory.

    1. Re:ECC by unixisc · · Score: 0

      Why not? You're assuming that there are only a few cosmic rays, when in fact, there are plenty. Cosmic rays could easily flip, say, 4 bits of a byte as well as the parity flag, thereby making that data completely useless

    2. Re:ECC by unixisc · · Score: 1

      As they get smaller, I think we are fast approaching the point where it will be thought that a silicon atom is too big to allow for a shrink, and that semiconductor physicists will have to start looking at carbon and maybe even boron

    3. Re:ECC by Anonymous Coward · · Score: 0

      Now if only companies like Intel would actually provide every day consumers with ECC protected systems. It's a bit baffling to me that such an old technology still hasn't made it's way into desktop/laptop category products.

      One of the solutions being used, as IC processes get smaller, is to interleave two different protected chunks of data. When a cell flips, it is very likely that adjacent cells will flip, and if those adjacent cells are protected by a different set of ECC, then this helps prevent double-bit uncorrectable errors from occurring, leaving most multi-bit flips as multiple single-bit ECC errors that can be corrected.

    4. Re:ECC by DontBeAMoran · · Score: 1

      Your ECC RAM won't matter much if the cosmic ray hits the CPU registers. Or a cell in a block of your flash storage.

      --
      #DeleteFacebook
    5. Re:ECC by drinkypoo · · Score: 1

      Your ECC RAM won't matter much if the cosmic ray hits the CPU registers.

      Some modern CPUs have ECC cache RAM. Is it not possible to have ECC registers?

      Or a cell in a block of your flash storage.

      Filesystems can have ECC, too. And in fact, so can storage devices.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    6. Re: ECC by ewanm89 · · Score: 2

      We are already there:
      http://www.pcworld.com/article...
      http://arstechnica.com/gadgets...

      As the IBM article states they are working with Samsung and Global Foundries while the other article is about Intel that is 3 of the major chip fab companies stating they are moving to silicon-germanium hybrid crystal over pure silicon for exactly this reason. Also the fabs on a new process node take time to setup and they need to be ready before circuit design comes in to fab prototype batches so they are usually a couple of years ahead of what is commercially available on the market.

    7. Re:ECC by PPH · · Score: 1

      Now if only companies like Intel would actually provide

      Yes, if only ...

      --
      Have gnu, will travel.
    8. Re:ECC by scdeimos · · Score: 1

      Cosmic rays aren't likely to trigger multiple bit flips simultaneously in the same block of memory.

      Maybe that's the case for now, but who knows what will happen with stacked 3D memory?

    9. Re:ECC by Anonymous Coward · · Score: 0

      I'm not sure I understand your post. As far as I can tell the 5150 didn't have any ECC protection.

      Even in most intel systems ECC is only used in CPU cores and memory controllers. But all of Intel's desktop/consumer south-bridges (PCH) don't even use parity, let alone ECC. The server south bridges have parity protection, and command parity errors are fatal errors that will lead to blue-screen/reboot.

    10. Re:ECC by viperidaenz · · Score: 1

      Costs money.
      It takes up more power, more die space, you need more RAM chips, etc.

    11. Re:ECC by Anonymous Coward · · Score: 0

      I'd be more worried about how frequently we recalculate the error check codes.

      Data on archive storage might be hit by lots of nasty particles before anyone gets around to reading it back, and we only check the error correction codes at the point we read (whether by reading a file or by verifying the whole media device). Of course this is also a very good reason to move infrequently used data onto chunkier archive media like tapes.

      The likelihood of four bits being smacked increases if your disk has been sat on a shelf for 50 years and the ECC might correct it to the wrong thing if enough damage was done. Then again most filesystems have multiple layers of CRC codes all the way up so you'd have to do a lot of very precise damage for bad data to pass the smell test.

    12. Re:ECC by Anonymous Coward · · Score: 0

      Some modern CPUs have ECC cache RAM. Is it not possible to have ECC registers?

      I suppose it's possible, but given how few registers there are compared to even small caches, the performance trade off just isn't worth it.

    13. Re:ECC by Anonymous Coward · · Score: 0

      That's why resilient live systems do a periodic scrub on infrequently accessed data.

    14. Re:ECC by Anonymous Coward · · Score: 0

      Once you get to the point that the size of the atom is the limiting factor, you're not going to be able to just switch atom types, as the bond length is still in the same order of magnitude. I'm also not sure why you'd use boron, as the bond length is longer than that of carbon. It's also a dopant, not a semiconductor.

    15. Re: ECC by Anonymous Coward · · Score: 0

      Are you willing to pay upwards of $500 per gigabyte of RAM?
      For your home computer, just to protect you from maybe having to restart your machine once or twice in your lifetime due to flipped bits?

      Most people aren't going to do that. Because they aren't running anything critical enough to matter.

    16. Re:ECC by Solandri · · Score: 1

      As the minimum detail slze of the IC process gets smaller, the potential for radiation to flip a bit gets higher.

      I suspect the math works out the same as Shannon's noisy channel theorem. And that as the chance of bit flips (noise) increases due to die shrinking, you can increase the error correction coding to compensate for it up to some theoretical limit.

      e.g.. instead of ECC memory having one parity bit for every 8 data bits, you increase it to two parity bits per 8 data bits, and it can withstand a higher error rate.

    17. Re:ECC by dbIII · · Score: 1

      Indeed, that could be a problem, but the failure to teach computer science graduates mathematics to the level of high school probability and statistics is a far greater problem in my opinion. It results in posts like the above.

    18. Re:ECC by unixisc · · Score: 1

      Yeah, I forgot that boron is a dopant. Although then, you'd have no dopants that are significantly smaller than carbon

    19. Re: ECC by Anonymous Coward · · Score: 0

      A register value is not held in such a fragile state as DRAM. I don't think register bit flips are common at all.

    20. Re:ECC by Anonymous Coward · · Score: 0

      Imagine running an entire IC fabrication just to make a few chips.

      SOI processes are readily available in the fabs. ;)

    21. Re:ECC by Jeremi · · Score: 1

      Your ECC RAM won't matter much if the cosmic ray hits the CPU registers. Or a cell in a block of your flash storage.

      Also, your ECC RAM won't matter much if you get run over by a truck. So what? ECC RAM will help if there is a bitflip in your ECC RAM, that's what it's for and that's what the benefit is. It's not going to solve world hunger either, and nobody ever suggested that it would.

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    22. Re:ECC by Bruce+Perens · · Score: 1

      The situation for AMSAT is still pretty bad, as far as I've heard. As a radio amateur group (and one that has launched quite a few satellites as space hitch-hikers) they can't afford the good stuff, but they get some donated by NASA and some of the commercial satellite companies. Only a few years ago they were still using the 1802 as their main vehicle controller, as that was their main choice in silicon-on-sapphire CPUs. They get some donations of space-qualified solar cells. They scrub their memory continuously, They use no boot ROMS. The program is loaded entirely by hardware, and then the CPU is started.

    23. Re: ECC by Bruce+Perens · · Score: 4, Interesting

      Read this paper. He postulates 2 soft errors per year for a Xeon 7500 with 24 MB L3 cache at sea level in New York City. He also gives figures for static RAM, which is the stuff of CPU registers.

    24. Re:ECC by Anonymous Coward · · Score: 0

      As they get smaller, I think we are fast approaching the point where it will be thought that a silicon atom is too big to allow for a shrink, and that semiconductor physicists will have to start looking at carbon and maybe even boron

      This issue with chips isn't transistor sizes, it's that they run too hot to scale up so they're at 13 layers in the high end chips. If we could so something as conceptually simple as embed copper heat pipes into the wafers or something as complex as the newer fluid-powered (and cooled) chips we could scale up along the Z axis and achieve order of magnitude more power out of practically the same space (to put this into perspective, if an Intel i7 were a cube instead of a square, it would have 52,383,203,414,835 transistors instead of 1,400,000,000 transistors, or about 37,416 times the computing power - with things like liquid cooling and power you might even be able to scale it up higher for the same space because the bulk of the chip package is actually wiring, not CPU - to use the i7 as an example again this would be another 64x improvement with a ~40mm cube [the lower end of CPU sizes.])

      Making transistors smaller is worth it for research but realistically we need to find a better manufacturing process, several decades in and we still can't do more than hold a mask over a light and treat the surface of a light-sensitive material.

    25. Re:ECC by Enter+the+Shoggoth · · Score: 1

      The situation for AMSAT is still pretty bad, as far as I've heard. As a radio amateur group (and one that has launched quite a few satellites as space hitch-hikers) they can't afford the good stuff, but they get some donated by NASA and some of the commercial satellite companies. Only a few years ago they were still using the 1802 as their main vehicle controller, as that was their main choice in silicon-on-sapphire CPUs. They get some donations of space-qualified solar cells. They scrub their memory continuously, They use no boot ROMS. The program is loaded entirely by hardware, and then the CPU is started.

      Bruce, what do you mean by "...no boot ROMS.... loaded entirely by hardware" ?

      --
      Andy Warhol got it right / Everybody gets the limelight
      Andy Warhol got it wrong / Fifteen minutes is too long.
    26. Re:ECC by Anonymous Coward · · Score: 0

      Was that for the LEO CubeSats, or something flying further away?

    27. Re:ECC by Anonymous Coward · · Score: 0

      The program is permanently cooked into a custom chip, rather than stored on some generic writable ROM?

    28. Re: ECC by Entrope · · Score: 1

      You can protect registers with ECC, but any CPU that does that is likely to want some (more complicated) scheme to protect data in the execution units as well, and doing that is more costly. You need something like majority voting of 3+ units, or you need logic that updates ECC bits coherently, which I think is not well-researched.

    29. Re: ECC by Anonymous Coward · · Score: 0

      ECC memory doesn't cost $500 per gigabyte as a simple search on amazon or newegg would show you.

    30. Re:ECC by DontBeAMoran · · Score: 1

      And your truck won't matter much if it gets run over by an asteroid. So what?

      --
      #DeleteFacebook
    31. Re:ECC by Anonymous Coward · · Score: 0

      Exactly! This issue (as well as a related one from radiation from the old ceramic chip packages) since the '70s. It's why IBM designed the PC with parity memory. Somewhere along the line cost became more important than reliability and the parity bit was eliminated.

    32. Re:ECC by Bruce+Perens · · Score: 1

      LEO doesn't have as much of a radiation problem, they are below the Van Allen belts.

    33. Re:ECC by Bruce+Perens · · Score: 1

      No boot ROM means that a hardware device constructed from discrete logic and analog chips directly demodulates digital data from the radio, addresses the memory, and writes the data. Once this process is completed, it de-asserts the RESET line of the CPU and the CPU starts executing from an address in memory. Really no ROM!

    34. Re:ECC by Anonymous Coward · · Score: 0

      There are several orders of magnitudes of bit storage in registers than any other type of memory. Chance of cosmic bit flip in memory, measured in days. Chance of cosmic bit flip in registers, measured in decades.

    35. Re:ECC by Reziac · · Score: 1

      I vaguely recall a study way back when that looked at stray radiation vs the computer case: Metal is helpful as shielding; plastic is not. D'oh!

      --
      ~REZ~ #43301. Who'd fake being me anyway?
    36. Re: ECC by Anonymous Coward · · Score: 0

      Yes, in the L3 cache, the single largest part of the chip. Registers are less than a rounding error in the number of transistors in a CPU. Take your 2 soft-errors per year, divide by a number between 100 and 1000, and that's your chance of a register having an issue. Modern CPUs are like a cache with a small execution unit buried somewhere in there. GPUs are another story.

    37. Re:ECC by Anonymous Coward · · Score: 0

      ZFS was made precisely because of these issues with storage. Undetectable errors are occurring about once every 10TiB read, of course they could be caused by a faulty write. the only reason they even know about this is because of ZFS or other tests that use ZFS like checksumming. Even with perfect ECC, your hardware could be faulty. ZFS will detect this, even if it can't fix it.

    38. Re: ECC by bmk67 · · Score: 1

      Nonsense. I have 16GB of ECC RAM in my NAS at home, and I can assure you I spent almost two orders of magnitude less than $500/GB. IIRC, it was closer to $10/GB.

    39. Re:ECC by Anonymous Coward · · Score: 0

      Oddly, the potential for bit flip as die geometries become smaller does not increase. If you look at Xilinx UG-116, the latest nodes have a lower SEU rate than previous geometries. https://www.xilinx.com/support/documentation/user_guides/ug116.pdf

    40. Re:ECC by dakra137 · · Score: 1

      One full page newspaper ad from the week the IBM PC was announced, called it "The IBM of Personal Computers." The "B" in IBM is for "Business." The IBM PC was a business machine, not a toy. What distinguished it from Apple and Radio Shack computers? It was the first PC to provide parity memory.

      The IBM PC had parity memory, not ECC. On a parity error, BIOS displayed an error message on the top line of the display and stopped the computer (HALT or disabled interrupts then loop, such that the error message remained visible.) If I recall correctly, restart required the BRS (Big Red Switch) to power down and then up, rather than <alt><ctrl><del>. When the computer was powered up again, as always back then, BIOS did a Power On Self Test that among other things, tested memory.

      The parity memory meant that the IBM PC produced results as programmed, or not at all.

      Some customers did not really care. They preferred results, even if incorrect ones. Clone companies started providing a BIOS setup option to disable the parity checking. Next they started saving 11% on memory costs by not including parity memory bits at all. Customers did not care. The IT trade press and personal computing trade press did not raise a ruckus, and almost never included the presence or absence of parity memory in product reviews.

      Then IBM had started to become "Market Driven." The market did not value it, so eventually, IBM dropped parity memory from end-user PC's, but retained ECC memory for servers. At an internal IBM technical conference, I asked IBM Executives whether anything had changed that made parity memory unnecessary in PC's. The heads of both the PC Division and IBM Microelectronics (IBM made its own memory back then) agreed that dropping parity memory was the wrong thing to do, from both technical and validity-of-results perspectives, but was something the market did not value.

      The follow-on was for the industry to create premium "workstation" class end-user computers with ECC. Unfortunately, as far as I know, no company offers laptops, notebooks, tablets, smartphones, or IoT devices with ECC.

    41. Re:ECC by Enter+the+Shoggoth · · Score: 1

      No boot ROM means that a hardware device constructed from discrete logic and analog chips directly demodulates digital data from the radio, addresses the memory, and writes the data. Once this process is completed, it de-asserts the RESET line of the CPU and the CPU starts executing from an address in memory. Really no ROM!

      Ok! (Very) remote pre-boot DMA, nice!

      Thanks for the expdanded explanation,

      --
      Andy Warhol got it right / Everybody gets the limelight
      Andy Warhol got it wrong / Fifteen minutes is too long.
  6. Sun Microsystems cache failure by Andrew+Lindh · · Score: 1

    Sun blamed cosmic rays for causing CPU cache corruption and system crashes in their high-end enterprise systems. http://www.forbes.com/forbes/2...

    1. Re:Sun Microsystems cache failure by Anonymous Coward · · Score: 0

      IIRC, it was instead some radioactive isotopes in the packaging of the cache chips.

      http://www.sparcproductdirectory.com/artic-2001-dec-1.html

  7. ECC by Bruce+Perens · · Score: 4, Insightful

    This is why ECC is used to protect memory and data busses. At least on the good stuff :-) . One of the issues is die shrink. As the minimum detail slze of the IC process gets smaller, the potential for radiation to flip a bit gets higher.

    Silicon-on-sapphire is the main way to implement silicon-on-insulator, which is more protective of radiation bit flips and less likely to latch-up. But since these have historically been required only for space satellites, they have been horribly expensive. Imagine running an entire IC fabrication just to make a few chips. As there are more applications for rad-hard chips, the price could fall.

  8. Nice, 45 year old info by Anonymous Coward · · Score: 0
  9. "Of course it can," says government by TheOuterLinux · · Score: 1

    Oh THATS what happened to the emails. Global warming is bullshit, but those cosmic rays will getcha every time...yeah -_- Witness the birth of oncoming onslaught of pathetic excuses. Not doubting the logic at all, especially given how mass power outages have happened because of this, but I got feeling someone will do research near "HAARP" and, "Oh no...Why god why!...all well." Â\_(ãf)_/Â

    1. Re:"Of course it can," says government by Bruce+Perens · · Score: 1

      Professional computers have metal cases and no silly viewing windows, and proper EMI suppression on the ports. Effectively, Faraday cages. They are quite proof against low-frequency radio transmitters nearby, whatever the power.

    2. Re:"Of course it can," says government by Anonymous Coward · · Score: 0

      I think you have confused "proof" with add attenuation. Transmitters produce non-ionizing radiation but I think to stop gamma rays you might be needing lead walls (like the vests for xray technicians.)

    3. Re:"Of course it can," says government by thinkwaitfast · · Score: 1

      Cosmic rays penetrate many km into the Earth. Very energetic ones can pass straight though the Earth. A metal Faraday cage is not going to help much.

    4. Re:"Of course it can," says government by Bruce+Perens · · Score: 1

      Please read the comment again. Low-frequency transmitters do not make cosmic rays.

    5. Re: "Of course it can," says government by ewanm89 · · Score: 1

      Accept what are being talked about here is not low frequency radiation but extremely higher frequency radiation, wavelengths smaller than gaps between atoms that are only stopped on that direct hit which if it happens to just the right atom on that added circuit or whatever. Now the are extraordinarily rare events it the probability of any single ray is calculated but are being constantly but by these rays all day every day making the probability of causing an issue somewhere on the plant quite high. There are some solutions though, ECC ram for example means individual but flips can be fixed and is what is used in most server systems however support on consumer level gear is non existent. If that isn't enough run systems in triplicate on the separate machines then run a vote on the result only one machine is likely to have had a bit flip during that specific operation.

    6. Re:"Of course it can," says government by Bruce+Perens · · Score: 1

      Faraday cages are really good for RF, and I was writing about HAARP. The X rays that you get from a radiologist don't have the same energy level as cosmic rays. The best we can do about energetic cosmic rays is to make our equipment less susceptible, because you can never have enough shielding.

    7. Re: "Of course it can," says government by Bruce+Perens · · Score: 2

      The comment I was responding to was regarding HAARP. And that's "except" FYI. :-) ECC is actually more reliable, for its problem domain, than a triple voting system. The probability that you would arrive at a valid ECC code for bad data due to multiple bit flips is much lower than than the probability of two out of three systems voting wrong. So, it is at least theoretically possible to design a computer system with data integrity throughout that exceeds that of a voting system.

    8. Re: "Of course it can," says government by ceoyoyo · · Score: 1

      What we're actually talking about is cosmic rays, which are matter particles (mostly protons), not any kind of electromagnetic radiation. Those generally slam into something in the atmosphere, producing showers of secondary particles. Occasionally some of these make it to the ground. The article mentions neutrons, but these seem to be mostly muons.

      Of course Bruce Perens, to whom you replied, was talking about the radio waves from HAARP, which was mentioned by the OP.

    9. Re: "Of course it can," says government by gumbi+west · · Score: 1

      The article mentions neutrons, but these seem to be mostly muons.

      Neutrons are not muons. Nor are muons the problem in bit flips.

    10. Re:"Of course it can," says government by Anonymous Coward · · Score: 1

      Cosmic rays don't even make it through the atmosphere. Not even multi TeV cosmic rays, which are rare. They never "pass straight through the Earth". Particles with energies of about 10^18 eV arrive at the rate of about one per square kilometer of atmosphere per century. These very high energy cosmic rays are detected at the ground by looking for the secondary photons, electrons, muons and neutrons that shower large areas of the ground after the primary particle impacts the atmosphere.

    11. Re: "Of course it can," says government by Anonymous Coward · · Score: 0

      I always enjoy watching people who have no idea they are arguing with.

    12. Re: "Of course it can," says government by Bruce+Perens · · Score: 1

      The problem is thermal neutrons. These are secondary particles from interaction of cosmic rays with the atmosphere, and you can't shield from them.

    13. Re: "Of course it can," says government by gumbi+west · · Score: 1

      thermals are actually quite easy to shield from, anything high in boron will do it. It's the high energy ones that you can't (economically) shield from.The issue with high energy neutrons is that what people use as gamma shields tend to make more neutrons. Basically, you need a swimming pool over your computer to shield from them.

    14. Re:"Of course it can," says government by Khyber · · Score: 1

      Cosmic rays are essentially depleted atomic nuclei. They don't penetrate much of shit, they tend to smash into another atom and that's the end of it. We detect them on the ground by secondary effects.

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    15. Re: "Of course it can," says government by werepants · · Score: 1

      ECC is actually more reliable, for its problem domain, than a triple voting system. The probability that you would arrive at a valid ECC code for bad data due to multiple bit flips is much lower than than the probability of two out of three systems voting wrong.

      I'd say it all depends on the architecture of the systems in question, and there are a variety of possible outcomes. If you compare a bit-level TMR system to an ECC system (suppose 8 data bits and 1 ECC bit) in which all bits are equally susceptible to upset, then you clearly have a greater chance to accumulate 2 bits in error in the ECC system just because you've got 9 chances to upset 2 bits instead of only 3 chances. If you're flipping coins, you've got a better chance of seeing heads > twice in 9 flips than you do of seeing heads twice in 3 flips.

      Suppose you've got a 10% chance of accumulating an error in a given bit per day, you end up with ~72% chance of 2 upsets in the ECC architecture (.9*.8, an approximation), vs 6% (0.3*0.2) in the TMR architecture. Granted, you've got to multiply that by 8 to to get the same amount of data storage as the ECC example. So at the end of the day in this notional case, you end up with 72% chance of lost data in the ECC architecture, and 48% chance of lost data in the TMR example. I've used imprecise approximations, but they demonstrate that statistically speaking, TMR can provide better protection than ECC in some cases.

      Now, many ECC algorithms provide a single-error correct dual-error detect (SECDED) capability, which does confer a meaningful advantage vs TMR, where you can correct a single bit but you have no idea if you are actually seeing a two bit error. On the whole, though, you're still going to get 2x upsets far more often with ECC simply because there's a larger target area. So you end up with data loss more often with ECC, but at least you know that you've lost data.

      It's also worth noting that if TMR is implemented on a byte level, where you compare the contents of 3 bytes, TMR looks a lot better because it's very unlikely that you'll upset the same bit on two different bytes. So effectively you do end up with something more like SECDED.

      Anyhow, it's a complex topic with lots of potential for statistical hangups. In my experience, ECC is attractive primarily because it is so efficient compared to TMR - you're generally talking a 1-10% memory overhead to provide some very capable protection that will generally bring upsets down by a few orders of magnitude. With TMR, the overhead is 300%. However, TMR can be simple, and for situations where you have memory capacity, board space, and/or power to spare, it can be a superior option.

      Last but not least, in both cases, the rate of uncorrectable errors is highly dependent on data retention time. You have to keep moving data in and out fast enough that the chance of accumulating multiple error bits in a data word is small. So there's a time component to the whole discussion as well.

    16. Re:"Of course it can," says government by TheOuterLinux · · Score: 1

      The HAARP part of my comment was just a joke, but still kinda cool to see intelligent responses though. :)

  10. explains so much by bobmajdakjr · · Score: 1

    "Cosmic rays, man." -- Bethesda

    1. Re:explains so much by sunwukong · · Score: 1

      Client: ... it crashed again! What's going on with the server?
      Me: I've recently become a Herald of Galactus. I may be difficult to reach from now on .... if you're lucky ...

  11. Re:@Intel: Why no ECC for consumer-grade processor by unixisc · · Score: 2

    Actually, wouldn't cosmic rays be capable of flipping bits even in ECC memory and processors, thereby making the whole ECC thing useless? Particularly in more recent process nodes, where the lithography scale is approaching atoms, and where cosmic rays would have a far greater effect?

  12. Acts of God by Anonymous Coward · · Score: 0

    Apple doesn't cover acts of God. It's actually in the warranty.

    1. Re:Acts of God by Anonymous Coward · · Score: 0

      Can Apple prove that God exists and created the cosmic rays?

    2. Re:Acts of God by Bruce+Perens · · Score: 1

      Tee hee.

      The legal definition of Act of God does not itself admit to the existence of a deity. Just natural phenomena which are beyond human agency to predict or prevent.

  13. Bring back analog computing by Anonymous Coward · · Score: 0

    Peace, love, op amps and 33 1/3 rpm records man.

  14. preposterous! by Gravis+Zero · · Score: 5, Informative

    When your computer crashes or phone freezes, don't be so quick to blame the manufacturer.

    If my computer crashes or phone freezes, it's almost certainly the fault of the person who released the software without properly debugging it. Cosmic rays are very low on the list of reasons why your device has malfunctioned.

    --
    Anons need not reply. Questions end with a question mark.
    1. Re:preposterous! by Arkh89 · · Score: 1

      You are right in the lottery sense : if your particular phone or app crashes, it is very unlikely that it is due to cosmic rays. However, it might be likely that it happens fairly often around the world. This is similar to the lottery : it is unlikely that you will win, but it is likely that someone will win.

      It's all a matter of cross-section of the devices actually. If we want to compare, the IPhone 4 (an old baseline, smaller than today's generation but close to most of the low-cost devices) measures 0.007 m^2, while the top 10 largest data centers (from this random link) combined measure about 1.7 x 10^6 m^2. I am going to assume only 1% of the surface is occupied by sensitive chips (?). You would need about 2.4 millions IPhone 4 to cover the same area. Thus, it is very possible that mobile hardware is victim of more high energy burps than immobile hardware.

    2. Re:preposterous! by Anonymous Coward · · Score: 0

      >Cosmic rays are very low on the list of reasons why your device has malfunctioned.
      Bit errors are rare but they do happen.

      I've encountered a single bit flip error before on a server with ECC memory, which leads me to believe the error occurred in the processor and not the ram.
      It caused a mysql replication slave to fail on a query where the master succeeded.
      I'm not about to blame mysql devs for this, no amount of debugging could have caught this because it wasn't due to a bug on their part.
      You might then argue that it's still their fault due to inaction, but that's not the same as it being a bug.

    3. Re:preposterous! by ShaunC · · Score: 1

      Low on the list, but certainly not nonzero. Given the increasing number of devices out there it's probably happening around the world with some regularity. There just isn't a way for most of us to properly measure or attribute the occurrences.

      Say you're driving down the interstate and your cruise control shuts off, but you're sure you didn't bump the brake. Your $1.49 bag of chips rings up as $9.49 at the grocery store, but re-scans at the correct price after a void. A few pixels go blurry in an otherwise flawless TV broadcast. We tend to chalk these things up as "a glitch" and go on with life, but a few of them really are caused by tiny visitors from outer space...

      --
      Thanks to the War on Drugs, it's easier to buy meth than it is to buy cold medicine!
    4. Re:preposterous! by complete+loony · · Score: 1

      Bit flipping happens often enough just in stored dns names, that it's worth buying up some bit flipped names.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    5. Re:preposterous! by craXORjack · · Score: 4, Funny

      Some pieces of software are just the recipients of more cosmic rays than others. For example, Windows 3.1 used to attract ultra high energy cosmic rays from as far away as Mars and for a time was making astronomers lives difficult due to the showers of particles released when many of those rays would strike molecules in the atmosphere instead of the Microsoft copyrighted code they were aiming for. Other software that attracts higher than normal numbers of cosmic rays are the Therac-25 and Diebold voting machines.

      --
      Liberals call everyone Nazis yet they are the closest thing to it.
    6. Re:preposterous! by Bruce+Perens · · Score: 1

      Someday I will be able to completely debug a piece of software. It will be a very small piece of software, I am sure.

      People discount the complexity that we face when attempting to fully debug anything.

    7. Re:preposterous! by Bruce+Perens · · Score: 1

      Somewhere on the Internet, it is perfectly possible that your packet was re-written by something that glitched a bit and wrote a proper checksum for the glitched data.

    8. Re:preposterous! by Anonymous Coward · · Score: 0

      You are right but in my case it was on a LAN.

    9. Re:preposterous! by Gravis+Zero · · Score: 1

      People discount the complexity that we face when attempting to fully debug anything.

      As a programmer, I recognize that getting rid of every bug in a large piece of software is a pipedream. I just want them to get the superficial bugs out of the way (which plague every release they make) so that they can actually focus on fixing the deeper bugs in a days, not months.

      However, with a large corporate entity like Microsoft, it is not unreasonable to insist on responsible programming practices though Microsoft is slow to adopt these. One of these practices is the reuse of existing code. With every release, it seems like they have rewritten the whole OS because it's always full of bugs at every layer. Like you mentioned, keeping the software small is a way to make software securable and despite their name, Microsoft has no interest in doing that.

      --
      Anons need not reply. Questions end with a question mark.
    10. Re:preposterous! by Khyber · · Score: 1

      Absolutely zero. Cosmic radiation - a depleted nucleus - never reaches the ground. We detect them at ground level through their secondary effects (produced photons, electrons, etc.) They smash into another atom in the atmosphere and are gone.

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    11. Re:preposterous! by Raenex · · Score: 1

      Absolutely zero.

      You'd think by chance alone the occasional one would reach ground level.

    12. Re:preposterous! by Anonymous Coward · · Score: 0

      I think you have just invented the shields required to sustainable human deep space travel. Only thing required now is to create the inverse of Windows 3.1 and install that to an inverse of Diebold voting machine, attached to the spacecraft. Bonus points for the inverse solitaire to entertain the crew and keep the morale up during the long journeys.

  15. NOT bringing down a passenger jet by Alain+Williams · · Score: 2

    Follow through the links: a cosmic ray caused problems, the jets misbehaved for a bit but the duplicated systems protected them from a crash - as they are supposed to after a malfunction.

  16. Imagine when the U.S. weaponizes this into a craft by Anonymous Coward · · Score: 0

    and sends it into orbit. They can just focus beams on anything and hope they'll hit and disrupt, and the target will never know what happened to their electronics. They will wreck havok and cause conflict anywhere where it benefits them.

  17. Hmm. by hey! · · Score: 1

    Shouldn't "News for Nerds" be news to nerds?

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  18. Modern Error Correction by Anonymous Coward · · Score: 0

    Yeah, back in the 90's I definitely remember this being a big issue, but there's so much expected error already in the computations being done in modern CPU's, that gets fixed at the hardware level before it ever impacts software, I honestly haven't thought of it much more than in passing for the last 10 or 15 years. Technically it's always POSSIBLE, but the frequency of these incidents is almost completely negligible.

  19. Re:@Intel: Why no ECC for consumer-grade processor by drinkypoo · · Score: 4, Informative

    Actually, wouldn't cosmic rays be capable of flipping bits even in ECC memory and processors, thereby making the whole ECC thing useless?

    No, this is what ECC is for. If a bit is flipped, you can detect it. If you have enough parity bits, you can even detect which bit is flipped, and correct it on the fly. Computation occurs as normal and an error shows up in the syslog.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  20. Odds by JBMcB · · Score: 4, Insightful

    The odds of a cosmic ray hitting your memory at the exact right spot to flip a bit are one in hundreds of millions. There are just enough computers out there that it happens from time to time. The odds of FIVE rays hitting just the right locations to flip four bits and a parity bit are, pardon the pun, astronomical.

    --
    My Other Computer Is A Data General Nova III.
    1. Re: Odds by Anonymous Coward · · Score: 0

      When high energy cosmic rays hit the upper atmosphere, they create a shower if particles, each of which may hit further things. The result is a very clumpy distribution of particles. So while your chances of being hit by a single one is low, the chances of being hit by two or more is not a simple power of the single probably as the events are not uncorrelated. The chances are still lower, but still high enough that there have been recorded cases of parity bit not being enough.

    2. Re:Odds by Bruce+Perens · · Score: 1

      The odds of a cosmic ray hitting your memory at the exact right spot to flip a bit are one in hundreds of millions.

      Each of my systems has more than hundreds of millions of bits of RAM. Some of them have 128 thousand million bits. There are a lot of places to hit.

    3. Re:Odds by dbraden · · Score: 1

      Odds are low, but the frequency is high. An oft-cited IBM study from the 90s determined that memory will get a cosmic ray bit-flip once per 256MB per month. So, an 8GB system will see about 32 bit-flips per month. Probably more with modern memory. Of course, as you mention, it's not likely that several would occur at the same time in nearly the same place.

      That vast majority will be in unused memory, executable code that never gets executed, or even in code or data that, while corrupted, simply doesn't have a noticeable effect.

      But, what about a single bit flip of a parity bit? Does a good bit get "corrected" to an incorrect value? Serious question, as I really don't know enough about the specifics.

    4. Re: Odds by Entrope · · Score: 1

      Typical error correcting codes decode correctly wherever the errors occur, as long as the number of errors is within the error-correcting capacity of the code. They would be pretty poor if they failed more often when errors occur in the parity bits.

    5. Re:Odds by Khyber · · Score: 1

      "An oft-cited IBM study from the 90s"

      When the cells were much larger and easier targets...

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    6. Re:Odds by Anonymous Coward · · Score: 0

      One that I read was 1.3x10^-12 bit/hour. Turns to 95% chance of at least 1 bit flipped in 4GiB of memory in 3 days. If you use ECC memory, it's 96% chance of an unrecoverable, not undetectable, bit error per 2.7mil years, assuming a random distribution.

    7. Re:Odds by ChrisMaple · · Score: 1

      A properly designed ECC detects errors in the parity bits. That can mean that one more parity bit is required than if the parity bits weren't covered by the ECC.

      For instance, an ECC for 8-bit data with 3 parity bits could correct a single bit error in the 8-bit part, but would cause an error in the 8-bit part if a parity bit was wrong. An ECC for 8-bit data with 4 parity bits can correct any of the total 12 bits, although correcting an error in the parity bits may not be necessary.

      --
      Contribute to civilization: ari.aynrand.org/donate
  21. Stock markets and the BOFH by bernywork · · Score: 1

    Even though market participents are warned about this by exchanges, you do have to wonder, if it makes it into the BOFH excuse calendar, can you really take it seriously?

    --
    Curiosity was framed; ignorance killed the cat. -- Author unknown
    1. Re:Stock markets and the BOFH by bernywork · · Score: 1

      Oh, that and solar flares

      --
      Curiosity was framed; ignorance killed the cat. -- Author unknown
  22. Goddamit where was this ... by CaptainDork · · Score: 1

    ... during my IT career?

    I could have used this as a dodge after I fucked something up in the system.

    I did the sunspot thing back in 2012.

    "Russia," seems to work well, though.

    --
    It little behooves the best of us to comment on the rest of us.
  23. Cosmic radiation is often just an excuse. by Anonymous Coward · · Score: 0

    The cosmic radiation excuse has been used by manufacturers for 30 years, and in a lot of cases it boils down to the techs being to lazy, or not sufficiently skilled to diagnose the actual issue.

    In just about every case when customers have pushed and a more complete diagnosis made, the manufacturer has discovered an actual fault.

    This is the case with the Cisco gear recently, from what I recall, Cisco quickly retracted the explanation in the face of industry ridicule.

    If a particular model of equipment is suffering a high fault rate, then the simplest explanation is usually the most likely: there's a fault in the design.

    1. Re: Cosmic radiation is often just an excuse. by Entrope · · Score: 1

      It's never lupus, and it's never a radiation-induced bit flip. (Until it is one, or both, of those.)

  24. Re:@Intel: Why no ECC for consumer-grade processor by thinkwaitfast · · Score: 1
    This has all been known for over 30 years. I knew about it before I knew what it meant because the old timey computer magazines lie BYTE! had articles about it.

    Are people really less knowledgeable about computers now than they were in the 80's?

  25. Yep. Cosmic rays. by PPH · · Score: 1

    I'm certain it's on the list somewhere.

    --
    Have gnu, will travel.
  26. Re:Imagine when the U.S. weaponizes this into a cr by thinkwaitfast · · Score: 1

    This was proposed as an SDI weapon in the 1980. And it wasn't just the US. Russia too, unless you don't believe they do stuff like that or have the capability.

  27. Re:@Intel: Why no ECC for consumer-grade processor by drinkypoo · · Score: 1

    Are people really less knowledgeable about computers now than they were in the 80's?

    If you mean on average, I think the answer is probably yes. More people know how to operate them now, but then, operating them has become orders of magnitude simpler.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  28. Re:@Intel: Why no ECC for consumer-grade processor by scdeimos · · Score: 2

    Are people really less knowledgeable about computers now than they were in the 80's?

    Yes, absolutely! Have you never sat down with a IT graduate from the 2000's to figure out what they actually know about computer hardware?

  29. Yep, Cosmic rays CAN cause problems by mhkohne · · Score: 1

    But much more frequently, problems are caused by somebody f**king something up. You shouldn't be looking to cosmic rays until you're pretty sure it's not just stupidity in action.

    --
    A thousand pounds of wood moving at 300 feet per minute. Don't get in the way.
  30. Re:@Intel: Why no ECC for consumer-grade processor by Anonymous Coward · · Score: 0

    Yeah, they kept making people after the 80s

  31. I didn't push "send", your Honor... by Anonymous Coward · · Score: 0

    it was a cosmic ray that did the act, not through anything from my intent.

  32. My computers stopped crashing... by Anonymous Coward · · Score: 0

    ---right after I installed Linux over Windows.

  33. Re:@Intel: Why no ECC for consumer-grade processor by Anonymous Coward · · Score: 0

    ECC only protects the things that ECC protects. Long lived data like the contents of RAM and the HDD/SSD should be backed up by some form of error check code simply because they're going to sit around under the assumption that everything is ok (and it might not be).

    ECC checking the cache might be necessary but it's refreshed so frequently that it's unlikely to contain bad data. We also don't need correcting codes here because we can always re-fetch cache lines from RAM if they're bad, so we can use less bits.

    ECC checking the registers might be necessary?

    Also, radiation harden the processor die and the whole machine. Then radiation harden the building they're all in.

    None of this guards against data handed off between components. If a bit is in flight on the bus or on a network cable and isn't protected by a check code then it might arrive flipped if a stray high-energy particle smashes into it on the way. Moving data needs to be verified as much as data at rest for complete correctness.

    Nothing guarantees correctness better than resetting the machine to default state from verified storage. How many bits have been smashed in your RAM since you last rebooted? It might be none, or it might be hundreds. You hit a LSB and something is off by one. You hit a MSB and you're potentially off by trillions. You hit bit 12 and you're off by exactly 4096. You hit any bit in a machine instruction and crashes are pretty much inevitable or (worse) every subsequent piece of data moving through those instructions gets turned into garbage. A hard crash is actually the best outcome.

  34. bullshit by gravewax · · Score: 2

    Your phone or computer crash is thousands of times (if not millions) more likely to have been caused by the manufacturer/coders error or fault than cosmic rays. Anyone that decides to consider cosmic rays as a more likely answer deserves to continue to experience their issues.

    1. Re:bullshit by thegarbz · · Score: 1

      No it isn't. Cosmic rays most definitely have an impact on your phone.

      You can take basic precautions though. I find new phones come with small amount of EM shielding that blocks cosmic rays. As time progresses this shield gets weak and more and more CPU power is dedicated to it operating properly which also slows down your phone. However it is often fixed by performing a factory reset (which also resets and recalibrates the EM shielding) making your phone fast and cosmic ray resistant again.

      Of course if you really want to be sure you can just encase your phone inside a lead box. I've never seen a phone encased in lead crash, however it does play havoc on your signal strength.

    2. Re:bullshit by Anonymous Coward · · Score: 0

      IBM has performed some massive studies on this and they concluded that ECC memory (specifically advanced ECC memory) is very much necessary due to cosmic rays. .... unless your electronics are under 50ft of limestone.

      "One example of the magnitude of the cosmic ray soft error phenomenon demonstrated
      that with a certain sample of non-IBM DRAMs the soft error rate as
      measured under purely real life conditions, and with the benefit of millions of
      device hours of testing, the soft error rate at sea level was measured at 5950
      FIT per chip. When the exact same test setup and DRAMs were moved to an
      underground vault, shielded by over 50 feet of rock, which effectually eliminates
      all cosmic rays, absolutely ZERO fails were recorded!21 Not only does this result
      emphatically validate the existence of a significant soft failure rate due to cosmic
      rays, but it simultaneously eliminates the possibility that alpha-particle soft error
      is even a contributor in the same order of magnitude because of the zero fails
      underground."

  35. Beware the Cosmic Ray Gnomes... by Anonymous Coward · · Score: 0

    1) Many management cultures devolve to a point where slaves can only appear to be
    competent by having an ultra low MTTR, regardless of bug difficulty. This is necesssary
    because the devolved management team cannot tell the difference between hard and
    easy bugs (or worse, they can, but they know their bosses are knuckle draggers, so
    they game the metric upwards to avoid getting bogged down on technical debt).

    2) Often, a different engineer can pick up the cosmic ray bug and reproduce
    the problem, either because they have one of the few competent managers
    left (who has a career death wish), or they are pining for the annual layoff package.

    3) Cosmic Ray Gnomes live amongst us. They are those magical creatures who
    can invoke celestial intervention every time they run their bug reproduction. There
    is no actual bug, but they fool peers and management with their dark magic.
    These Gnomes, when they reveal themselves, must be excised from the organization as
    soon as possible, before upper management falls under their magical spell.

    Think of the children!

  36. built cosmic hodoscope back in 1992 by Anonymous Coward · · Score: 1

    As a student intern working at a lab in 1992 my project was to build a cosmic hodoscope to record cosmic rays. It involved scintillators, fiber optics, HV photomultiplier tubes, a timing coincidence system, radiation sources for calibration and testing, etc. When two PMT's fired within the same timing gate window it was the result of a cosmic ray and we could determined the path of the cosmic ray. The interesting thing was that this showed that there are quite a few energetic cosmic rays reaching the surface of the earth and that they have no problem passing through the atmosphere, buildings, etc. It's very real.
    prsdntl

    1. Re:built cosmic hodoscope back in 1992 by Anonymous Coward · · Score: 0

      Was there ever enough energy to start the Delorean?

  37. Re:@Intel: Why no ECC for consumer-grade processor by unrtst · · Score: 2

    Are people really less knowledgeable about computers now than they were in the 80's?

    If you mean on average, I think the answer is probably yes.

    If you mean on average out of the total number of computer users or programmers, then yes (they are less knowledgable), because that pool has increased by lots and lots.

    If you mean on average out of all people, then no. I suspect there are far more people that know what ECC does now than did in the 80's, and the total population count hasn't gone up as much as that number, so there are more people on average, and in total, that know about the inner workings of computers.

    I think there are just far more people touching stuff they know very little about, and we assume they must know *something*, but they don't.
    Compare it to early cars, where every operator had to know a bunch of stuff about it just to keep it running, but it was simple enough that the average operator could learn that stuff. Now, most cars make maintenance very difficult, and many drivers would be hard pressed to do simple things like changing the oil, flushing the radiator, replacing a brake light, replacing the battery, changing a tire, jump starting, etc. That said, there are WAAAAY more people that know WAAY more about cars now than there were in 1930. It's just shifted more to professional/hobbyist knowledge than something that every operator is required to know.

      More people know how to operate them now, but then, operating them has become orders of magnitude simpler.

  38. Thousands of years, same surprises by holophrastic · · Score: 2

    Is anyone surprised that if you store things once, and reference the one place alone, that you get screwed on occasion?

    Is the word "co-roberation" new? How about "validation", "authentication", "verification", and, oh, I don't know, "paper-trail"?

    It's electronic information, not magic. The benefit of not carving into stone is that you can readily duplicate information into multiple places. Use it.

    RAID.

    1. Re:Thousands of years, same surprises by Anonymous Coward · · Score: 0

      *sigh* Another slashdotter who doesn't understand what RAID is for.

      RAID is for high availability, NOT high reliability. BACKUPS are for high reliability.

      If a bit gets flipped and written to one drive, RAID will dutifully duplicate that error to every other copy in the array.

    2. Re:Thousands of years, same surprises by holophrastic · · Score: 1

      *sigh*, I didn't punctuate, and you chose to interpret that I made a mistake, instead of interpreting that I didn't.

      I've been upset with RAID 5, in particular, for exactly that reason -- it has the ability to notice a single bit-flip, but it specifically does not check. I've even built a working prototype of a RAID 5 implementation that does check on read, notices that the parity is amiss, and screams. I've built another (in software), that chains the parities so it can actually repair a single bit-flip 80% of the time.

      When I typed "RAID." I was continuing my complaint that even in electronic data, no one co-roberates anything -- the thesis statement of my post.

      I was ambiguous, you could have decided that I was correct.

      So, *sigh* another person who chooses the inference that makes the implication incorrect, instead of the inference that would make the implication correct.

  39. Re:@Intel: Why no ECC for consumer-grade processor by Anonymous Coward · · Score: 0

    Depends on the ECC algorithms.

    You can design these algorithms to be able to detect an arbitrary number of mistakes in the data (on the bit level). You can also design them to be able to correct an arbitrary number of mistakes in the data.

    Standard, every day ECC RAM, can detect up to two bit flips in every byte. It can also correct a single bit flip in every byte. One bit flip? Fix the problem, log a warning, move on. Two bit flips? Throw an unrecoverable, fatal hardware exception. Depending on where in memory it happened, the OS might kill a process, panic, or just carry on as if nothing happened (if the memory was unallocated, or could be rebuilt from source - eg, an in-memory hard disk cache.) If you see errors coming consistently from a given region of memory - mark it as bad, map it out, and log it so the stick can be replaced.

    With more advanced techniques you can even recover from an entire memory chip failing completely. (Chipkill).

    In any event: better to have at least some error detection, even if it doesn't have error recovery abilities, than none; it might not matter if a couple of frames of a game get corrupted, but it will matter if financial data has a couple of bits flipped...

  40. Serious Computer Glitches Can Be Caused By IDIOTS by dbIII · · Score: 1

    Why didn't that voting machine have ECC memory? Why didn't the software have bounds checking?
    Yes, I know it's common, I use some software (from a very large company that was run by a guy you don't go hunting with) that when it hits a some input data with a negative integer IT ATTEMPTS TO ALLOCATE NEGATIVE MEMORY, and of course, crashes - but things that stupid should never happen (especially since it's supposed to deal with very noisy data). If it's out of range for a bit of code to work on then don't let it in! Don't just check in one place and hope that catches everything, check everywhere that out of bounds data is a problem.

  41. Altitude matters by Anonymous Coward · · Score: 0

    LANL is located at 7000', higher than most other supercomputer installations. DOE labs often build the fastest supercomputers in collaboration with a vendor that has won such a bid. On one, SECDED memory was omitted to save money (there is a LOT more total memory in a supercomputer than anything an individual or any corporation would build). LANL experienced transient errors that could not be traced, and finally concluded that the altitude combined with the SECDED mistake was the root cause.

  42. We've always known this. This is why we have ECC by kriston · · Score: 2

    We've always known this. This is why we have ECC memory on servers.

    --

    Kriston

  43. there is a product for this! by ooloorie · · Score: 1

    Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read or write to them.[2] Hence, the error rates increase rapidly with rising altitude; for example, compared to the sea level, the rate of neutron flux is 3.5 times higher at 1.5 km and 300 times higher at 10–12 km (the cruising altitude of commercial airplanes).

    https://en.wikipedia.org/wiki/...

    And, whaddaya know, you can buy them pretty much everywhere. For voting machines, medical applications, etc. they should obviously be used.

  44. Re:We've always known this. This is why we have EC by kriston · · Score: 3, Informative

    It's also why systems on spacecraft such as the Space Shuttle had what's called the Data Processing System. It consisted of four systems with identical software and an extra one with the same hardware but a different implementation with the same goals. They checked each others' decisions, and a majority "vote" would lock out the differing system.

    --

    Kriston

  45. Don't blame cosmic rays... by Anonymous Coward · · Score: 0

    At ground level, SEUs are far more likely to come from terrestrial sources, or even radioactivity in the plastic the devices are packaged in. They are extremely rare. Cosmic ray generated events are even more so. In space, it is a different story. Even at airliner altitudes, radiation levels are 40 times what they are on the ground.

    1. Re: Don't blame cosmic rays... by Anonymous Coward · · Score: 0

      I always put my old hard drives in the basement so the radon will overwrite them with random data.

  46. ZFS by locokamil · · Score: 1

    Isn't this why ZFS exists?

    1. Re: ZFS by Entrope · · Score: 1

      Unless you use ZFS on a ramdisk, no.

  47. wow by Anonymous Coward · · Score: 0

    don't blame the human, often a ecstasy junkie. but never hire the pot smoking non-prostitute fucker who use to have programming as a hobby. besides looking for good porn.

  48. We've known this since the 1980s... by JustAnotherOldGuy · · Score: 1

    We've known this since the 1980s...and the more dense/smaller the transistors get the greater the likelihood of it happening.

    This is news, but it's literally from the previous century.

    --
    Just cruising through this digital world at 33 1/3 rpm...
  49. Cosmic rays are real. by pjv936 · · Score: 1

    All servers should have ECC memory at a minimum.

  50. A new addition to the old joke by Anonymous Coward · · Score: 0

    A manager, an engineer, a software developer and a manufacturer are returning from a convention. As they are driving down the peak of a mountain the brakes fail and the goes careening down the road, bouncing off several guard rails before stopping at the bottom.
    All three get out of the car, amazingly unhurt.
    The manager says "I think we should hold a meeting to discuss the possible solutions to our problem."
    The engineer says, "I think we should disassemble the car and do a structural analysis on each part to determine the cause."
    The software developer says, "Let's push it up to the top and try it again."
    The manufacturer says, "It must have been the cosmic rays."

  51. Re:Voting system by hackwrench · · Score: 1

    Exactly nothing is done to examine whether the voting system is accurate, yet they expect us to believe in it.

  52. No one puts the phone grade chips in a plane by Anonymous Coward · · Score: 0

    You need to pass ISO 26262. The level of error detection and recovery details you need to go thorough to get your chips in a safety system is exteteme.

  53. Blame Canada by stooo · · Score: 1

    Yep, it's good news. Very useful.

    Dumb user error can be blamed on IT problems
    IT problems can be blamed on computer glitches
    Computer glitches can be blamed on cosmic rays

    As a result, dumb user errors can and shall always be blamed on cosmic rays

    --
    aaaaaaa
  54. Re:@Intel: Why no ECC for consumer-grade processor by thinkwaitfast · · Score: 1

    It's just shifted more to professional/hobbyist knowledge than something that every operator is required to know.

    Isn't that implied by the site we're on?

  55. Re:Marketing... by hackwrench · · Score: 1

    Well the marketing for high quality stuff is lousy. I buy phones around $100. There are cheaper phones, but they clearly do not meet my requirements. At this price, I can't tell if a given device will meet my needs or not. Perhaps I should go higher, but all devices fail to stand out at any price $100 or higher.

  56. Re:Knowledge by hackwrench · · Score: 1

    Still, there's much to be made by taking advantage of people who have more money than sense that sell equipment with minor issues on eBay because they don't know much about how to fix them. Just scored me what should be quite a deal. Still waiting for the machine to come though, to confirm my expectation.

  57. Cable modem statistics by hackwrench · · Score: 1

    Channel ID 18 19 20 22
    Total Unerrored Codewords 243285196329 243285196266 243285195305 243285196923
    Total Correctable Codewords 1094 1439 1100 1342
    Total Uncorrectable Codewords 16934 16642 17884 16943
    Don't know what normal values are.

    1. Re:Cable modem statistics by Bruce+Perens · · Score: 1

      This is not unusual. However, the cable modem channel is many orders of magnitude noisier than the paths inside of your CPU would be from external radiation.

  58. Re:ECC - Why not by hackwrench · · Score: 1

    Because studies have been done to ascertain this information.

  59. Re:@Intel: Why no ECC for consumer-grade processor by Bruce+Perens · · Score: 1

    You hit a LSB and something is off by one. You hit a MSB and you're potentially off by trillions.

    That's a good argument for Gray code.

    I have to take issue with the assumption that nothing clears errors better than a hard reset. There are very many known strategies for dealing with errors on a running system, and a reset only clears persistent and cumulative error, rather than transient ones. Since we can assume that your computer doesn't keep the same data in memory all of the time, most will be transient.

  60. Re:Imagine when the U.S. weaponizes this into a cr by Bruce+Perens · · Score: 1

    Really earlier than that, Fermi expected it and had equipment shielded and double-shielded when testing the first nuclear bomb. But we should not confuse cosmic rays and EMP.

  61. Re:Imagine when the U.S. weaponizes this into a cr by thinkwaitfast · · Score: 1

    This isn't either, but closer to a cosmic ray, just lower energy. Pointing a particle accelerator at warheads to fry their electronics. Which it would.
    We did precisely this for NASA as part of a systems we built and am very familiar...or was a long time ago...with radiation damage and failure modes to electronics in space. Sometimes the shielding can make things worse. Instead of going straight through a transistor, a collision can occur upstream sending a spray of other particles with the right energy to do damage. There are parts of the upper atmosphere that are more radioactive than the area above or below.

  62. "A bit flip added extra votes to one candidate" by Anonymous Coward · · Score: 0

    ...and not a single Trump comment was made. So glad its finally over.

  63. Weve been told by Anonymous Coward · · Score: 0

    If your of the lucky guys who run million dollar systems from IBM, you have probably been to their Austin facility, where they treat you to a couple days of presentations on how awesome their stuff is. We also giggled when they show how they bombard their systems with various cosmic particles. But it's no joke.

  64. Northbridge (and thus memory controller) is in CPU by DrYak · · Score: 1

    ECC Memory isn't the only added cost, you also need a motherboard and processor that supports it.

    For your information, ever since AMD's Athlon 64, most x86 compatible hardware has had its Northbridge *inside the processor package*.

    That means that the memory controller is inside the package of your CPU.
    The mother board is basically only traces that connect your CPU and the memory slots directly.
    A glorified cable/connector.
    (In practice, there is a bit more, regarding powering the RAM slots, etc. but you got the general idea : not much smarts in the motherboard between RAM and CPU.
    Smarts is in the "Southbridge" : between the CPU and peripherals)

    On the AMD side of things, nearly every CPU has ECC capability in its build-in memory controller.
    For a motherboard to support ECC, it basically means just having a few instruction to activate it in the EFI/BIOS.

    On the Intel side of things, it's marketed as an enterprise feature, so it's only available on the more expensive business/workstation hardware.

    --
    "Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
  65. Re: @Intel: Why no ECC for consumer-grade processo by Entrope · · Score: 1

    SECDED codes can detect up to two errored bits per codeword, not per byte. In modern systems, a typical codeword is 64 bits of data plus 8 bits of parity (where multiple parity bits cover each data bit).

  66. Re:@Intel: Why no ECC for consumer-grade processor by drinkypoo · · Score: 1

    Compare it to early cars, where every operator had to know a bunch of stuff about it just to keep it running, but it was simple enough that the average operator could learn that stuff.

    Are you really going back to early cars here? I mean, I think we can break it into basically three eras. The early age of cars was characterized by horseless carriages. The prior age of cars was ushered in around the 1930s or 1940s, where automatic transmissions appeared, the control layout became standard, and vehicles were pretty much all fully enclosed unless they were specifically designed to be a cabriolet. And the modern age of cars came with the O2 sensor, and self-tuning.

    For the earliest cars, it was common to hire a driver and mechanic, because keeping the car moving was a full-time job. Maybe halfway through the period it became reasonable for people to maintain their own vehicles, as the reliability came up to the level where you didn't have to be an engineer to keep it going.

    Obviously, the middle era was the time when any schmoe with a set of wrenches could fix a car. There was very limited availability of fluids, so vehicles were engineered to use what was ubiquitous, which was all the same. Vehicles were easy to maintain because they wasted a lot of space. On the other hand, reliability was nowhere.

    Most modern cars are staggeringly reliable, but maintenance is a mixed bag. Oil changes tend to remain trivial, but transmission oil changes may be a massive PITA. You have to get the car flat and level and add fluid from the bottom while running on a disturbing percentage of modern vehicles, and there is no dipstick. A radiator flush is exactly as hard as it ever was, and you install a flush tee the same as ever. The battery, on the other hand, might be in the wheel well behind the plastic inner fender. Even if it's someplace supposedly convenient like the trunk, it might be a PITA to get in and out as it is in my A8. And you have to jump start from the battery, too. There's no redundant terminal under the hood. That would have just added weight and crap so they skipped it. The starter takes power from beneath the frame rail on the right side, you can apply power there if you have to but again, what a PITA. On the other hand, even reasonable estimates of the service intervals are all much longer than cars from the prior era. And on the gripping hand, nobody is meant to own cars like that for more than half a decade or so. They are for rich fucks who can afford to turn them over :)

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  67. ECC is cheap by Anonymous Coward · · Score: 0

    Marginal extra cost, want to look up the difference in price between a Intel Core i7 extreme edition on an X99 board and the equivalent Intel Xeon where the difference between the processors is the ECC memory controller.

    I don't pay attention to the prices of extreme editions anymore (they have always looked like ripoffs), but back when Haswell was the thing and I was looking at prices of stuff for my home server (in the mistaken belief that I'd be transcoding video on it, which turned out to not really happen so the CPU ended up being totally overkill), the Xeon E3-1240v3 cost less than the common and similarly-speced non-extreme i7 model. (Maybe it was a 100 MHz slower or didn't have the integrated graphics, or whatever. But I'm serious that it cost less.) If I paid an "extra" non-negative cost for ECC RAM compatibility, I think it had more to do with the relatively expensive SuperMicro motherboard that I used. And the RAM itself cost a little more, but it really wasn't much.

    Once you get above low-end, ECC is nearly free: low enough that it's totally overwhelmed by all the other costs of your build. Even with Intel. (And as many people will point out, with AMD it's even cheaper.) Shit, I spent more on fans than ECC cost me. I spent more on SFF-8470 cables than ECC cost me. I spent more on the UPS than ECC cost me. ECC is one of those things where if you think the machine's job is important enough, it's trivial and costs less than a lot of other less-sexy, more-dubious things.

  68. Re:Serious Computer Glitches Can Be Caused By IDIO by omnichad · · Score: 1

    Why didn't that voting machine have ECC memory? Why didn't the software have bounds checking?

    Because if one bit-flip changes the totals by more than 1, then the software was designed wrong.

    Each vote should be a separate record - the totals should only be a summation. You can keep a running tally separately as a backup record, but that should not be your only count. If one bit flips, one vote changes - not one bit on the total.

  69. Alternative interpretation by volodymyrbiryuk · · Score: 1

    Only n00bs think it's a glitch...real programmers use butterflies. They open their hands and let their delicate wings flap once. The disturbance ripple outward, changing the flow of the eddy currents in the upper atmosphere. These cause momentary pockets of higher-pressure air to form, which act as lenses that deflect incoming cosmic rays, focusing them to strike the drive platter and flip the desired bit.

    --
    sudo rm -r -f --no-preserve-root /
  70. Blame them for not doing ECC properly by Anonymous Coward · · Score: 0

    Oh yes, DO blame the manufacturer.

    There is no acceptable reason to not have ECC on every layer (hint: even the L1 cache in your processor does ECC, and it is likely the faster memory there is in your entire system, TCAM SRAM on switch forwarding engines included), and a simple ECC or FEC (heck, even parity will do) on the serial buses. Cosmic rays almost never manage to toggle more than a single bit.

  71. Cosmic rays can cause problems? What are the odds? by AnotherBlackHat · · Score: 1

    No, seriously, what are the odds of a cosmic ray flipping a bit?

    0.1, 0.000000001, 1e-15, 1e-30?

    It's easy to blame cosmic rays, but a subtle bug is far more likely.
    .

  72. Re:Voting system by hesiod · · Score: 1

    Except, you know, all the people who HAVE actually looked for voter fraud and have found nothing that would affect a result.

  73. Re: Voting system by AmazingRuss · · Score: 1

    Doesn't matter when the electorate is flooded with morons.

  74. redundancy for critical systems by Khashishi · · Score: 1

    ECC, RAID and some sort of parallel computation should be in place for voting systems. It should be possible to run the same code on multiple processors and check that the results are the same in real-time.

  75. Old news from the real world by Anonymous Coward · · Score: 0

    Worked for a modem manufacturer for years. 2 things we could guarantee would cause lots of dead modems or devises attached to them. Thunderstorms and solar flares. This seems to just be an extension of the second.

  76. Re:We've always known this. This is why we have EC by thegarbz · · Score: 1

    Same on any industrial safety system. Often these are triplicated or quadruplicated. I actually prefer triplicated since you don't end up with an even vote on a situation.

  77. Re:Voting system by hackwrench · · Score: 1

    Except that they found irregularities that were ignored. And that wasn't a very widespread effort. You can't trust a flawed system. Most of the time, that doesn't matter, but for the voting system to matter, we must have confidence in it.

  78. Re:@Intel: Why no ECC for consumer-grade processor by Anonymous Coward · · Score: 0

    You hit a LSB and something is off by one. You hit a MSB and you're potentially off by trillions.

    That's a good argument for Gray code.

    No it isn't.

    Gray code guarantees that numbers adjacent in value have encodings that differ by only one bit. It does not guarantee the converse, that numbers that differ by one bit are close in value.

  79. Data formats by Ocrad · · Score: 1

    Data formats are designed without taking bit flips into account even today.

  80. Re:Cosmic rays can cause problems? What are the od by cwsumner · · Score: 1

    No, seriously, what are the odds of a cosmic ray flipping a bit?

    Scientists do study this. The estimate is that a typical computer will have a hit about once a year.

    The circuits get smaller, the chips get bigger and more devices are used. It seems to all cancel out and the odds have been about the same for 40 years or more.

    But you are correct, software errors are far more likely.

  81. The nuclear industry has the answer for this by Anonymous Coward · · Score: 0

    We used to deal with this when electronics were bombarded with neutron radiation when they were installed in the containment building.

    Voting algorithms and redundancy (as well as using less dense memory and processor dies) were the key to preventing these bit flips becoming an issue.

  82. Not exactly news by werepants · · Score: 1

    An entire field of study devoted to this exact problem (Radiation Effects) has been around since the 70's at least for aerospace purposes. That said, neutrons have only become a serious concern for terrestrial applications in recent years, as process geometries have gotten small enough and parts have gotten dense enough that neutron upset becomes less of an occasional annoyance and more of a constant problem.

    Generally you test electronics for Single Event Effects (SEE) at a cyclotron, and it used to be NASA, Lockheed, Boeing and the likes who were doing all the testing. More recently, though, Cisco and Intel have begun doing a lot of testing of their own. Cisco is known to put an entire server rack at a time in a neutron beam to see what goes boom.

  83. Re:@Intel: Why no ECC for consumer-grade processor by Anonymous Coward · · Score: 0

    It's called market segmentation, and Intel are WHORES for keeping ECC within the Xeon lineup only.

  84. Re:Serious Computer Glitches Can Be Caused By IDIO by dbIII · · Score: 1

    Good point. Personally I think things like the Diebold voting machines (designed by a convicted fraudster!) fail on many levels. I'm a big fan of very simple paper ballots and big high speed scanners to collate everything. When something is contested (which seems to happen someplace in just about every election anywhere) paper ballots allow a fallback all the way to manual verification if necessary.

  85. ECC much? by MikeBabcock · · Score: 1

    ECC memory has been available for a long time and most servers use it, I have no idea why voting machines and other important devices wouldn't.

    --
    - Michael T. Babcock (Yes, I blog)
  86. Fantastic! by Anonymous Coward · · Score: 0

    Excelsior!

  87. 80/20 Rule by Anonymous Coward · · Score: 0

    Device manufacturers are aware of the source of device errors. Ordinary bugs and malware are responsible for something like 99.999999% of all device errors.

    While cosmic rays are a thing, they aren't a very big thing, and there are very robust, well-developed, and widely available systems that can handle such environmentally induced errors. The biggest, cheapest, and most low-hanging fruit-wise of these has to be ECC technologies.

    Seriously, if you are going to have a geek-induced panic attack about cosmic rays, then equip your devices with ECC RAM. Then move on, you've solved it, it's as simple as that.

    The only device consumers with serious radiation issues are space agencies, the makers of nuclear power plants, and specialty radiation monitoring & cleanup companies.

  88. Cosmic Rays by Anonymous Coward · · Score: 0

    This is not news. Back around the turn of the century, there was a considerable effort to repair CPU Cache chips
    on Sun gear because they could develop serious parity errors . One of the agents listed was 'cosmic rays'...

    Sorry, no link, had to have been 1999 or so.