Slashdot Mirror


Serious Computer Glitches Can Be Caused By Cosmic Rays (computerworld.com)

The Los Alamos National Lab wrote in 2012 that "For over 20 years the military, the commercial aerospace industry, and the computer industry have known that high-energy neutrons streaming through our atmosphere can cause computer errors." Now an anonymous reader quotes Computerworld: When your computer crashes or phone freezes, don't be so quick to blame the manufacturer. Cosmic rays -- or rather the electrically charged particles they generate -- may be your real foe. While harmless to living organisms, a small number of these particles have enough energy to interfere with the operation of the microelectronic circuitry in our personal devices... particles alter an individual bit of data stored in a chip's memory. Consequences can be as trivial as altering a single pixel in a photograph or as serious as bringing down a passenger jet.

A "single-event upset" was also blamed for an electronic voting error in Schaerbeekm, Belgium, back in 2003. A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible. "This is a really big problem, but it is mostly invisible to the public," said Bharat Bhuva. Bhuva is a member of Vanderbilt University's Radiation Effects Research Group, established in 1987 to study the effects of radiation on electronic systems.

Cisco has been researching cosmic radiation since 2001, and in September briefly cited cosmic rays as a possible explanation for partial data losses that customer's were experiencing with their ASR 9000 routers.

168 of 264 comments (clear)

  1. This is news...? by __aaclcg7560 · · Score: 5, Funny

    Whenever a user calls up to ask why his computer rebooted after I install an update, I say... drumroll, please... gamma radiation.

    1. Re:This is news...? by __aaclcg7560 · · Score: 1, Interesting

      I really would like to visit your house. After I eat a lot of fiber, maybe some bran muffins or flax seeds. That way I can take a great big SHIT and put a very large, moist turd in your microwave. You will just LOVE what happens when it's in there on high for about ten minutes!

      You need to have a better diet. Healthy shit has the consistency of toothpaste.

      It makes for an interesting conversation piece.

      I had a roommate who left a squash inside a toaster oven on low heat overnight. The squash was carbonized all the way through. Charred on the outside, charred on the inside. Now that was a conversation piece.

    2. Re:This is news...? by manu0601 · · Score: 1

      gamma radiation

      I also believed that cosmic rays trouble were about gamma radiation, but TFA says it is all about neutron radiation.

    3. Re:This is news...? by ArchieBunker · · Score: 1

      Nah do an upper decker instead.

      --
      Only the State obtains its revenue by coercion. - Murray Rothbard
    4. Re:This is news...? by Z00L00K · · Score: 1

      Fibers are for data communication, you shouldn't eat them.

      --
      If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    5. Re:This is news...? by arglebargle_xiv · · Score: 5, Funny

      A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible.

      How could they tell this apart from standard operations on a Diebold machine?

    6. Re:This is news...? by jellomizer · · Score: 1

      It 2017 anything goes for news. I expect the Los Alamos National Lab is worried about its funding so will repurpose one of its old hypothesis and try to get it on Fox News so the president see it and decides to keeps it funding. These organizations if smart realize how manipulatable the president is, and just a few simple things can cause him to change his mind and course. Just as long as you stroke his ego you can do whatever you want.

      I am sorry I didn't want to make this political, but we had a problem with bad science news for a long time, because the people eat it us, and they use it to keep their funding. Unfortunately for some areas such as climate change due to some early overzealous hypothesis created a situation of mistrust of science where the general population and politicians just don't get the scientific process and are unable to weed out what are strong results and poor results and the difference between a hypothesis and a theory.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    7. Re:This is news...? by oobayly · · Score: 1

      I had to google the term after watching an episode of Archer - it was the first time I'd ever heard it. I have literally never heard of anyone (not even about a friend of a friend of a friend...) doing something like that, and I've heard some pretty fucked stuff from my colleagues.

      Is this an actual thing in the US?

    8. Re:This is news...? by geekmux · · Score: 5, Funny

      Whenever a user calls up to ask why his computer rebooted after I install an update, I say... drumroll, please... gamma radiation.

      Computers and Incredible Hulks don't interface well together, but a Ctrl-Alt-SMASH sequence? I'd buy that.

    9. Re: This is news...? by Entrope · · Score: 1

      Lighten up, Francis.

    10. Re:This is news...? by chainsaw1 · · Score: 1

      Both are right. Protons are deflected in the atmosphere and neutrons have no charge and no (almost no) magnetic oment so don't interact with things unless they hit a nucleas head-on (elastic interaction).

      That being said, if a high energy neutron hits something the energy my be sufficient to create other particles (like protons and gamma rays). So in a way both theories in this thread are correct (protons produced in lower atmosphere from neutrons from space).

      If you have the equipment, cosmic ray bit flips in memory can be determined. For this, one must map the status of every memory cell in the affected region to it's physical location in the chip and have a mirrored (RAID 1 like) data set with appreciable physical separation from the data under investigation. If there are many bit flips following a physical straight line path through the memory device, that is likely a cosmic ray.

      --
      - Sig
    11. Re:This is news...? by omnichad · · Score: 1

      May I introduce you to the Bristol Stool Chart. "Normal" is within the range of 3-4.

    12. Re:This is news...? by __aaclcg7560 · · Score: 1

      May I introduce you to the Bristol Stool Chart [wikipedia.org]. "Normal" is within the range of 3-4.

      I drop a Type 4 every morning. ;)

    13. Re:This is news...? by werepants · · Score: 1

      For technical accuracy, it would be better to blame terrestrial neutron cascades generated by cosmic rays. Gamma rays don't actually cause bit upsets or things that would cause a momentary glitch - you need something like a neutron or proton for that. Gamma rays do a perfectly fine job at putting dose on parts, but if the computer has accumulated enough dose to cause a failure, then the user probably died quite a while ago.

      Disclaimer: IAAREE (I Am A Radiation Effects Engineer)

      ;)

    14. Re:This is news...? by igny · · Score: 1
      --
      In theory there is no difference between theory and practice. In practice there is. - Yogi Berra
    15. Re:This is news...? by LienRag · · Score: 1

      More seriously, is there a reference to the actual incident?
      All I can find by googling it is copies of TFA...

    16. Re:This is news...? by ivanjager · · Score: 1

      The standard operation is carefully designed to avoid increasing the count to more than the total number of voters in a given district.

  2. That is why Excel crashes all the time on OSX by thesjaakspoiler · · Score: 3, Insightful

    I was convinced that is was a lousy programming job by Microsoft that has more attention to fancy UX components rather than stability. I am waiting for the confirmation that the fact that Excel start searching every known (network) drive for a license if it can't connect to the online subscription service, for every operation, must be due to black matter. Unless it crashes when it tries to display that warning message, then it's just some cosmic ray again. So relieved!

    1. Re:That is why Excel crashes all the time on OSX by aaarrrgggh · · Score: 1

      Funny how copy-paste operations are constantly and consistently corrupted by those damn cosmic rays...

      Oh well, guess I need better shielding.

    2. Re:That is why Excel crashes all the time on OSX by toddestan · · Score: 1

      I assume cosmic rays is also why Outlook constantly pops up that "Need Password" prompt. All this time I was assuming it was a bug introduced back in Office 2007!

  3. Why not blame the manufacturer? by Anonymous Coward · · Score: 1, Informative

    When your computer crashes or phone freezes, don't be so quick to blame the manufacturer.

    Why not? According to the article, it is well-known phenomena:

    For over 20 years the military, the commercial aerospace industry, and the computer industry have known that high-energy neutrons streaming through our atmosphere can cause computer errors.

    So if it is a well-known problem, and manufacturers are ignoring the problem and creating devices susceptible to such interference, why can I not blame the manufacturer for making hardware with known problems? I would blame the manufacturer if a hearing aid was picking up local radio stations, so why not here?

    1. Re:Why not blame the manufacturer? by DontBeAMoran · · Score: 3, Insightful

      There's something you can do about it. It's very easy, but you won't like it.

      Make every component in triplicate. Everything in the CPU, everything in the RAM, everything in storage, etc. If the three aren't equal, go with the value shared by two of them and rewrite the different one with that value.

      --
      #DeleteFacebook
    2. Re:Why not blame the manufacturer? by Baloroth · · Score: 5, Interesting

      There's something you can do about it. It's very easy, but you won't like it.

      Make every component in triplicate. Everything in the CPU, everything in the RAM, everything in storage, etc. If the three aren't equal, go with the value shared by two of them and rewrite the different one with that value.

      Not only is this not actually all that easy (all of your triplicate systems have to be clocked together in sync, you need a shitload of extra hardware to do the comparison, etc.) it's grossly unnecessary. Standard off-the-shelf error detection and correction can (and routinely does) handle radiation induced errors. It just costs a bit more, because it's a business-level feature. It doesn't matter if that MP3 of Taylor Swift gets mildly corrupted (might even sound better that way, zing), but it very much *does* matter if that bank account gets a flipped bit.

      --
      "None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
    3. Re:Why not blame the manufacturer? by ShanghaiBill · · Score: 4, Informative

      Probably b'cos there is nothing that manufacturers can do about cosmic rays

      Except that is not true. Electronic devices can be made more resistant to cosmic rays and other radiation. The easiest way to do so is to use depleted boron instead of "normal" boron as a semiconductor dopant. Boron-10 has a very high neutron absorption cross section while Boron-11 has a very small cross section. Use boron that has been "depleted" of the B10 isotope, and you cut way down on your neutron induced SEUs.

      Another obvious countermeasure is to use ECC memory, and memory scrubbing.

      The problem is not that there is nothing that manufacturers can do, but that consumers aren't willing to pay the extra cost. Would you be willing to pay an extra $100 for your phone if it meant one fewer reboot every decade or so?

    4. Re:Why not blame the manufacturer? by ShooterNeo · · Score: 5, Informative

      You know that several FPGA manufacturers offer this. Xilinx offers a method where this is done in software - when you do design synthesis, more than triple the gates are needed for every circuit allocated in the design. (I think it's done at a higher level - truth tables with the triple redundant bits are generated)

      Some do it in hardware, so your design synthesis is the same but the actual software programmable subunits use ternary redundancy.

    5. Re:Why not blame the manufacturer? by ShooterNeo · · Score: 1

      I wonder if this means it's actually cheaper to use 3 separate computers, on cheap off the shelf hardware, than one armored and extra redundant computer. For example, spacecraft guidance or for an autonomous car.

    6. Re:Why not blame the manufacturer? by currently_awake · · Score: 1

      The use of parity on memory can prevent most of this. Unfortunately the higher density cpu's are highly vulnerable, and using mil spec parts isn't mandatory for elections.

    7. Re:Why not blame the manufacturer? by currently_awake · · Score: 2

      Yes it's cheaper. That's why they invented RAID.

    8. Re:Why not blame the manufacturer? by ShaunC · · Score: 1

      I recall reading that most space probes are designed this way, due to increased exposure to radiation. Flipped bits are a serious problem in space, the hope is that only one event occurs at a time so that the other two processors maintain quorum.

      --
      Thanks to the War on Drugs, it's easier to buy meth than it is to buy cold medicine!
    9. Re:Why not blame the manufacturer? by ShooterNeo · · Score: 1

      So the reason the Curiosity rover uses FPGAs with ternary logic (and just 2 computers if I recall) is to save on weight. If they were going to optimal cost efficiency they'd have redundant computers and do what the FPGAs are doing in firmware.

    10. Re:Why not blame the manufacturer? by Applehu+Akbar · · Score: 1

      Probably b'cos there is nothing that manufacturers can do about cosmic rays, which are beyond even gamma rays in the electromagnetic spectrum in terms of wavelength and frequency.

      Not the manufacturer, but the CTOs: just put all data centers into old mines. This could be a great business for rural areas.

    11. Re: Why not blame the manufacturer? by ShanghaiBill · · Score: 5, Informative

      ECC memory doesn't do anything to help when the bits that get flipped are in the CPU. Or anywhere else that isn't a RAM chip.

      Except that the RAM has hundreds or thousands of times as many bits as a CPU, and Flash may have millions of times as many, and dynamic ram has smaller feature size, and is more susceptible to SEUs. So correcting RAM and Flash helps because that is where 99.9% of the problem is.

      Even within the CPU, most transistors are used to implement cache, and cache can also be scrubbed (although not with just software).

    12. Re:Why not blame the manufacturer? by unrtst · · Score: 4, Informative

      Another obvious countermeasure is to use ECC memory ...

      The problem is not that there is nothing that manufacturers can do, but that consumers aren't willing to pay the extra cost. Would you be willing to pay an extra $100 for your phone ...

      ECC memory is not that much more expensive. It's been a few years since I built the desktop I'm using, but I included 16gb of ECC memory (4x 4gb DDR3 ECC KVR1333D3E9SK2/8G). At the time, I think it was around $60. The equivalent normal memory was only a couple bucks cheaper. If Samsung started using ECC memory in all their phones, the cost would be nearly the same with the volume they would be ordering/making.

      FWIW, I did try to do the same comparison just now on newegg and, while it's a bit of a mess, the situation is nearly the same today:
      $34 : Kingston 4GB 240-Pin DDR3 SDRAM ECC Unbuffered DDR3 1333 Server Memory Model KVR13LE9S8/4
      $52 : Kingston 8GB (2 x 4GB) 240-Pin DDR3 SDRAM DDR3 1600 (PC3 12800) Memory Model KVR16N11S8K2/8

      More expensive? Yes.
      $100 more? Nowhere near that much.

    13. Re:Why not blame the manufacturer? by fuzzyfuzzyfungus · · Score: 2

      If you think that finding a vendor that doesn't keep cutting battery life/SD card slots/headphone jacks/basic safeguards against electrical fire in order to make it thinner, cheaper, or both is hard; just try to find one that ensures sufficient borated polyethylene(with something else to sop up the resulting gamma rays) or other neutron shielding into their products.

      There probably are some, making bits for nuclear reactors and industrial, scientific, and medical users of neutron sources; but it's a niche.

    14. Re:Why not blame the manufacturer? by dbIII · · Score: 1

      If they were going to optimal cost efficiency

      In that special case saving on weight is optimal cost efficiency. I doubt that any of the rover components cost as much as their percentage of the total weight divided by the cost to get the rover on Mars.

    15. Re:Why not blame the manufacturer? by ShooterNeo · · Score: 1

      Technically correct...the best kind of correct. I was implicitly referring strictly to electronic component cost and development time costs, since there's only going to be a handful of Curiosity rover style projects per decade but there are many thousands of projects to develop safe computerized control systems for cars and robots and everything else.

    16. Re:Why not blame the manufacturer? by dgatwood · · Score: 3, Informative

      Adding one ECC bit per byte, yes. Adding one parity bit, no. ECC != parity.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    17. Re:Why not blame the manufacturer? by Solandri · · Score: 2
      This is actually a fairly recent development. When I was putting together a file server in 2012, I really wanted to use ECC RAM. But 2x4GB ECC cost more than $250 vs $50 for regular 2x4GB RAM. Add in the extra cost of a server motherboard that supported ECC RAM and the processor restrictions, and I gave up and just built the file server using regular RAM.

      A couple years later, the price of ECC RAM had dropped to only about 50% more than the cost of regular RAM.

      . If Samsung started using ECC memory in all their phones, the cost would be nearly the same with the volume they would be ordering/making.

      The cost would be 12.5% more. :)

    18. Re:Why not blame the manufacturer? by thegarbz · · Score: 1

      it's grossly unnecessary

      That depends on the application. I agree unnecessary for a general purpose computer, probably also unnecessary for servers.

      However if your electronics store critical financial information or are safety systems in control of hazardous facilities then it becomes a bit of a different story. The comparison model is actually one that is adopted by many safety systems.

    19. Re:Why not blame the manufacturer? by ewanm89 · · Score: 1

      Marginal extra cost, want to look up the difference in price between a Intel Core i7 extreme edition on an X99 board and the equivalent Intel Xeon where the difference between the processors is the ECC memory controller. There are a few low end mobile and embedded processors Intel do with ECC, but majority of their consumer range deliberately do not have it, it is a Xeon "feature" and the price tag that has.

    20. Re:Why not blame the manufacturer? by vtcodger · · Score: 1

      and manufacturers are ignoring the problem and creating devices susceptible to such interference

      If it really is a problem, it could be easily dealt with at very modest cost by using extra memory bits for memory error detection/correction. Although as others point out, modern software is so buggy that it might not be worth the effort to actually improve the hardware a little.

      Those who were around 25 years or so ago will remember that the lack of parity/ECC actually can be laid partly on Microsoft. Early PCs had a parity bit on each byte (and they probably needed it). But memory was expensive back then -- $100US a Megabyte. And Windows needed a fair amount of memory. Which meant it needed a costly box to run on. So MS launched a campaign to convince us all that we really no longer needed that extra bit that added roughly 12% to the cost of memory.

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
    21. Re: Why not blame the manufacturer? by vtcodger · · Score: 1

      I should think that it wouldn't be that hard to add parity to CPU registers, caches, etc

      OTOH, I'm sure Intel could find a way to make the implementation obtuse and even further complicate their CPUs. And in any case, it's unclear to me what the device is supposed to do when it finds the number it is working with is wrong.

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
    22. Re:Why not blame the manufacturer? by arth1 · · Score: 1

      Probably b'cos there is nothing that manufacturers can do about cosmic rays, which are beyond even gamma rays in the electromagnetic spectrum in terms of wavelength and frequency.

      In addition to what others have pointed out, stop shrinking dies. The smaller a circuit is, the greater the risk that it will be impacted when hit by a neutron. Components from a decade or two ago are a heck of a lot more resilient against cosmic rays than today's components.

      Sure, at lower speeds, but for a great many things you just need enough speed. Split out speed-requiring jobs on cutting-edge hardware, and run other critical services on more reliable hardware.

    23. Re:Why not blame the manufacturer? by Anonymous Coward · · Score: 1

      There are ALREADY high-end servers that do exactly that. In-fact, they have been sold and are being sold even today to high-end businesses like Banks, Stock Exchanges and the like.

      One line of products that I am familiar with are : https://en.wikipedia.org/wiki/NonStop_(server_computers)

      Obviously anonymous coward, since I work for the company that makes them (yup the same people that were making the shitty laptops were also making these at the same time.)

    24. Re:Why not blame the manufacturer? by omnichad · · Score: 1

      You can RAID memory? ..... checksum, detect and repair single-bit errors in blocks of bytes

      ECC Memory?

    25. Re:Why not blame the manufacturer? by Khyber · · Score: 1

      Almost every cosmic ray is an atomic nuclei that's been stripped down. It's not an EM-spectrum object.

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    26. Re: Why not blame the manufacturer? by Khyber · · Score: 1

      "ECC memory doesn't do anything to help when the bits that get flipped are in the CPU"

      Guess what the SRAM used in CPUs is? Fucking ECC, dumbass.

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    27. Re:Why not blame the manufacturer? by bws111 · · Score: 1

      Yes, you can RAID memory. See https://www.ibm.com/developerw...

    28. Re:Why not blame the manufacturer? by boristdog · · Score: 1

      Shoot, we have to account for die loss due to cosmic rays every time we ship a bunch of wafers by air. We've been doing that as long as I've been working here.

      semiconductor fab, 22 years.

    29. Re:Why not blame the manufacturer? by Raenex · · Score: 1

      It doesn't matter if that MP3 of Taylor Swift gets mildly corrupted (might even sound better that way, zing), but it very much *does* matter if that bank account gets a flipped bit.

      Sorry to rain on your popular-bashing parade, but ordinary people actually do important things with their computers besides listening to music.

    30. Re:Why not blame the manufacturer? by werepants · · Score: 1

      Xilinx does offer this, but implementing TMR (Triple Majority Redundancy) isn't always straightforward. The thing is, when you've tripled the size of your circuit, you will now accrue 3 times as many errors - you've just created a much larger target. You've also tripled size and power, and probably slowed the whole design down. And unless you are very, very careful, you introduce a new single point of failure: the voter, which has to mediate between the three elements in question.

      Fun fact: A previous generation of Xilinx TMR IP actually made upset rates WORSE than no protection whatsoever. Source - IAAREE - I Am A Radiation Effects Engineer, I do this for a living. ;)

    31. Re:Why not blame the manufacturer? by Baloroth · · Score: 1

      If you're doing something important on your computer that requires error checking and correction, you should either a) get your employer to pay for the hardware to run it on (if it's a work thing), or b) pay for the hardware to run it on yourself (if it's a personal/self-employed thing). What you should *not* do, is use the wrong tool for the job, like running important error-sensitive operations on consumer grade hardware.

      --
      "None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
    32. Re:Why not blame the manufacturer? by ChrisMaple · · Score: 1

      CPU registers are likely to have a design that differs from cache cells. The registers will be bigger, faster, and more resistant to bit flips than cache.

      --
      Contribute to civilization: ari.aynrand.org/donate
    33. Re:Why not blame the manufacturer? by ChrisMaple · · Score: 1

      After a little google searching, it appears that density isn't effective for neutron shielding; even lead is a poor choice. A material with a high neutron capture area is needed. Apparently, water works well for neutrons.

      --
      Contribute to civilization: ari.aynrand.org/donate
    34. Re:Why not blame the manufacturer? by Agripa · · Score: 1

      I call bullshit. I've been building systems with ECC RAM since the early 2000's (because I'm a paranoid fuck) and the price was never even close to a factor 5 more expensive. Perhaps you used registered RAM.

      After Intel dropped support for ECC in desktop systems and during the beginning of the DDR2 to DDR3 transition, ECC DDR3 was not available and not supported and new Intel systems which supported ECC used FB-DIMM which were way more expensive than ECC DDR2. The difference in price at the time between a new AMD system using ECC DDR2 and a comparable new Intel system using ECC FB-DIMMs was roughly $1000.

  4. Sun Microsystems cache failure by Andrew+Lindh · · Score: 1

    Sun blamed cosmic rays for causing CPU cache corruption and system crashes in their high-end enterprise systems. http://www.forbes.com/forbes/2...

  5. ECC by Bruce+Perens · · Score: 4, Insightful

    This is why ECC is used to protect memory and data busses. At least on the good stuff :-) . One of the issues is die shrink. As the minimum detail slze of the IC process gets smaller, the potential for radiation to flip a bit gets higher.

    Silicon-on-sapphire is the main way to implement silicon-on-insulator, which is more protective of radiation bit flips and less likely to latch-up. But since these have historically been required only for space satellites, they have been horribly expensive. Imagine running an entire IC fabrication just to make a few chips. As there are more applications for rad-hard chips, the price could fall.

  6. "Of course it can," says government by TheOuterLinux · · Score: 1

    Oh THATS what happened to the emails. Global warming is bullshit, but those cosmic rays will getcha every time...yeah -_- Witness the birth of oncoming onslaught of pathetic excuses. Not doubting the logic at all, especially given how mass power outages have happened because of this, but I got feeling someone will do research near "HAARP" and, "Oh no...Why god why!...all well." Â\_(ãf)_/Â

    1. Re:"Of course it can," says government by Bruce+Perens · · Score: 1

      Professional computers have metal cases and no silly viewing windows, and proper EMI suppression on the ports. Effectively, Faraday cages. They are quite proof against low-frequency radio transmitters nearby, whatever the power.

    2. Re:"Of course it can," says government by thinkwaitfast · · Score: 1

      Cosmic rays penetrate many km into the Earth. Very energetic ones can pass straight though the Earth. A metal Faraday cage is not going to help much.

    3. Re:"Of course it can," says government by Bruce+Perens · · Score: 1

      Please read the comment again. Low-frequency transmitters do not make cosmic rays.

    4. Re: "Of course it can," says government by ewanm89 · · Score: 1

      Accept what are being talked about here is not low frequency radiation but extremely higher frequency radiation, wavelengths smaller than gaps between atoms that are only stopped on that direct hit which if it happens to just the right atom on that added circuit or whatever. Now the are extraordinarily rare events it the probability of any single ray is calculated but are being constantly but by these rays all day every day making the probability of causing an issue somewhere on the plant quite high. There are some solutions though, ECC ram for example means individual but flips can be fixed and is what is used in most server systems however support on consumer level gear is non existent. If that isn't enough run systems in triplicate on the separate machines then run a vote on the result only one machine is likely to have had a bit flip during that specific operation.

    5. Re:"Of course it can," says government by Bruce+Perens · · Score: 1

      Faraday cages are really good for RF, and I was writing about HAARP. The X rays that you get from a radiologist don't have the same energy level as cosmic rays. The best we can do about energetic cosmic rays is to make our equipment less susceptible, because you can never have enough shielding.

    6. Re: "Of course it can," says government by Bruce+Perens · · Score: 2

      The comment I was responding to was regarding HAARP. And that's "except" FYI. :-) ECC is actually more reliable, for its problem domain, than a triple voting system. The probability that you would arrive at a valid ECC code for bad data due to multiple bit flips is much lower than than the probability of two out of three systems voting wrong. So, it is at least theoretically possible to design a computer system with data integrity throughout that exceeds that of a voting system.

    7. Re: "Of course it can," says government by ceoyoyo · · Score: 1

      What we're actually talking about is cosmic rays, which are matter particles (mostly protons), not any kind of electromagnetic radiation. Those generally slam into something in the atmosphere, producing showers of secondary particles. Occasionally some of these make it to the ground. The article mentions neutrons, but these seem to be mostly muons.

      Of course Bruce Perens, to whom you replied, was talking about the radio waves from HAARP, which was mentioned by the OP.

    8. Re: "Of course it can," says government by gumbi+west · · Score: 1

      The article mentions neutrons, but these seem to be mostly muons.

      Neutrons are not muons. Nor are muons the problem in bit flips.

    9. Re:"Of course it can," says government by Anonymous Coward · · Score: 1

      Cosmic rays don't even make it through the atmosphere. Not even multi TeV cosmic rays, which are rare. They never "pass straight through the Earth". Particles with energies of about 10^18 eV arrive at the rate of about one per square kilometer of atmosphere per century. These very high energy cosmic rays are detected at the ground by looking for the secondary photons, electrons, muons and neutrons that shower large areas of the ground after the primary particle impacts the atmosphere.

    10. Re: "Of course it can," says government by Bruce+Perens · · Score: 1

      The problem is thermal neutrons. These are secondary particles from interaction of cosmic rays with the atmosphere, and you can't shield from them.

    11. Re: "Of course it can," says government by gumbi+west · · Score: 1

      thermals are actually quite easy to shield from, anything high in boron will do it. It's the high energy ones that you can't (economically) shield from.The issue with high energy neutrons is that what people use as gamma shields tend to make more neutrons. Basically, you need a swimming pool over your computer to shield from them.

    12. Re:"Of course it can," says government by Khyber · · Score: 1

      Cosmic rays are essentially depleted atomic nuclei. They don't penetrate much of shit, they tend to smash into another atom and that's the end of it. We detect them on the ground by secondary effects.

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    13. Re: "Of course it can," says government by werepants · · Score: 1

      ECC is actually more reliable, for its problem domain, than a triple voting system. The probability that you would arrive at a valid ECC code for bad data due to multiple bit flips is much lower than than the probability of two out of three systems voting wrong.

      I'd say it all depends on the architecture of the systems in question, and there are a variety of possible outcomes. If you compare a bit-level TMR system to an ECC system (suppose 8 data bits and 1 ECC bit) in which all bits are equally susceptible to upset, then you clearly have a greater chance to accumulate 2 bits in error in the ECC system just because you've got 9 chances to upset 2 bits instead of only 3 chances. If you're flipping coins, you've got a better chance of seeing heads > twice in 9 flips than you do of seeing heads twice in 3 flips.

      Suppose you've got a 10% chance of accumulating an error in a given bit per day, you end up with ~72% chance of 2 upsets in the ECC architecture (.9*.8, an approximation), vs 6% (0.3*0.2) in the TMR architecture. Granted, you've got to multiply that by 8 to to get the same amount of data storage as the ECC example. So at the end of the day in this notional case, you end up with 72% chance of lost data in the ECC architecture, and 48% chance of lost data in the TMR example. I've used imprecise approximations, but they demonstrate that statistically speaking, TMR can provide better protection than ECC in some cases.

      Now, many ECC algorithms provide a single-error correct dual-error detect (SECDED) capability, which does confer a meaningful advantage vs TMR, where you can correct a single bit but you have no idea if you are actually seeing a two bit error. On the whole, though, you're still going to get 2x upsets far more often with ECC simply because there's a larger target area. So you end up with data loss more often with ECC, but at least you know that you've lost data.

      It's also worth noting that if TMR is implemented on a byte level, where you compare the contents of 3 bytes, TMR looks a lot better because it's very unlikely that you'll upset the same bit on two different bytes. So effectively you do end up with something more like SECDED.

      Anyhow, it's a complex topic with lots of potential for statistical hangups. In my experience, ECC is attractive primarily because it is so efficient compared to TMR - you're generally talking a 1-10% memory overhead to provide some very capable protection that will generally bring upsets down by a few orders of magnitude. With TMR, the overhead is 300%. However, TMR can be simple, and for situations where you have memory capacity, board space, and/or power to spare, it can be a superior option.

      Last but not least, in both cases, the rate of uncorrectable errors is highly dependent on data retention time. You have to keep moving data in and out fast enough that the chance of accumulating multiple error bits in a data word is small. So there's a time component to the whole discussion as well.

    14. Re:"Of course it can," says government by TheOuterLinux · · Score: 1

      The HAARP part of my comment was just a joke, but still kinda cool to see intelligent responses though. :)

  7. explains so much by bobmajdakjr · · Score: 1

    "Cosmic rays, man." -- Bethesda

    1. Re:explains so much by sunwukong · · Score: 1

      Client: ... it crashed again! What's going on with the server?
      Me: I've recently become a Herald of Galactus. I may be difficult to reach from now on .... if you're lucky ...

  8. Re:@Intel: Why no ECC for consumer-grade processor by unixisc · · Score: 2

    Actually, wouldn't cosmic rays be capable of flipping bits even in ECC memory and processors, thereby making the whole ECC thing useless? Particularly in more recent process nodes, where the lithography scale is approaching atoms, and where cosmic rays would have a far greater effect?

  9. Re:ECC by unixisc · · Score: 1

    As they get smaller, I think we are fast approaching the point where it will be thought that a silicon atom is too big to allow for a shrink, and that semiconductor physicists will have to start looking at carbon and maybe even boron

  10. preposterous! by Gravis+Zero · · Score: 5, Informative

    When your computer crashes or phone freezes, don't be so quick to blame the manufacturer.

    If my computer crashes or phone freezes, it's almost certainly the fault of the person who released the software without properly debugging it. Cosmic rays are very low on the list of reasons why your device has malfunctioned.

    --
    Anons need not reply. Questions end with a question mark.
    1. Re:preposterous! by Arkh89 · · Score: 1

      You are right in the lottery sense : if your particular phone or app crashes, it is very unlikely that it is due to cosmic rays. However, it might be likely that it happens fairly often around the world. This is similar to the lottery : it is unlikely that you will win, but it is likely that someone will win.

      It's all a matter of cross-section of the devices actually. If we want to compare, the IPhone 4 (an old baseline, smaller than today's generation but close to most of the low-cost devices) measures 0.007 m^2, while the top 10 largest data centers (from this random link) combined measure about 1.7 x 10^6 m^2. I am going to assume only 1% of the surface is occupied by sensitive chips (?). You would need about 2.4 millions IPhone 4 to cover the same area. Thus, it is very possible that mobile hardware is victim of more high energy burps than immobile hardware.

    2. Re:preposterous! by ShaunC · · Score: 1

      Low on the list, but certainly not nonzero. Given the increasing number of devices out there it's probably happening around the world with some regularity. There just isn't a way for most of us to properly measure or attribute the occurrences.

      Say you're driving down the interstate and your cruise control shuts off, but you're sure you didn't bump the brake. Your $1.49 bag of chips rings up as $9.49 at the grocery store, but re-scans at the correct price after a void. A few pixels go blurry in an otherwise flawless TV broadcast. We tend to chalk these things up as "a glitch" and go on with life, but a few of them really are caused by tiny visitors from outer space...

      --
      Thanks to the War on Drugs, it's easier to buy meth than it is to buy cold medicine!
    3. Re:preposterous! by complete+loony · · Score: 1

      Bit flipping happens often enough just in stored dns names, that it's worth buying up some bit flipped names.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    4. Re:preposterous! by craXORjack · · Score: 4, Funny

      Some pieces of software are just the recipients of more cosmic rays than others. For example, Windows 3.1 used to attract ultra high energy cosmic rays from as far away as Mars and for a time was making astronomers lives difficult due to the showers of particles released when many of those rays would strike molecules in the atmosphere instead of the Microsoft copyrighted code they were aiming for. Other software that attracts higher than normal numbers of cosmic rays are the Therac-25 and Diebold voting machines.

      --
      Liberals call everyone Nazis yet they are the closest thing to it.
    5. Re:preposterous! by Bruce+Perens · · Score: 1

      Someday I will be able to completely debug a piece of software. It will be a very small piece of software, I am sure.

      People discount the complexity that we face when attempting to fully debug anything.

    6. Re:preposterous! by Bruce+Perens · · Score: 1

      Somewhere on the Internet, it is perfectly possible that your packet was re-written by something that glitched a bit and wrote a proper checksum for the glitched data.

    7. Re:preposterous! by Gravis+Zero · · Score: 1

      People discount the complexity that we face when attempting to fully debug anything.

      As a programmer, I recognize that getting rid of every bug in a large piece of software is a pipedream. I just want them to get the superficial bugs out of the way (which plague every release they make) so that they can actually focus on fixing the deeper bugs in a days, not months.

      However, with a large corporate entity like Microsoft, it is not unreasonable to insist on responsible programming practices though Microsoft is slow to adopt these. One of these practices is the reuse of existing code. With every release, it seems like they have rewritten the whole OS because it's always full of bugs at every layer. Like you mentioned, keeping the software small is a way to make software securable and despite their name, Microsoft has no interest in doing that.

      --
      Anons need not reply. Questions end with a question mark.
    8. Re:preposterous! by Khyber · · Score: 1

      Absolutely zero. Cosmic radiation - a depleted nucleus - never reaches the ground. We detect them at ground level through their secondary effects (produced photons, electrons, etc.) They smash into another atom in the atmosphere and are gone.

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    9. Re:preposterous! by Raenex · · Score: 1

      Absolutely zero.

      You'd think by chance alone the occasional one would reach ground level.

  11. Re:ECC by DontBeAMoran · · Score: 1

    Your ECC RAM won't matter much if the cosmic ray hits the CPU registers. Or a cell in a block of your flash storage.

    --
    #DeleteFacebook
  12. NOT bringing down a passenger jet by Alain+Williams · · Score: 2

    Follow through the links: a cosmic ray caused problems, the jets misbehaved for a bit but the duplicated systems protected them from a crash - as they are supposed to after a malfunction.

  13. Hmm. by hey! · · Score: 1

    Shouldn't "News for Nerds" be news to nerds?

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  14. Re:@Intel: Why no ECC for consumer-grade processor by drinkypoo · · Score: 4, Informative

    Actually, wouldn't cosmic rays be capable of flipping bits even in ECC memory and processors, thereby making the whole ECC thing useless?

    No, this is what ECC is for. If a bit is flipped, you can detect it. If you have enough parity bits, you can even detect which bit is flipped, and correct it on the fly. Computation occurs as normal and an error shows up in the syslog.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  15. Odds by JBMcB · · Score: 4, Insightful

    The odds of a cosmic ray hitting your memory at the exact right spot to flip a bit are one in hundreds of millions. There are just enough computers out there that it happens from time to time. The odds of FIVE rays hitting just the right locations to flip four bits and a parity bit are, pardon the pun, astronomical.

    --
    My Other Computer Is A Data General Nova III.
    1. Re:Odds by Bruce+Perens · · Score: 1

      The odds of a cosmic ray hitting your memory at the exact right spot to flip a bit are one in hundreds of millions.

      Each of my systems has more than hundreds of millions of bits of RAM. Some of them have 128 thousand million bits. There are a lot of places to hit.

    2. Re:Odds by dbraden · · Score: 1

      Odds are low, but the frequency is high. An oft-cited IBM study from the 90s determined that memory will get a cosmic ray bit-flip once per 256MB per month. So, an 8GB system will see about 32 bit-flips per month. Probably more with modern memory. Of course, as you mention, it's not likely that several would occur at the same time in nearly the same place.

      That vast majority will be in unused memory, executable code that never gets executed, or even in code or data that, while corrupted, simply doesn't have a noticeable effect.

      But, what about a single bit flip of a parity bit? Does a good bit get "corrected" to an incorrect value? Serious question, as I really don't know enough about the specifics.

    3. Re: Odds by Entrope · · Score: 1

      Typical error correcting codes decode correctly wherever the errors occur, as long as the number of errors is within the error-correcting capacity of the code. They would be pretty poor if they failed more often when errors occur in the parity bits.

    4. Re:Odds by Khyber · · Score: 1

      "An oft-cited IBM study from the 90s"

      When the cells were much larger and easier targets...

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    5. Re:Odds by ChrisMaple · · Score: 1

      A properly designed ECC detects errors in the parity bits. That can mean that one more parity bit is required than if the parity bits weren't covered by the ECC.

      For instance, an ECC for 8-bit data with 3 parity bits could correct a single bit error in the 8-bit part, but would cause an error in the 8-bit part if a parity bit was wrong. An ECC for 8-bit data with 4 parity bits can correct any of the total 12 bits, although correcting an error in the parity bits may not be necessary.

      --
      Contribute to civilization: ari.aynrand.org/donate
  16. Re:ECC by drinkypoo · · Score: 1

    Your ECC RAM won't matter much if the cosmic ray hits the CPU registers.

    Some modern CPUs have ECC cache RAM. Is it not possible to have ECC registers?

    Or a cell in a block of your flash storage.

    Filesystems can have ECC, too. And in fact, so can storage devices.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  17. Stock markets and the BOFH by bernywork · · Score: 1

    Even though market participents are warned about this by exchanges, you do have to wonder, if it makes it into the BOFH excuse calendar, can you really take it seriously?

    --
    Curiosity was framed; ignorance killed the cat. -- Author unknown
    1. Re:Stock markets and the BOFH by bernywork · · Score: 1

      Oh, that and solar flares

      --
      Curiosity was framed; ignorance killed the cat. -- Author unknown
  18. Goddamit where was this ... by CaptainDork · · Score: 1

    ... during my IT career?

    I could have used this as a dodge after I fucked something up in the system.

    I did the sunspot thing back in 2012.

    "Russia," seems to work well, though.

    --
    It little behooves the best of us to comment on the rest of us.
  19. Re: ECC by ewanm89 · · Score: 2

    We are already there:
    http://www.pcworld.com/article...
    http://arstechnica.com/gadgets...

    As the IBM article states they are working with Samsung and Global Foundries while the other article is about Intel that is 3 of the major chip fab companies stating they are moving to silicon-germanium hybrid crystal over pure silicon for exactly this reason. Also the fabs on a new process node take time to setup and they need to be ready before circuit design comes in to fab prototype batches so they are usually a couple of years ahead of what is commercially available on the market.

  20. Re:ECC by PPH · · Score: 1

    Now if only companies like Intel would actually provide

    Yes, if only ...

    --
    Have gnu, will travel.
  21. Re:@Intel: Why no ECC for consumer-grade processor by thinkwaitfast · · Score: 1
    This has all been known for over 30 years. I knew about it before I knew what it meant because the old timey computer magazines lie BYTE! had articles about it.

    Are people really less knowledgeable about computers now than they were in the 80's?

  22. Yep. Cosmic rays. by PPH · · Score: 1

    I'm certain it's on the list somewhere.

    --
    Have gnu, will travel.
  23. Re:Imagine when the U.S. weaponizes this into a cr by thinkwaitfast · · Score: 1

    This was proposed as an SDI weapon in the 1980. And it wasn't just the US. Russia too, unless you don't believe they do stuff like that or have the capability.

  24. Re:@Intel: Why no ECC for consumer-grade processor by drinkypoo · · Score: 1

    Are people really less knowledgeable about computers now than they were in the 80's?

    If you mean on average, I think the answer is probably yes. More people know how to operate them now, but then, operating them has become orders of magnitude simpler.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  25. Re:@Intel: Why no ECC for consumer-grade processor by scdeimos · · Score: 2

    Are people really less knowledgeable about computers now than they were in the 80's?

    Yes, absolutely! Have you never sat down with a IT graduate from the 2000's to figure out what they actually know about computer hardware?

  26. Re:ECC by scdeimos · · Score: 1

    Cosmic rays aren't likely to trigger multiple bit flips simultaneously in the same block of memory.

    Maybe that's the case for now, but who knows what will happen with stacked 3D memory?

  27. Yep, Cosmic rays CAN cause problems by mhkohne · · Score: 1

    But much more frequently, problems are caused by somebody f**king something up. You shouldn't be looking to cosmic rays until you're pretty sure it's not just stupidity in action.

    --
    A thousand pounds of wood moving at 300 feet per minute. Don't get in the way.
  28. Re:ECC by viperidaenz · · Score: 1

    Costs money.
    It takes up more power, more die space, you need more RAM chips, etc.

  29. bullshit by gravewax · · Score: 2

    Your phone or computer crash is thousands of times (if not millions) more likely to have been caused by the manufacturer/coders error or fault than cosmic rays. Anyone that decides to consider cosmic rays as a more likely answer deserves to continue to experience their issues.

    1. Re:bullshit by thegarbz · · Score: 1

      No it isn't. Cosmic rays most definitely have an impact on your phone.

      You can take basic precautions though. I find new phones come with small amount of EM shielding that blocks cosmic rays. As time progresses this shield gets weak and more and more CPU power is dedicated to it operating properly which also slows down your phone. However it is often fixed by performing a factory reset (which also resets and recalibrates the EM shielding) making your phone fast and cosmic ray resistant again.

      Of course if you really want to be sure you can just encase your phone inside a lead box. I've never seen a phone encased in lead crash, however it does play havoc on your signal strength.

  30. built cosmic hodoscope back in 1992 by Anonymous Coward · · Score: 1

    As a student intern working at a lab in 1992 my project was to build a cosmic hodoscope to record cosmic rays. It involved scintillators, fiber optics, HV photomultiplier tubes, a timing coincidence system, radiation sources for calibration and testing, etc. When two PMT's fired within the same timing gate window it was the result of a cosmic ray and we could determined the path of the cosmic ray. The interesting thing was that this showed that there are quite a few energetic cosmic rays reaching the surface of the earth and that they have no problem passing through the atmosphere, buildings, etc. It's very real.
    prsdntl

  31. Re:@Intel: Why no ECC for consumer-grade processor by unrtst · · Score: 2

    Are people really less knowledgeable about computers now than they were in the 80's?

    If you mean on average, I think the answer is probably yes.

    If you mean on average out of the total number of computer users or programmers, then yes (they are less knowledgable), because that pool has increased by lots and lots.

    If you mean on average out of all people, then no. I suspect there are far more people that know what ECC does now than did in the 80's, and the total population count hasn't gone up as much as that number, so there are more people on average, and in total, that know about the inner workings of computers.

    I think there are just far more people touching stuff they know very little about, and we assume they must know *something*, but they don't.
    Compare it to early cars, where every operator had to know a bunch of stuff about it just to keep it running, but it was simple enough that the average operator could learn that stuff. Now, most cars make maintenance very difficult, and many drivers would be hard pressed to do simple things like changing the oil, flushing the radiator, replacing a brake light, replacing the battery, changing a tire, jump starting, etc. That said, there are WAAAAY more people that know WAAY more about cars now than there were in 1930. It's just shifted more to professional/hobbyist knowledge than something that every operator is required to know.

      More people know how to operate them now, but then, operating them has become orders of magnitude simpler.

  32. Re:ECC by Solandri · · Score: 1

    As the minimum detail slze of the IC process gets smaller, the potential for radiation to flip a bit gets higher.

    I suspect the math works out the same as Shannon's noisy channel theorem. And that as the chance of bit flips (noise) increases due to die shrinking, you can increase the error correction coding to compensate for it up to some theoretical limit.

    e.g.. instead of ECC memory having one parity bit for every 8 data bits, you increase it to two parity bits per 8 data bits, and it can withstand a higher error rate.

  33. Thousands of years, same surprises by holophrastic · · Score: 2

    Is anyone surprised that if you store things once, and reference the one place alone, that you get screwed on occasion?

    Is the word "co-roberation" new? How about "validation", "authentication", "verification", and, oh, I don't know, "paper-trail"?

    It's electronic information, not magic. The benefit of not carving into stone is that you can readily duplicate information into multiple places. Use it.

    RAID.

    1. Re:Thousands of years, same surprises by holophrastic · · Score: 1

      *sigh*, I didn't punctuate, and you chose to interpret that I made a mistake, instead of interpreting that I didn't.

      I've been upset with RAID 5, in particular, for exactly that reason -- it has the ability to notice a single bit-flip, but it specifically does not check. I've even built a working prototype of a RAID 5 implementation that does check on read, notices that the parity is amiss, and screams. I've built another (in software), that chains the parities so it can actually repair a single bit-flip 80% of the time.

      When I typed "RAID." I was continuing my complaint that even in electronic data, no one co-roberates anything -- the thesis statement of my post.

      I was ambiguous, you could have decided that I was correct.

      So, *sigh* another person who chooses the inference that makes the implication incorrect, instead of the inference that would make the implication correct.

  34. Serious Computer Glitches Can Be Caused By IDIOTS by dbIII · · Score: 1

    Why didn't that voting machine have ECC memory? Why didn't the software have bounds checking?
    Yes, I know it's common, I use some software (from a very large company that was run by a guy you don't go hunting with) that when it hits a some input data with a negative integer IT ATTEMPTS TO ALLOCATE NEGATIVE MEMORY, and of course, crashes - but things that stupid should never happen (especially since it's supposed to deal with very noisy data). If it's out of range for a bit of code to work on then don't let it in! Don't just check in one place and hope that catches everything, check everywhere that out of bounds data is a problem.

  35. Re:ECC by dbIII · · Score: 1

    Indeed, that could be a problem, but the failure to teach computer science graduates mathematics to the level of high school probability and statistics is a far greater problem in my opinion. It results in posts like the above.

  36. We've always known this. This is why we have ECC by kriston · · Score: 2

    We've always known this. This is why we have ECC memory on servers.

    --

    Kriston

  37. Re:ECC by unixisc · · Score: 1

    Yeah, I forgot that boron is a dopant. Although then, you'd have no dopants that are significantly smaller than carbon

  38. there is a product for this! by ooloorie · · Score: 1

    Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read or write to them.[2] Hence, the error rates increase rapidly with rising altitude; for example, compared to the sea level, the rate of neutron flux is 3.5 times higher at 1.5 km and 300 times higher at 10–12 km (the cruising altitude of commercial airplanes).

    https://en.wikipedia.org/wiki/...

    And, whaddaya know, you can buy them pretty much everywhere. For voting machines, medical applications, etc. they should obviously be used.

  39. Re:We've always known this. This is why we have EC by kriston · · Score: 3, Informative

    It's also why systems on spacecraft such as the Space Shuttle had what's called the Data Processing System. It consisted of four systems with identical software and an extra one with the same hardware but a different implementation with the same goals. They checked each others' decisions, and a majority "vote" would lock out the differing system.

    --

    Kriston

  40. ZFS by locokamil · · Score: 1

    Isn't this why ZFS exists?

    1. Re: ZFS by Entrope · · Score: 1

      Unless you use ZFS on a ramdisk, no.

  41. We've known this since the 1980s... by JustAnotherOldGuy · · Score: 1

    We've known this since the 1980s...and the more dense/smaller the transistors get the greater the likelihood of it happening.

    This is news, but it's literally from the previous century.

    --
    Just cruising through this digital world at 33 1/3 rpm...
  42. Cosmic rays are real. by pjv936 · · Score: 1

    All servers should have ECC memory at a minimum.

  43. Re:Voting system by hackwrench · · Score: 1

    Exactly nothing is done to examine whether the voting system is accurate, yet they expect us to believe in it.

  44. Re:ECC by Jeremi · · Score: 1

    Your ECC RAM won't matter much if the cosmic ray hits the CPU registers. Or a cell in a block of your flash storage.

    Also, your ECC RAM won't matter much if you get run over by a truck. So what? ECC RAM will help if there is a bitflip in your ECC RAM, that's what it's for and that's what the benefit is. It's not going to solve world hunger either, and nobody ever suggested that it would.

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.
  45. Blame Canada by stooo · · Score: 1

    Yep, it's good news. Very useful.

    Dumb user error can be blamed on IT problems
    IT problems can be blamed on computer glitches
    Computer glitches can be blamed on cosmic rays

    As a result, dumb user errors can and shall always be blamed on cosmic rays

    --
    aaaaaaa
  46. Re:@Intel: Why no ECC for consumer-grade processor by thinkwaitfast · · Score: 1

    It's just shifted more to professional/hobbyist knowledge than something that every operator is required to know.

    Isn't that implied by the site we're on?

  47. Re:Marketing... by hackwrench · · Score: 1

    Well the marketing for high quality stuff is lousy. I buy phones around $100. There are cheaper phones, but they clearly do not meet my requirements. At this price, I can't tell if a given device will meet my needs or not. Perhaps I should go higher, but all devices fail to stand out at any price $100 or higher.

  48. Re:Knowledge by hackwrench · · Score: 1

    Still, there's much to be made by taking advantage of people who have more money than sense that sell equipment with minor issues on eBay because they don't know much about how to fix them. Just scored me what should be quite a deal. Still waiting for the machine to come though, to confirm my expectation.

  49. Cable modem statistics by hackwrench · · Score: 1

    Channel ID 18 19 20 22
    Total Unerrored Codewords 243285196329 243285196266 243285195305 243285196923
    Total Correctable Codewords 1094 1439 1100 1342
    Total Uncorrectable Codewords 16934 16642 17884 16943
    Don't know what normal values are.

    1. Re:Cable modem statistics by Bruce+Perens · · Score: 1

      This is not unusual. However, the cable modem channel is many orders of magnitude noisier than the paths inside of your CPU would be from external radiation.

  50. Re:ECC - Why not by hackwrench · · Score: 1

    Because studies have been done to ascertain this information.

  51. Re:ECC by Bruce+Perens · · Score: 1

    The situation for AMSAT is still pretty bad, as far as I've heard. As a radio amateur group (and one that has launched quite a few satellites as space hitch-hikers) they can't afford the good stuff, but they get some donated by NASA and some of the commercial satellite companies. Only a few years ago they were still using the 1802 as their main vehicle controller, as that was their main choice in silicon-on-sapphire CPUs. They get some donations of space-qualified solar cells. They scrub their memory continuously, They use no boot ROMS. The program is loaded entirely by hardware, and then the CPU is started.

  52. Re:@Intel: Why no ECC for consumer-grade processor by Bruce+Perens · · Score: 1

    You hit a LSB and something is off by one. You hit a MSB and you're potentially off by trillions.

    That's a good argument for Gray code.

    I have to take issue with the assumption that nothing clears errors better than a hard reset. There are very many known strategies for dealing with errors on a running system, and a reset only clears persistent and cumulative error, rather than transient ones. Since we can assume that your computer doesn't keep the same data in memory all of the time, most will be transient.

  53. Re:Acts of God by Bruce+Perens · · Score: 1

    Tee hee.

    The legal definition of Act of God does not itself admit to the existence of a deity. Just natural phenomena which are beyond human agency to predict or prevent.

  54. Re:Imagine when the U.S. weaponizes this into a cr by Bruce+Perens · · Score: 1

    Really earlier than that, Fermi expected it and had equipment shielded and double-shielded when testing the first nuclear bomb. But we should not confuse cosmic rays and EMP.

  55. Re: ECC by Bruce+Perens · · Score: 4, Interesting

    Read this paper. He postulates 2 soft errors per year for a Xeon 7500 with 24 MB L3 cache at sea level in New York City. He also gives figures for static RAM, which is the stuff of CPU registers.

  56. Re:Imagine when the U.S. weaponizes this into a cr by thinkwaitfast · · Score: 1

    This isn't either, but closer to a cosmic ray, just lower energy. Pointing a particle accelerator at warheads to fry their electronics. Which it would.
    We did precisely this for NASA as part of a systems we built and am very familiar...or was a long time ago...with radiation damage and failure modes to electronics in space. Sometimes the shielding can make things worse. Instead of going straight through a transistor, a collision can occur upstream sending a spray of other particles with the right energy to do damage. There are parts of the upper atmosphere that are more radioactive than the area above or below.

  57. Re:ECC by Enter+the+Shoggoth · · Score: 1

    The situation for AMSAT is still pretty bad, as far as I've heard. As a radio amateur group (and one that has launched quite a few satellites as space hitch-hikers) they can't afford the good stuff, but they get some donated by NASA and some of the commercial satellite companies. Only a few years ago they were still using the 1802 as their main vehicle controller, as that was their main choice in silicon-on-sapphire CPUs. They get some donations of space-qualified solar cells. They scrub their memory continuously, They use no boot ROMS. The program is loaded entirely by hardware, and then the CPU is started.

    Bruce, what do you mean by "...no boot ROMS.... loaded entirely by hardware" ?

    --
    Andy Warhol got it right / Everybody gets the limelight
    Andy Warhol got it wrong / Fifteen minutes is too long.
  58. Northbridge (and thus memory controller) is in CPU by DrYak · · Score: 1

    ECC Memory isn't the only added cost, you also need a motherboard and processor that supports it.

    For your information, ever since AMD's Athlon 64, most x86 compatible hardware has had its Northbridge *inside the processor package*.

    That means that the memory controller is inside the package of your CPU.
    The mother board is basically only traces that connect your CPU and the memory slots directly.
    A glorified cable/connector.
    (In practice, there is a bit more, regarding powering the RAM slots, etc. but you got the general idea : not much smarts in the motherboard between RAM and CPU.
    Smarts is in the "Southbridge" : between the CPU and peripherals)

    On the AMD side of things, nearly every CPU has ECC capability in its build-in memory controller.
    For a motherboard to support ECC, it basically means just having a few instruction to activate it in the EFI/BIOS.

    On the Intel side of things, it's marketed as an enterprise feature, so it's only available on the more expensive business/workstation hardware.

    --
    "Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
  59. Re: @Intel: Why no ECC for consumer-grade processo by Entrope · · Score: 1

    SECDED codes can detect up to two errored bits per codeword, not per byte. In modern systems, a typical codeword is 64 bits of data plus 8 bits of parity (where multiple parity bits cover each data bit).

  60. Re:@Intel: Why no ECC for consumer-grade processor by drinkypoo · · Score: 1

    Compare it to early cars, where every operator had to know a bunch of stuff about it just to keep it running, but it was simple enough that the average operator could learn that stuff.

    Are you really going back to early cars here? I mean, I think we can break it into basically three eras. The early age of cars was characterized by horseless carriages. The prior age of cars was ushered in around the 1930s or 1940s, where automatic transmissions appeared, the control layout became standard, and vehicles were pretty much all fully enclosed unless they were specifically designed to be a cabriolet. And the modern age of cars came with the O2 sensor, and self-tuning.

    For the earliest cars, it was common to hire a driver and mechanic, because keeping the car moving was a full-time job. Maybe halfway through the period it became reasonable for people to maintain their own vehicles, as the reliability came up to the level where you didn't have to be an engineer to keep it going.

    Obviously, the middle era was the time when any schmoe with a set of wrenches could fix a car. There was very limited availability of fluids, so vehicles were engineered to use what was ubiquitous, which was all the same. Vehicles were easy to maintain because they wasted a lot of space. On the other hand, reliability was nowhere.

    Most modern cars are staggeringly reliable, but maintenance is a mixed bag. Oil changes tend to remain trivial, but transmission oil changes may be a massive PITA. You have to get the car flat and level and add fluid from the bottom while running on a disturbing percentage of modern vehicles, and there is no dipstick. A radiator flush is exactly as hard as it ever was, and you install a flush tee the same as ever. The battery, on the other hand, might be in the wheel well behind the plastic inner fender. Even if it's someplace supposedly convenient like the trunk, it might be a PITA to get in and out as it is in my A8. And you have to jump start from the battery, too. There's no redundant terminal under the hood. That would have just added weight and crap so they skipped it. The starter takes power from beneath the frame rail on the right side, you can apply power there if you have to but again, what a PITA. On the other hand, even reasonable estimates of the service intervals are all much longer than cars from the prior era. And on the gripping hand, nobody is meant to own cars like that for more than half a decade or so. They are for rich fucks who can afford to turn them over :)

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  61. Re: ECC by Entrope · · Score: 1

    You can protect registers with ECC, but any CPU that does that is likely to want some (more complicated) scheme to protect data in the execution units as well, and doing that is more costly. You need something like majority voting of 3+ units, or you need logic that updates ECC bits coherently, which I think is not well-researched.

  62. Re: Cosmic radiation is often just an excuse. by Entrope · · Score: 1

    It's never lupus, and it's never a radiation-induced bit flip. (Until it is one, or both, of those.)

  63. Re:Serious Computer Glitches Can Be Caused By IDIO by omnichad · · Score: 1

    Why didn't that voting machine have ECC memory? Why didn't the software have bounds checking?

    Because if one bit-flip changes the totals by more than 1, then the software was designed wrong.

    Each vote should be a separate record - the totals should only be a summation. You can keep a running tally separately as a backup record, but that should not be your only count. If one bit flips, one vote changes - not one bit on the total.

  64. Alternative interpretation by volodymyrbiryuk · · Score: 1

    Only n00bs think it's a glitch...real programmers use butterflies. They open their hands and let their delicate wings flap once. The disturbance ripple outward, changing the flow of the eddy currents in the upper atmosphere. These cause momentary pockets of higher-pressure air to form, which act as lenses that deflect incoming cosmic rays, focusing them to strike the drive platter and flip the desired bit.

    --
    sudo rm -r -f --no-preserve-root /
  65. Re:ECC by DontBeAMoran · · Score: 1

    And your truck won't matter much if it gets run over by an asteroid. So what?

    --
    #DeleteFacebook
  66. Cosmic rays can cause problems? What are the odds? by AnotherBlackHat · · Score: 1

    No, seriously, what are the odds of a cosmic ray flipping a bit?

    0.1, 0.000000001, 1e-15, 1e-30?

    It's easy to blame cosmic rays, but a subtle bug is far more likely.
    .

  67. Re:Voting system by hesiod · · Score: 1

    Except, you know, all the people who HAVE actually looked for voter fraud and have found nothing that would affect a result.

  68. Re:ECC by Bruce+Perens · · Score: 1

    LEO doesn't have as much of a radiation problem, they are below the Van Allen belts.

  69. Re:ECC by Bruce+Perens · · Score: 1

    No boot ROM means that a hardware device constructed from discrete logic and analog chips directly demodulates digital data from the radio, addresses the memory, and writes the data. Once this process is completed, it de-asserts the RESET line of the CPU and the CPU starts executing from an address in memory. Really no ROM!

  70. Re: Voting system by AmazingRuss · · Score: 1

    Doesn't matter when the electorate is flooded with morons.

  71. redundancy for critical systems by Khashishi · · Score: 1

    ECC, RAID and some sort of parallel computation should be in place for voting systems. It should be possible to run the same code on multiple processors and check that the results are the same in real-time.

  72. Re:We've always known this. This is why we have EC by thegarbz · · Score: 1

    Same on any industrial safety system. Often these are triplicated or quadruplicated. I actually prefer triplicated since you don't end up with an even vote on a situation.

  73. Re:Voting system by hackwrench · · Score: 1

    Except that they found irregularities that were ignored. And that wasn't a very widespread effort. You can't trust a flawed system. Most of the time, that doesn't matter, but for the voting system to matter, we must have confidence in it.

  74. Data formats by Ocrad · · Score: 1

    Data formats are designed without taking bit flips into account even today.

  75. Re:Cosmic rays can cause problems? What are the od by cwsumner · · Score: 1

    No, seriously, what are the odds of a cosmic ray flipping a bit?

    Scientists do study this. The estimate is that a typical computer will have a hit about once a year.

    The circuits get smaller, the chips get bigger and more devices are used. It seems to all cancel out and the odds have been about the same for 40 years or more.

    But you are correct, software errors are far more likely.

  76. Re:ECC by Reziac · · Score: 1

    I vaguely recall a study way back when that looked at stray radiation vs the computer case: Metal is helpful as shielding; plastic is not. D'oh!

    --
    ~REZ~ #43301. Who'd fake being me anyway?
  77. Not exactly news by werepants · · Score: 1

    An entire field of study devoted to this exact problem (Radiation Effects) has been around since the 70's at least for aerospace purposes. That said, neutrons have only become a serious concern for terrestrial applications in recent years, as process geometries have gotten small enough and parts have gotten dense enough that neutron upset becomes less of an occasional annoyance and more of a constant problem.

    Generally you test electronics for Single Event Effects (SEE) at a cyclotron, and it used to be NASA, Lockheed, Boeing and the likes who were doing all the testing. More recently, though, Cisco and Intel have begun doing a lot of testing of their own. Cisco is known to put an entire server rack at a time in a neutron beam to see what goes boom.

  78. Re: ECC by bmk67 · · Score: 1

    Nonsense. I have 16GB of ECC RAM in my NAS at home, and I can assure you I spent almost two orders of magnitude less than $500/GB. IIRC, it was closer to $10/GB.

  79. Re:Serious Computer Glitches Can Be Caused By IDIO by dbIII · · Score: 1

    Good point. Personally I think things like the Diebold voting machines (designed by a convicted fraudster!) fail on many levels. I'm a big fan of very simple paper ballots and big high speed scanners to collate everything. When something is contested (which seems to happen someplace in just about every election anywhere) paper ballots allow a fallback all the way to manual verification if necessary.

  80. Re:ECC by dakra137 · · Score: 1

    One full page newspaper ad from the week the IBM PC was announced, called it "The IBM of Personal Computers." The "B" in IBM is for "Business." The IBM PC was a business machine, not a toy. What distinguished it from Apple and Radio Shack computers? It was the first PC to provide parity memory.

    The IBM PC had parity memory, not ECC. On a parity error, BIOS displayed an error message on the top line of the display and stopped the computer (HALT or disabled interrupts then loop, such that the error message remained visible.) If I recall correctly, restart required the BRS (Big Red Switch) to power down and then up, rather than <alt><ctrl><del>. When the computer was powered up again, as always back then, BIOS did a Power On Self Test that among other things, tested memory.

    The parity memory meant that the IBM PC produced results as programmed, or not at all.

    Some customers did not really care. They preferred results, even if incorrect ones. Clone companies started providing a BIOS setup option to disable the parity checking. Next they started saving 11% on memory costs by not including parity memory bits at all. Customers did not care. The IT trade press and personal computing trade press did not raise a ruckus, and almost never included the presence or absence of parity memory in product reviews.

    Then IBM had started to become "Market Driven." The market did not value it, so eventually, IBM dropped parity memory from end-user PC's, but retained ECC memory for servers. At an internal IBM technical conference, I asked IBM Executives whether anything had changed that made parity memory unnecessary in PC's. The heads of both the PC Division and IBM Microelectronics (IBM made its own memory back then) agreed that dropping parity memory was the wrong thing to do, from both technical and validity-of-results perspectives, but was something the market did not value.

    The follow-on was for the industry to create premium "workstation" class end-user computers with ECC. Unfortunately, as far as I know, no company offers laptops, notebooks, tablets, smartphones, or IoT devices with ECC.

  81. Re:ECC by Enter+the+Shoggoth · · Score: 1

    No boot ROM means that a hardware device constructed from discrete logic and analog chips directly demodulates digital data from the radio, addresses the memory, and writes the data. Once this process is completed, it de-asserts the RESET line of the CPU and the CPU starts executing from an address in memory. Really no ROM!

    Ok! (Very) remote pre-boot DMA, nice!

    Thanks for the expdanded explanation,

    --
    Andy Warhol got it right / Everybody gets the limelight
    Andy Warhol got it wrong / Fifteen minutes is too long.
  82. ECC much? by MikeBabcock · · Score: 1

    ECC memory has been available for a long time and most servers use it, I have no idea why voting machines and other important devices wouldn't.

    --
    - Michael T. Babcock (Yes, I blog)