Serious Computer Glitches Can Be Caused By Cosmic Rays (computerworld.com)
The Los Alamos National Lab wrote in 2012 that "For over 20 years the military, the commercial aerospace industry, and the computer industry have known that high-energy neutrons streaming through our atmosphere can cause computer errors." Now an anonymous reader quotes Computerworld:
When your computer crashes or phone freezes, don't be so quick to blame the manufacturer. Cosmic rays -- or rather the electrically charged particles they generate -- may be your real foe. While harmless to living organisms, a small number of these particles have enough energy to interfere with the operation of the microelectronic circuitry in our personal devices... particles alter an individual bit of data stored in a chip's memory. Consequences can be as trivial as altering a single pixel in a photograph or as serious as bringing down a passenger jet.
A "single-event upset" was also blamed for an electronic voting error in Schaerbeekm, Belgium, back in 2003. A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible. "This is a really big problem, but it is mostly invisible to the public," said Bharat Bhuva. Bhuva is a member of Vanderbilt University's Radiation Effects Research Group, established in 1987 to study the effects of radiation on electronic systems.
Cisco has been researching cosmic radiation since 2001, and in September briefly cited cosmic rays as a possible explanation for partial data losses that customer's were experiencing with their ASR 9000 routers.
A "single-event upset" was also blamed for an electronic voting error in Schaerbeekm, Belgium, back in 2003. A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible. "This is a really big problem, but it is mostly invisible to the public," said Bharat Bhuva. Bhuva is a member of Vanderbilt University's Radiation Effects Research Group, established in 1987 to study the effects of radiation on electronic systems.
Cisco has been researching cosmic radiation since 2001, and in September briefly cited cosmic rays as a possible explanation for partial data losses that customer's were experiencing with their ASR 9000 routers.
Whenever a user calls up to ask why his computer rebooted after I install an update, I say... drumroll, please... gamma radiation.
I was convinced that is was a lousy programming job by Microsoft that has more attention to fancy UX components rather than stability. I am waiting for the confirmation that the fact that Excel start searching every known (network) drive for a license if it can't connect to the online subscription service, for every operation, must be due to black matter. Unless it crashes when it tries to display that warning message, then it's just some cosmic ray again. So relieved!
When your computer crashes or phone freezes, don't be so quick to blame the manufacturer.
Why not? According to the article, it is well-known phenomena:
For over 20 years the military, the commercial aerospace industry, and the computer industry have known that high-energy neutrons streaming through our atmosphere can cause computer errors.
So if it is a well-known problem, and manufacturers are ignoring the problem and creating devices susceptible to such interference, why can I not blame the manufacturer for making hardware with known problems? I would blame the manufacturer if a hearing aid was picking up local radio stations, so why not here?
Cosmic rays at the rest of this comment.
And this is why we have ECC RAM. It can detect and correct a single bit flip. Cosmic rays aren't likely to trigger multiple bit flips simultaneously in the same block of memory.
Sun blamed cosmic rays for causing CPU cache corruption and system crashes in their high-end enterprise systems. http://www.forbes.com/forbes/2...
This is why ECC is used to protect memory and data busses. At least on the good stuff :-) . One of the issues is die shrink. As the minimum detail slze of the IC process gets smaller, the potential for radiation to flip a bit gets higher.
Silicon-on-sapphire is the main way to implement silicon-on-insulator, which is more protective of radiation bit flips and less likely to latch-up. But since these have historically been required only for space satellites, they have been horribly expensive. Imagine running an entire IC fabrication just to make a few chips. As there are more applications for rad-hard chips, the price could fall.
Bruce Perens.
https://en.wikipedia.org/wiki/...
Oh THATS what happened to the emails. Global warming is bullshit, but those cosmic rays will getcha every time...yeah -_- Witness the birth of oncoming onslaught of pathetic excuses. Not doubting the logic at all, especially given how mass power outages have happened because of this, but I got feeling someone will do research near "HAARP" and, "Oh no...Why god why!...all well." Â\_(ãf)_/Â
"Cosmic rays, man." -- Bethesda
Actually, wouldn't cosmic rays be capable of flipping bits even in ECC memory and processors, thereby making the whole ECC thing useless? Particularly in more recent process nodes, where the lithography scale is approaching atoms, and where cosmic rays would have a far greater effect?
Apple doesn't cover acts of God. It's actually in the warranty.
Peace, love, op amps and 33 1/3 rpm records man.
When your computer crashes or phone freezes, don't be so quick to blame the manufacturer.
If my computer crashes or phone freezes, it's almost certainly the fault of the person who released the software without properly debugging it. Cosmic rays are very low on the list of reasons why your device has malfunctioned.
Anons need not reply. Questions end with a question mark.
Follow through the links: a cosmic ray caused problems, the jets misbehaved for a bit but the duplicated systems protected them from a crash - as they are supposed to after a malfunction.
and sends it into orbit. They can just focus beams on anything and hope they'll hit and disrupt, and the target will never know what happened to their electronics. They will wreck havok and cause conflict anywhere where it benefits them.
Shouldn't "News for Nerds" be news to nerds?
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Yeah, back in the 90's I definitely remember this being a big issue, but there's so much expected error already in the computations being done in modern CPU's, that gets fixed at the hardware level before it ever impacts software, I honestly haven't thought of it much more than in passing for the last 10 or 15 years. Technically it's always POSSIBLE, but the frequency of these incidents is almost completely negligible.
Actually, wouldn't cosmic rays be capable of flipping bits even in ECC memory and processors, thereby making the whole ECC thing useless?
No, this is what ECC is for. If a bit is flipped, you can detect it. If you have enough parity bits, you can even detect which bit is flipped, and correct it on the fly. Computation occurs as normal and an error shows up in the syslog.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
The odds of a cosmic ray hitting your memory at the exact right spot to flip a bit are one in hundreds of millions. There are just enough computers out there that it happens from time to time. The odds of FIVE rays hitting just the right locations to flip four bits and a parity bit are, pardon the pun, astronomical.
My Other Computer Is A Data General Nova III.
Even though market participents are warned about this by exchanges, you do have to wonder, if it makes it into the BOFH excuse calendar, can you really take it seriously?
Curiosity was framed; ignorance killed the cat. -- Author unknown
... during my IT career?
I could have used this as a dodge after I fucked something up in the system.
I did the sunspot thing back in 2012.
"Russia," seems to work well, though.
It little behooves the best of us to comment on the rest of us.
The cosmic radiation excuse has been used by manufacturers for 30 years, and in a lot of cases it boils down to the techs being to lazy, or not sufficiently skilled to diagnose the actual issue.
In just about every case when customers have pushed and a more complete diagnosis made, the manufacturer has discovered an actual fault.
This is the case with the Cisco gear recently, from what I recall, Cisco quickly retracted the explanation in the face of industry ridicule.
If a particular model of equipment is suffering a high fault rate, then the simplest explanation is usually the most likely: there's a fault in the design.
Are people really less knowledgeable about computers now than they were in the 80's?
I'm certain it's on the list somewhere.
Have gnu, will travel.
This was proposed as an SDI weapon in the 1980. And it wasn't just the US. Russia too, unless you don't believe they do stuff like that or have the capability.
Are people really less knowledgeable about computers now than they were in the 80's?
If you mean on average, I think the answer is probably yes. More people know how to operate them now, but then, operating them has become orders of magnitude simpler.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Yes, absolutely! Have you never sat down with a IT graduate from the 2000's to figure out what they actually know about computer hardware?
But much more frequently, problems are caused by somebody f**king something up. You shouldn't be looking to cosmic rays until you're pretty sure it's not just stupidity in action.
A thousand pounds of wood moving at 300 feet per minute. Don't get in the way.
Yeah, they kept making people after the 80s
it was a cosmic ray that did the act, not through anything from my intent.
---right after I installed Linux over Windows.
ECC only protects the things that ECC protects. Long lived data like the contents of RAM and the HDD/SSD should be backed up by some form of error check code simply because they're going to sit around under the assumption that everything is ok (and it might not be).
ECC checking the cache might be necessary but it's refreshed so frequently that it's unlikely to contain bad data. We also don't need correcting codes here because we can always re-fetch cache lines from RAM if they're bad, so we can use less bits.
ECC checking the registers might be necessary?
Also, radiation harden the processor die and the whole machine. Then radiation harden the building they're all in.
None of this guards against data handed off between components. If a bit is in flight on the bus or on a network cable and isn't protected by a check code then it might arrive flipped if a stray high-energy particle smashes into it on the way. Moving data needs to be verified as much as data at rest for complete correctness.
Nothing guarantees correctness better than resetting the machine to default state from verified storage. How many bits have been smashed in your RAM since you last rebooted? It might be none, or it might be hundreds. You hit a LSB and something is off by one. You hit a MSB and you're potentially off by trillions. You hit bit 12 and you're off by exactly 4096. You hit any bit in a machine instruction and crashes are pretty much inevitable or (worse) every subsequent piece of data moving through those instructions gets turned into garbage. A hard crash is actually the best outcome.
Your phone or computer crash is thousands of times (if not millions) more likely to have been caused by the manufacturer/coders error or fault than cosmic rays. Anyone that decides to consider cosmic rays as a more likely answer deserves to continue to experience their issues.
1) Many management cultures devolve to a point where slaves can only appear to be
competent by having an ultra low MTTR, regardless of bug difficulty. This is necesssary
because the devolved management team cannot tell the difference between hard and
easy bugs (or worse, they can, but they know their bosses are knuckle draggers, so
they game the metric upwards to avoid getting bogged down on technical debt).
2) Often, a different engineer can pick up the cosmic ray bug and reproduce
the problem, either because they have one of the few competent managers
left (who has a career death wish), or they are pining for the annual layoff package.
3) Cosmic Ray Gnomes live amongst us. They are those magical creatures who
can invoke celestial intervention every time they run their bug reproduction. There
is no actual bug, but they fool peers and management with their dark magic.
These Gnomes, when they reveal themselves, must be excised from the organization as
soon as possible, before upper management falls under their magical spell.
Think of the children!
As a student intern working at a lab in 1992 my project was to build a cosmic hodoscope to record cosmic rays. It involved scintillators, fiber optics, HV photomultiplier tubes, a timing coincidence system, radiation sources for calibration and testing, etc. When two PMT's fired within the same timing gate window it was the result of a cosmic ray and we could determined the path of the cosmic ray. The interesting thing was that this showed that there are quite a few energetic cosmic rays reaching the surface of the earth and that they have no problem passing through the atmosphere, buildings, etc. It's very real.
prsdntl
Are people really less knowledgeable about computers now than they were in the 80's?
If you mean on average, I think the answer is probably yes.
If you mean on average out of the total number of computer users or programmers, then yes (they are less knowledgable), because that pool has increased by lots and lots.
If you mean on average out of all people, then no. I suspect there are far more people that know what ECC does now than did in the 80's, and the total population count hasn't gone up as much as that number, so there are more people on average, and in total, that know about the inner workings of computers.
I think there are just far more people touching stuff they know very little about, and we assume they must know *something*, but they don't.
Compare it to early cars, where every operator had to know a bunch of stuff about it just to keep it running, but it was simple enough that the average operator could learn that stuff. Now, most cars make maintenance very difficult, and many drivers would be hard pressed to do simple things like changing the oil, flushing the radiator, replacing a brake light, replacing the battery, changing a tire, jump starting, etc. That said, there are WAAAAY more people that know WAAY more about cars now than there were in 1930. It's just shifted more to professional/hobbyist knowledge than something that every operator is required to know.
More people know how to operate them now, but then, operating them has become orders of magnitude simpler.
Is anyone surprised that if you store things once, and reference the one place alone, that you get screwed on occasion?
Is the word "co-roberation" new? How about "validation", "authentication", "verification", and, oh, I don't know, "paper-trail"?
It's electronic information, not magic. The benefit of not carving into stone is that you can readily duplicate information into multiple places. Use it.
RAID.
Depends on the ECC algorithms.
You can design these algorithms to be able to detect an arbitrary number of mistakes in the data (on the bit level). You can also design them to be able to correct an arbitrary number of mistakes in the data.
Standard, every day ECC RAM, can detect up to two bit flips in every byte. It can also correct a single bit flip in every byte. One bit flip? Fix the problem, log a warning, move on. Two bit flips? Throw an unrecoverable, fatal hardware exception. Depending on where in memory it happened, the OS might kill a process, panic, or just carry on as if nothing happened (if the memory was unallocated, or could be rebuilt from source - eg, an in-memory hard disk cache.) If you see errors coming consistently from a given region of memory - mark it as bad, map it out, and log it so the stick can be replaced.
With more advanced techniques you can even recover from an entire memory chip failing completely. (Chipkill).
In any event: better to have at least some error detection, even if it doesn't have error recovery abilities, than none; it might not matter if a couple of frames of a game get corrupted, but it will matter if financial data has a couple of bits flipped...
Why didn't that voting machine have ECC memory? Why didn't the software have bounds checking?
Yes, I know it's common, I use some software (from a very large company that was run by a guy you don't go hunting with) that when it hits a some input data with a negative integer IT ATTEMPTS TO ALLOCATE NEGATIVE MEMORY, and of course, crashes - but things that stupid should never happen (especially since it's supposed to deal with very noisy data). If it's out of range for a bit of code to work on then don't let it in! Don't just check in one place and hope that catches everything, check everywhere that out of bounds data is a problem.
LANL is located at 7000', higher than most other supercomputer installations. DOE labs often build the fastest supercomputers in collaboration with a vendor that has won such a bid. On one, SECDED memory was omitted to save money (there is a LOT more total memory in a supercomputer than anything an individual or any corporation would build). LANL experienced transient errors that could not be traced, and finally concluded that the altitude combined with the SECDED mistake was the root cause.
We've always known this. This is why we have ECC memory on servers.
Kriston
https://en.wikipedia.org/wiki/...
And, whaddaya know, you can buy them pretty much everywhere. For voting machines, medical applications, etc. they should obviously be used.
It's also why systems on spacecraft such as the Space Shuttle had what's called the Data Processing System. It consisted of four systems with identical software and an extra one with the same hardware but a different implementation with the same goals. They checked each others' decisions, and a majority "vote" would lock out the differing system.
Kriston
At ground level, SEUs are far more likely to come from terrestrial sources, or even radioactivity in the plastic the devices are packaged in. They are extremely rare. Cosmic ray generated events are even more so. In space, it is a different story. Even at airliner altitudes, radiation levels are 40 times what they are on the ground.
Isn't this why ZFS exists?
don't blame the human, often a ecstasy junkie. but never hire the pot smoking non-prostitute fucker who use to have programming as a hobby. besides looking for good porn.
We've known this since the 1980s...and the more dense/smaller the transistors get the greater the likelihood of it happening.
This is news, but it's literally from the previous century.
Just cruising through this digital world at 33 1/3 rpm...
All servers should have ECC memory at a minimum.
A manager, an engineer, a software developer and a manufacturer are returning from a convention. As they are driving down the peak of a mountain the brakes fail and the goes careening down the road, bouncing off several guard rails before stopping at the bottom.
All three get out of the car, amazingly unhurt.
The manager says "I think we should hold a meeting to discuss the possible solutions to our problem."
The engineer says, "I think we should disassemble the car and do a structural analysis on each part to determine the cause."
The software developer says, "Let's push it up to the top and try it again."
The manufacturer says, "It must have been the cosmic rays."
Exactly nothing is done to examine whether the voting system is accurate, yet they expect us to believe in it.
You need to pass ISO 26262. The level of error detection and recovery details you need to go thorough to get your chips in a safety system is exteteme.
Yep, it's good news. Very useful.
Dumb user error can be blamed on IT problems
IT problems can be blamed on computer glitches
Computer glitches can be blamed on cosmic rays
As a result, dumb user errors can and shall always be blamed on cosmic rays
aaaaaaa
It's just shifted more to professional/hobbyist knowledge than something that every operator is required to know.
Isn't that implied by the site we're on?
Well the marketing for high quality stuff is lousy. I buy phones around $100. There are cheaper phones, but they clearly do not meet my requirements. At this price, I can't tell if a given device will meet my needs or not. Perhaps I should go higher, but all devices fail to stand out at any price $100 or higher.
Still, there's much to be made by taking advantage of people who have more money than sense that sell equipment with minor issues on eBay because they don't know much about how to fix them. Just scored me what should be quite a deal. Still waiting for the machine to come though, to confirm my expectation.
Channel ID 18 19 20 22
Total Unerrored Codewords 243285196329 243285196266 243285195305 243285196923
Total Correctable Codewords 1094 1439 1100 1342
Total Uncorrectable Codewords 16934 16642 17884 16943
Don't know what normal values are.
Because studies have been done to ascertain this information.
That's a good argument for Gray code.
I have to take issue with the assumption that nothing clears errors better than a hard reset. There are very many known strategies for dealing with errors on a running system, and a reset only clears persistent and cumulative error, rather than transient ones. Since we can assume that your computer doesn't keep the same data in memory all of the time, most will be transient.
Bruce Perens.
Really earlier than that, Fermi expected it and had equipment shielded and double-shielded when testing the first nuclear bomb. But we should not confuse cosmic rays and EMP.
Bruce Perens.
This isn't either, but closer to a cosmic ray, just lower energy. Pointing a particle accelerator at warheads to fry their electronics. Which it would.
We did precisely this for NASA as part of a systems we built and am very familiar...or was a long time ago...with radiation damage and failure modes to electronics in space. Sometimes the shielding can make things worse. Instead of going straight through a transistor, a collision can occur upstream sending a spray of other particles with the right energy to do damage. There are parts of the upper atmosphere that are more radioactive than the area above or below.
...and not a single Trump comment was made. So glad its finally over.
If your of the lucky guys who run million dollar systems from IBM, you have probably been to their Austin facility, where they treat you to a couple days of presentations on how awesome their stuff is. We also giggled when they show how they bombard their systems with various cosmic particles. But it's no joke.
ECC Memory isn't the only added cost, you also need a motherboard and processor that supports it.
For your information, ever since AMD's Athlon 64, most x86 compatible hardware has had its Northbridge *inside the processor package*.
That means that the memory controller is inside the package of your CPU.
The mother board is basically only traces that connect your CPU and the memory slots directly.
A glorified cable/connector.
(In practice, there is a bit more, regarding powering the RAM slots, etc. but you got the general idea : not much smarts in the motherboard between RAM and CPU.
Smarts is in the "Southbridge" : between the CPU and peripherals)
On the AMD side of things, nearly every CPU has ECC capability in its build-in memory controller.
For a motherboard to support ECC, it basically means just having a few instruction to activate it in the EFI/BIOS.
On the Intel side of things, it's marketed as an enterprise feature, so it's only available on the more expensive business/workstation hardware.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
SECDED codes can detect up to two errored bits per codeword, not per byte. In modern systems, a typical codeword is 64 bits of data plus 8 bits of parity (where multiple parity bits cover each data bit).
Compare it to early cars, where every operator had to know a bunch of stuff about it just to keep it running, but it was simple enough that the average operator could learn that stuff.
Are you really going back to early cars here? I mean, I think we can break it into basically three eras. The early age of cars was characterized by horseless carriages. The prior age of cars was ushered in around the 1930s or 1940s, where automatic transmissions appeared, the control layout became standard, and vehicles were pretty much all fully enclosed unless they were specifically designed to be a cabriolet. And the modern age of cars came with the O2 sensor, and self-tuning.
For the earliest cars, it was common to hire a driver and mechanic, because keeping the car moving was a full-time job. Maybe halfway through the period it became reasonable for people to maintain their own vehicles, as the reliability came up to the level where you didn't have to be an engineer to keep it going.
Obviously, the middle era was the time when any schmoe with a set of wrenches could fix a car. There was very limited availability of fluids, so vehicles were engineered to use what was ubiquitous, which was all the same. Vehicles were easy to maintain because they wasted a lot of space. On the other hand, reliability was nowhere.
Most modern cars are staggeringly reliable, but maintenance is a mixed bag. Oil changes tend to remain trivial, but transmission oil changes may be a massive PITA. You have to get the car flat and level and add fluid from the bottom while running on a disturbing percentage of modern vehicles, and there is no dipstick. A radiator flush is exactly as hard as it ever was, and you install a flush tee the same as ever. The battery, on the other hand, might be in the wheel well behind the plastic inner fender. Even if it's someplace supposedly convenient like the trunk, it might be a PITA to get in and out as it is in my A8. And you have to jump start from the battery, too. There's no redundant terminal under the hood. That would have just added weight and crap so they skipped it. The starter takes power from beneath the frame rail on the right side, you can apply power there if you have to but again, what a PITA. On the other hand, even reasonable estimates of the service intervals are all much longer than cars from the prior era. And on the gripping hand, nobody is meant to own cars like that for more than half a decade or so. They are for rich fucks who can afford to turn them over :)
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
I don't pay attention to the prices of extreme editions anymore (they have always looked like ripoffs), but back when Haswell was the thing and I was looking at prices of stuff for my home server (in the mistaken belief that I'd be transcoding video on it, which turned out to not really happen so the CPU ended up being totally overkill), the Xeon E3-1240v3 cost less than the common and similarly-speced non-extreme i7 model. (Maybe it was a 100 MHz slower or didn't have the integrated graphics, or whatever. But I'm serious that it cost less.) If I paid an "extra" non-negative cost for ECC RAM compatibility, I think it had more to do with the relatively expensive SuperMicro motherboard that I used. And the RAM itself cost a little more, but it really wasn't much.
Once you get above low-end, ECC is nearly free: low enough that it's totally overwhelmed by all the other costs of your build. Even with Intel. (And as many people will point out, with AMD it's even cheaper.) Shit, I spent more on fans than ECC cost me. I spent more on SFF-8470 cables than ECC cost me. I spent more on the UPS than ECC cost me. ECC is one of those things where if you think the machine's job is important enough, it's trivial and costs less than a lot of other less-sexy, more-dubious things.
Why didn't that voting machine have ECC memory? Why didn't the software have bounds checking?
Because if one bit-flip changes the totals by more than 1, then the software was designed wrong.
Each vote should be a separate record - the totals should only be a summation. You can keep a running tally separately as a backup record, but that should not be your only count. If one bit flips, one vote changes - not one bit on the total.
Only n00bs think it's a glitch...real programmers use butterflies. They open their hands and let their delicate wings flap once. The disturbance ripple outward, changing the flow of the eddy currents in the upper atmosphere. These cause momentary pockets of higher-pressure air to form, which act as lenses that deflect incoming cosmic rays, focusing them to strike the drive platter and flip the desired bit.
sudo rm -r -f --no-preserve-root /
Oh yes, DO blame the manufacturer.
There is no acceptable reason to not have ECC on every layer (hint: even the L1 cache in your processor does ECC, and it is likely the faster memory there is in your entire system, TCAM SRAM on switch forwarding engines included), and a simple ECC or FEC (heck, even parity will do) on the serial buses. Cosmic rays almost never manage to toggle more than a single bit.
No, seriously, what are the odds of a cosmic ray flipping a bit?
0.1, 0.000000001, 1e-15, 1e-30?
It's easy to blame cosmic rays, but a subtle bug is far more likely.
.
Except, you know, all the people who HAVE actually looked for voter fraud and have found nothing that would affect a result.
Doesn't matter when the electorate is flooded with morons.
ECC, RAID and some sort of parallel computation should be in place for voting systems. It should be possible to run the same code on multiple processors and check that the results are the same in real-time.
Worked for a modem manufacturer for years. 2 things we could guarantee would cause lots of dead modems or devises attached to them. Thunderstorms and solar flares. This seems to just be an extension of the second.
Same on any industrial safety system. Often these are triplicated or quadruplicated. I actually prefer triplicated since you don't end up with an even vote on a situation.
Except that they found irregularities that were ignored. And that wasn't a very widespread effort. You can't trust a flawed system. Most of the time, that doesn't matter, but for the voting system to matter, we must have confidence in it.
That's a good argument for Gray code.
No it isn't.
Gray code guarantees that numbers adjacent in value have encodings that differ by only one bit. It does not guarantee the converse, that numbers that differ by one bit are close in value.
Data formats are designed without taking bit flips into account even today.
No, seriously, what are the odds of a cosmic ray flipping a bit?
Scientists do study this. The estimate is that a typical computer will have a hit about once a year.
The circuits get smaller, the chips get bigger and more devices are used. It seems to all cancel out and the odds have been about the same for 40 years or more.
But you are correct, software errors are far more likely.
We used to deal with this when electronics were bombarded with neutron radiation when they were installed in the containment building.
Voting algorithms and redundancy (as well as using less dense memory and processor dies) were the key to preventing these bit flips becoming an issue.
An entire field of study devoted to this exact problem (Radiation Effects) has been around since the 70's at least for aerospace purposes. That said, neutrons have only become a serious concern for terrestrial applications in recent years, as process geometries have gotten small enough and parts have gotten dense enough that neutron upset becomes less of an occasional annoyance and more of a constant problem.
Generally you test electronics for Single Event Effects (SEE) at a cyclotron, and it used to be NASA, Lockheed, Boeing and the likes who were doing all the testing. More recently, though, Cisco and Intel have begun doing a lot of testing of their own. Cisco is known to put an entire server rack at a time in a neutron beam to see what goes boom.
It's called market segmentation, and Intel are WHORES for keeping ECC within the Xeon lineup only.
Good point. Personally I think things like the Diebold voting machines (designed by a convicted fraudster!) fail on many levels. I'm a big fan of very simple paper ballots and big high speed scanners to collate everything. When something is contested (which seems to happen someplace in just about every election anywhere) paper ballots allow a fallback all the way to manual verification if necessary.
ECC memory has been available for a long time and most servers use it, I have no idea why voting machines and other important devices wouldn't.
- Michael T. Babcock (Yes, I blog)
Excelsior!
Device manufacturers are aware of the source of device errors. Ordinary bugs and malware are responsible for something like 99.999999% of all device errors.
While cosmic rays are a thing, they aren't a very big thing, and there are very robust, well-developed, and widely available systems that can handle such environmentally induced errors. The biggest, cheapest, and most low-hanging fruit-wise of these has to be ECC technologies.
Seriously, if you are going to have a geek-induced panic attack about cosmic rays, then equip your devices with ECC RAM. Then move on, you've solved it, it's as simple as that.
The only device consumers with serious radiation issues are space agencies, the makers of nuclear power plants, and specialty radiation monitoring & cleanup companies.
This is not news. Back around the turn of the century, there was a considerable effort to repair CPU Cache chips
on Sun gear because they could develop serious parity errors . One of the agents listed was 'cosmic rays'...
Sorry, no link, had to have been 1999 or so.