Cisco Blamed A Router Bug On 'Cosmic Radiation' (networkworld.com)
Network World's news editor contacted Slashdot with this report: A Cisco bug report addressing "partial data traffic loss" on the company's ASR 9000 Series routers contended that a "possible trigger is cosmic radiation causing SEU [single-event upset] soft errors." Not everyone is buying: "It IS possible for bits to be flipped in memory by stray background radiation. However it's mostly impossible to detect the reason as to WHERE or WHEN this happens," writes a Redditor identifying himself as a former [technical assistance center] engineer...
"While we can't speak to this particular case," Cisco wrote in a follow-up, "Cisco has conducted extensive research, dating back to 2001, on the effects cosmic radiation can have on our service provider networking hardware, system architectures and software designs. Despite being rare, as electronics operate at faster speeds and the density of silicon chips increases, it becomes more likely that a stray bit of energy could cause problems that affect the performance of a router or switch."
Friday a commenter claiming to be Xander Thuijs, Cisco's principal engineer on the ASR 9000 router, posted below the article, "apologies for the detail provided and the 'concept' of cosmic radiation. This is not the type of explanation I would like to see presented to the respected users of our products. We have made some updates to the DDTS [defect-tracking report] in question with a more substantial data and explanation. The issue is something that we can likely address with an FPD update on the 2x100 or 1x100G Typhoon-based linecard."
"While we can't speak to this particular case," Cisco wrote in a follow-up, "Cisco has conducted extensive research, dating back to 2001, on the effects cosmic radiation can have on our service provider networking hardware, system architectures and software designs. Despite being rare, as electronics operate at faster speeds and the density of silicon chips increases, it becomes more likely that a stray bit of energy could cause problems that affect the performance of a router or switch."
Friday a commenter claiming to be Xander Thuijs, Cisco's principal engineer on the ASR 9000 router, posted below the article, "apologies for the detail provided and the 'concept' of cosmic radiation. This is not the type of explanation I would like to see presented to the respected users of our products. We have made some updates to the DDTS [defect-tracking report] in question with a more substantial data and explanation. The issue is something that we can likely address with an FPD update on the 2x100 or 1x100G Typhoon-based linecard."
Cosmic
I'm not saying it was aliens, but...
It was aliens!
would be another explanation.
Slashdot, fix the reply notifications... You won't get away with it...
If it's cosmic radiation, wouldn't it affect more than the ASR 9000? Or is that the only model without a lead case?
Is that a roll of dimes in your pocket or are you happy to see me?
even if you have a strong support organization, one slacker responding with this to a customer, and the entire brand is tarnished.
Anybody can work under ideal circumstances. -- Jeff K. (January 4, 2001)
I work at a fortune 500 and I had to explain this to management just a few years ago on a Cisco 6500. It was a tough sell but I recall having a similar issue in the late 90's/ early 2000's with sun hardware so it isn't new. That was was even better to explain. The Sun's cosmic rays were causing the Sun's hardware to break!
I'm guessing that they've read the BOFH, but realized that there's much more reporting on solar-induced radiation ... so just decided to go with 'galactic' instead. .... completely forgetting that if this were the case, it would happen more frequently at high latitudes, due to the magnetosphere. And we'd also see a higher incidence rate after solar x-ray flares and solar particle events.
(and the disclaimer: I work for the Solar Data Analysis Center, but I'm not a scientist, and don't speak for my place of work, etc, blah blah blah)
Build it, and they will come^Hplain.
Sun Microsystems already pulled this bullshit back about 15 years ago... I don't really recall if it was a bad batch of processors or if it was bad non-ecc cache memory or whatever... but I do remember plenty of folks giving them a ration of shit and generally refusing to buy hardware from them after that... though once they fessed up to the problem and replaced all of our defective systems(and gave us a couple of free systems) we never had any further issues.
Trying to clam acts of god to get out of being responsible?
Because working with geosynchronous satellites I learned that it mostly affects them, and satellites in LEO aren't affected very much, and the ground has quite a bit of atmosphere for additional shielding. I don't work with stuff on the ground anymore. Do you not see that I said things like "I wonder" and such? Just provide an answer and don't be a douche.
This makes me wish I was still working for them in IOS Engineering for the opportunity just to stir some shit.
I'd have gone into the office on Monday morning with my head covered in tinfoil.
I use to get some good laughs in the Cisco office, I seem to be getting more on the outside these days.
Oh how the mighty have fallen....
They will be losing market share... us included... we're a global company, with a size akin to a fortune 100. We're pulling the plug and moving to Juniper. Too much horse shit from Cisco. It's a fucking nightmare to get a quote, or even get an order filled CORRECTLY. We get the wrong shit sent all the time, and Cisco says the internal RMA process is so tedious, they tell us to throw the wrong equipment in the garbage or just keep it.
They don't stock shit. Everything takes weeks. RMA'd equipment takes weeks. Half the time we get these vague emails saying oh sorry your switch broke... we don't have shit in the warehouse... you know, because why would you ever stock something like a Nexus class switch... so we sent your RMA over to manufacturing... which is also backed up... therefore you *should* get your replacement Nexus 5k in 6-7 weeks. Good thing you purchased "SmartNet" with uber-fast advance replacement coverage. Have a nice day.
One of the VP level Engineers (title is "distinguished" or something exalted like that) told me over lunch a couple of years ago that Chambers had said to him he wasn't interested in R&D. If there was a technology he needed, he'd buy it.
The problem is that Cisco climbed to the top using IBM strategies and thinking which were focus on delivering "end to end" solutions to customers.
They had no interest in box shipping. Those were just lego bricks and logistics. You can imagine how soul destroying that was to be a Cisco engineer.
Bugs were a bonus to them as they sold annual maintenance contracts for roughly the same cost as the gear they sold.
Now that the router/switch market has matured and commoditised they care even less about the quality of those boxes they have to ship.
Their focus is entirely on the "service" level.
They will eventually become another IBM. I was trying to think of a real tangible product that IBM made and sold just the other day. Do they?
As discovered by IBM back in the 70s, if it is a radiation induced upset, you'd see higher rates in places like Colorado vs Sea Level, and on upper floors of building vs lower floors.
As FOLDOC explains, Intel tested this idea decades ago by putting one board in a 25 ton lead safe and another outside to see if there was a measurable difference in bit rot. There wasn't. " Further investigation demonstrated conclusively that the bit drops were due to alpha particle emissions from thorium (and to a much lesser degree uranium) in the encapsulation material." They ended up redesigning the memory to be more resistant to the effect.
Good, inexpensive web hosting
Too bad you resort to the logical fallacy of attacking the speaker instead of the argument. Plus this time it wasn't even an argument, so that just makes you a pure troll.
Santa Claus spread chemtrails in the sky with which the easter bunny got stoned and confused causing the routers to crash!
Hey, it's not impossible!
That was in his excuses rolodex.
We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
I used to use that all the time. Now I'll have to think of something else..
Flips of a single bit in a memory or register are that few modern systems would run for long without error correcting memory. Even ECM has its limitations and most systems eventually crash/panic/blue-screen or whatever and require a reboot.
The costs to improve error resilience go up rapidly and don't have a meaningful upper bound. My working trade off was to design for a mtbf comparable to how long I wanted to keep that job.
I think they're still selling POWER-based systems.
I remember in the 70s some memory manufacturer used a ceramic package that had a lot of thorium. Bad trouble.
...a Cosmic Brownie?
http://cosmicbrownies.littlede...
You were mistaken. Which is odd, since memory shouldn't be a problem for you
It shouldn't be a huge expense to build in some form of error correction to catch that sort of thing.
Chas - The one, the only.
THANK GOD!!!
My wife was looking over my shoulder when the "Cisco Blamed A Router Bug on 'Cosmic Radiation'" headline went by, and asked:
"What's their next excuse? Global Warming?"
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
The Earth's magnetic field is great for deflecting most of the stuff that come from the Sun, but cosmic rays, as in the stuff from outside the solar system, includes a lot of high to very high energy particles that are not much deflect by the same field. It can actually be worse in the atmosphere than in orbit too, depending on your exact setup. A single high energy particle directly hitting something in space might deflect a single atom, but a similar particle hitting the upper atmosphere deposits enough energy into that atom to repeat the process, and you get a whole shower of particles. Still, the net effect tends to be worse in space.
But from a practical stand point, there is plenty of work for various devices showing failure rate that increases with elevation. It has been a while since I've seen such work on digital systems, but you get even analog devices like IGCTs where failure rate increases with altitude, since when under voltage a small, fast conducting channel can cause the current density to spike up faster than it can spread out.
I clearly have no argument because I was curious about something. Pointing that out doesn't mean anything.
Thank you. There seem to be a lot of parallels to space systems. I would imagine it is still as big of an issue on the ground because they do not put the same thought in protecting from it, as they do in orbit.
A White House health report addressing "partial data traffic loss" on Secretary of State Hillary Clinton contends that a "possible trigger is cosmic radiation causing SEU [single-event upset] soft errors." Not everyone is buying: "It IS possible for bits to be flipped in memory by stray background radiation. However it's mostly impossible to detect the reason as to WHERE or WHEN this happens," writes a Redditor identifying himself as a former [technical assistance center] engineer...
Because working with geosynchronous satellites I learned that it mostly affects them, and satellites in LEO aren't affected very much, and the ground has quite a bit of atmosphere for additional shielding.
While satellites and space vehicles may be hit with a lot more sub-atomic particles, some particles do reach the earth's surface, e.g., muons, which results in about 13 to 14 neutrons/cm^2/hr with sufficient energy to flip a bit on a chip (depending on latitude, longitude, altitude, etc.) All chip companies that care to be concerned about it know the estimates for errors/Mbit for their process technology. Some extra effort is needed to estimate the amount of error masking due to the micro-architecture and software. Many/most companies design in sufficient error detection or correction to bound the expected chip error rate to an acceptable level. However, that acceptable error rate depends on the customer. For supercomputing, nuclear control systems, and self-driving cars, the acceptable error rate is quite small, but for networking ASICs, the chip error rate may be allowed to be higher with software and network protocols providing additional error mitigation.
They will eventually become another IBM. I was trying to think of a real tangible product that IBM made and sold just the other day. Do they?
They sell more mainframes now than they did in 1970, believe it or not. One z13 can support 8,000 Linux VMs simultaneously. Cool looking box, too.
"First they came for the slanderers and i said nothing."
Thank you. I didn't realize it was so high on the ground. I know the technology for hardening is available, but clearly that costs money, and if patented that would add way too much for low threat situations. It makes a lot of sense for network technology, as you said, it has plentiful error correction and retransmission protocols as it is.
I can't claim to know the cause, but I have seen one proven case of a bit flip affecting processing on a machine that ran fine before and after the incident. Since it was doing batch processing I was able to re-run the job with identical inputs. It never made the error again.
Could have been a cosmic ray, alpha decay, power glitch that oddly didn't affect the other machines on the same circuit, who knows?
I'm definitely going to read up on this surface SEU more. I have always attributed things you mention to non-ECC ram in the past.
Yet she still had time to write the Realm-Jumper Chronicles, 5 volumes and counting. Colour me impressed.
Il n'y a pas de Planet B.
99.99â... of SEUs have notihing to do with cosmic radiation.
Let me guess, you are a millennial, and to cool to use Google?
https://en.wikipedia.org/wiki/ECC_memory#Problem_background
Watch this Heartland Institute video
When I was a physics teacher I had an ongoing memory error problem with my Fujitsu Siemens laptop which led to frequent BSOD. I replaced the memory, and it still occurred. I then noticed the memory error happened frequently at work, but never at home. I wondered whether it could be a radiation issue, as I handled radioactive sources at my desk. I got my tech to do a leak check on my desk. It showed there was higher-than-background levels of radiation (can't recall whether alpha or beta) around my desk. This only showed up using a fairly decent G-M tube which had been given to us by the local hospital when they were having a clearout. Turns out the source of radiation was dust from a piece of fossilised wood I'd picked up some time previously. It had been sitting on my desk and zapping my laptop's memory. I sealed the fossil in a Ziplock bag and kept it in a Quality Street tin. The problem never recurred.
I have always attributed things you mention to non-ECC ram in the past.
Non ECC ram has exactly the same number of bit-flips as ECC ram. It's just that ECC lets you know it happened (as would parity) and can fix it for you (if only one bit flipped).
Watch this Heartland Institute video
There has been assloads of research on mitigating soft errors going back to the 1970’s. I’ve published some myself. There is no shortage of workable methods on masking transient errors in logic and bit flips in DRAMs. SEUs are a major problem for supercomputers, so their memory systems have sophisticated mechanisms for catching them.
If Cisco is blaming this on SEUs, that just proves their incompetence, since they obvious didn’t spend 5 minutes with Google Scholar looking at hundreds of GOOD papers (in the top conferences and journals) on this topic. Seriously.
PLUS, if something goes wrong, even if it IS a transient error, it’s FAR more likely to be a fixable bug than radiation. We had a weird bug in a DRAM controller whose state kept going invalid. We had to add another circuit to fix that. We *called* is a cosmic ray deflector, but the more likely causes, in order were (a) another bug we couldn’t find, (b) a timing violation caused perhaps by voltage or temperature fluctuation, or (c) crosstalk in the circuit. We would have kept looking, but this deflector circuit made it robust to hundreds of hours of slamming the memory system, so we let it go. (Also, it was graphics memory, so even if it did ultimately suffer a glitch some day, it would go unnoticed.)
I seem to run across a fair number of places with AS/400 stuff. What's kind of interesting is the AS/400 stuff and people seem to run in this parallel universe IT department with their own staff somehow immune from the other pressures of the rest of the IT department.
Once in a blue moon I'll hear mention that some kind of AS/400 update or installation is happening, so it's not like they're strictly legacy systems. And at longer term clients with AS/400 I occasionally see something new/different in the "AS/400 rack".
I don't know what IBM's growth potential is or how at risk their active businesses like AS/400 are from being eaten by Wintel/Lintel systems are, but they sure seem to have carved out a niche that seems nearly immovable.
I guess in this case it is "the same thing" ... the silicon from which they made some of the chips involved was not pure enough, or the material for doting was contaminated.
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
I have had Cisco tell me this many times any time a router reboots from a parity error for over 15 years now, so they have been using this for a long time now.
It could indeed be possible. Aloha particles are well-know to be capable of causing bit-flips in capacitive memories (DRAM). This is exactly why we have things like ECC on memory pathways. That said - its not the only explanation. There are ways of testing this. For example, observing the general abundance and frequency of such particles in a bubble chamber, and attempting to corrolate to instances if error. Or placing equipment in a shilded enviroment and seeing if frequency of errors change. Long story short - it MAY be true - but if you want to draw a conclusion - you really have to offer more data to prove it.
Of course it depends where you are on the ground. I used to work in a data center in Colorado Springs, at about 6000 feet altitude. We saw quite a few correctable memory errors in the logs (and a few random crashes).
Might have been cosmic rays, might have been radiation from the mountain of granite (Pikes Peak) we were in the shadow of.
Either way, if the errors occurred several times in the same DIMM, it was probably bad memory and we replaced it. The odds of cosmic rays hitting the same DIMM every few days or so are pretty remote.
-- Alastair
My reaction when I first heard the "cosmic radiation" excuse for misbehaving electronics.
With decades of experience in tech implementations in radiation fields I can personally attest to the fact that the radiation flux levels needed to cause reactions in electronics could only be high enough due to cosmic radiation at elevations higher than 20,000 feet. The levels need to be in Rad per hour rather than the microrad per hour that you get from cosmic radiation. (i.e. background at sea level is often 15-20 microrem/hr in the day and 3-5 microrem/hr at night with the difference due to cosmic radiation. In a 5 Rad/hr field, 5000000 microrem, the lifetime of electronics is weeks if not days before the semiconductors fail from ionization of the doping in the material.) This is for electronics other than radio transmissions as radio transmissions can experience interference in transmission due to ionization in the atmosphere. (thunderstorms do that too) Low power short range such as wifi is much less effected than long range skywave or aimed microwave. And radio interference is not an issue in the electronics but with interfering transmissions from mama nature. Cisco was so obviously full of a certain word that rhymes with their name.
NRRPT/RCT
You just need the right gadget.
Yup, that's obvious. I'm dumb not to have thought of that: ECC == more cells / usable bit == more events / usable bit.
We had a fault in a high altitude aircraft (>60K feet) that we are pretty confident occurred when a cosmic ray flipped a bit inside the error correction circuitry.
Hilarious. As you add protection you add places to be broken :-( Need error correction circuitry for the error correction circuitry.
Watch this Heartland Institute video
https://en.wikipedia.org/wiki/ECC_memory#Problem_background
Actually, there was a packaging material that caused this problem, back in the late '60s. They reaserched it and stopped using that, but were astonished to find it was still not completely cured. That's when the "Science-fiction" idea of cosmic ray impacts started to be believed.
(Strange how new people assume the "ancients" were wrong and ignorant, when they were usually smarter than the newbies.)
Of course it depends where you are on the ground. I used to work in a data center in Colorado Springs, at about 6000 feet altitude. We saw quite a few correctable memory errors in the logs (and a few random crashes).
The soft error rate do to cosmic rays is an order of magnitude higher at Colorado springs than at ground level. For aircraft it is another order of magnitude higher but they have a very small capture area; I assume safety critical aircraft systems are designed with this in mind. Placing systems in a deep basement has the opposite result do to greater shielding.
A lot of memory*hours are needed for it to become a significant problem which is why systems with either lots of memory or systems which run continuously for long times (or both) are the most affected. Oddly enough, the error rate is also proportional to memory bandwidth because use of the logic controlling the memory array increases the capture area.
More likely an bug in the code that the NSA has inserted into all of their routers.