Slashdot Mirror


Cisco Blamed A Router Bug On 'Cosmic Radiation' (networkworld.com)

Network World's news editor contacted Slashdot with this report: A Cisco bug report addressing "partial data traffic loss" on the company's ASR 9000 Series routers contended that a "possible trigger is cosmic radiation causing SEU [single-event upset] soft errors." Not everyone is buying: "It IS possible for bits to be flipped in memory by stray background radiation. However it's mostly impossible to detect the reason as to WHERE or WHEN this happens," writes a Redditor identifying himself as a former [technical assistance center] engineer...
"While we can't speak to this particular case," Cisco wrote in a follow-up, "Cisco has conducted extensive research, dating back to 2001, on the effects cosmic radiation can have on our service provider networking hardware, system architectures and software designs. Despite being rare, as electronics operate at faster speeds and the density of silicon chips increases, it becomes more likely that a stray bit of energy could cause problems that affect the performance of a router or switch."

Friday a commenter claiming to be Xander Thuijs, Cisco's principal engineer on the ASR 9000 router, posted below the article, "apologies for the detail provided and the 'concept' of cosmic radiation. This is not the type of explanation I would like to see presented to the respected users of our products. We have made some updates to the DDTS [defect-tracking report] in question with a more substantial data and explanation. The issue is something that we can likely address with an FPD update on the 2x100 or 1x100G Typhoon-based linecard."

17 of 145 comments (clear)

  1. Not buying it by lsllll · · Score: 2

    If it's cosmic radiation, wouldn't it affect more than the ASR 9000? Or is that the only model without a lead case?

    --
    Is that a roll of dimes in your pocket or are you happy to see me?
    1. Re:Not buying it by slowdeath · · Score: 5, Interesting

      Sorry, but cases such as this exist.

      Back around 1999/2000 I was with Cisco engineering on the GSR 12000 (the first Cisco service provider class router).

      We did send a system to a POP in Denver (altitude 5000+ ft) and saw on this system a statistically significant increase in recoverable memory ECC errors.

      When the affected board was returned to San Jose and retested (basically sea level) the errors could not be reproduced.

      So we returned the hardware back to the Denver POP, and the recoverable ECC errors returned. No amount of swapping memory DIMMs (various vendors) made a difference.

      Any satellite hardware designer will tell you that cosmic radiation is a big deal for satellite design. And lead shielding is not a cost effective option in space.

    2. Re:Not buying it by Richard+Kirk · · Score: 2

      It is a reasonable explanation. Memory has parity bits. There are random faults from the various sorts of noise you can get in semiconducting circuits, but if you have some safety-net that will catch the occasional flipped bit. Your computer will be catching these sorts of errors all the time. The problem with cosmic rays is that they are very energetic, so they can pass through a lot of matter, but when they collide they generate a tight cone of ionising particles that knocks out electronics in a small region of circuitry. This can flip a number of bits in the same region of memory, so it can become possible for the memory to get corrupted but the parity bits (or Reed-Solomon codec or whatever) to think that everything is ok. This is still unlikely, but it is much more probably than if events were happening at random. There is nothing sensible you can do about this other than run the same calculation a second time and see whether you come up with the same answer. This is what we used to do with large codes that ran on Cray YMP's back in the 80's, and cosmic rays set the limit on complex calculations.

      Putting it in a lead case? That makes it worse. If you have cosmic rays, you either have to go into a deep mine to shield them, which is what they do with neutrino experiments but it a bit impractical for a server; or to put it in a very light case and hope the cosmic ray goes straight through.

      If you have a sever which seemed to be working, then did something mad, then went back to working again; and it was in a rack of other similar devices, so you can be sure nobody unplugged it to plug in a vacuum cleaner, or something like that, then the only explanation that remains for me is cosmic rays.

  2. happened to me by w0ss · · Score: 4, Funny

    I work at a fortune 500 and I had to explain this to management just a few years ago on a Cisco 6500. It was a tough sell but I recall having a similar issue in the late 90's/ early 2000's with sun hardware so it isn't new. That was was even better to explain. The Sun's cosmic rays were causing the Sun's hardware to break!

  3. Solar Flares? by oneiros27 · · Score: 2

    I'm guessing that they've read the BOFH, but realized that there's much more reporting on solar-induced radiation ... so just decided to go with 'galactic' instead. .... completely forgetting that if this were the case, it would happen more frequently at high latitudes, due to the magnetosphere. And we'd also see a higher incidence rate after solar x-ray flares and solar particle events.

    (and the disclaimer: I work for the Solar Data Analysis Center, but I'm not a scientist, and don't speak for my place of work, etc, blah blah blah)

    --
    Build it, and they will come^Hplain.
  4. Re:Bad memory... by Dutch+Gun · · Score: 4, Insightful

    Or perhaps something like a design flaw in memory that's provable and repeatable, and has even been used for conceptual security attacks.

    Still, when you start looking at crash reports from millions of customers (I used to work on a fairly well-known MMO), you see stuff that simply shouldn't be possible, and you start wondering about things like cosmic radiation. We had to filter out what we figured were hardware-based errors due to overclocked CPUs, bad RAM, etc, or else you get flooded with impossible crash stacks.

    x = 3 * y; // Crash here! WTF?

    --
    Irony: Agile development has too much intertia to be abandoned now.
  5. Re:Van Allen radiation belts by ArtemaOne · · Score: 4, Interesting

    Because working with geosynchronous satellites I learned that it mostly affects them, and satellites in LEO aren't affected very much, and the ground has quite a bit of atmosphere for additional shielding. I don't work with stuff on the ground anymore. Do you not see that I said things like "I wonder" and such? Just provide an answer and don't be a douche.

  6. Re:Cisco is getting worse... by seoras · · Score: 5, Interesting

    One of the VP level Engineers (title is "distinguished" or something exalted like that) told me over lunch a couple of years ago that Chambers had said to him he wasn't interested in R&D. If there was a technology he needed, he'd buy it.
    The problem is that Cisco climbed to the top using IBM strategies and thinking which were focus on delivering "end to end" solutions to customers.
    They had no interest in box shipping. Those were just lego bricks and logistics. You can imagine how soul destroying that was to be a Cisco engineer.
    Bugs were a bonus to them as they sold annual maintenance contracts for roughly the same cost as the gear they sold.
    Now that the router/switch market has matured and commoditised they care even less about the quality of those boxes they have to ship.
    Their focus is entirely on the "service" level.
    They will eventually become another IBM. I was trying to think of a real tangible product that IBM made and sold just the other day. Do they?

  7. strong elevation effect too.. by Anonymous Coward · · Score: 3, Informative

    As discovered by IBM back in the 70s, if it is a radiation induced upset, you'd see higher rates in places like Colorado vs Sea Level, and on upper floors of building vs lower floors.

  8. Not bloody likely by techno-vampire · · Score: 4, Informative

    As FOLDOC explains, Intel tested this idea decades ago by putting one board in a 25 ton lead safe and another outside to see if there was a measurable difference in bit rot. There wasn't. " Further investigation demonstrated conclusively that the bit drops were due to alpha particle emissions from thorium (and to a much lesser degree uranium) in the encapsulation material." They ended up redesigning the memory to be more resistant to the effect.

    --
    Good, inexpensive web hosting
  9. Re: Van Allen radiation belts by ArtemaOne · · Score: 2

    Too bad you resort to the logical fallacy of attacking the speaker instead of the argument. Plus this time it wasn't even an argument, so that just makes you a pure troll.

  10. Re: Van Allen radiation belts by ArtemaOne · · Score: 2

    I clearly have no argument because I was curious about something. Pointing that out doesn't mean anything.

  11. Re:Van Allen radiation belts by larryjoe · · Score: 5, Interesting

    Because working with geosynchronous satellites I learned that it mostly affects them, and satellites in LEO aren't affected very much, and the ground has quite a bit of atmosphere for additional shielding.

    While satellites and space vehicles may be hit with a lot more sub-atomic particles, some particles do reach the earth's surface, e.g., muons, which results in about 13 to 14 neutrons/cm^2/hr with sufficient energy to flip a bit on a chip (depending on latitude, longitude, altitude, etc.) All chip companies that care to be concerned about it know the estimates for errors/Mbit for their process technology. Some extra effort is needed to estimate the amount of error masking due to the micro-architecture and software. Many/most companies design in sufficient error detection or correction to bound the expected chip error rate to an acceptable level. However, that acceptable error rate depends on the customer. For supercomputing, nuclear control systems, and self-driving cars, the acceptable error rate is quite small, but for networking ASICs, the chip error rate may be allowed to be higher with software and network protocols providing additional error mitigation.

  12. Re:Cisco is getting worse... by phantomfive · · Score: 3, Interesting

    They will eventually become another IBM. I was trying to think of a real tangible product that IBM made and sold just the other day. Do they?

    They sell more mainframes now than they did in 1970, believe it or not. One z13 can support 8,000 Linux VMs simultaneously. Cool looking box, too.

    --
    "First they came for the slanderers and i said nothing."
  13. Radiation can indeed cause memory errors by thermidor · · Score: 2

    When I was a physics teacher I had an ongoing memory error problem with my Fujitsu Siemens laptop which led to frequent BSOD. I replaced the memory, and it still occurred. I then noticed the memory error happened frequently at work, but never at home. I wondered whether it could be a radiation issue, as I handled radioactive sources at my desk. I got my tech to do a leak check on my desk. It showed there was higher-than-background levels of radiation (can't recall whether alpha or beta) around my desk. This only showed up using a fairly decent G-M tube which had been given to us by the local hospital when they were having a clearout. Turns out the source of radiation was dust from a piece of fossilised wood I'd picked up some time previously. It had been sitting on my desk and zapping my laptop's memory. I sealed the fossil in a Ziplock bag and kept it in a Quality Street tin. The problem never recurred.

  14. Total bullshit, SEUs are fixable by Theovon · · Score: 2

    There has been assloads of research on mitigating soft errors going back to the 1970’s. I’ve published some myself. There is no shortage of workable methods on masking transient errors in logic and bit flips in DRAMs. SEUs are a major problem for supercomputers, so their memory systems have sophisticated mechanisms for catching them.

    If Cisco is blaming this on SEUs, that just proves their incompetence, since they obvious didn’t spend 5 minutes with Google Scholar looking at hundreds of GOOD papers (in the top conferences and journals) on this topic. Seriously.

    PLUS, if something goes wrong, even if it IS a transient error, it’s FAR more likely to be a fixable bug than radiation. We had a weird bug in a DRAM controller whose state kept going invalid. We had to add another circuit to fix that. We *called* is a cosmic ray deflector, but the more likely causes, in order were (a) another bug we couldn’t find, (b) a timing violation caused perhaps by voltage or temperature fluctuation, or (c) crosstalk in the circuit. We would have kept looking, but this deflector circuit made it robust to hundreds of hours of slamming the memory system, so we let it go. (Also, it was graphics memory, so even if it did ultimately suffer a glitch some day, it would go unnoticed.)

  15. Reasonable - but not enough data by bradgoodman · · Score: 2

    It could indeed be possible. Aloha particles are well-know to be capable of causing bit-flips in capacitive memories (DRAM). This is exactly why we have things like ECC on memory pathways. That said - its not the only explanation. There are ways of testing this. For example, observing the general abundance and frequency of such particles in a bubble chamber, and attempting to corrolate to instances if error. Or placing equipment in a shilded enviroment and seeing if frequency of errors change. Long story short - it MAY be true - but if you want to draw a conclusion - you really have to offer more data to prove it.