Slashdot Mirror


Cisco Blamed A Router Bug On 'Cosmic Radiation' (networkworld.com)

Network World's news editor contacted Slashdot with this report: A Cisco bug report addressing "partial data traffic loss" on the company's ASR 9000 Series routers contended that a "possible trigger is cosmic radiation causing SEU [single-event upset] soft errors." Not everyone is buying: "It IS possible for bits to be flipped in memory by stray background radiation. However it's mostly impossible to detect the reason as to WHERE or WHEN this happens," writes a Redditor identifying himself as a former [technical assistance center] engineer...
"While we can't speak to this particular case," Cisco wrote in a follow-up, "Cisco has conducted extensive research, dating back to 2001, on the effects cosmic radiation can have on our service provider networking hardware, system architectures and software designs. Despite being rare, as electronics operate at faster speeds and the density of silicon chips increases, it becomes more likely that a stray bit of energy could cause problems that affect the performance of a router or switch."

Friday a commenter claiming to be Xander Thuijs, Cisco's principal engineer on the ASR 9000 router, posted below the article, "apologies for the detail provided and the 'concept' of cosmic radiation. This is not the type of explanation I would like to see presented to the respected users of our products. We have made some updates to the DDTS [defect-tracking report] in question with a more substantial data and explanation. The issue is something that we can likely address with an FPD update on the 2x100 or 1x100G Typhoon-based linecard."

145 comments

  1. That's my go to excuse, too by marmot7 · · Score: 1

    Cosmic

    1. Re:That's my go to excuse, too by arglebargle_xiv · · Score: 1

      Hey, cosmic radiation is right here on page 27 of the BOFH excuse calendar. Today's it's day.

    2. Re:That's my go to excuse, too by Anonymous Coward · · Score: 0

      Thank god as I pretty much did nothing but online all day. Cosmic excuse.

    3. Re:That's my go to excuse, too by Anonymous Coward · · Score: 0

      No surprise the BOFH works for Cisco now

  2. Alternate explanation by Anonymous Coward · · Score: 1

    I'm not saying it was aliens, but...

    It was aliens!

  3. SEUs are a real thing... by Anonymous Coward · · Score: 0

    And they're not all that rare. Ask your local super computer admin how often they register ECC errors...

    1. Re: SEUs are a real thing... by Anonymous Coward · · Score: 0

      And they're not all that rare. Ask your local super computer admin how often they register ECC errors...

      99.99â... of SEUs have notihing to do with cosmic radiation.

      Let me guess, you are a millennial, and to cool to use Google?

    2. Re: SEUs are a real thing... by Anonymous Coward · · Score: 0

      He did a search, but didn't read past the first 3 links. Classic millennial move.

      Web searching != research, nor understanding.

      (You might be a millennial too, with your misuse of "to".)

    3. Re: SEUs are a real thing... by Eunuchswear · · Score: 1

      99.99â... of SEUs have notihing to do with cosmic radiation.

      Let me guess, you are a millennial, and to cool to use Google?

      It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries

      https://en.wikipedia.org/wiki/ECC_memory#Problem_background

      --
      Watch this Heartland Institute video
    4. Re: SEUs are a real thing... by cwsumner · · Score: 1

      It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries

      https://en.wikipedia.org/wiki/ECC_memory#Problem_background

      Actually, there was a packaging material that caused this problem, back in the late '60s. They reaserched it and stopped using that, but were astonished to find it was still not completely cured. That's when the "Science-fiction" idea of cosmic ray impacts started to be believed.

      (Strange how new people assume the "ancients" were wrong and ignorant, when they were usually smarter than the newbies.)

  4. Bad memory... by hcs_$reboot · · Score: 1

    would be another explanation.

    --
    Slashdot, fix the reply notifications... You won't get away with it...
    1. Re:Bad memory... by Dutch+Gun · · Score: 4, Insightful

      Or perhaps something like a design flaw in memory that's provable and repeatable, and has even been used for conceptual security attacks.

      Still, when you start looking at crash reports from millions of customers (I used to work on a fairly well-known MMO), you see stuff that simply shouldn't be possible, and you start wondering about things like cosmic radiation. We had to filter out what we figured were hardware-based errors due to overclocked CPUs, bad RAM, etc, or else you get flooded with impossible crash stacks.

      x = 3 * y; // Crash here! WTF?

      --
      Irony: Agile development has too much intertia to be abandoned now.
    2. Re:Bad memory... by hcs_$reboot · · Score: 1

      x = 3 * y; // Crash here! WTF?

      Maybe y ~ 2^31 and the CPU doesn't support overflow...

      --
      Slashdot, fix the reply notifications... You won't get away with it...
    3. Re:Bad memory... by arth1 · · Score: 0

      x = 3 * y; // Crash here! WTF?

      Quite a few opportunities that doesn't involve external causes:
      Overflows.
      y is a pointer to an address that can't be read.
      y is a macro which causes a divide by zero or other exceptions.
      Using a C compiler that doesn't take C++ style comments

    4. Re:Bad memory... by Anonymous Coward · · Score: 0

      Alien interference, residual færie magic, poltergeist possession, and bad feng shui would also be believable alternative hypotheses.

    5. Re:Bad memory... by Dutch+Gun · · Score: 1

      Lol, I knew I was going to get comments like that. Christ, it was just an example. Assume this is simple integer arithmetic in a well-defined range, ok? And I'm still trying to figure out what sort of C compiler that doesn't understand C++ comments would generate runtime code that crashes instead.

      Honestly, there are seriously whacked PCs in the world (especially badly overclocked gaming PCs) that try to argue that 1 + 1 == 3.

      --
      Irony: Agile development has too much intertia to be abandoned now.
    6. Re:Bad memory... by Anonymous Coward · · Score: 0

      That's not how compilers or modern CPUs work. GO away.

    7. Re:Bad memory... by Anonymous Coward · · Score: 0

      I think we need to create a new term "slashsplain" for these sorts of idiotic comments.

      Of course, this was meant as a silly example, but (given that) to enumerate:

      1) overflows in C don't cause immediate crashes
      2) didn't dereference the pointer, it's taking the value of the pointer
      3) stupid assumption unless it was implied
      4) see #3

      Slashsplain. Like mansplaining, but instead of a man condescendingly explaining something to a woman who already knows as much or more as the 'splainer, it's a /. user condescendingly explaining something to another /. user who already knows as much or more. See also, slashpedantism.

    8. Re:Bad memory... by Eunuchswear · · Score: 1

      overflows in C don't cause immediate crashes

      Where in any C standard does it say that.

      "overflows in C [ compiled by most compilers ] [ run on most common hardware ] don't cause immediate crashes" is probably true.

      But an overflow in signed integer calcutaltions on a Harris 800 or a Tandem TXP, to cite two examples I have real experience with, would cause an immediate crash.

      --
      Watch this Heartland Institute video
    9. Re:Bad memory... by davyCrockett · · Score: 1

      Quite right. And by way of actual observation on Cisco TAC engagement with a top tier Service Provider... both bad memory and cosmic radiation have surfaced in the debug and diagnostic phase of the trouble ticket. It is not easy to then get the customer to ignore mention of 'cosmic radiation' and make progress on the real issue (timeline, early 2000's). And guess what surfaces in the next service contract negotiation.

  5. Van Allen radiation belts by ArtemaOne · · Score: 0

    I wonder if this is a real thing on the surface. The Earth's magnetic sphere has a tendency to grab and divert most of these things, which manned spacecraft have a hard time maneuvering. Do they actually ever screw up processors on the ground? That's pretty crazy.

    1. Re:Van Allen radiation belts by ArtemaOne · · Score: 4, Interesting

      Because working with geosynchronous satellites I learned that it mostly affects them, and satellites in LEO aren't affected very much, and the ground has quite a bit of atmosphere for additional shielding. I don't work with stuff on the ground anymore. Do you not see that I said things like "I wonder" and such? Just provide an answer and don't be a douche.

    2. Re: Van Allen radiation belts by ArtemaOne · · Score: 2

      Too bad you resort to the logical fallacy of attacking the speaker instead of the argument. Plus this time it wasn't even an argument, so that just makes you a pure troll.

    3. Re:Van Allen radiation belts by Anonymous Coward · · Score: 1

      The Earth's magnetic field is great for deflecting most of the stuff that come from the Sun, but cosmic rays, as in the stuff from outside the solar system, includes a lot of high to very high energy particles that are not much deflect by the same field. It can actually be worse in the atmosphere than in orbit too, depending on your exact setup. A single high energy particle directly hitting something in space might deflect a single atom, but a similar particle hitting the upper atmosphere deposits enough energy into that atom to repeat the process, and you get a whole shower of particles. Still, the net effect tends to be worse in space.

      But from a practical stand point, there is plenty of work for various devices showing failure rate that increases with elevation. It has been a while since I've seen such work on digital systems, but you get even analog devices like IGCTs where failure rate increases with altitude, since when under voltage a small, fast conducting channel can cause the current density to spike up faster than it can spread out.

    4. Re: Van Allen radiation belts by sexconker · · Score: 0

      You presented no argument because you have none. That's why you're being attacked as the fool you are. Hint: It is better to remain silent.

    5. Re: Van Allen radiation belts by ArtemaOne · · Score: 2

      I clearly have no argument because I was curious about something. Pointing that out doesn't mean anything.

    6. Re:Van Allen radiation belts by ArtemaOne · · Score: 1

      Thank you. There seem to be a lot of parallels to space systems. I would imagine it is still as big of an issue on the ground because they do not put the same thought in protecting from it, as they do in orbit.

    7. Re:Van Allen radiation belts by larryjoe · · Score: 5, Interesting

      Because working with geosynchronous satellites I learned that it mostly affects them, and satellites in LEO aren't affected very much, and the ground has quite a bit of atmosphere for additional shielding.

      While satellites and space vehicles may be hit with a lot more sub-atomic particles, some particles do reach the earth's surface, e.g., muons, which results in about 13 to 14 neutrons/cm^2/hr with sufficient energy to flip a bit on a chip (depending on latitude, longitude, altitude, etc.) All chip companies that care to be concerned about it know the estimates for errors/Mbit for their process technology. Some extra effort is needed to estimate the amount of error masking due to the micro-architecture and software. Many/most companies design in sufficient error detection or correction to bound the expected chip error rate to an acceptable level. However, that acceptable error rate depends on the customer. For supercomputing, nuclear control systems, and self-driving cars, the acceptable error rate is quite small, but for networking ASICs, the chip error rate may be allowed to be higher with software and network protocols providing additional error mitigation.

    8. Re:Van Allen radiation belts by ArtemaOne · · Score: 1

      Thank you. I didn't realize it was so high on the ground. I know the technology for hardening is available, but clearly that costs money, and if patented that would add way too much for low threat situations. It makes a lot of sense for network technology, as you said, it has plentiful error correction and retransmission protocols as it is.

    9. Re:Van Allen radiation belts by sjames · · Score: 1

      I can't claim to know the cause, but I have seen one proven case of a bit flip affecting processing on a machine that ran fine before and after the incident. Since it was doing batch processing I was able to re-run the job with identical inputs. It never made the error again.

      Could have been a cosmic ray, alpha decay, power glitch that oddly didn't affect the other machines on the same circuit, who knows?

    10. Re:Van Allen radiation belts by ArtemaOne · · Score: 1

      I'm definitely going to read up on this surface SEU more. I have always attributed things you mention to non-ECC ram in the past.

    11. Re:Van Allen radiation belts by Eunuchswear · · Score: 1

      I have always attributed things you mention to non-ECC ram in the past.

      Non ECC ram has exactly the same number of bit-flips as ECC ram. It's just that ECC lets you know it happened (as would parity) and can fix it for you (if only one bit flipped).

      --
      Watch this Heartland Institute video
    12. Re: Van Allen radiation belts by Anonymous Coward · · Score: 0

      I have a gaming rig that would crash seconds before I received a heavy rain warning. This would lead to a repeating bootup sequence. The only explanation I have is that the local radar weather station (1 mile away) was receiving multiple return echoes boosting the strength of the signal.

    13. Re:Van Allen radiation belts by Anonymous Coward · · Score: 0

      minor detail; ECC ram has on the order of 18% more bit error events, due to more memory cells, but most of them are corrected transparently (unless you keep telemetry). We had a fault in a high altitude aircraft (>60K feet) that we are pretty confident occurred when a cosmic ray flipped a bit inside the error correction circuitry.

    14. Re:Van Allen radiation belts by AJWM · · Score: 1

      Of course it depends where you are on the ground. I used to work in a data center in Colorado Springs, at about 6000 feet altitude. We saw quite a few correctable memory errors in the logs (and a few random crashes).

      Might have been cosmic rays, might have been radiation from the mountain of granite (Pikes Peak) we were in the shadow of.

      Either way, if the errors occurred several times in the same DIMM, it was probably bad memory and we replaced it. The odds of cosmic rays hitting the same DIMM every few days or so are pretty remote.

      --
      -- Alastair
    15. Re:Van Allen radiation belts by Eunuchswear · · Score: 1

      Yup, that's obvious. I'm dumb not to have thought of that: ECC == more cells / usable bit == more events / usable bit.

      We had a fault in a high altitude aircraft (>60K feet) that we are pretty confident occurred when a cosmic ray flipped a bit inside the error correction circuitry.

      Hilarious. As you add protection you add places to be broken :-( Need error correction circuitry for the error correction circuitry.

      --
      Watch this Heartland Institute video
    16. Re:Van Allen radiation belts by Agripa · · Score: 1

      Of course it depends where you are on the ground. I used to work in a data center in Colorado Springs, at about 6000 feet altitude. We saw quite a few correctable memory errors in the logs (and a few random crashes).

      The soft error rate do to cosmic rays is an order of magnitude higher at Colorado springs than at ground level. For aircraft it is another order of magnitude higher but they have a very small capture area; I assume safety critical aircraft systems are designed with this in mind. Placing systems in a deep basement has the opposite result do to greater shielding.

      A lot of memory*hours are needed for it to become a significant problem which is why systems with either lots of memory or systems which run continuously for long times (or both) are the most affected. Oddly enough, the error rate is also proportional to memory bandwidth because use of the logic controlling the memory array increases the capture area.

  6. Old News by Anonymous Coward · · Score: 0

    How is this new? They've been saying that for years. They've used that explanation for every router series back to the 7500s at least. Mind you people have been making fun of it for just as long. 'Ooo. Sunspots. Better check to see if any of the routers rebooted.'

    1. Re: Old News by Anonymous Coward · · Score: 0

      Indeed in 2002 we had a case closed from Cisco with this exact reason...it was either a 6509 or an old Arrowpoint LB. We always assumed it was an analog for "Fucked if we know..." . We made the same jokes for years...

  7. Not buying it by lsllll · · Score: 2

    If it's cosmic radiation, wouldn't it affect more than the ASR 9000? Or is that the only model without a lead case?

    --
    Is that a roll of dimes in your pocket or are you happy to see me?
    1. Re:Not buying it by slowdeath · · Score: 5, Interesting

      Sorry, but cases such as this exist.

      Back around 1999/2000 I was with Cisco engineering on the GSR 12000 (the first Cisco service provider class router).

      We did send a system to a POP in Denver (altitude 5000+ ft) and saw on this system a statistically significant increase in recoverable memory ECC errors.

      When the affected board was returned to San Jose and retested (basically sea level) the errors could not be reproduced.

      So we returned the hardware back to the Denver POP, and the recoverable ECC errors returned. No amount of swapping memory DIMMs (various vendors) made a difference.

      Any satellite hardware designer will tell you that cosmic radiation is a big deal for satellite design. And lead shielding is not a cost effective option in space.

    2. Re:Not buying it by CrazyCiscoDude · · Score: 1

      It's not first time they've blamed an error on this... Cisco ACE modules had the same issue years ago, CSCsv52331.

    3. Re:Not buying it by hcs_$reboot · · Score: 1

      Maybe the router was routing in the ISS

      --
      Slashdot, fix the reply notifications... You won't get away with it...
    4. Re:Not buying it by dgatwood · · Score: 1

      Sure, it could be that. But it could also be:

      • A cleaning person plugging a vacuum cleaner into the power strip on the rack instead of into the wall outlet that's on an external circuit (combined with improper power filtering in the equipment).
      • Electrical noise caused by some other crappy piece of equipment in the rack (combined with improper power filtering in the equipment).
      • Errors caused by higher operating temperature.
      • Errors caused by emissions from natural Uranium or other radioactive elements in the soil.
      • A software bug.
      • A hardware bug.

      And if it happens disproportionately on one class of equipment, unless there are material differences in the amount of shielding, any one of those five is probably much more likely than cosmic rays, IMO. :-)

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    5. Re:Not buying it by thegarbz · · Score: 1

      A cleaning person plugging a vacuum cleaner into the power strip on the rack instead of into the wall outlet that's on an external circuit (combined with improper power filtering in the equipment).

      Even shitty chinese powersupplies filter this out to an acceptable level to make this a non issue.

      Electrical noise caused by some other crappy piece of equipment in the rack (combined with improper power filtering in the equipment).

      Even shitty chinese powersupplies filter this out to an acceptable level to make this a non issue.

      Errors caused by higher operating temperature.

      Unless the equipment has an appreciable difference in operating envrionment it would be insignificant. It's also one of the first things you do when checking failures is do a quick check of the installation equipment, especially if you're in a data centre or other envionmental controlled situation.

      Errors caused by emissions from natural Uranium or other radioactive elements in the soil.

      You mean just like cosmic radiation? :-)

      A software bug.

      Would have been picked up in the GP's tests

      A hardware bug.

      Would have been picked up in the GP's tests

    6. Re:Not buying it by Richard+Kirk · · Score: 2

      It is a reasonable explanation. Memory has parity bits. There are random faults from the various sorts of noise you can get in semiconducting circuits, but if you have some safety-net that will catch the occasional flipped bit. Your computer will be catching these sorts of errors all the time. The problem with cosmic rays is that they are very energetic, so they can pass through a lot of matter, but when they collide they generate a tight cone of ionising particles that knocks out electronics in a small region of circuitry. This can flip a number of bits in the same region of memory, so it can become possible for the memory to get corrupted but the parity bits (or Reed-Solomon codec or whatever) to think that everything is ok. This is still unlikely, but it is much more probably than if events were happening at random. There is nothing sensible you can do about this other than run the same calculation a second time and see whether you come up with the same answer. This is what we used to do with large codes that ran on Cray YMP's back in the 80's, and cosmic rays set the limit on complex calculations.

      Putting it in a lead case? That makes it worse. If you have cosmic rays, you either have to go into a deep mine to shield them, which is what they do with neutrino experiments but it a bit impractical for a server; or to put it in a very light case and hope the cosmic ray goes straight through.

      If you have a sever which seemed to be working, then did something mad, then went back to working again; and it was in a rack of other similar devices, so you can be sure nobody unplugged it to plug in a vacuum cleaner, or something like that, then the only explanation that remains for me is cosmic rays.

    7. Re:Not buying it by angel'o'sphere · · Score: 1

      But that is not cosmic radiation.
      That wold be occurring regardless where you are.

      My guess is a batch of the chips in those routers has contaminated silicon. The radiation is likely coming from inside. Does not need to a health risk high contamination, just a random increase in a phosphor isotope or something.

      --
      Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
    8. Re:Not buying it by angel'o'sphere · · Score: 1

      Some Memory has parity bits.
      Fixing that for you.
      I bet your PC or Laptop has no memory with parity bits.

      --
      Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
    9. Re: Not buying it by Anonymous Coward · · Score: 0

      Vacuum cleaner definitely did cause a 'server' (a 586 NT workstation in the tiny startup I worked for in mid-nineties) to crash on my coworker (we were young back then, he had to clean up a mess under the desk, was in a hurry and wasn't thinking). So it could happen, at least back then, with commodity hardware at least.

    10. Re:Not buying it by slowdeath · · Score: 1

      Sorry you are incorrect. The EXACT SAME hardware did NOT malfunction at sea level altitude, but when relocated to Denver at 5000+ ft above sea level it displayed a statistically significant increase in single bit ECC errors.This behavior has been studied by numerous organizations, including IBM, Sun Microsystems, and others. See this IEEE technical talk: http://www.ewh.ieee.org/r6/scv...

    11. Re:Not buying it by cwsumner · · Score: 1

      It is calculated that the average personal computer will be hit by this about once a year. It's has been about the same since the 1960's. Cell size gets smaller, which reduces the chance, but number of cells gets greater, increasing the chance. The chips are about the same overall size, which is what makes the "target" area.

      But operating systems such as Windows crash much more often than that, so nobody notices unless they have high-reliability equipment and track faults.

      It's not an excuse or a fairy tale, it does happen. But not very often, so if your problem re-occurs it's something else.

    12. Re:Not buying it by angel'o'sphere · · Score: 1

      5000ft difference in hight has no influx on cosmic radiation.
      Much more likely the buildings had.

      --
      Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
  8. 'Cosmic Radiation' can corrode credability by Camel+Racer · · Score: 1

    even if you have a strong support organization, one slacker responding with this to a customer, and the entire brand is tarnished.

    --
    Anybody can work under ideal circumstances. -- Jeff K. (January 4, 2001)
    1. Re: 'Cosmic Radiation' can corrode credability by Anonymous Coward · · Score: 0

      Okay...except if you had actually read the article, even if you had read the summary, you'd have realized that it wasn't just "one slacker responding with this to a customer," it's a bug report on the company's own website:

      https://quickview.cloudapps.ci...

      You know what else tarnishes a brand in the IT world? People like you. People who think they know better but don't, people who should have retired years ago but won't. Supposed "professionals" who can't even be bothered to read more than a sentence on the site they've been visiting for YEARS.

    2. Re: 'Cosmic Radiation' can corrode credability by Coren22 · · Score: 1

      Is there more detail in the logged in bug report, as the one you linked to without login says nowhere cosmic radiation, just SEU, which can be caused by many things and is just a generic term for a bit flip.

      --
      APK likes to ask for responses to the same things over and over. Maybe he just likes the responses?
  9. happened to me by w0ss · · Score: 4, Funny

    I work at a fortune 500 and I had to explain this to management just a few years ago on a Cisco 6500. It was a tough sell but I recall having a similar issue in the late 90's/ early 2000's with sun hardware so it isn't new. That was was even better to explain. The Sun's cosmic rays were causing the Sun's hardware to break!

    1. Re:happened to me by Anonymous Coward · · Score: 0

      no they weren't - they just told you that to get you off the phone

    2. Re:happened to me by ebrandsberg · · Score: 1

      On a related note, there is the infamous (in narrow circles) issue of the serial consoles on old Unix systems. Many had an option to "press any key for boot menu" on the serial console. The problem was that the serial consoles would get enough static interference to occasionally detect a character while this option was available, and it would halt the boot process. On a datacenter reboot (usually due to power loss), a handful of servers would never come up because of this. It was far more reliable to require a particular character to be received to break the boot sequence, although there is a risk that even that could be triggered, but FAR less often.

    3. Re:happened to me by Anonymous Coward · · Score: 0

      Or, one could you know, have put in a pullup or pulldown resistor?

      And it's not static- any wire left floating (at high impedance) becomes a radio antenna.

  10. Solar Flares? by oneiros27 · · Score: 2

    I'm guessing that they've read the BOFH, but realized that there's much more reporting on solar-induced radiation ... so just decided to go with 'galactic' instead. .... completely forgetting that if this were the case, it would happen more frequently at high latitudes, due to the magnetosphere. And we'd also see a higher incidence rate after solar x-ray flares and solar particle events.

    (and the disclaimer: I work for the Solar Data Analysis Center, but I'm not a scientist, and don't speak for my place of work, etc, blah blah blah)

    --
    Build it, and they will come^Hplain.
  11. Uh huh... by Anonymous Coward · · Score: 1

    Sun Microsystems already pulled this bullshit back about 15 years ago... I don't really recall if it was a bad batch of processors or if it was bad non-ecc cache memory or whatever... but I do remember plenty of folks giving them a ration of shit and generally refusing to buy hardware from them after that... though once they fessed up to the problem and replaced all of our defective systems(and gave us a couple of free systems) we never had any further issues.

    1. Re: Uh huh... by Anonymous Coward · · Score: 0

      If I remember correctly some of their process water was contaminated slightly with radioactive particles. Their decay caused the errors.

    2. Re:Uh huh... by Anonymous Coward · · Score: 0

      Hey, I realize the amount of energy is far exceeded, but don't forget the Raspberry pi's have been documented to reset if a strong enough pulse of visible light hits just the right spot on the CPU.

    3. Re:Uh huh... by Agripa · · Score: 1

      The SRAM structures used for integrated high performance processor cache are orders of magnitude more sensitive than discrete DRAM to radiation induced soft errors. Some of this is simply because the bandwidth is so high which exposes a greater capture area of logic. And so high performance processor cache has included parity and ECC protection for a long time.

  12. Trying to clam acts of god to get out of being by Joe_Dragon · · Score: 1

    Trying to clam acts of god to get out of being responsible?

    1. Re: Trying to clam acts of god to get out of being by Anonymous Coward · · Score: 0

      What do you have against clams? Oysters are just as good.

      Are you claiming that oysters are not as good as clams?

    2. Re:Trying to clam acts of god to get out of being by MtHuurne · · Score: 1

      It reminds me of people blaming compiler bugs for non-working code. While it does happen that a compiler generates incorrect code (I've encountered a few instances over the years), unless you either have reduced the problem to a minimal test case or examined the generated assembly and located the problem there, it's far more likely that it's a case of not digging deep enough to find a bug in your own code.

  13. Cisco is getting worse... by DMJC · · Score: 0

    I work at a fortune 100, we're being delayed at the moment by software bugs in Cisco's routers. Their QA has completely gone out the window in the last few years, probably related to all the staff cutbacks. I expect we will start seeing Cisco losing market share if this keeps up.

    1. Re:Cisco is getting worse... by Anonymous Coward · · Score: 0

      I would argue they have always been this way based on my experience.

    2. Re:Cisco is getting worse... by Anonymous Coward · · Score: 1

      They will be losing market share... us included... we're a global company, with a size akin to a fortune 100. We're pulling the plug and moving to Juniper. Too much horse shit from Cisco. It's a fucking nightmare to get a quote, or even get an order filled CORRECTLY. We get the wrong shit sent all the time, and Cisco says the internal RMA process is so tedious, they tell us to throw the wrong equipment in the garbage or just keep it.

      They don't stock shit. Everything takes weeks. RMA'd equipment takes weeks. Half the time we get these vague emails saying oh sorry your switch broke... we don't have shit in the warehouse... you know, because why would you ever stock something like a Nexus class switch... so we sent your RMA over to manufacturing... which is also backed up... therefore you *should* get your replacement Nexus 5k in 6-7 weeks. Good thing you purchased "SmartNet" with uber-fast advance replacement coverage. Have a nice day.

    3. Re:Cisco is getting worse... by seoras · · Score: 5, Interesting

      One of the VP level Engineers (title is "distinguished" or something exalted like that) told me over lunch a couple of years ago that Chambers had said to him he wasn't interested in R&D. If there was a technology he needed, he'd buy it.
      The problem is that Cisco climbed to the top using IBM strategies and thinking which were focus on delivering "end to end" solutions to customers.
      They had no interest in box shipping. Those were just lego bricks and logistics. You can imagine how soul destroying that was to be a Cisco engineer.
      Bugs were a bonus to them as they sold annual maintenance contracts for roughly the same cost as the gear they sold.
      Now that the router/switch market has matured and commoditised they care even less about the quality of those boxes they have to ship.
      Their focus is entirely on the "service" level.
      They will eventually become another IBM. I was trying to think of a real tangible product that IBM made and sold just the other day. Do they?

    4. Re:Cisco is getting worse... by Anonymous Coward · · Score: 0

      They sell service. I've had the pleasure of dealing with various TAC's and I don't care if X company makes a better product, their support sucks compared to Cisco.

    5. Re:Cisco is getting worse... by R3d+M3rcury · · Score: 1

      I think they're still selling POWER-based systems.

    6. Re: Cisco is getting worse... by Anonymous Coward · · Score: 0

      Hahahahahahahahahahahahahahahahahahahahaha.

      Juniper.

      Ha ha ha.

      Aw, man.

    7. Re:Cisco is getting worse... by phantomfive · · Score: 3, Interesting

      They will eventually become another IBM. I was trying to think of a real tangible product that IBM made and sold just the other day. Do they?

      They sell more mainframes now than they did in 1970, believe it or not. One z13 can support 8,000 Linux VMs simultaneously. Cool looking box, too.

      --
      "First they came for the slanderers and i said nothing."
    8. Re:Cisco is getting worse... by swb · · Score: 1

      I seem to run across a fair number of places with AS/400 stuff. What's kind of interesting is the AS/400 stuff and people seem to run in this parallel universe IT department with their own staff somehow immune from the other pressures of the rest of the IT department.

      Once in a blue moon I'll hear mention that some kind of AS/400 update or installation is happening, so it's not like they're strictly legacy systems. And at longer term clients with AS/400 I occasionally see something new/different in the "AS/400 rack".

      I don't know what IBM's growth potential is or how at risk their active businesses like AS/400 are from being eaten by Wintel/Lintel systems are, but they sure seem to have carved out a niche that seems nearly immovable.

    9. Re:Cisco is getting worse... by Anonymous Coward · · Score: 0

      AS/400s run business critical applications and are very prominent where money is concerned (banking, finance, investments etc). The code works from the OS up through to the green-screen (or http wrappings). There is no need to replace them with suspect off the shelf toys like wintel crap. No one senior is going to accept the same stack that's always down, needing reboots, umpteen patches, malware/virus costs, and an army of drones running around trying to keep things ticking over. What runs on the AS/400 will basically run forever bar drive failure. But as they're all mirrored and stripped, you merely need to replace a dodgy drive when the system tells you to.

      Only a fool replaces something that works with a gamble. Unless you can guarantee wintel junk will be as robust as these old machines, and will underwrite trillions of dollar/pound/euros with your claim, don't even think about it. That's not to say many other AS/400 based applications can't be moved over to generic intel+linux/bsd. Most of the office / groupware crap should be migrated off long ago. Likewise with OP/AP/GL stuff.

  14. Head cover protest by seoras · · Score: 1

    This makes me wish I was still working for them in IOS Engineering for the opportunity just to stir some shit.
    I'd have gone into the office on Monday morning with my head covered in tinfoil.
    I use to get some good laughs in the Cisco office, I seem to be getting more on the outside these days.
    Oh how the mighty have fallen....

  15. This is a significant problem for spacecraft by Anonymous Coward · · Score: 0

    I worked on communications satellites in the 1980's and bit flips from cosmic rays was a serious problem we needed to address. Chips needed to be hardened to resist cosmic rays and electronics had to be substantially shielded.

    There are a lot of cosmic rays going through you right now. While the vast majority don't interact with your cells, once in a while one does and that can cause cancer or genetic changes. We owe much of evolution to cosmic rays!

  16. strong elevation effect too.. by Anonymous Coward · · Score: 3, Informative

    As discovered by IBM back in the 70s, if it is a radiation induced upset, you'd see higher rates in places like Colorado vs Sea Level, and on upper floors of building vs lower floors.

  17. Not bloody likely by techno-vampire · · Score: 4, Informative

    As FOLDOC explains, Intel tested this idea decades ago by putting one board in a 25 ton lead safe and another outside to see if there was a measurable difference in bit rot. There wasn't. " Further investigation demonstrated conclusively that the bit drops were due to alpha particle emissions from thorium (and to a much lesser degree uranium) in the encapsulation material." They ended up redesigning the memory to be more resistant to the effect.

    --
    Good, inexpensive web hosting
    1. Re:Not bloody likely by thegarbz · · Score: 1

      Given that alpha emissions are trivially blocked by something as thin as a sheet of paper I take that citation with a grain of salt.

    2. Re:Not bloody likely by rsmith-mac · · Score: 1

      Interesting. Do you know how long ago that study was done? I"m curious if smaller manufacturing geometries have made newer processors more vulnerable.

    3. Re:Not bloody likely by Anonymous Coward · · Score: 0

      thorium in the encapsulation on the chip, not in the lead walls of the safe. .

    4. Re:Not bloody likely by angel'o'sphere · · Score: 1

      "Alpha radiation" is always from a nuclear decay. That is how the discoverers "named" it.

      An Alpha particle, is a helium core, an atom without electrons, an ion.

      An cosmic alpha particle has energies that go far beyond your imagination. You don't shield them with a sheet of paper. Hence we gave them a different name "cosmic ray".

      --
      Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
    5. Re:Not bloody likely by Agripa · · Score: 1

      Interesting. Do you know how long ago that study was done? I"m curious if smaller manufacturing geometries have made newer processors more vulnerable.

      The sensitivity of DRAM actually leveled off a few generations ago. I think what happened is that there is a minimum capacitance needed per DRAM cell so as the cells became smaller and the dielectric constant was increased to make up for it, the charge stored in a given volume became *greater* so an ionizing radiation impact spreading charge over a greater number of DRAM capacitors without enough charge to affect them individually.

      High performance SRAM used for integrated caches became more vulnerable and has been protected by parity and ECC for a long time now.

      I am not sure about high performance logic.

  18. Wake up sheeple! by XSportSeeker · · Score: 1

    Santa Claus spread chemtrails in the sky with which the easter bunny got stoned and confused causing the routers to crash!
    Hey, it's not impossible!

  19. Reddit is social media by Anonymous Coward · · Score: 0

    Why believe someone from Reddit?

    1. Re: Reddit is social media by Anonymous Coward · · Score: 0

      Plausibility is enough for internet facts.

  20. The BOFH now works for Cisco? by Opportunist · · Score: 1

    That was in his excuses rolodex.

    --
    We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
    1. Re: The BOFH now works for Cisco? by Anonymous Coward · · Score: 0

      Same thing I thought- hey, this was in my bofh excuse of the day file

  21. Stray Gamma Rays.. by bduncan · · Score: 1

    I used to use that all the time. Now I'll have to think of something else..

  22. Tinfoil by Anonymous Coward · · Score: 0

    Finally a reason to put a tonfoil hat on my cable modem. Is that why they keep failing over time, or is it the diodes in the splitters the cable keeps replacing?

    1. Re:Tinfoil by Anonymous Coward · · Score: 0

      Since a ton of foil weighs the same as a ton of anything else, I'd say look to that for the source of your modem troubles.

  23. Really Old News by Anonymous Coward · · Score: 0

    Cisco has said this before, easily 15 years they have claim this issue. Shitty happens is a better explanation.

  24. Same old excuses by Anonymous Coward · · Score: 0

    Back in the mid 70, a company I was working for as a software engineer rented a non-IBM 370 type computer. It began to drop several random bytes out of random length records. The company swore that the computer couldn't do that and the problem was not their fault.

  25. Re:Unh Hunh by BringsApples · · Score: 0

    Hahaha... for some reason, this video came to mind. Just another fine example of stupid assumptions based on, and exposing, a woman's desire to be a sex object.

    Maybe cosmic radiation is effecting them too?

    --
    Politics; n. : A religion whereby man is god.
  26. Of course this happens by Yoik · · Score: 1

    Flips of a single bit in a memory or register are that few modern systems would run for long without error correcting memory. Even ECM has its limitations and most systems eventually crash/panic/blue-screen or whatever and require a reboot.

    The costs to improve error resilience go up rapidly and don't have a meaningful upper bound. My working trade off was to design for a mtbf comparable to how long I wanted to keep that job.

  27. ASR 9000 Agglutination Router by Anonymous Coward · · Score: 0

    From http://noelcomm.com/ethernet/: " We have a philosophy of using routers to route and switches to switch which ensures that our Ethernet devices move layer 2 frames as quickly as possible avoiding the “bumps on the wire” often encountered by our competitors who seek to agglutinate multiple services on a single, expensive platform."

    Of course, this business just happens to be located in the buttcrack of the universe, Yakima, WA; home of the Braindead.

  28. But are u for real tho? by Anonymous Coward · · Score: 0, Funny

    "does not take cpp comments" jesus christ

    -> STACK TRACE OF ALREADY COMPILED PROGRAM

    pls get help soon you aspergian shitfuc

  29. I remember in the 70s some memory manufacturer use by charliemerritt03 · · Score: 1

    I remember in the 70s some memory manufacturer used a ceramic package that had a lot of thorium. Bad trouble.

  30. Are sure they didn't mean... by tkrotchko · · Score: 1

    ...a Cosmic Brownie?
    http://cosmicbrownies.littlede...

    --
    You were mistaken. Which is odd, since memory shouldn't be a problem for you
  31. Re:Unh Hunh by Anonymous Coward · · Score: 0

    LA Prostitute (by her own admission) Raven Williams is an HIV positive self-loathing crack addict and alcoholic who has borderline personality disorder and is possibly bipolar as well. She spends her time making YouTube videos cursing people out and starting fights with people in public around Hauser Blvd. in LA as she walks around topless. She has been arrested numerous times by the LAPD and is well known by the cops who patrol that district. She also harasses the police as they are doing their jobs. She attempts to sell her ugly art work on YouTube and via her weblog as well. Currently, she is being investigated by the IRS as she has never filed taxes for the IRS, besides the fact that she makes money illegally via prostitution. Due to her serious personality and psychological issues, she has been fired from every job she has ever had and simply can not relate to anyone else socially or professionally.

  32. This is real by Anonymous Coward · · Score: 0

    I have heard a person from a Cisco competitor talk about how their switches are cosmic-ray safe, but Ciscos are not.

  33. Oopsie! Bit flip! Oh well! by Anonymous Coward · · Score: 0

    The correct response to rare spontaneous radiation-induced errors is not, "Oopsie! Oh well!" The correct response is to design the hardware to be more tolerant and robust in the presence of inevitable background radiation. E.g., use ECC memory for fuck sake. And at least parity checking on all buses.

  34. On a router that expensive by Chas · · Score: 1

    It shouldn't be a huge expense to build in some form of error correction to catch that sort of thing.

    --


    Chas - The one, the only.
    THANK GOD!!!
    1. Re:On a router that expensive by cwsumner · · Score: 1

      It shouldn't be a huge expense to build in some form of error correction to catch that sort of thing.

      Otherwise known as ECC memory?

  35. Global Warming next? by Ungrounded+Lightning · · Score: 1

    My wife was looking over my shoulder when the "Cisco Blamed A Router Bug on 'Cosmic Radiation'" headline went by, and asked:

    "What's their next excuse? Global Warming?"

    --
    Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
  36. Ive actually heard this before by Anonymous Coward · · Score: 0

    In the early 00s i worked on a lot of as5200 series routers for dialup. Ive had Cisco blame cosmic radiation and solar flares for a handful of unexplainable crashes. I really dont know how anyone could argue that with a straight face when the equipment is in a rack with 10 other working pieces of Cisco equipment.

  37. Old Standby Excuse by Anonymous Coward · · Score: 0

    It was rats, sir. They ate through the wires.

  38. Lie by Anonymous Coward · · Score: 0

    Except all professional computing hardware comes in metal cases which are rack in metal rack units

  39. Next week by Jesus+H+Rolle · · Score: 1

    A White House health report addressing "partial data traffic loss" on Secretary of State Hillary Clinton contends that a "possible trigger is cosmic radiation causing SEU [single-event upset] soft errors." Not everyone is buying: "It IS possible for bits to be flipped in memory by stray background radiation. However it's mostly impossible to detect the reason as to WHERE or WHEN this happens," writes a Redditor identifying himself as a former [technical assistance center] engineer...

  40. cosmic radiaton? by Anonymous Coward · · Score: 0

    What type of cosmic radiation? Does it occur more often near nuclear reactors? Fukushima?
    If neutrinos trigger it, then thses routers are a really cheap neutrino detector.
    If an of the other neutral partcles trigger it, ditto.
    If only charges pertcles trigger it, then no such luck.
    BUT ! If dark matter can trigger it, then physicists will keep them all.
    Then there's the rare suggestion of psychic phenomena......

    Just open source the software, check to see if deep-packet inspection triggers it when
    the CEO gets a bonus... or something...

  41. Re:Unh Hunh by Zontar+The+Mindless · · Score: 1

    Yet she still had time to write the Realm-Jumper Chronicles, 5 volumes and counting. Colour me impressed.

    --
    Il n'y a pas de Planet B.
  42. Radiation can indeed cause memory errors by thermidor · · Score: 2

    When I was a physics teacher I had an ongoing memory error problem with my Fujitsu Siemens laptop which led to frequent BSOD. I replaced the memory, and it still occurred. I then noticed the memory error happened frequently at work, but never at home. I wondered whether it could be a radiation issue, as I handled radioactive sources at my desk. I got my tech to do a leak check on my desk. It showed there was higher-than-background levels of radiation (can't recall whether alpha or beta) around my desk. This only showed up using a fairly decent G-M tube which had been given to us by the local hospital when they were having a clearout. Turns out the source of radiation was dust from a piece of fossilised wood I'd picked up some time previously. It had been sitting on my desk and zapping my laptop's memory. I sealed the fossil in a Ziplock bag and kept it in a Quality Street tin. The problem never recurred.

  43. SEU from cosmic radiation are real by Anonymous Coward · · Score: 0

    A project I'm working on expects one SEU per month. This is an issue in safety-critical applications where failures have to be of the order of once per decade. Mitigated by CPU's and memory being triplicated. "Voting" hardware detects differences on every cycle.

  44. Total bullshit, SEUs are fixable by Theovon · · Score: 2

    There has been assloads of research on mitigating soft errors going back to the 1970’s. I’ve published some myself. There is no shortage of workable methods on masking transient errors in logic and bit flips in DRAMs. SEUs are a major problem for supercomputers, so their memory systems have sophisticated mechanisms for catching them.

    If Cisco is blaming this on SEUs, that just proves their incompetence, since they obvious didn’t spend 5 minutes with Google Scholar looking at hundreds of GOOD papers (in the top conferences and journals) on this topic. Seriously.

    PLUS, if something goes wrong, even if it IS a transient error, it’s FAR more likely to be a fixable bug than radiation. We had a weird bug in a DRAM controller whose state kept going invalid. We had to add another circuit to fix that. We *called* is a cosmic ray deflector, but the more likely causes, in order were (a) another bug we couldn’t find, (b) a timing violation caused perhaps by voltage or temperature fluctuation, or (c) crosstalk in the circuit. We would have kept looking, but this deflector circuit made it robust to hundreds of hours of slamming the memory system, so we let it go. (Also, it was graphics memory, so even if it did ultimately suffer a glitch some day, it would go unnoticed.)

    1. Re:Total bullshit, SEUs are fixable by Anonymous Coward · · Score: 0

      Sure, with enough resources, most problems are solvable.

      The ASR9000 likely contains a fair number of pretty large FPGAs. Those are actually pretty good gamma-induced SEU detectors, due to their large configuration SRAMs. Yes, there are mitigation mechanisms in those parts (including ECC), but they don't completely protect you (e.g. control path effects). Your option at that point is to burn more resources to mitigate the remaining vulnerability through engineering.

      For space, you can afford to put down three of everything, plus voter circuits. The cost of that logic is nothing compared to the cost of getting the satellite into space or the cost of your satellite getting into an unrecoverable state. For terrestrial devices, cost of the product is king. Request more or larger parts is a great way to get your product cancelled.

  45. Re:I remember in the 70s some memory manufacturer by angel'o'sphere · · Score: 1

    I guess in this case it is "the same thing" ... the silicon from which they made some of the chips involved was not pure enough, or the material for doting was contaminated.

    --
    Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
  46. Not new by Comen · · Score: 1

    I have had Cisco tell me this many times any time a router reboots from a parity error for over 15 years now, so they have been using this for a long time now.

  47. Reasonable - but not enough data by bradgoodman · · Score: 2

    It could indeed be possible. Aloha particles are well-know to be capable of causing bit-flips in capacitive memories (DRAM). This is exactly why we have things like ECC on memory pathways. That said - its not the only explanation. There are ways of testing this. For example, observing the general abundance and frequency of such particles in a bubble chamber, and attempting to corrolate to instances if error. Or placing equipment in a shilded enviroment and seeing if frequency of errors change. Long story short - it MAY be true - but if you want to draw a conclusion - you really have to offer more data to prove it.

  48. sunspots!! by Anonymous Coward · · Score: 0

    everything is because Sunspots!

  49. Bwahahahahah.... by MercTech · · Score: 1

    My reaction when I first heard the "cosmic radiation" excuse for misbehaving electronics.
          With decades of experience in tech implementations in radiation fields I can personally attest to the fact that the radiation flux levels needed to cause reactions in electronics could only be high enough due to cosmic radiation at elevations higher than 20,000 feet. The levels need to be in Rad per hour rather than the microrad per hour that you get from cosmic radiation. (i.e. background at sea level is often 15-20 microrem/hr in the day and 3-5 microrem/hr at night with the difference due to cosmic radiation. In a 5 Rad/hr field, 5000000 microrem, the lifetime of electronics is weeks if not days before the semiconductors fail from ionization of the doping in the material.) This is for electronics other than radio transmissions as radio transmissions can experience interference in transmission due to ionization in the atmosphere. (thunderstorms do that too) Low power short range such as wifi is much less effected than long range skywave or aimed microwave. And radio interference is not an issue in the electronics but with interfering transmissions from mama nature. Cisco was so obviously full of a certain word that rhymes with their name.

    --
    NRRPT/RCT
    1. Re:Bwahahahahah.... by cwsumner · · Score: 1

      It only takes a single "cosmic ray" particle to flip a bit in memory. The readings that are averages, are no good for this.

      And all this has been known for about 50 years... ;-)

    2. Re:Bwahahahahah.... by Anonymous Coward · · Score: 0

      You're talking about actual hard failures, not soft bit flips. Soft bit flips can occur from a single strike, and at the newer process nodes (40nm on down) SRAM cells are quite vulnerable. I work in the semiconductor industry and can tell you that it is quite possible for products to suffer from this. Proof points for cosmic ray strikes being the cause of a failure (vs. other causes) in a given product are that the occurrences are (a) random across all memory bits on the device, (b) devices only suffer a single instance (usually) of the problem and have no further failures afterward, and (c) failures correlate to the altitude at which the device is being used; higher altitudes consistently have higher failure rates.

      That said, the occurrences of these failures are very rare; a single multi-core processor device with only single bit parity protection on internal SRAM can expect to have a failure about every 40 years of run time or so, roughly. However, if you have a huge data center with tens of thousands of nodes, 40 years of run time can pile up in a week or two. Some older ARM and MIPS (and probably other) embedded processor designs only have parity protection on some of the small internal caches used in the design, and occasionally glitch because of this issue. A good embedded OS can detect and correct some of these (instruction cache corruption, for example) but others will cause parity exceptions and crash the processor. A good software design that can handle a quick reboot and state recovery is the way around that problem.

      ECC on SRAM takes the failure rate down significantly, and some rad-hardened designs used in avionics and space applications use even more bits than ECC does.

  50. haha by Anonymous Coward · · Score: 0

    thanks for that video, that chick was hilarious!
    I needed a good laugh... she was addicted to the penus!!!

  51. God Did It by Anonymous Coward · · Score: 0

    Case solved.

  52. Re:Unh Hunh by Anonymous Coward · · Score: 0

    Shut the fuck up, you plagiarising cut-n-paste junky! that has nothing to do with TFA!
    Try think for yourself instead of quoting random people from youtube and trying to pass it off as a legitimate original thought.

  53. This paves the way to new ways for DDOS attacks by ctrl-alt-canc · · Score: 1

    You just need the right gadget.

  54. "Cosmic Radiation Bug"? by NoSalt · · Score: 1

    More likely an bug in the code that the NSA has inserted into all of their routers.