Cisco Blamed A Router Bug On 'Cosmic Radiation' (networkworld.com)
Network World's news editor contacted Slashdot with this report: A Cisco bug report addressing "partial data traffic loss" on the company's ASR 9000 Series routers contended that a "possible trigger is cosmic radiation causing SEU [single-event upset] soft errors." Not everyone is buying: "It IS possible for bits to be flipped in memory by stray background radiation. However it's mostly impossible to detect the reason as to WHERE or WHEN this happens," writes a Redditor identifying himself as a former [technical assistance center] engineer...
"While we can't speak to this particular case," Cisco wrote in a follow-up, "Cisco has conducted extensive research, dating back to 2001, on the effects cosmic radiation can have on our service provider networking hardware, system architectures and software designs. Despite being rare, as electronics operate at faster speeds and the density of silicon chips increases, it becomes more likely that a stray bit of energy could cause problems that affect the performance of a router or switch."
Friday a commenter claiming to be Xander Thuijs, Cisco's principal engineer on the ASR 9000 router, posted below the article, "apologies for the detail provided and the 'concept' of cosmic radiation. This is not the type of explanation I would like to see presented to the respected users of our products. We have made some updates to the DDTS [defect-tracking report] in question with a more substantial data and explanation. The issue is something that we can likely address with an FPD update on the 2x100 or 1x100G Typhoon-based linecard."
"While we can't speak to this particular case," Cisco wrote in a follow-up, "Cisco has conducted extensive research, dating back to 2001, on the effects cosmic radiation can have on our service provider networking hardware, system architectures and software designs. Despite being rare, as electronics operate at faster speeds and the density of silicon chips increases, it becomes more likely that a stray bit of energy could cause problems that affect the performance of a router or switch."
Friday a commenter claiming to be Xander Thuijs, Cisco's principal engineer on the ASR 9000 router, posted below the article, "apologies for the detail provided and the 'concept' of cosmic radiation. This is not the type of explanation I would like to see presented to the respected users of our products. We have made some updates to the DDTS [defect-tracking report] in question with a more substantial data and explanation. The issue is something that we can likely address with an FPD update on the 2x100 or 1x100G Typhoon-based linecard."
Or perhaps something like a design flaw in memory that's provable and repeatable, and has even been used for conceptual security attacks.
Still, when you start looking at crash reports from millions of customers (I used to work on a fairly well-known MMO), you see stuff that simply shouldn't be possible, and you start wondering about things like cosmic radiation. We had to filter out what we figured were hardware-based errors due to overclocked CPUs, bad RAM, etc, or else you get flooded with impossible crash stacks.
x = 3 * y; // Crash here! WTF?
Irony: Agile development has too much intertia to be abandoned now.