Slashdot Mirror


That Time The Windows Kernel Fought Gamma Rays Corrupting Its Processor Cache (microsoft.com)

Long-time Microsoft programmer Raymond Chen recently shared a memory about an unusual single-line instruction that was once added into the Windows kernel code -- accompanied by an "incredulous" comment from the Microsoft programmer who added it:

;
; Invalidate the processor cache so that any stray gamma
; rays (I'm serious) that may have flipped cache bits
; while in S1 will be ignored.
;
; Honestly. The processor manufacturer asked for this.
; I'm serious.
invd


"Less than three weeks later, the INVD instruction was commented out," writes Chen. "But the comment block remains.

"In case we decide to resume trying to deal with gamma rays corrupting the the processor cache, I guess."

166 comments

  1. So thats how the NSA by Anonymous Coward · · Score: 0

    phrase their 'requests' these days.

  2. Microsoft's never doing any military or space work by johnjones · · Score: 3, Informative

    preparing your software for failures in hardware due to common problems such as radiation might be a good idea...

    This is why some firms/states would not trust microsoft to critical functions....

  3. That's a great comment by NotSoHeavyD3 · · Score: 5, Insightful

    Since it explains the reasoning why that code is there.(Since another developer could come by and wonder why that code is there.) I've seen way too many people put in a comment like ;invalidate cache and call it a day.

    --
    Did you know 80 to 90% of the moderators on slashdot wouldn't recognize a troll even if one dragged them under a bridge.
    1. Re:That's a great comment by Anonymous Coward · · Score: 1

      Would have been somewhat better if they left in which processor, and which manufacturer, they were talking about.

    2. Re:That's a great comment by NotSoHeavyD3 · · Score: 1

      You're very right there. Also I'm guessing there's probably some issue tracking so putting that in there would be nice as well. I'm just so surprised the original developer didn't put in some pointless comment.

      --
      Did you know 80 to 90% of the moderators on slashdot wouldn't recognize a troll even if one dragged them under a bridge.
    3. Re: That's a great comment by Balial · · Score: 2

      It needs a reference to the errata from the vendor. Future revisions may need to tweak code flow and understand exactly what this is trying to achieve.

    4. Re:That's a great comment by Anonymous Coward · · Score: 1

      +1

      Note from professional programmer: I can read the code to see WHAT is happening, and HOW it is happening. I need the comments to explain WHY it is happening, and WHY I should care. During code review, this comment would get a "awesome comment" comment.

    5. Re:That's a great comment by shabble · · Score: 2

      Since it explains the reasoning why that code is there.(Since another developer could come by and wonder why that code is there.).

      But... the code isn't there. The code itself was commented out shortly after.

      What's more concerning is why the commented stuff was actually left in there, since I'm presuming they had source control even back then.

      And "in case someone put it back in later" isn't really covered since the same sort of code could conceivably be put elsewhere in the code without the programmer seeing this bit of code.

    6. Re:That's a great comment by hcs_$reboot · · Score: 1

      The comment could have included "Use this instruction with care. Data cached internally and not written back to main memory will be lost", INVD man.

      --
      Slashdot, fix the reply notifications... You won't get away with it...
  4. Smoke and Mirrors by Anonymous Coward · · Score: 0

    Single Event Upsets are real and all semiconductors are susceptible. 90nm might be more "resilient" to it, but it can still occur.

    This sounds like a processor bug or a bug elsewhere and they bamboosled MS with smoke and mirrors...

    1. Re: Smoke and Mirrors by Anonymous Coward · · Score: 0

      High altitude installations actually have to deal with this fairly frequently. There's a reason why altitude is listed in hardware specifications alongside humidity and temperature operating ranges.

    2. Re: Smoke and Mirrors by Colourspace · · Score: 1

      Interesting. I used to work for an FPGA manufacturer and the move to 90nm back around 2003-ish (?) Really spooked our mil/aero costomers in particular.

    3. Re: Smoke and Mirrors by Anonymous Coward · · Score: 0

      Yes but arenâ(TM)t they due to alpha radiation, not gamma rays?

  5. This is not that crazy... by Anonymous Coward · · Score: 0, Informative

    I once had to debug a situation where an opto-coupler had been changed out from a part that had black plastic to a part that had white plastic. The difference in the opacity of the casing was enough to cause a larger drift when in the sunlight. This is not as crazy as it sounds...

  6. I'm not sure what's odd about that by vadim_t · · Score: 4, Interesting

    The need for error checking has been around for a very long time. Yes, cosmic particles are indeed a thing, and result in increased memory errors at high altitude, in airplanes, or especially in space.

    I remember parity RAM being around in the 90s, and I'm pretty sure it's older than that. Pretty much any server these days uses ECC for this reason.

    I run ECC and record the occassional bit flip in my logs once in a while. These can be found at /sys/devices/system/edac/mc/mc0/.

    What's odd is that ECC is not routinely used in all hardware. Depending on the conditions it can be of great help, as the rare bit flip can cause strange problems that can take ages to track down. And it works well for figuring out when you have a bad memory module -- the computer will figure it out on its own.

    1. Re:I'm not sure what's odd about that by dargaud · · Score: 2

      I have a friend who had written his own accounting software in the 80s on a 6502 PC. Once there was a discrepancy of a few $ at the end of the month. He spent an entire month backtracking the error through software logic, then software debug, then finally assembly until he found the exact place where a single bit had flipped in memory. Took him a month.

      --
      Non-Linux Penguins ?
    2. Re:I'm not sure what's odd about that by Anonymous Coward · · Score: 0

      I've always thought book keepers and accountants were out of their tiny minds.

      They will spend tens of hours tracking down an error of a few cents in accounts that amount to hundreds, thousands, millions of dollars.

      An effort that costs orders of magnitude more than the error.

      Never mind, the client pays I guess.

    3. Re:I'm not sure what's odd about that by thegarbz · · Score: 1

      What's odd is that ECC is not routinely used in all hardware.

      Nothing odd about it. It costs more, It performs worse, and the vast majority of the incredibly rare errors that are caused end up being entirely non-critical due to the way people generally use computers.

      If you have a database server handling critical information all day then it makes sense. But hell for the vast majority of workloads your computer is more likely to get "Aw. Snap! Something went wrong" Along with a frowny face displayed in your browser. Any time a consumer is doing anything remotely important they either have a confirmation window (e.g. I don't care if just before I hit submit a magic bit changes the $2000 payment to a $20000 payment since the next thing that will happen will be a "Are you sure you wish to transfer $20000?" message), or they are performing an event so incredibly short that the odds of there being significant and lasting unrecoverable data corruption is low.

      Sidenote: That system processing my banking transaction better have ECC memory!

    4. Re:I'm not sure what's odd about that by larryjoe · · Score: 2

      What's odd is that ECC is not routinely used in all hardware.

      For a lot of systems and uses, the rate of error occurrence doesn't justify the area cost of ECC. For all fabrication processes in the last decade, error rates per SRAM bit have been decreasing faster than the increase in number of SRAM bits, meaning that the total error rates for most chip families have been decreasing. Furthermore, the vast majority of errors in SRAM never propagate to user-discernible outcomes. For these systems, the user is more interested in a lower initial price or better performance rather than a decrease in the failure rate from very infrequent to even more infrequent.

      However, ECC is ubiquitous in data centers, supercomputers, control systems, and aeronautics (where the expected error rate per SRAM bit is at least two orders of magnitude higher than for terrestrial systems). For those systems, the users are willing to pay a premium for data integrity, availability, and safety.

    5. Re:I'm not sure what's odd about that by Anonymous Coward · · Score: 0

      What's funny about your story is that that hardware (for the 6502) most likely used static memory.
      I really suspect his "flipped bit" was really a software bug.

      Doesn't make it any less troubling, but there's no way the same random bit flip would be repeated
      enough for the many iterations of debugging (unless a rom went bad).

      CAP === 'hogging'

    6. Re:I'm not sure what's odd about that by Anonymous Coward · · Score: 0

      Sometimes it can help you find a bigger mistake before it happens.

    7. Re: I'm not sure what's odd about that by Colourspace · · Score: 1

      A 6502 PC? Do you mean a Commodore 64?

    8. Re: I'm not sure what's odd about that by Colourspace · · Score: 1

      Also, I'd be fascinated how you find the smoking gun of a bit flip retrospectively? There's some Columbo shit right there.

    9. Re:I'm not sure what's odd about that by Solandri · · Score: 1

      What's odd is that ECC is not routinely used in all hardware. Depending on the conditions it can be of great help, as the rare bit flip can cause strange problems that can take ages to track down. And it works well for figuring out when you have a bad memory module -- the computer will figure it out on its own.

      Others have already covered the higher cost and performance hit of ECC RAM.

      The most visible symptom of a random bit flip is that your program crashes. The RAM a program occupies far exceeds the RAM your data occupies, so a random bit flip is more likely to affect a program than it is your data.

      Back in the 1980s and 1990s, this would cause your computer to hang or crash. You'd lose not just everything you were doing with that program, but also everything in all other programs running on the computer. This would cause a lot of screaming and cursing. Parity RAM to the rescue!

      Eventually, multitasking on OSes got good enough that a program crashing wouldn't freeze your entire computer. Just the lone program would freeze or crash. Your OS (and all your other programs) would continue as if nothing had happened. And you simply restart the crashed program. You might curse at a bit of lost time, but, oh hey, looks like Word auto-saved your work just a couple minutes before the crash. No harm done then. Occasionally the OS starts acting a little wonky, but a reboot usually takes care of that.

    10. Re: I'm not sure what's odd about that by squiggleslash · · Score: 1

      The 6502 was one of the most popular processors for personal computers in the late seventies and early eighties. The Commodore 64 would just have been one of them, and it was part of a family of personal computers starting with the Commodore PET that had a CPU in the 6502 family. Other two major personal computing platforms with chips in that CPU family were the Apple II series, Acorn's BBC series, and Atari's pre-ST personal computers.

      --
      You are not alone. This is not normal. None of this is normal.
    11. Re:I'm not sure what's odd about that by Megol · · Score: 1

      Can you give an example of a personal computer using an 6502 and SRAM in the 80's? One is fine, I can't think of a single one using it as the main memory.

      The 6502 was used as it was inexpensive and (with the right software) reasonable powerful. Equipping a system with a lower cost CPU and then using enough very expensive SRAM to run real programs seem strange if not stupid.

    12. Re: I'm not sure what's odd about that by Megol · · Score: 1

      Apple II, BBC Micro, Commodore 64, Commodore PET, Atari 8bit series etc. There were many alternatives.

    13. Re: I'm not sure what's odd about that by Miamicanes · · Score: 1

      Not a 6502, but I'm pretty sure the TI99/4a used SRAM for its 256-byte "scratchpad" RAM (the only RAM its CPU could access directly).

      I know the Amiga used "Static Column" RAM, but I think SC ram was what we'd NOW consider to be "PSRAM" -- DRAM with extra onboard circuitry to do its own refreshing automatically so it "looks" (and behaves) like SRAM as far as the outside world is concerned.

    14. Re:I'm not sure what's odd about that by Anonymous Coward · · Score: 0

      Atari, I think the original Apple as well.

      CAP === 'distort'

    15. Re: I'm not sure what's odd about that by Anonymous Coward · · Score: 1

      Technically the Commodore 64 used a 6510 rather than a 6502, although in practise the only difference was that the 6510 had an extra 8-bit I/O port used for bank switching memory and talking to the tape drive.

      Commodore owned MOS Technology who made the 6502 so they made quite a few custom variants like this for various computers and devices.

    16. Re:I'm not sure what's odd about that by arth1 · · Score: 1

      Nothing odd about it. It costs more, It performs worse

      Not always. Modern ECC does the fetch and verification in parallel, negating most of the slowdown. And some registered ECC (which used to be slower) is now faster, as it does pre-fetch before the actual request.

    17. Re:I'm not sure what's odd about that by PhunkySchtuff · · Score: 3, Interesting

      The issue is not that the error is only a few dollars or even a few cents. The issue is that there is an error at all. If something doesn't balance, even if it's a few cents out, that means that there's likely an error in the logic that calculates everything.

      It's basic maths. You can't say when you're calculating 100 + 100 = 199 and call it a day because it's close enough. There is something fundamentally wrong if you're not getting the exact correct answer.

    18. Re:I'm not sure what's odd about that by complete+loony · · Score: 1

      There are so many machines out there with domain names in memory, that squatting domains that are a single bit-flip away can be quite interesting.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    19. Re:I'm not sure what's odd about that by TechyImmigrant · · Score: 1

      > What's odd is that ECC is not routinely used in all hardware

      I know why. It's a pain to implement on arbitrary logic - as opposed to memory.

      TMR is more appropriate, however the tool support for TMR is still abysmal. Synopsis should have a tmr command you can apply to a module and have it just happen. Instead you waste weeks fighting the optimizer to prevent it removing the TMR you put in manually.

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    20. Re:I'm not sure what's odd about that by arglebargle_xiv · · Score: 1

      For software, you still need to put in the countermeasures by hand, because you've also got things like control flow integrity and other aspects to deal with. Also you don't need the overhead of TMR for all values, just critical system variables and the like, so having a tool try and do it automatically doesn't work.

    21. Re:I'm not sure what's odd about that by Anonymous Coward · · Score: 0

      I bet early ones like Commodore PET with only 4K or 8K used SRAM but when RAM got "huge" like 48K or 64K it was DRAM already.
      My first "computers" were NES and Game Boy and I much later learned they had some ridiculously small RAM (a Commodore equivalent might be cartridge games on VIC 20 : only a few K in RAM but more precious K available from ROM). The Game Boy, while made in comparatively modern times (and oddly based on Z80 rather than 6502), has 8K of SRAM as its main RAM :).

    22. Re:I'm not sure what's odd about that by thegarbz · · Score: 1

      Always. Without fail.

      The process of check and verification itself was only a small part of the performance of memory. ECC memory is almost impossible to find at common desktop speeds with almost all of them being in the sub 3000MHz except for the truly ultra-expensive modules.

      Where someone wants to pay for equal speeds and chose something like a 2166 module the ECC memory invariably has far worse latency figures.

      ECC memory has a lower upper speed limit, lower than the actual standard speed capability of a modern processor, it has a higher latency for all equivalent speeds than non-ECC modules, and almost universally a higher price (the worst performing ECC modules in a speed class are often more expensive than some of the best performing non-ECC modules).

    23. Re:I'm not sure what's odd about that by Anonymous Coward · · Score: 0

      Your presence on earth are not welcome. Please leave. Preferably by slitting your own throat and drowning in your own blood.

      Be careful not to cut the dick off the guys penis that is down your throat though.

      gerald butler's impersonator

    24. Re:I'm not sure what's odd about that by Agripa · · Score: 1

      What's odd is that ECC is not routinely used in all hardware. Depending on the conditions it can be of great help, as the rare bit flip can cause strange problems that can take ages to track down.

      Whether ECC is used or not depends on the likelyhood of an error and how serious the consequences will be. The number of errors depends on how much memory is used (not installed), how long it is used, and oddly enough some factor related to the access rate. Since servers tend to have much more memory and operate for longer times than desktops, ECC makes more sense for them.

      Who cares about errors while playing a game or media or doing consumer type tasks which do not tax the computer? But if my workstation is running for days at a time on important work, or I do not want to waste my time programming tracking down even one soft error, or a computer has any effect on human safety, ECC is very economical.

    25. Re:I'm not sure what's odd about that by Agripa · · Score: 1

      I think you are confusing SRAM and DRAM.

      DRAM soft error rates leveled off a couple generations ago. SRAM soft error rates are a couple orders of magnitude higher and have remained so for integrated SRAM caches. A discussion of the difference and why it exists would be interesting.

      Other than some odd exceptions, integrated SRAM caches have been protected by ECC or parity almost since they were first used.

    26. Re:I'm not sure what's odd about that by larryjoe · · Score: 1

      I think you are confusing SRAM and DRAM.

      DRAM soft error rates leveled off a couple generations ago. SRAM soft error rates are a couple orders of magnitude higher and have remained so for integrated SRAM caches. A discussion of the difference and why it exists would be interesting.

      DRAM per Mbit error rates have not dropped as precipitously as SRAM error rates. Over the last decade, SRAM error rates have dropped by a few orders of magnitude, faster than the increase in the total number of SRAM bits on a chip due to scaling and chip area increase. Ten years ago, the SRAM error rate was quite a bit higher than the DRAM error rate, by about an order of magnitude. Especially with the introduction of FinFET/tri-gate, SRAM error rates have plummeted and are now somewhat lower than that for DRAM. Furthermore, SRAM error rates continue to drop with each process node.

      Other than some odd exceptions, integrated SRAM caches have been protected by ECC or parity almost since they were first used.

      This is true for users that care and still not true for low-cost users. Xeons have ECC and consumer processors don't. GPUs in supercomputers have ECC, and GPUs in gaming systems don't.

  7. Why is this so strange? by Brett+Buck · · Score: 2

    It seems to make good sense to put in some protections against register or other bit flips, they do happen from time to time. He probably meant cosmic rays instead of gamma rays, but that definitely can happen and i have spent many, many, hours of my life putting things in software that detect these and recover properly. I have one processor type that has something like this about once a month, very consistently, over several decades.

    1. Re:Why is this so strange? by Anonymous Coward · · Score: 0

      It's not c!ear how this is helpful. The corruption can happen outside of S1 as well. You might as well always invalidate the cache or just invalidate the cache periodically.

    2. Re:Why is this so strange? by Anonymous Coward · · Score: 0

      Maybe the chip was expected to be asleep for long periods.

    3. Re:Why is this so strange? by Anonymous Coward · · Score: 0

      Hulk-Smash-You says gamma rays -- fool!

      CAP === 'biplanes'

    4. Re: Why is this so strange? by Colourspace · · Score: 1

      Why are people still doing this captcha thing? Most the time they are completely tenuous, if not utterly irrelevant.

  8. Gamma rays? Did it get big and turn green? by Anonymous Coward · · Score: 0

    Maybe they were afraid it would get angry.

    You wouldn't like it if it got angry.

  9. Sure they did by rsilvergun · · Score: 4, Insightful

    that's what their embedded OSes were for. AFAIK this was in their consumer code base.

    If I had to guess this was because of a real processor bug Intel didn't want to admit to. I remember when Win XP hit the shop I was at was flooded with dead computers from upgrades. Manufacturers had been selling bad ram in computers for years. By default Win98 would only make use of the first 64 MB of ram in most cases (there was a registry hack I've long since forgotten to force it to use your entire ram before going to the cache).

    Anyway, XP's installer would copy the CD into ram to make the (very slow) install run faster. So you got to find out your OEM stuck bad ram in your box the hard way when the installer blew up. The best part was the upgrade couldn't roll itself back gracefully. I don't remember all the steps to fix it but it was a pain. We just did software where I was at too so it was fun having to send them somewhere else to get new ram and have them yell at me that the ram was fine. Good times.

    --
    Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
    1. Re: Sure they did by Anonymous Coward · · Score: 1

      Hey anon, I've seen enough of their posts to recognize the username. What the fuck are you famous for? Get over yourself

    2. Re:Sure they did by Anonymous Coward · · Score: 0

      Good times.

      While at the same time I, a total newb, downloaded and installed FreeBSD over anon http. From people I never even knew, no less. Never a hiccup and a solid OS with X. Well, I didn't have Office and I couldn't get iTunes or Apple Store, but the scars from this healed in time. About a day.

    3. Re:Sure they did by msauve · · Score: 5, Interesting

      "If I had to guess this was because of a real processor bug Intel didn't want to admit to."

      Alpha particles affecting memory is a known, but uncommon, issue. This code invalidated the cache when coming out of S1 (sleep) state. The deeper (S2+) sleep states already invalidate the cache. The longer the processor is in a static state (sleep), the more chance that an alpha particle hit will flip a bit. Invalidating the cache when coming out of a sleep state has no meaningful impact on performance. The time to re-fetch is nothing compared to the amount of time spent sleeping. Of course, there are many more bits in RAM which could be affected, so a problem is more likely to occur there, which this doesn't address.

      But it hurts nothing, avoids an (admittedly rare) issue, and is but a single instruction. I wonder why they removed it?

      --
      "National Security is the chief cause of national insecurity." - Celine's First Law
    4. Re:Sure they did by viperidaenz · · Score: 1

      S1 i supposed to keep the cache fully powered up. How's it going to make any difference if an alpha particle hits the cache memory cells while the core clock has stopped?

    5. Re: Sure they did by Anonymous Coward · · Score: 0

      The irony of a comment criticizing an AC coming from an AC is palpable.

    6. Re:Sure they did by Anonymous Coward · · Score: 0

      Bullshit. The parent post is talking about an experience from 20 years ago. At that time, FreeBSD was still a pain in the ass to configure even if you weren't trying to use the buggy USB support they just managed to squeeze into the 3.x release.

    7. Re:Sure they did by msauve · · Score: 2

      "How's it going to make any difference if an alpha particle hits the cache memory cells while the core clock has stopped?"

      It's not clear what you're asking. If a bit in the cache gets changed, it corrupts the instruction or data. That the cache is powered up makes no difference.

      --
      "National Security is the chief cause of national insecurity." - Celine's First Law
    8. Re:Sure they did by mlyle · · Score: 1

      It's just that the stuff will have sat for an indeterminate, long time while the clock is stopped-- providing an unusually long window for a bit to flip-- and resuming even from S1 is a relatively costly operation.

      I think overall it is silly, but if you have ECC RAM and non-ECC cache, and spend most of the time in S1, it's not completely crazy.

    9. Re:Sure they did by Anonymous Coward · · Score: 0

      Alpha particles affecting memory is a known, but uncommon, issue

      I really doubt it would be alpha particles- alpha particles are very short range, a couple cm in air. Alpha particles are stopped by a sheet of paper.

      https://en.wikipedia.org/wiki/...

    10. Re:Sure they did by Anonymous Coward · · Score: 0

      Well it can't be gamma. Everybody knows gamma doesn't affect anything unless you really piss it off. If that happens, bail out before it turns green.

    11. Re:Sure they did by viperidaenz · · Score: 1

      I'm saying the risk of cache corruption from gamma rays should be no different between S0 and S1.

    12. Re:Sure they did by svirre · · Score: 2

      The usual source of alpha emissions affecting memory in semiconductor devices come from the capsule of the device itself.

    13. Re:Sure they did by Anonymous Coward · · Score: 0

      You don't have to go all the way to space for an explanation. At the time especially, the material used for solder balls could also release alpha particles; in fact, one of the options for one of the vendors we looked at sourcing ASICs from were special "low-alpha" solder balls that would supposedly reduce the chances of release of alpha particles.

      Source: was a processor physical design engineer way back in 1999.

    14. Re:Sure they did by TechyImmigrant · · Score: 1

      >Alpha particles affecting memory is a known, but uncommon, issue.

      A known issue for plastic packaging. The alpha emitters are in the plastic.

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    15. Re:Sure they did by msauve · · Score: 1

      "the risk of cache corruption from gamma rays should be no different between S0 and S1."

      But invalidating the cache when returning from S1 removes any (even remote) risk. And there's no downside. Better is better.

      --
      "National Security is the chief cause of national insecurity." - Celine's First Law
    16. Re:Sure they did by phantomfive · · Score: 1

      If I had to guess this was because of a real processor bug Intel didn't want to admit to.

      I was wondering that too. The article suggests it is true.

      --
      "First they came for the slanderers and i said nothing."
    17. Re: Sure they did by Anonymous Coward · · Score: 0

      Ok guys, alpha particles can't penetrate a piece of paper. Alpha particles that cause memory parity errors come from poor material selection inside the chip itself, not from cosmic rays. Cosmic rays that penetrate chips are high energy gamma rays.

    18. Re:Sure they did by Anonymous Coward · · Score: 0

      I was unaware that they were in the habit of using materials that suffer from alpha decay when manufacturing electronics (hint, they don't). This article is about gamma radiation, and lighter elements such as potassium and such which are known to be radioactive are beta emitters or use other mechanisms of radioactive decay. Alpha decay tends to be for heavier elements. The lightest element capable of it is tellurium, and even then, we're talking about it being extremely rare there. Radon is the lightest that is considered to be an alpha emitter. Polonium, the element used to poison a former KGB guy is a beta emitter, which famously was why it was so hard to detect originally.

    19. Re: Sure they did by Anonymous Coward · · Score: 0

      Some of it is high-energy gamma radiation yes. But the bulk is cosmic rays (charged particles), not gamma rays. Most of the cosmic rays come from the Sun.

    20. Re:Sure they did by spinozaq · · Score: 1

      Polonium, the element used to poison a former KGB guy is a beta emitter, which famously was why it was so hard to detect originally.

      That's not true. The KGB guy, Alexander Litvinenko, was poisoned with Polonium 210, which is a near pure alpha emitter with a small bit of gamma. He died from acute alpha radiation poisoning. The gamma is very detectable, you just have to know enough to look for it. After they figured that out they were able to accurately estimate the intake dosage from a gamma ray measurement.

    21. Re:Sure they did by epine · · Score: 1

      It's not clear what you're asking. If a bit in the cache gets changed, it corrupts the instruction or data. That the cache is powered up makes no difference.

      Wrong.

      Next contestant, please.

      You're assuming the cache has not parity or ECC mechanism where active use would eliminate single-bit errors before they accumulate into undetectable errors, whereas pickling the cache in quiescent warm brine would not.

    22. Re:Sure they did by viperidaenz · · Score: 1

      There's a performance and power consumption impact. Otherwise they wouldn't have any cache at all.

    23. Re:Sure they did by msauve · · Score: 1

      But see, that's just it. I'm not assuming anything, it's you who are making the ASSumption. The odds of a double hit (which might pass the parity check) are multiplied billions (GHz) of times when it's sitting there in static sleep for a second or two.

      --
      "National Security is the chief cause of national insecurity." - Celine's First Law
    24. Re:Sure they did by msauve · · Score: 1

      I understand how, not knowing how sleep states work, you would think that.

      --
      "National Security is the chief cause of national insecurity." - Celine's First Law
    25. Re: Sure they did by Anonymous Coward · · Score: 0

      you probably know the username because all of the stupid shit he/she has posted...

    26. Re:Sure they did by viperidaenz · · Score: 1

      So you don't understand that coming out of a sleep state and not having any data in the cache at all results in more stalls and main memory access and how that translates to a performance hit and more power consumption?

    27. Re:Sure they did by msauve · · Score: 1

      No, you simply don't understand that any delay needed to fetch from RAM when coming out of a sleep measured in seconds is absolutely meaningless in the real world.

      --
      "National Security is the chief cause of national insecurity." - Celine's First Law
    28. Re:Sure they did by Agripa · · Score: 1

      Of course, there are many more bits in RAM which could be affected, so a problem is more likely to occur there, which this doesn't address.

      High performance integrated SRAM is orders of magnitude more susceptible to radiation induced soft errors than DRAM which is why SRAM caches have included ECC or parity protection almost since they were first used.

      Oddly enough, DRAM has actually become more resistant to radiation induced soft errors over the last couple of generations but this is more than cancelled by the increasing amount of DRAM used.

    29. Re:Sure they did by Agripa · · Score: 1

      S1 i supposed to keep the cache fully powered up. How's it going to make any difference if an alpha particle hits the cache memory cells while the core clock has stopped?

      The cache is protected against data corruption by ECC or parity however if multiple bit errors accumulate within one word, this protection fails. During normal operation, the cache is continuously scrubbed of errors so this is not a problem.

    30. Re:Sure they did by Agripa · · Score: 1

      I was unaware that they were in the habit of using materials that suffer from alpha decay when manufacturing electronics (hint, they don't).

      They try to avoid radioactive materials in packaging for that very reason however sometimes contamination occurs anyway.

      Back during about the 64kbit DRAM generation, this was a huge problem with ceramic packaged parts increasing the demand for plastic packaging despite doubts about its reliability.

  10. probably cosmic rays rather than gamma rays by starless · · Score: 2

    A real gamma ray wouldn't do much, and would just pass through, unless it pair converted to electron and positron.
    But cosmic rays (charged particles) would be more likely to interact.

    1. Re: probably cosmic rays rather than gamma rays by Anonymous Coward · · Score: 0

      If you take a picture of nothing (i.e. lens cap on) with a digital slr and look at the raw data, you might see a few short, curled lines of slightly brighter pixels, this is from cosmic rays interacting with the ccd chip. You would certainly see this in cooled ccds for astrophotography where you can process out many other forms of noise, like thermal noise on the chip.

    2. Re:probably cosmic rays rather than gamma rays by Anonymous Coward · · Score: 0

      Depends on the flux of the gamma rays. One is likely not to interact. Use a chip near the core of a reactor and at some point the gamma rays will cause electrons to appear where you didn't expect and it may have an effect. Under my desk isn't a problem, even in a banana warehouse. In a lab that deals with radioactive materials, it might just come into play. Not sure where the threshold is, as the metal casing offers some protection.

      But agreed... this is probably for applications in space where cosmic rays are more of an issue (alpha and beta). High energy x-rays also become a problem at some flux as well.

    3. Re:probably cosmic rays rather than gamma rays by Anonymous Coward · · Score: 0

      Not to flip a bit. Energy deposition from a charged muon of a micron-sized gate is not enough energy to flip the gate. If the gate were made of a boron doped silicon, where the incident particle is a neutron, then the alpha particle emitted from the neutron absorption in boron would have more than enough energy to flip the bit.

    4. Re:probably cosmic rays rather than gamma rays by Anonymous Coward · · Score: 1

      Gamma rays lose energy while passing through materials by knocking electrons around. This can involve many collisions and many displaced electrons depending on the energy of the gamma ray. Higher energy photons will go a ways without interacting much, but as they lose energy collisions can become more frequent and at some point they can quickly dump the rest of their energy in a smaller volume. Charged particles stop by practically the same process, just interact more strongly and so are more likely to dump all of their energy in a smaller volume. But silicon is still used to detect things like gamma rays and x-rays, with varying efficiency depending on energy and volume of the detecting element. Pair-conversion isn't necessary or involved at all usually (usually you need gamma rays > 20 MeV for pair production to ramp up), and you can get a mishmash as a cosmic ray can produce an atmospheric shower involving a bunch of gamma rays.

    5. Re:probably cosmic rays rather than gamma rays by Anonymous Coward · · Score: 0

      And it makes you wonder whether the programmer knew what he was about in making the comment.

    6. Re:probably cosmic rays rather than gamma rays by mnmn · · Score: 1

      I agree. I came here to comment on 'why is this strange' but looks like many slashdotters (at least ones with physics backgrounds) feel similarly.

      It seems ridiculous when you take a cpu rma to Intel for an rca on some OS crash, but their response is that the cpu is fine, it was a cosmic particle. But it's true, and statistically this can happen to any bit in any register. Especially with the lithography processes producing ever smaller gates with few atoms manning the gate/bit.

      --
      "Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
  11. And what's the context? by drinkypoo · · Score: 1

    If it's being done rarely, and before exceptionally critical operations, then maybe it makes sense. Although, if someone bothered to take it out, then it was probably happening too often and thus affecting performance...

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    1. Re: And what's the context? by Anonymous Coward · · Score: 0

      Do you really like random bsod?

    2. Re: And what's the context? by drinkypoo · · Score: 1

      Do you really like random bsod?

      If I did, I'd disable the cache ECC that is generally highly successful at protecting users from that kind of problem. I don't know or care when Intel implemented it, but AMD did it at least since the K7.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  12. Re:Microsoft's never doing any military or space w by Anonymous Coward · · Score: 3, Informative

    Reading the full story, it's rather strongly implied that it was actually a workaround for a bug in the processor which the manufacturer hadn't found yet, and was blaming on cosmic rays.

  13. Gamma ray bit flips are for cows. by Anonymous Coward · · Score: 0

    You are all cows. Cows say moo. MOOOOO! MOOOOO! Moo cows MOOOOOOO! Moo say the cows. YOU CACHE COWS!!

  14. Actually makes good sense by Crashmarik · · Score: 4, Insightful

    Your cpu has been asleep for an apriori unknown amount of time, you are powering back up you'd absolutely want to clear the cache to purge any potential bit flips. It's a relatively cheap way of insuring data integrity.

    1. Re:Actually makes good sense by Anonymous Coward · · Score: 0

      Yeah, right, better wipe all the RAM too, it's a much bigger target after all.

    2. Re:Actually makes good sense by Anonymous Coward · · Score: 0

      OK, so you refresh the cache. Where do you refresh it from? DRAM. What makes you think that the DRAM that has been sitting there idle for however long is any less likely to show cosmic ray damage than the much more robust SRAM in your CPU cache? Odds are the safest place for those bits is in the cache.

      dom

    3. Re:Actually makes good sense by Anonymous Coward · · Score: 0

      What does apriori here mean?

    4. Re:Actually makes good sense by Anonymous Coward · · Score: 0

      Personally I think a good dividing line is how often interrupts are being serviced. If there is some source of activity like that, leaving the cache intact is going to improve response speed, perhaps quite a bit. If it's shut down farther than that point, nuke the cache, if it isn't already.

  15. Key info missing from comment by Anonymous Coward · · Score: 0

    A reference to the *specific* communication from the chip vendor should be clearly visible to anyone auditing the code’s history.

    Since it’s not in-line in the code, I hope it’s somewhere else in the same “check-in” to the code repository. Presumably only MS and their partners have access to that.

    Likewise, when it was commented out there needs to be a corresponding justification, such as a reference to an additional communication from the vendor or an internal memo approving ignoring vendor advice.

    Without such justification, code-auditors can’t easily determine if this was a put in as a joke, because of a misunderstanding, or for some other reason.

  16. Laptop aboard the International Space Station ? by Laxator2 · · Score: 5, Informative

    I think they use laptops on the International Space Station and there you are not protected from cosmic rays by the blanket of the Earth's atmosphere. Just read up on the phosphenes experienced by the astronauts as they try to go to sleep.

    Not sure if "gamma rays" is the correct term here, as high-energy protons are most likely to create a local change in electric charge density. With modern processors being built ont the 14 nanometres process this becones a serious problem. All the processors that are used in spacecraft and control vital functions are radiation-hardened. That usually means older fabrication processes (wider paths reduce the probability of cross-talk) and amorphous silicon (a monocrystal can sustain permanent damage from a particle of high enough energy)

    Overall, it does make sense if it is meant to be used in space.

    1. Re:Laptop aboard the International Space Station ? by Anonymous Coward · · Score: 1

      I think they use laptops on the International Space Station and there you are not protected from cosmic rays by the blanket of the Earth's atmosphere. Just read up on the phosphenes experienced by the astronauts as they try to go to sleep.

      On ISS you are still protected by earths magnetic field. Leave orbit and your problems get much worse.

    2. Re:Laptop aboard the International Space Station ? by Anonymous Coward · · Score: 0

      Hi, I'm a software engineer working on embedded computers flying on satellites. Single upset events are a common occurrence up in space, even on low earth orbit. And modern COTS chips are more vulnerable than chips of yore. RADHARD chips are old slow and expensive. They have fat thick lines that aren't as easily bumped by radiation. Satellites up on GEO might have it worse, but LEO still has problems.

    3. Re:Laptop aboard the International Space Station ? by Agripa · · Score: 1

      With modern processors being built ont the 14 nanometres process this becones a serious problem.

      Susceptibility is more complicated than just the minimum feature size. It was a serious problem generations ago.

      Denser processes use gate insulators with a higher dielectric constant to store more charge and also provide more drive for a given area. These things make a process more resistant to radiation induce soft errors. The same things caused the susceptibility of DRAM processes to level off or even decrease slightly starting a couple generations ago.
       

    4. Re:Laptop aboard the International Space Station ? by Agripa · · Score: 1

      Hi, I'm a software engineer working on embedded computers flying on satellites. Single upset events are a common occurrence up in space, even on low earth orbit. And modern COTS chips are more vulnerable than chips of yore. RADHARD chips are old slow and expensive. They have fat thick lines that aren't as easily bumped by radiation. Satellites up on GEO might have it worse, but LEO still has problems.

      The last time I checked, the reason radiation hardened processes used much larger minimum feature sizes was simply because it was not economical to produce a denser radiation hardened process. More modern fabrication processes require an economy of scale which is not available for such a small market.

  17. Cyrix by Anonymous Coward · · Score: 0

    Sounds like something Cyrix must have asked for, wondering why machines with their CPUs kept locking up

  18. Re: ECC everywhere by davidwr · · Score: 2

    RAM is cheap enough that ECC or similar tech should be routine. Iâ(TM)ll pay 10-15% more per GB for this.

    --
    Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
  19. Radioactive Packaging Material by tiffanytimbric · · Score: 1, Insightful

    Yeah, I'll get this was before they discovered that their processor packaging material was radioactive and that was ramdonly flipping bits. Seriously, radioactive RAM was on culprit which ran Sun Microsystems, Inc. out of business. It took them years to find it. They even started ECC their motherboard data paths, looking to see if their data centers were near nuclear research facilities. By the time they found it it was too late. ...that and they should have ditched Solaris for Linux, but...

    1. Re: Radioactive Packaging Material by Anonymous Coward · · Score: 0

      Sounds like industrial espionage by their competitors.

    2. Re:Radioactive Packaging Material by Anonymous Coward · · Score: 0

      I remember being given a Solaris kernel setting by Sun support to increase the CPU cache refresh rate to prevent bit flipping errors as a result of "solar rays"... It did seem to fix the problem :-)

  20. Self-reply, after reading TFA by davidwr · · Score: 2

    Shouldâ(TM)ve read the article first, where the author explained that oddly-commented code similar to this was used TEMPORARILY on early processor revisions or on early microcode revisions.

    In these cases, the check-in logs or the context of the code - say, itâ(TM)s in a block of code that only runs on processors that are in pre-production at the time - should make it clear that this is âoework-atoundâ code that we expect to be removed soon.

    --
    Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
    1. Re:Self-reply, after reading TFA by Anonymous Coward · · Score: 1

      I bet the requirement is still there from the manufacturer, but because INVD invalidates all levels of cache, the performance hit for some latency critical code that is supposed to run right after return from S1 is too much. So they chose to not follow the manufacturer recommendation and take the chance that the system does not crash with some instruction mutating into an illegal or operand reference pointing to a wrong register or, worst case, the cache line valid bit gets set and the line of trash bits gets executed as instructions instead of fetching from memory. This is the vector that those transient power attacks use to break into a system.

  21. Sorry for the weird characters by davidwr · · Score: 1

    Appearently, the apostrophe got turned into a curly-apostrophe. Bad computer.

    Still my fault for not previewing.

    --
    Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
    1. Re:Sorry for the weird characters by Anonymous Coward · · Score: 0

      Appearently, the apostrophe got turned into a curly-apostrophe. Bad computer.

      Still my fault for not previewing.

      No, it's not. Please don't ever think that.

      What *is* your fault is being a stupid 100% Pure African Nígger!

  22. Re: ECC everywhere by arth1 · · Score: 1

    The problem is that you need a CPU and north bridge that can handle it, which adds to the initial costs. For Intel, for example, a Xeon CPU costs (artificially) a good deal more than a comparable speed i3/5/7/9, which is an upfront cost that consumers aren't willing to eat, and they tend to choose either a cheaper CPU or a faster CPU for the same kind of money.

  23. Artificial problem. by Anonymous Coward · · Score: 0

    The i3 is a Xeon with cut down features. Hell the 8xxx series actually HAVE ECC enabled, but you need a C236 (or whatever the recent model is) server motherboard to support it. A completely artificial requirement, given that AMD has supported it since the Socket 939 chips, and Intel supported it on the 440BX and FX chipsets, which were used in consumer hardware, being replaced by the 810/815 in part to remove the ECC capability which was cutting into their server sales (or so they claimed.)

    Even most ARM processors if you read the spec sheet have ECC support included by default today, even when the majority of products decide not to include it.

    Lack of ECC is entirely artificial today. You can find AMD motherboards under 100 dollars with ECC capabilities and chip-wise everything has it in hardware, even if the support is disabled when sold to the consumer.

    1. Re:Artificial problem. by Anonymous Coward · · Score: 0

      It's artificial because you do not know the reasons. It is not artificial because the i-Core design's "cut down features" explicitly removes certain dependencies that ECC requires. It is not artificial because for Intel to be first to market, they decide to continue running with previous technical debt. To compare to Intel to AMD is the exact reason why market choice exists. AMD has support for it but you do not come close to the performance Intel provides. Then Intel wins on cost because you get same or slightly slower performance without expensive northbridges.

      To finalize my point, you harp on 810/815 as a conspiracy to remove ECC then you completely neglect the integrated graphics that 440BX/FX did not have to deal with. Food for thought, architecture grows and sometimes things are lost and the "absurd cost" is actually reality smacking you in the face.

    2. Re:Artificial problem. by dshk · · Score: 2

      I do believe missing ECC support is an artifical restriction at Intel. AMD has ECC. One of the reasons I always buy AMD, that I can be sure that all processor features of that generation is enabled in even their cheapest processor. No surprises. Btw. modern processors include most/all of the functionality of the north bridge. Regarding performance, for the same cost AMD usually provides more performance, specifically similar single threaded performance and better multi-threaded performance.

  24. Silly idea, if true by dltaylor · · Score: 1

    Sounds like a smoke screen for something else.

    If the cache is susceptible to random gamma rays, or, more likely, cosmic rays, and has no ECC, it is NEVER trustworthy, and should be permanently disabled.

    It's like the Intel floating point bugs (yes, plural). Since the end user has no idea WHICH of the operations will produce an erroneous result, NONE of the operations' results are usable, ever.

    Could be worse. Intel once had a "genius" purchasing agent that got a "good deal" on clay for the ceramic package of EPROMs. Devices didn't hold their state for particularly long, however, since the clay was mildly radioactive.

    1. Re:Silly idea, if true by Anonymous Coward · · Score: 0

      Sounds like a smoke screen for something else.

      If the cache is susceptible to random gamma rays, or, more likely, cosmic rays, and has no ECC, it is NEVER trustworthy, and should be permanently disabled.

      It's like the Intel floating point bugs (yes, plural). Since the end user has no idea WHICH of the operations will produce an erroneous result, NONE of the operations' results are usable, ever.

      Could be worse. Intel once had a "genius" purchasing agent that got a "good deal" on clay for the ceramic package of EPROMs. Devices didn't hold their state for particularly long, however, since the clay was mildly radioactive.

      The reality is having error detection and/or correction on every cache, register, bus, lane, memory, storage, and everything between is hella expensive. So your desktop processor has relatively little and processors meant for servers or embedded systems have widely varying amounts.

      PS
      Listening to /. argue about Intel/AMD is like listening to kids arguing over Ford/Chevy.

    2. Re: Silly idea, if true by Anonymous Coward · · Score: 0

      Your PS makes it sound like there's an argument to be had. Intel shat the bed with Meltdown, they're done. AMD Ryzen forever.

    3. Re:Silly idea, if true by Agripa · · Score: 1

      If the cache is susceptible to random gamma rays, or, more likely, cosmic rays, and has no ECC, it is NEVER trustworthy, and should be permanently disabled.

      While it is shut down, the cache is not being continuously scrubbed by ECC or parity allowing bit errors to accumulate and defeat the ECC or parity after it is powered up. Invalidating and reloading the contents of the cache makes perfect sense in this situation.

    4. Re:Silly idea, if true by Agripa · · Score: 1

      The reality is having error detection and/or correction on every cache, register, bus, lane, memory, storage, and everything between is hella expensive. So your desktop processor has relatively little and processors meant for servers or embedded systems have widely varying amounts.

      With some exceptions, most Intel consumer and server processors use the exact same die so the ECC protection is there. The major difference is that with some exceptions, ECC for external memory is disabled on non-Xeon products.

      One of the exceptions is for lower performance processors like the i3 series which do not compete with Xeon products. Fancy that.

  25. Re:Microsoft's never doing any military or space w by mikael · · Score: 3, Interesting

    One component that many defence contract required was a Nuclear Event Detector. This little component would set a pin when it detected the precursor of a nuclear detonation. What the system did next was up to the vendor, but usually it would involve a shutdown and disconnect of ports and power lines.

    --
    Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
  26. Re:Microsoft's never doing any military or space w by mikael · · Score: 1

    I know stray radar microwaves can take out a PC. There was weather radar station close to where I lived. Whenever my smartphone app received a heavy rain warning, my gaming PC would crash seconds before.

    --
    Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
  27. Risk/Reward by QuietLagoon · · Score: 2

    Nowadays, it probably is far, far more likely that Microsoft's horrendous Windows QA will result in bad data than stray gamma rays flipping bits in a sleeping cache.

  28. Commented out code by DigressivePoser · · Score: 5, Insightful
    The comment block was descriptive and necessary, but it should also include processor errata info to trace back to published documentation. Perhaps this was something newly discovered and the processor and software engineers were in close communications.

    "Less than three weeks later, the INVD instruction was commented out," writes Chen. "But the comment block remains.

    I don't like seeing commented out code. If it's commented out then it has no business being in the source code file - even if there's an explanation in the comment block. The code's removal along with its comment block should be documented in whatever revision control system is in use. Maybe I'm bias because I worked in safety critical environments where commented out code is a no-no.

    1. Re:Commented out code by Anonymous Coward · · Score: 0

      This however is Windows. A synonym for bad design, at least in versions until Windows 7.

    2. Re:Commented out code by Anonymous Coward · · Score: 0

      This however is Windows. A synonym for bad design, at least in versions until Windows 7.

      I see you have never seen Windows 10.

    3. Re: Commented out code by Anonymous Coward · · Score: 0

      No. Nor do I want to.

    4. Re: Commented out code by functor0 · · Score: 2

      On occasion, I've had to keep the commented out code with comment explanation why this code must not occur. Otherwise, people keep coming in trying to fix code that's not broken.

    5. Re: Commented out code by TechyImmigrant · · Score: 2

      On occasion, I've had to keep the commented out code with comment explanation why this code must not occur. Otherwise, people keep coming in trying to fix code that's not broken.

      This.

      I've left the wrong code in, commented with a detailed explanation as to why it's wrong, so someone doesn't come and 'fix' it again.

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    6. Re:Commented out code by thegarbz · · Score: 1

      Or maybe someone made a mistake. The specification seems to imply you need to flush the cache *BEFORE* entering the S1 state and the hardware is responsible for the rest:

      "15.1.1 S1 Sleeping State
      The S1 state is defined as a low wake-latency sleeping state. In this state, all system context is preserved with the exception of CPU caches. Before setting the SLP_EN bit, OSPM will flush the system caches. If the platform supports the WBINVD instruction (as indicated by the WBINVD and WBINVD_FLUSH flags in the FADT), OSPM will execute the WBINVD instruction. The hardware is responsible for maintaining all other system context, which includes the context of the CPU, memory, and chipset. "

    7. Re:Commented out code by Anonymous Coward · · Score: 0

      "If it's commented out then it has no business being in the source code file..."

      So, professional programmer here, 30 years, and this is wrong. It is so wrong that it is Not Even Wrong. I use commented out code all the time, routinely, wouldn't even think twice about it.

      Commented code shows that a previous programmer actually thought about a certain situation, planned for it, solved it one way (or not), then changed their mind. It shows learning, growth and progression. Commented code is routinely used as part of change control. Commented code is often used to show what direction code could develop in. Commented code can show that the customer's needs have changed.

      Another use-case is something that shouldn't happen, but does in the real world. You get a minimally documented or undocumented API, and you've spent some time reverse engineering the parameters. You can use commented code to document what you've learned about that API.

      Now I fully expect a reply to the effect that I "don't know what I'm talking about", "you're doing it wrong", and "that isn't your responsibility, or should not happen". OK, just know this. My experience tells me otherwise, and my customers think that I'm pretty good at my job.

  29. Re:Microsoft's never doing any military or space w by Anonymous Coward · · Score: 0

    If the radar was making your PC crash, it'd be crashing constantly, I'd imagine.

  30. Re: ECC everywhere by vadim_t · · Score: 2

    Or you could buy AMD, which seems to have excellent support for it.

  31. Re:Microsoft's never doing any military or space w by Mister+Transistor · · Score: 3, Interesting

    Some of the newer Doppler WX radars do a rapid narrow scan in some modes of operation for some fine examination of a particular front or phenomenon they want to image with more detail or using some more specialized mode like water vapor density, etc.

    So, the usual low(er) power scanning 'round and 'round, like radars usually do, probably isn't enough to trigger this poster's problem, but if the high-powered focused scans happen to be in his direction, well, bad news that day.

    Perhaps some Meteorologist can weigh in on this mode of operation with the radars, I don't know enough about them to be more specific.

    --
    -- You are in a maze of little, twisty passages, all different... --
  32. Bogus story, fake news. by 140Mandak262Jamuna · · Score: 1
    Microsoft code does not contain comments.

    To thwart lawyers finding out the true intentions of the strategies, Bill Gates decreed that the code should not have comments. Famously he said, "I am paying you to write code, not comment."

    --
    sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
    1. Re:Bogus story, fake news. by Bite+The+Pillow · · Score: 2

      http://atdt.freeshell.org/k5/s...

      I don't feel like html today for you.

  33. Re:Microsoft's never doing any military or space w by Anonymous Coward · · Score: 0

    If it was very marginally bad, then just changing the parking of trucks and cars near by would change what reflections you have overlapping and interfering. Multipath signals can be a real mess.

  34. cool story by Anonymous Coward · · Score: 0

    I used to work on Ninnle Linux. Intel, Cyrix, AMD, etc officially didn't acknowledge us. But I was friends with some key engineers, met through swinging, keyparties, hotwifing, etc. So they gave me inside info and weird tips like that (IIRC, cache could get corrupted but it wasn't gamma rays). Today it would be called catfishing, but I would target their top engineers and have them fuck my "wife" (actually an escort) while I watched.

    Afterwards, we'd hang out drinking beers and smoking weed. "Hey, you work at Intel? What a coincidence. I'm working on the Ninnle Linux hypervisor. Maybe you could tell me about these undocumented flags...".

    We did the math once and $5,000 worth of call girl pussy was worth $75,000+ in bribes. Plus sometimes it turned into a gang-bang :)

  35. Quantic computers at our doors by Burgergold · · Score: 0

    Similar to someone who told me to prefer a RSA4096 key to a ECDSA (512?) key for a 1 year certificate because quantic computer are at our doors and would break ECDSA faster than RSA...

  36. stuck in the flytrap by Anonymous Coward · · Score: 0

    stuck in the flytrap

  37. Sun Microsystems and gamma ray bursts by Anonymous Coward · · Score: 0

    Somewhere around 2005 the place I worked began to buy Sun's newest top of the line machines. I think the model number was F15000 or some such. We had two of these machines with a bunch of processors boards in each machine. About the time we were ready to go live with them, Sun informed us that they needed to replace about half of the processors because some didn't have ECC on one of the memory caches. They said cosmic rays impinging on non-ECC cache could cause the O.S. to crash!

    I was never a big Sun fan, so soon thereafter when we bought three smaller Sun machines, I named them "cosmic", "ray", and "burst". Management wasn't pleased with that decision. I didn't care, they didn't necessarily please me either.

  38. Re: Microsoft's never doing any military or space by nowwith25percentmore · · Score: 1

    Was your gaming PC's case solid metal, or did it have large windows / oversized vents?

  39. UltraSPARC, anyone? by nbvb · · Score: 2

    Anyone surprised by this must have not been around during the UltraSPARC days ....

    I must’ve replaced 1000+ of those damn chips when the “Sombra” modules came out. Mirrored SRAM to protect against the ecache bit-flips. Kernel panics due to “ecache parity errors” were so common ....

    Cache scrubbers in the Solaris kernel. Replacement CPUs. All of it helped.

    This stuff is real and painful if you had a data center full of gear susceptible to it.

  40. Re: ECC everywhere by Anonymous Coward · · Score: 0

    Or you could buy AMD, which seems to have excellent support for it.

    AMD doesn't make motherboards, so no not only do no AMD motherboards support ECC, but there are no AMD motherboards in existence to buy.

    Of the companies that DO make motherboards that support ECC, all of the big five make motherboards for both AMD and Intel that support ECC.

    Switching from an ECC capable i7 to an ECC capable AMD CPU just because "I want ECC" is a pretty special and wasteful form of stupidity.

  41. Re:Microsoft's never doing any military or space w by Anonymous Coward · · Score: 0

    Two words. Ferrite clips.

  42. It happened like this... by toxygen01 · · Score: 5, Interesting

    A friend of mine, developer of the spreadsheet SW back in the days of DOS a Norton Commander, had one customer who would keep complaining about the SW crashing from time to time. These kind of crashes would only happen to this customer and no other.

    He installed a debug build on the customer's site and and waited... and fair enough, the SW would crash, and crash again and again... at completely random places in the code. In some cases there was literally no way those lines of code could make the program crash under any circumstances.

    Well, he spent days trying to debug it and came up empty handed. Until it struck him to look at the time when the SW is crashing. And fair enough, it was crashing on one particular day in a week usually in the time-span of few hours during that day. Now comes the interesting part -- the customer's site was actually a railway station on the Slovakia-Ukraine border (in town called Uzghorod). So he called the customer to ask if there was a train in the station regularly on that day and hour every week and voila, there was one train coming from Ukraine to Slovakia with some goods. So he asked the customer to take Geiger counter and see if there was anything going on in the air.

    They found out one of the train cars was radiating like hell. It was used for transferring spent nuclear fuel before. And Ukrainians thought they would save some money by using it for regular cargo after EOL. I wouldn't like to be a person living near those railway tracks...

    tl;dr
    Spreadsheet SW was crashing on the computers in the train station and thanks to customer complaints they found out the crashes were caused by radioactive train coming regularly to the station.

  43. Re: ECC everywhere by mlyle · · Score: 1

    The most real problem is that this is a way for motherboard and CPU vendors to segment the market, and prevent commodity PC hardware from being used for critical things. Home users "don't need" ECC, so it can be left off the cheap stuff.

  44. Re: ECC everywhere by Anonymous Coward · · Score: 0

    ECC is another good reason (on top of all the others) for buying Ryzen.

  45. Re: Microsoft's never doing any military or space by c6gunner · · Score: 2

    Aircraft have weather radar built in, so I've had my smartphone in front of a powered up radar emitter many times; didn't affect it in the slightest. The ground based ones are probably more powerful, but it seems unlikely that they would be affecting electronics. If they did there would be a lot more problems than just one random guy having his computer crash.

  46. Re:Microsoft's never doing any military or space w by Anonymous Coward · · Score: 1

    Maxwell HSN-1000. You can't buy them new from Maxwell but you can get them used from recycled military gear for around $150.

  47. Common in IBM mid-ranges in the 90s by coreyh · · Score: 2

    This is actually pretty common and has gone on for a long time, especially on systems that were striving to be low-to-zero downtime.

    Some of the idle processing on AS/400s would periodically re-write the microcode from disk. When I asked a core developer why, they cited gamma rays flipping a bit. I then asked if a lead umbrella wouldn't do the job better, and they said yes, but the umbrella would have to be about six feet thick.

    1. Re:Common in IBM mid-ranges in the 90s by Agripa · · Score: 1

      Some of the idle processing on AS/400s would periodically re-write the microcode from disk. When I asked a core developer why, they cited gamma rays flipping a bit. I then asked if a lead umbrella wouldn't do the job better, and they said yes, but the umbrella would have to be about six feet thick.

      Cache and memory scrubbing is a standard feature even on x86 consumer desktop processors whether the user has access to it or not. Motherboards which support ECC memory may make the settings which control scrubbing available in the BIOS. Scrubbing applies to every level of cache which is ECC or parity protected and to main memory if ECC protected.

  48. Re: ECC everywhere by Anonymous Coward · · Score: 1

    Intel omits ECC from the desktop market as purposeful market segmentation. It's a fact.

  49. Re: Microsoft's never doing any military or space by Mister+Transistor · · Score: 2

    Aircraft radars are in the hundreds of watts power output; WX radars are in the MILLIONS of watts. You're talking an order of magnitude difference of 10,000 or more.

    Also, some circuits are more sensitive than others to particular frequencies due to the length of wires or runners on PCB's that act like little antennas, so not everything is going to be adversely affected, but stuff that's resonant at that frequency will be much more susceptible to external interference.

    RF engineer here, BTW. I just don't do radars...

    --
    -- You are in a maze of little, twisty passages, all different... --
  50. Once a year by cwsumner · · Score: 1

    Cosmic rays causing ram errors, is a thing. Scientists estimate it will happen to PCs, at ground level, about once a year. Surprisingly, which year does not matter much because as the tech gets smaller, the capacity gets larger, so the die size stays about the same.

    Once a year might not sound like much, but that is not "at the end of the year", it can happen right away. Chance is strange that way. 8-)

    MS should probably -not- have commented it out...

    1. Re:Once a year by Agripa · · Score: 1

      Cosmic rays causing ram errors, is a thing. Scientists estimate it will happen to PCs, at ground level, about once a year. Surprisingly, which year does not matter much because as the tech gets smaller, the capacity gets larger, so the die size stays about the same.

      Moore's law is about economics and includes cost reduction per transistor from increasing die size so I wonder if the total die area of memory has actually increased at the high end of consumer hardware.

      Once a year might not sound like much, but that is not "at the end of the year", it can happen right away. Chance is strange that way. 8-)

      MS should probably -not- have commented it out...

      A couple of DRAM generations ago it was something like 1 bit per year per gigabyte but later DRAM generations actually improved slightly. My workstations went from 2GB to 8GB in my current one and my next will likely be 64GB but they all use 4 x dual sided DIMMs so the same number of chips but the silicon chips themselves have increased in area with better packaging.

    2. Re:Once a year by cwsumner · · Score: 1

      Magnetic media is not so prone to this. But this makes me wonder if the SSD drives, we are all using now, are having this problem??

      Maybe SSDs have better data check and correction functions, but maybe we should keep a hard drive in our computers to reload the SSD, if necessary.

    3. Re:Once a year by Agripa · · Score: 1

      Magnetic media is not so prone to this. But this makes me wonder if the SSD drives, we are all using now, are having this problem??

      Maybe SSDs have better data check and correction functions, but maybe we should keep a hard drive in our computers to reload the SSD, if necessary.

      Both hard disk drives and solid state drives use block based error correction. Several bad bits can be corrected in each sector and sectors may even be considered good with several bad bits below a specified threshold.

      Where SSDs compare poorly to HDDs is endurance and retention time but as long as they are not used for unpowered offline storage like a hard drive might be, retention time is not a problem and few users are going to reach endurance limits. There is a new standard for SSD retention time but I do not think any actually meet it.

  51. Single event upsets (SEU) by Anonymous Coward · · Score: 0

    Mitigating this problem has been the elephant in the room since the late 90s. At least one aircraft manufacturer would not allow the use of FPGAs in filght critical electronics designs of commercial airliners because of it. Xilinx for years had several FPGAs running at the top of one of the Hawaii volcanos doing nothing but repeatedly measuring the number of times their bitstream was altered. At the current chip geometries you can be pretty sure that if you jump on a plane in California and fly west with a new laptop, you may have a SEU. It may simply flip a bit of unused memory.....or not. Google it.

  52. Don't laugh as this isn't as funny as ... by Anonymous Coward · · Score: 0

    You think this is funny? Than read why ECC memory was developed and get an education about interference from radiation.

  53. Mario 64 speedrunning proved this is an issue by Anonymous Coward · · Score: 0

    A long time ago, someone was going through a level (Tick Tock Clock) in Mario 64. They somehow managed to "warp" to the top of the level - something very valuable to the speedrunners obviously. Many people believe this was due to a bit flip, whether caused by a cosmic ray or not, because so far it has never been reproduced again. Bit flips happen more commonly than is realized but most of the time the impact is not noticeable. It is curious as to why this programmer felt the need to put that instruction in the code. Did something happen to him or a colleague that they could not explain? Bit flips may be the cause of more issues than we realize in computer hardware.

  54. ACPI specification requirements by thegarbz · · Score: 1

    A quick read through the ACPI specification implies that the caches should be flushed *before* entering the S1 state and letting the hardware deal with the rest.

    I'm not sure what to make of the comment. Part of the comment makes it apear as though this instruction comes after waking (making it pointless since the cache is already invalid). If this comment is about before going into the sleep state then it wasn't a manufacturer who asked for this, it was the ACPI specification itself, and not flushing the cache before entering would be in breach of the spec.

    "15.1.1 S1 Sleeping State
    The S1 state is defined as a low wake-latency sleeping state. In this state, all system context is preserved with the exception of CPU caches. Before setting the SLP_EN bit, OSPM will flush the system caches. If the platform supports the WBINVD instruction (as indicated by the WBINVD and WBINVD_FLUSH flags in the FADT), OSPM will execute the WBINVD instruction. The hardware is responsible for maintaining all other system context, which includes the context of the CPU, memory, and chipset. "

    A very big portion of the ACPI specification details exactly how to flush caches going into and out of the various sleep states and how hardware should respond to this. If implementing the specificaiton as written it would appear as though flushing the cache when waking doesn't need to be done.

    Are there any experts on this topic here which can shed more light on this?

  55. Re: Microsoft's never doing any military or space by Anonymous Coward · · Score: 0

    Aircraft have their weather radar turned off on the ground, for you know, interference reasons. And the cancer issue.

  56. Re: ECC everywhere by Agripa · · Score: 1

    The problem is that you need a CPU and north bridge that can handle it, which adds to the initial costs. For Intel, for example, a Xeon CPU costs (artificially) a good deal more than a comparable speed i3/5/7/9, which is an upfront cost that consumers aren't willing to eat, and they tend to choose either a cheaper CPU or a faster CPU for the same kind of money.

    In most cases Intel's Xeon and consumer CPUs are the same hardware so the only difference in production might be testing time. Intel's artificial market segmentation of ECC is more about price discrimination then costs which can be seen by their tying ECC to use of the proper south bridge which has nothing to do with it.