Slashdot Mirror


That Time The Windows Kernel Fought Gamma Rays Corrupting Its Processor Cache (microsoft.com)

Long-time Microsoft programmer Raymond Chen recently shared a memory about an unusual single-line instruction that was once added into the Windows kernel code -- accompanied by an "incredulous" comment from the Microsoft programmer who added it:

;
; Invalidate the processor cache so that any stray gamma
; rays (I'm serious) that may have flipped cache bits
; while in S1 will be ignored.
;
; Honestly. The processor manufacturer asked for this.
; I'm serious.
invd


"Less than three weeks later, the INVD instruction was commented out," writes Chen. "But the comment block remains.

"In case we decide to resume trying to deal with gamma rays corrupting the the processor cache, I guess."

33 of 166 comments (clear)

  1. Microsoft's never doing any military or space work by johnjones · · Score: 3, Informative

    preparing your software for failures in hardware due to common problems such as radiation might be a good idea...

    This is why some firms/states would not trust microsoft to critical functions....

  2. That's a great comment by NotSoHeavyD3 · · Score: 5, Insightful

    Since it explains the reasoning why that code is there.(Since another developer could come by and wonder why that code is there.) I've seen way too many people put in a comment like ;invalidate cache and call it a day.

    --
    Did you know 80 to 90% of the moderators on slashdot wouldn't recognize a troll even if one dragged them under a bridge.
    1. Re: That's a great comment by Balial · · Score: 2

      It needs a reference to the errata from the vendor. Future revisions may need to tweak code flow and understand exactly what this is trying to achieve.

    2. Re:That's a great comment by shabble · · Score: 2

      Since it explains the reasoning why that code is there.(Since another developer could come by and wonder why that code is there.).

      But... the code isn't there. The code itself was commented out shortly after.

      What's more concerning is why the commented stuff was actually left in there, since I'm presuming they had source control even back then.

      And "in case someone put it back in later" isn't really covered since the same sort of code could conceivably be put elsewhere in the code without the programmer seeing this bit of code.

  3. I'm not sure what's odd about that by vadim_t · · Score: 4, Interesting

    The need for error checking has been around for a very long time. Yes, cosmic particles are indeed a thing, and result in increased memory errors at high altitude, in airplanes, or especially in space.

    I remember parity RAM being around in the 90s, and I'm pretty sure it's older than that. Pretty much any server these days uses ECC for this reason.

    I run ECC and record the occassional bit flip in my logs once in a while. These can be found at /sys/devices/system/edac/mc/mc0/.

    What's odd is that ECC is not routinely used in all hardware. Depending on the conditions it can be of great help, as the rare bit flip can cause strange problems that can take ages to track down. And it works well for figuring out when you have a bad memory module -- the computer will figure it out on its own.

    1. Re:I'm not sure what's odd about that by dargaud · · Score: 2

      I have a friend who had written his own accounting software in the 80s on a 6502 PC. Once there was a discrepancy of a few $ at the end of the month. He spent an entire month backtracking the error through software logic, then software debug, then finally assembly until he found the exact place where a single bit had flipped in memory. Took him a month.

      --
      Non-Linux Penguins ?
    2. Re:I'm not sure what's odd about that by larryjoe · · Score: 2

      What's odd is that ECC is not routinely used in all hardware.

      For a lot of systems and uses, the rate of error occurrence doesn't justify the area cost of ECC. For all fabrication processes in the last decade, error rates per SRAM bit have been decreasing faster than the increase in number of SRAM bits, meaning that the total error rates for most chip families have been decreasing. Furthermore, the vast majority of errors in SRAM never propagate to user-discernible outcomes. For these systems, the user is more interested in a lower initial price or better performance rather than a decrease in the failure rate from very infrequent to even more infrequent.

      However, ECC is ubiquitous in data centers, supercomputers, control systems, and aeronautics (where the expected error rate per SRAM bit is at least two orders of magnitude higher than for terrestrial systems). For those systems, the users are willing to pay a premium for data integrity, availability, and safety.

    3. Re:I'm not sure what's odd about that by PhunkySchtuff · · Score: 3, Interesting

      The issue is not that the error is only a few dollars or even a few cents. The issue is that there is an error at all. If something doesn't balance, even if it's a few cents out, that means that there's likely an error in the logic that calculates everything.

      It's basic maths. You can't say when you're calculating 100 + 100 = 199 and call it a day because it's close enough. There is something fundamentally wrong if you're not getting the exact correct answer.

  4. Why is this so strange? by Brett+Buck · · Score: 2

    It seems to make good sense to put in some protections against register or other bit flips, they do happen from time to time. He probably meant cosmic rays instead of gamma rays, but that definitely can happen and i have spent many, many, hours of my life putting things in software that detect these and recover properly. I have one processor type that has something like this about once a month, very consistently, over several decades.

  5. Sure they did by rsilvergun · · Score: 4, Insightful

    that's what their embedded OSes were for. AFAIK this was in their consumer code base.

    If I had to guess this was because of a real processor bug Intel didn't want to admit to. I remember when Win XP hit the shop I was at was flooded with dead computers from upgrades. Manufacturers had been selling bad ram in computers for years. By default Win98 would only make use of the first 64 MB of ram in most cases (there was a registry hack I've long since forgotten to force it to use your entire ram before going to the cache).

    Anyway, XP's installer would copy the CD into ram to make the (very slow) install run faster. So you got to find out your OEM stuck bad ram in your box the hard way when the installer blew up. The best part was the upgrade couldn't roll itself back gracefully. I don't remember all the steps to fix it but it was a pain. We just did software where I was at too so it was fun having to send them somewhere else to get new ram and have them yell at me that the ram was fine. Good times.

    --
    Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
    1. Re:Sure they did by msauve · · Score: 5, Interesting

      "If I had to guess this was because of a real processor bug Intel didn't want to admit to."

      Alpha particles affecting memory is a known, but uncommon, issue. This code invalidated the cache when coming out of S1 (sleep) state. The deeper (S2+) sleep states already invalidate the cache. The longer the processor is in a static state (sleep), the more chance that an alpha particle hit will flip a bit. Invalidating the cache when coming out of a sleep state has no meaningful impact on performance. The time to re-fetch is nothing compared to the amount of time spent sleeping. Of course, there are many more bits in RAM which could be affected, so a problem is more likely to occur there, which this doesn't address.

      But it hurts nothing, avoids an (admittedly rare) issue, and is but a single instruction. I wonder why they removed it?

      --
      "National Security is the chief cause of national insecurity." - Celine's First Law
    2. Re:Sure they did by msauve · · Score: 2

      "How's it going to make any difference if an alpha particle hits the cache memory cells while the core clock has stopped?"

      It's not clear what you're asking. If a bit in the cache gets changed, it corrupts the instruction or data. That the cache is powered up makes no difference.

      --
      "National Security is the chief cause of national insecurity." - Celine's First Law
    3. Re:Sure they did by svirre · · Score: 2

      The usual source of alpha emissions affecting memory in semiconductor devices come from the capsule of the device itself.

  6. probably cosmic rays rather than gamma rays by starless · · Score: 2

    A real gamma ray wouldn't do much, and would just pass through, unless it pair converted to electron and positron.
    But cosmic rays (charged particles) would be more likely to interact.

  7. Re:Microsoft's never doing any military or space w by Anonymous Coward · · Score: 3, Informative

    Reading the full story, it's rather strongly implied that it was actually a workaround for a bug in the processor which the manufacturer hadn't found yet, and was blaming on cosmic rays.

  8. Actually makes good sense by Crashmarik · · Score: 4, Insightful

    Your cpu has been asleep for an apriori unknown amount of time, you are powering back up you'd absolutely want to clear the cache to purge any potential bit flips. It's a relatively cheap way of insuring data integrity.

  9. Laptop aboard the International Space Station ? by Laxator2 · · Score: 5, Informative

    I think they use laptops on the International Space Station and there you are not protected from cosmic rays by the blanket of the Earth's atmosphere. Just read up on the phosphenes experienced by the astronauts as they try to go to sleep.

    Not sure if "gamma rays" is the correct term here, as high-energy protons are most likely to create a local change in electric charge density. With modern processors being built ont the 14 nanometres process this becones a serious problem. All the processors that are used in spacecraft and control vital functions are radiation-hardened. That usually means older fabrication processes (wider paths reduce the probability of cross-talk) and amorphous silicon (a monocrystal can sustain permanent damage from a particle of high enough energy)

    Overall, it does make sense if it is meant to be used in space.

  10. Re: ECC everywhere by davidwr · · Score: 2

    RAM is cheap enough that ECC or similar tech should be routine. Iâ(TM)ll pay 10-15% more per GB for this.

    --
    Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
  11. Self-reply, after reading TFA by davidwr · · Score: 2

    Shouldâ(TM)ve read the article first, where the author explained that oddly-commented code similar to this was used TEMPORARILY on early processor revisions or on early microcode revisions.

    In these cases, the check-in logs or the context of the code - say, itâ(TM)s in a block of code that only runs on processors that are in pre-production at the time - should make it clear that this is âoework-atoundâ code that we expect to be removed soon.

    --
    Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
  12. Re:Microsoft's never doing any military or space w by mikael · · Score: 3, Interesting

    One component that many defence contract required was a Nuclear Event Detector. This little component would set a pin when it detected the precursor of a nuclear detonation. What the system did next was up to the vendor, but usually it would involve a shutdown and disconnect of ports and power lines.

    --
    Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
  13. Risk/Reward by QuietLagoon · · Score: 2

    Nowadays, it probably is far, far more likely that Microsoft's horrendous Windows QA will result in bad data than stray gamma rays flipping bits in a sleeping cache.

  14. Commented out code by DigressivePoser · · Score: 5, Insightful
    The comment block was descriptive and necessary, but it should also include processor errata info to trace back to published documentation. Perhaps this was something newly discovered and the processor and software engineers were in close communications.

    "Less than three weeks later, the INVD instruction was commented out," writes Chen. "But the comment block remains.

    I don't like seeing commented out code. If it's commented out then it has no business being in the source code file - even if there's an explanation in the comment block. The code's removal along with its comment block should be documented in whatever revision control system is in use. Maybe I'm bias because I worked in safety critical environments where commented out code is a no-no.

    1. Re: Commented out code by functor0 · · Score: 2

      On occasion, I've had to keep the commented out code with comment explanation why this code must not occur. Otherwise, people keep coming in trying to fix code that's not broken.

    2. Re: Commented out code by TechyImmigrant · · Score: 2

      On occasion, I've had to keep the commented out code with comment explanation why this code must not occur. Otherwise, people keep coming in trying to fix code that's not broken.

      This.

      I've left the wrong code in, commented with a detailed explanation as to why it's wrong, so someone doesn't come and 'fix' it again.

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
  15. Re: ECC everywhere by vadim_t · · Score: 2

    Or you could buy AMD, which seems to have excellent support for it.

  16. Re:Microsoft's never doing any military or space w by Mister+Transistor · · Score: 3, Interesting

    Some of the newer Doppler WX radars do a rapid narrow scan in some modes of operation for some fine examination of a particular front or phenomenon they want to image with more detail or using some more specialized mode like water vapor density, etc.

    So, the usual low(er) power scanning 'round and 'round, like radars usually do, probably isn't enough to trigger this poster's problem, but if the high-powered focused scans happen to be in his direction, well, bad news that day.

    Perhaps some Meteorologist can weigh in on this mode of operation with the radars, I don't know enough about them to be more specific.

    --
    -- You are in a maze of little, twisty passages, all different... --
  17. Re:Artificial problem. by dshk · · Score: 2

    I do believe missing ECC support is an artifical restriction at Intel. AMD has ECC. One of the reasons I always buy AMD, that I can be sure that all processor features of that generation is enabled in even their cheapest processor. No surprises. Btw. modern processors include most/all of the functionality of the north bridge. Regarding performance, for the same cost AMD usually provides more performance, specifically similar single threaded performance and better multi-threaded performance.

  18. UltraSPARC, anyone? by nbvb · · Score: 2

    Anyone surprised by this must have not been around during the UltraSPARC days ....

    I must’ve replaced 1000+ of those damn chips when the “Sombra” modules came out. Mirrored SRAM to protect against the ecache bit-flips. Kernel panics due to “ecache parity errors” were so common ....

    Cache scrubbers in the Solaris kernel. Replacement CPUs. All of it helped.

    This stuff is real and painful if you had a data center full of gear susceptible to it.

  19. It happened like this... by toxygen01 · · Score: 5, Interesting

    A friend of mine, developer of the spreadsheet SW back in the days of DOS a Norton Commander, had one customer who would keep complaining about the SW crashing from time to time. These kind of crashes would only happen to this customer and no other.

    He installed a debug build on the customer's site and and waited... and fair enough, the SW would crash, and crash again and again... at completely random places in the code. In some cases there was literally no way those lines of code could make the program crash under any circumstances.

    Well, he spent days trying to debug it and came up empty handed. Until it struck him to look at the time when the SW is crashing. And fair enough, it was crashing on one particular day in a week usually in the time-span of few hours during that day. Now comes the interesting part -- the customer's site was actually a railway station on the Slovakia-Ukraine border (in town called Uzghorod). So he called the customer to ask if there was a train in the station regularly on that day and hour every week and voila, there was one train coming from Ukraine to Slovakia with some goods. So he asked the customer to take Geiger counter and see if there was anything going on in the air.

    They found out one of the train cars was radiating like hell. It was used for transferring spent nuclear fuel before. And Ukrainians thought they would save some money by using it for regular cargo after EOL. I wouldn't like to be a person living near those railway tracks...

    tl;dr
    Spreadsheet SW was crashing on the computers in the train station and thanks to customer complaints they found out the crashes were caused by radioactive train coming regularly to the station.

  20. Re: Microsoft's never doing any military or space by c6gunner · · Score: 2

    Aircraft have weather radar built in, so I've had my smartphone in front of a powered up radar emitter many times; didn't affect it in the slightest. The ground based ones are probably more powerful, but it seems unlikely that they would be affecting electronics. If they did there would be a lot more problems than just one random guy having his computer crash.

  21. Re:Bogus story, fake news. by Bite+The+Pillow · · Score: 2

    http://atdt.freeshell.org/k5/s...

    I don't feel like html today for you.

  22. Common in IBM mid-ranges in the 90s by coreyh · · Score: 2

    This is actually pretty common and has gone on for a long time, especially on systems that were striving to be low-to-zero downtime.

    Some of the idle processing on AS/400s would periodically re-write the microcode from disk. When I asked a core developer why, they cited gamma rays flipping a bit. I then asked if a lead umbrella wouldn't do the job better, and they said yes, but the umbrella would have to be about six feet thick.

  23. Re: Microsoft's never doing any military or space by Mister+Transistor · · Score: 2

    Aircraft radars are in the hundreds of watts power output; WX radars are in the MILLIONS of watts. You're talking an order of magnitude difference of 10,000 or more.

    Also, some circuits are more sensitive than others to particular frequencies due to the length of wires or runners on PCB's that act like little antennas, so not everything is going to be adversely affected, but stuff that's resonant at that frequency will be much more susceptible to external interference.

    RF engineer here, BTW. I just don't do radars...

    --
    -- You are in a maze of little, twisty passages, all different... --