That Time The Windows Kernel Fought Gamma Rays Corrupting Its Processor Cache (microsoft.com)
Long-time Microsoft programmer Raymond Chen recently shared a memory about an unusual single-line instruction that was once added into the Windows kernel code -- accompanied by an "incredulous" comment from the Microsoft programmer who added it:
;
; Invalidate the processor cache so that any stray gamma
; rays (I'm serious) that may have flipped cache bits
; while in S1 will be ignored.
;
; Honestly. The processor manufacturer asked for this.
; I'm serious.
invd
"Less than three weeks later, the INVD instruction was commented out," writes Chen. "But the comment block remains.
"In case we decide to resume trying to deal with gamma rays corrupting the the processor cache, I guess."
;
; Invalidate the processor cache so that any stray gamma
; rays (I'm serious) that may have flipped cache bits
; while in S1 will be ignored.
;
; Honestly. The processor manufacturer asked for this.
; I'm serious.
invd
"Less than three weeks later, the INVD instruction was commented out," writes Chen. "But the comment block remains.
"In case we decide to resume trying to deal with gamma rays corrupting the the processor cache, I guess."
phrase their 'requests' these days.
preparing your software for failures in hardware due to common problems such as radiation might be a good idea...
This is why some firms/states would not trust microsoft to critical functions....
Since it explains the reasoning why that code is there.(Since another developer could come by and wonder why that code is there.) I've seen way too many people put in a comment like ;invalidate cache
and call it a day.
Did you know 80 to 90% of the moderators on slashdot wouldn't recognize a troll even if one dragged them under a bridge.
Single Event Upsets are real and all semiconductors are susceptible. 90nm might be more "resilient" to it, but it can still occur.
This sounds like a processor bug or a bug elsewhere and they bamboosled MS with smoke and mirrors...
I once had to debug a situation where an opto-coupler had been changed out from a part that had black plastic to a part that had white plastic. The difference in the opacity of the casing was enough to cause a larger drift when in the sunlight. This is not as crazy as it sounds...
The need for error checking has been around for a very long time. Yes, cosmic particles are indeed a thing, and result in increased memory errors at high altitude, in airplanes, or especially in space.
I remember parity RAM being around in the 90s, and I'm pretty sure it's older than that. Pretty much any server these days uses ECC for this reason.
I run ECC and record the occassional bit flip in my logs once in a while. These can be found at /sys/devices/system/edac/mc/mc0/.
What's odd is that ECC is not routinely used in all hardware. Depending on the conditions it can be of great help, as the rare bit flip can cause strange problems that can take ages to track down. And it works well for figuring out when you have a bad memory module -- the computer will figure it out on its own.
It seems to make good sense to put in some protections against register or other bit flips, they do happen from time to time. He probably meant cosmic rays instead of gamma rays, but that definitely can happen and i have spent many, many, hours of my life putting things in software that detect these and recover properly. I have one processor type that has something like this about once a month, very consistently, over several decades.
Maybe they were afraid it would get angry.
You wouldn't like it if it got angry.
that's what their embedded OSes were for. AFAIK this was in their consumer code base.
If I had to guess this was because of a real processor bug Intel didn't want to admit to. I remember when Win XP hit the shop I was at was flooded with dead computers from upgrades. Manufacturers had been selling bad ram in computers for years. By default Win98 would only make use of the first 64 MB of ram in most cases (there was a registry hack I've long since forgotten to force it to use your entire ram before going to the cache).
Anyway, XP's installer would copy the CD into ram to make the (very slow) install run faster. So you got to find out your OEM stuck bad ram in your box the hard way when the installer blew up. The best part was the upgrade couldn't roll itself back gracefully. I don't remember all the steps to fix it but it was a pain. We just did software where I was at too so it was fun having to send them somewhere else to get new ram and have them yell at me that the ram was fine. Good times.
Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
A real gamma ray wouldn't do much, and would just pass through, unless it pair converted to electron and positron.
But cosmic rays (charged particles) would be more likely to interact.
If it's being done rarely, and before exceptionally critical operations, then maybe it makes sense. Although, if someone bothered to take it out, then it was probably happening too often and thus affecting performance...
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Reading the full story, it's rather strongly implied that it was actually a workaround for a bug in the processor which the manufacturer hadn't found yet, and was blaming on cosmic rays.
You are all cows. Cows say moo. MOOOOO! MOOOOO! Moo cows MOOOOOOO! Moo say the cows. YOU CACHE COWS!!
Your cpu has been asleep for an apriori unknown amount of time, you are powering back up you'd absolutely want to clear the cache to purge any potential bit flips. It's a relatively cheap way of insuring data integrity.
A reference to the *specific* communication from the chip vendor should be clearly visible to anyone auditing the code’s history.
Since it’s not in-line in the code, I hope it’s somewhere else in the same “check-in” to the code repository. Presumably only MS and their partners have access to that.
Likewise, when it was commented out there needs to be a corresponding justification, such as a reference to an additional communication from the vendor or an internal memo approving ignoring vendor advice.
Without such justification, code-auditors can’t easily determine if this was a put in as a joke, because of a misunderstanding, or for some other reason.
I think they use laptops on the International Space Station and there you are not protected from cosmic rays by the blanket of the Earth's atmosphere. Just read up on the phosphenes experienced by the astronauts as they try to go to sleep.
Not sure if "gamma rays" is the correct term here, as high-energy protons are most likely to create a local change in electric charge density. With modern processors being built ont the 14 nanometres process this becones a serious problem. All the processors that are used in spacecraft and control vital functions are radiation-hardened. That usually means older fabrication processes (wider paths reduce the probability of cross-talk) and amorphous silicon (a monocrystal can sustain permanent damage from a particle of high enough energy)
Overall, it does make sense if it is meant to be used in space.
Sounds like something Cyrix must have asked for, wondering why machines with their CPUs kept locking up
RAM is cheap enough that ECC or similar tech should be routine. Iâ(TM)ll pay 10-15% more per GB for this.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Yeah, I'll get this was before they discovered that their processor packaging material was radioactive and that was ramdonly flipping bits. Seriously, radioactive RAM was on culprit which ran Sun Microsystems, Inc. out of business. It took them years to find it. They even started ECC their motherboard data paths, looking to see if their data centers were near nuclear research facilities. By the time they found it it was too late. ...that and they should have ditched Solaris for Linux, but...
Shouldâ(TM)ve read the article first, where the author explained that oddly-commented code similar to this was used TEMPORARILY on early processor revisions or on early microcode revisions.
In these cases, the check-in logs or the context of the code - say, itâ(TM)s in a block of code that only runs on processors that are in pre-production at the time - should make it clear that this is âoework-atoundâ code that we expect to be removed soon.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Appearently, the apostrophe got turned into a curly-apostrophe. Bad computer.
Still my fault for not previewing.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
The problem is that you need a CPU and north bridge that can handle it, which adds to the initial costs. For Intel, for example, a Xeon CPU costs (artificially) a good deal more than a comparable speed i3/5/7/9, which is an upfront cost that consumers aren't willing to eat, and they tend to choose either a cheaper CPU or a faster CPU for the same kind of money.
The i3 is a Xeon with cut down features. Hell the 8xxx series actually HAVE ECC enabled, but you need a C236 (or whatever the recent model is) server motherboard to support it. A completely artificial requirement, given that AMD has supported it since the Socket 939 chips, and Intel supported it on the 440BX and FX chipsets, which were used in consumer hardware, being replaced by the 810/815 in part to remove the ECC capability which was cutting into their server sales (or so they claimed.)
Even most ARM processors if you read the spec sheet have ECC support included by default today, even when the majority of products decide not to include it.
Lack of ECC is entirely artificial today. You can find AMD motherboards under 100 dollars with ECC capabilities and chip-wise everything has it in hardware, even if the support is disabled when sold to the consumer.
Sounds like a smoke screen for something else.
If the cache is susceptible to random gamma rays, or, more likely, cosmic rays, and has no ECC, it is NEVER trustworthy, and should be permanently disabled.
It's like the Intel floating point bugs (yes, plural). Since the end user has no idea WHICH of the operations will produce an erroneous result, NONE of the operations' results are usable, ever.
Could be worse. Intel once had a "genius" purchasing agent that got a "good deal" on clay for the ceramic package of EPROMs. Devices didn't hold their state for particularly long, however, since the clay was mildly radioactive.
One component that many defence contract required was a Nuclear Event Detector. This little component would set a pin when it detected the precursor of a nuclear detonation. What the system did next was up to the vendor, but usually it would involve a shutdown and disconnect of ports and power lines.
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
I know stray radar microwaves can take out a PC. There was weather radar station close to where I lived. Whenever my smartphone app received a heavy rain warning, my gaming PC would crash seconds before.
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
Nowadays, it probably is far, far more likely that Microsoft's horrendous Windows QA will result in bad data than stray gamma rays flipping bits in a sleeping cache.
"Less than three weeks later, the INVD instruction was commented out," writes Chen. "But the comment block remains.
I don't like seeing commented out code. If it's commented out then it has no business being in the source code file - even if there's an explanation in the comment block. The code's removal along with its comment block should be documented in whatever revision control system is in use. Maybe I'm bias because I worked in safety critical environments where commented out code is a no-no.
If the radar was making your PC crash, it'd be crashing constantly, I'd imagine.
Or you could buy AMD, which seems to have excellent support for it.
Some of the newer Doppler WX radars do a rapid narrow scan in some modes of operation for some fine examination of a particular front or phenomenon they want to image with more detail or using some more specialized mode like water vapor density, etc.
So, the usual low(er) power scanning 'round and 'round, like radars usually do, probably isn't enough to trigger this poster's problem, but if the high-powered focused scans happen to be in his direction, well, bad news that day.
Perhaps some Meteorologist can weigh in on this mode of operation with the radars, I don't know enough about them to be more specific.
-- You are in a maze of little, twisty passages, all different... --
To thwart lawyers finding out the true intentions of the strategies, Bill Gates decreed that the code should not have comments. Famously he said, "I am paying you to write code, not comment."
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
If it was very marginally bad, then just changing the parking of trucks and cars near by would change what reflections you have overlapping and interfering. Multipath signals can be a real mess.
I used to work on Ninnle Linux. Intel, Cyrix, AMD, etc officially didn't acknowledge us. But I was friends with some key engineers, met through swinging, keyparties, hotwifing, etc. So they gave me inside info and weird tips like that (IIRC, cache could get corrupted but it wasn't gamma rays). Today it would be called catfishing, but I would target their top engineers and have them fuck my "wife" (actually an escort) while I watched.
Afterwards, we'd hang out drinking beers and smoking weed. "Hey, you work at Intel? What a coincidence. I'm working on the Ninnle Linux hypervisor. Maybe you could tell me about these undocumented flags...".
We did the math once and $5,000 worth of call girl pussy was worth $75,000+ in bribes. Plus sometimes it turned into a gang-bang :)
Similar to someone who told me to prefer a RSA4096 key to a ECDSA (512?) key for a 1 year certificate because quantic computer are at our doors and would break ECDSA faster than RSA...
stuck in the flytrap
Somewhere around 2005 the place I worked began to buy Sun's newest top of the line machines. I think the model number was F15000 or some such. We had two of these machines with a bunch of processors boards in each machine. About the time we were ready to go live with them, Sun informed us that they needed to replace about half of the processors because some didn't have ECC on one of the memory caches. They said cosmic rays impinging on non-ECC cache could cause the O.S. to crash!
I was never a big Sun fan, so soon thereafter when we bought three smaller Sun machines, I named them "cosmic", "ray", and "burst". Management wasn't pleased with that decision. I didn't care, they didn't necessarily please me either.
Was your gaming PC's case solid metal, or did it have large windows / oversized vents?
Anyone surprised by this must have not been around during the UltraSPARC days ....
I must’ve replaced 1000+ of those damn chips when the “Sombra” modules came out. Mirrored SRAM to protect against the ecache bit-flips. Kernel panics due to “ecache parity errors” were so common ....
Cache scrubbers in the Solaris kernel. Replacement CPUs. All of it helped.
This stuff is real and painful if you had a data center full of gear susceptible to it.
Or you could buy AMD, which seems to have excellent support for it.
AMD doesn't make motherboards, so no not only do no AMD motherboards support ECC, but there are no AMD motherboards in existence to buy.
Of the companies that DO make motherboards that support ECC, all of the big five make motherboards for both AMD and Intel that support ECC.
Switching from an ECC capable i7 to an ECC capable AMD CPU just because "I want ECC" is a pretty special and wasteful form of stupidity.
Two words. Ferrite clips.
A friend of mine, developer of the spreadsheet SW back in the days of DOS a Norton Commander, had one customer who would keep complaining about the SW crashing from time to time. These kind of crashes would only happen to this customer and no other.
He installed a debug build on the customer's site and and waited... and fair enough, the SW would crash, and crash again and again... at completely random places in the code. In some cases there was literally no way those lines of code could make the program crash under any circumstances.
Well, he spent days trying to debug it and came up empty handed. Until it struck him to look at the time when the SW is crashing. And fair enough, it was crashing on one particular day in a week usually in the time-span of few hours during that day. Now comes the interesting part -- the customer's site was actually a railway station on the Slovakia-Ukraine border (in town called Uzghorod). So he called the customer to ask if there was a train in the station regularly on that day and hour every week and voila, there was one train coming from Ukraine to Slovakia with some goods. So he asked the customer to take Geiger counter and see if there was anything going on in the air.
They found out one of the train cars was radiating like hell. It was used for transferring spent nuclear fuel before. And Ukrainians thought they would save some money by using it for regular cargo after EOL. I wouldn't like to be a person living near those railway tracks...
tl;dr
Spreadsheet SW was crashing on the computers in the train station and thanks to customer complaints they found out the crashes were caused by radioactive train coming regularly to the station.
The most real problem is that this is a way for motherboard and CPU vendors to segment the market, and prevent commodity PC hardware from being used for critical things. Home users "don't need" ECC, so it can be left off the cheap stuff.
ECC is another good reason (on top of all the others) for buying Ryzen.
Aircraft have weather radar built in, so I've had my smartphone in front of a powered up radar emitter many times; didn't affect it in the slightest. The ground based ones are probably more powerful, but it seems unlikely that they would be affecting electronics. If they did there would be a lot more problems than just one random guy having his computer crash.
Maxwell HSN-1000. You can't buy them new from Maxwell but you can get them used from recycled military gear for around $150.
This is actually pretty common and has gone on for a long time, especially on systems that were striving to be low-to-zero downtime.
Some of the idle processing on AS/400s would periodically re-write the microcode from disk. When I asked a core developer why, they cited gamma rays flipping a bit. I then asked if a lead umbrella wouldn't do the job better, and they said yes, but the umbrella would have to be about six feet thick.
Intel omits ECC from the desktop market as purposeful market segmentation. It's a fact.
Aircraft radars are in the hundreds of watts power output; WX radars are in the MILLIONS of watts. You're talking an order of magnitude difference of 10,000 or more.
Also, some circuits are more sensitive than others to particular frequencies due to the length of wires or runners on PCB's that act like little antennas, so not everything is going to be adversely affected, but stuff that's resonant at that frequency will be much more susceptible to external interference.
RF engineer here, BTW. I just don't do radars...
-- You are in a maze of little, twisty passages, all different... --
Cosmic rays causing ram errors, is a thing. Scientists estimate it will happen to PCs, at ground level, about once a year. Surprisingly, which year does not matter much because as the tech gets smaller, the capacity gets larger, so the die size stays about the same.
Once a year might not sound like much, but that is not "at the end of the year", it can happen right away. Chance is strange that way. 8-)
MS should probably -not- have commented it out...
Mitigating this problem has been the elephant in the room since the late 90s. At least one aircraft manufacturer would not allow the use of FPGAs in filght critical electronics designs of commercial airliners because of it. Xilinx for years had several FPGAs running at the top of one of the Hawaii volcanos doing nothing but repeatedly measuring the number of times their bitstream was altered. At the current chip geometries you can be pretty sure that if you jump on a plane in California and fly west with a new laptop, you may have a SEU. It may simply flip a bit of unused memory.....or not. Google it.
You think this is funny? Than read why ECC memory was developed and get an education about interference from radiation.
A long time ago, someone was going through a level (Tick Tock Clock) in Mario 64. They somehow managed to "warp" to the top of the level - something very valuable to the speedrunners obviously. Many people believe this was due to a bit flip, whether caused by a cosmic ray or not, because so far it has never been reproduced again. Bit flips happen more commonly than is realized but most of the time the impact is not noticeable. It is curious as to why this programmer felt the need to put that instruction in the code. Did something happen to him or a colleague that they could not explain? Bit flips may be the cause of more issues than we realize in computer hardware.
A quick read through the ACPI specification implies that the caches should be flushed *before* entering the S1 state and letting the hardware deal with the rest.
I'm not sure what to make of the comment. Part of the comment makes it apear as though this instruction comes after waking (making it pointless since the cache is already invalid). If this comment is about before going into the sleep state then it wasn't a manufacturer who asked for this, it was the ACPI specification itself, and not flushing the cache before entering would be in breach of the spec.
"15.1.1 S1 Sleeping State
The S1 state is defined as a low wake-latency sleeping state. In this state, all system context is preserved with the exception of CPU caches. Before setting the SLP_EN bit, OSPM will flush the system caches. If the platform supports the WBINVD instruction (as indicated by the WBINVD and WBINVD_FLUSH flags in the FADT), OSPM will execute the WBINVD instruction. The hardware is responsible for maintaining all other system context, which includes the context of the CPU, memory, and chipset. "
A very big portion of the ACPI specification details exactly how to flush caches going into and out of the various sleep states and how hardware should respond to this. If implementing the specificaiton as written it would appear as though flushing the cache when waking doesn't need to be done.
Are there any experts on this topic here which can shed more light on this?
Aircraft have their weather radar turned off on the ground, for you know, interference reasons. And the cancer issue.
The problem is that you need a CPU and north bridge that can handle it, which adds to the initial costs. For Intel, for example, a Xeon CPU costs (artificially) a good deal more than a comparable speed i3/5/7/9, which is an upfront cost that consumers aren't willing to eat, and they tend to choose either a cheaper CPU or a faster CPU for the same kind of money.
In most cases Intel's Xeon and consumer CPUs are the same hardware so the only difference in production might be testing time. Intel's artificial market segmentation of ECC is more about price discrimination then costs which can be seen by their tying ECC to use of the proper south bridge which has nothing to do with it.