Elevation Plays a Role In Memory Error Rates
alphadogg writes "With memory, as with real estate, location matters. A group of researchers from AMD and the Department of Energy's Los Alamos National Laboratory have found that the altitude at which SRAM resides can influence how many random errors the memory produces. In a field study of two high-performance computers, the researchers found that L2 and L3 caches had more transient errors on the supercomputer located at a higher altitude, compared with the one closer to sea level. They attributed the disparity largely to lower air pressure and higher cosmic ray-induced neutron strikes. Strangely, higher elevation even led to more errors within a rack of servers, the researchers found. Their tests showed that memory modules on the top of a server rack had 20 percent more transient errors than those closer to the bottom of the rack. However, it's not clear what causes this smaller-scale effect."
Top of the rack tends to get toasty, but is this too simple?
Someone tell Fusion.io. They're based at 5000+ feet here in the Salt Lake valley! It would be interesting if their QC procedures are what have made them more reliable as the failure rate is higher where the testing is performed.
Another reason for nerds to stay in the basement
This isn't news. Companies that make supercomputers have known this for decades. The one I worked for 15 years ago used a high elevation test environment in Colorado to verify error correcting capabilities. Even the article says that the results were not a surprise.
This message is encrypted with Quad ROT-13 to protect the author's copyright under the DMCA.
If you get high you can lose your memory?
Another interesting idea would be to do the same experiment by latitude. Does the Arctic Region Supercomputing Center have a higher rate than the Maui Supercomputing Center?
They tried to do that test a few years back, but both research teams mysteriously disappeared. The leading hypothesis is that the Arctic team was eaten by polar bears, but nobody has any idea what happened to the Maui team. The only clue left at the scene was a nearly-empty glass of pina colada.
Everything is better with chainsaws.
It seems to me that an unexploited structure for a low radiation environment is the bottom side of a water tower. Steel has most radionuclides slagged off when it is produced while drinking water standards ensure the water in the tower will have low radioactivity. A meter or two of water forms a nice shield for cosmic rays from above while the air below the tower shields against lower energy ground radiation. And, you get a nice heat sink in the water for cooling electronic.
Is this why when I'm in an airplane I can never remember if I turned all the lights out? ;-)
Lost at C:>. Found at C.
On Mount Everest, time slows by 0.00261261 seconds (2.6ms) compared to sea level.
Every foot higher you go is 90 billionths of a second difference, if you want to check the maths for me. The problem is, we're not talking about a sea-level / Mount Everest communication here. The RAM chips are about a foot long at absolute maximum.
And these sorts of effects then suddenly skitter into insignificance compared to solar radiation, different pressures, different air make-ups, heat, etc.
The fact is, we know that this effect exists. We know that time-slowing exists (GPS wouldn't work if we didn't compensate for such things). We know that solar radiation exists. But this single statistic barely bothers to eliminate memory manufacturer, operating voltage, or ambient temperature as a cause rather than these exotic causes.
Chances are, they might just have had a batch of dodgy RAM chips from a single manufacturer more than ANYTHING else combined.
And, even then, you'd need thousands of test sites / machines to even hint at the cause. But, why bother? We know there would be an effect, we also know it wouldn't be this large or obvious and that - chances are - there's a much simpler explanation. The whole "top of the rack fails more often" hints at what complete and utter bullshit this is. That would be an effect we'd notice at sea-level and most likely things like ventilation and heating have orders-of-magnitutide more to do with it.
. . . .recall that the new NSA "Supercenter" in Utah is at ~4300 feet. So they'll be making a lot MORE errors when monitoring us all. . .
OK. Where am I going to find an RoHS lead block?
Have gnu, will travel.
http://hyperphysics.phy-astr.gsu.edu/hbase/astro/cosmic.html
There are statistics that cover the expected frequency of events caused by radiation in the first couple of pages.
http://docs.oracle.com/cd/E19095-01/sf3800.srvr/816-5053-10/816-5053-10.pdf
Either give it away or get top dollar, but never sell yourself cheap.
But could it be simply gravity?
You mean because the 1 bits are lighter than the 0 bits? But you've got to remember about packing density. You can fit a lot more 1s than 0s because they are thinner. Vibrations in the chips will help the 1s settle to the bottom, despite being lighter.
When our name is on the back of your car, we're behind you all the way!
For IA CPUs the L1 cache has parity and the server grade chips have ECC on the L2 cache.
I am becoming gerund, destroyer of verbs.
Then wouldn't you expect a cascading rate of failures from 20% down to the baseline bottom rack in a linear fashion?
The majority of cosmic rays that make it this far are muons. These are relatively penetrating and I highly doubt that a few centimetres of metal and plastic will have anything like a 20% effect. 60m underground with the ATLAS detector at the LHC we still get a reasonable rate of cosmic rays and we use them for calibration when there is no beam. While the rate is reduced 60m of rock is far, far more shielding than a few computers plus many cosmics passing through you come at an angle so the stack above will have no effect on shielding these.
I expect that heat and vibration will be the most likely causes.
Enough with all the mixing of terminology.
You use 'altitude' when referring to how high something is above the ground. You use 'elevation' when referring to how high the ground is from sea level.
What you don't see are signs for city limits on the road with 'altitude' on them. They say 'elevation' for a reason. Just like you don't find an elevatometer inside an airplane. You find an altimeter.
Mixing these terms as you've done (and so has TFS, so I don't blame you as you were simply restating the flawed summary) only causes confusion.
Someone flopped a steamer in the gene pool.
About five years ago, I was involved in the installation of a thousand-node cluster in Boulder. We knew *before we went in* that we needed to change our EDAC (memory error correction) code to account for the higher rate of bit-flips due to the altitude. Some of the people we were working with had been there when those same problems nearly caused a months-long delay in a larger installation at NCAR nearby. We ended up running into a more subtle problem involving lower air density, heat and voltage, but *this* problem was incredibly old news even then.
Slashdot - News for Herds. Stuff that Splatters.
Perhaps the researchers are too young to have read this 1979 paper http://www.ncbi.nlm.nih.gov/pubmed/17820742