Elevation Plays a Role In Memory Error Rates
alphadogg writes "With memory, as with real estate, location matters. A group of researchers from AMD and the Department of Energy's Los Alamos National Laboratory have found that the altitude at which SRAM resides can influence how many random errors the memory produces. In a field study of two high-performance computers, the researchers found that L2 and L3 caches had more transient errors on the supercomputer located at a higher altitude, compared with the one closer to sea level. They attributed the disparity largely to lower air pressure and higher cosmic ray-induced neutron strikes. Strangely, higher elevation even led to more errors within a rack of servers, the researchers found. Their tests showed that memory modules on the top of a server rack had 20 percent more transient errors than those closer to the bottom of the rack. However, it's not clear what causes this smaller-scale effect."
Top of the rack tends to get toasty, but is this too simple?
But could it be simply gravity? I know that G is negligible when talking about electrons, but a difference in height in a gravity well does affect time. Maybe it's something to do with time being at different speeds or a similar effect?
Someone tell Fusion.io. They're based at 5000+ feet here in the Salt Lake valley! It would be interesting if their QC procedures are what have made them more reliable as the failure rate is higher where the testing is performed.
Do computers have a fear of heights?
Another reason for nerds to stay in the basement
>Their tests showed that memory modules on the top of a server rack had 20 percent more transient errors than those closer to the bottom of the rack.
Maybe it has something to do with the top side being shielded by other servers.
Oxygen deprivation. Wait, are we talking about the same thing? Altitude does weird things to the brain.
[Posted from Everest base camp]
This has been known for at least 20 years.
This isn't news. Companies that make supercomputers have known this for decades. The one I worked for 15 years ago used a high elevation test environment in Colorado to verify error correcting capabilities. Even the article says that the results were not a surprise.
This message is encrypted with Quad ROT-13 to protect the author's copyright under the DMCA.
A couple of years back at one of the Supercomputing conferences (I think in Phoenix), Fermilab had a cloud chamber in their booth, and you simply *would* *not* believe the amount of ambient radiation passing you at all times. I can easily believe that altitude would have an effect.
Another interesting idea would be to do the same experiment by latitude. Does the Arctic Region Supercomputing Center have a higher rate than the Maui Supercomputing Center? What happens during an aurora?
My wife tells me I have problems with my memory because she supposedly told me some things last day. I tell her I have "selective" memory instead. I choose what i want to hear... :)
Pretty well known that radiation-induced soft errors increase with altitiude - just ask your aviation and space industry brethren
If you get high you can lose your memory?
the space industry has been warning the commercial industry about this for decades. 10 or so years ago we started seeing upsets in aircraft and then on the ground. This makes perfect sense, the higher you are the more upsets. Within a rack/building/etc the deeper you go the more sheilding you get, so the top gets more than the bottom.
Interesting that this is sort of the premise of Verner Vinge's scifi novel A Fire Upon the Deep. Namely, that the further from the busy galactic core computers are, the more error-free and powerful they are, and thus the ability of civilizations to progress is limited by their location in the galaxy.
According to the article the low elevation system was a Jaguar supercomputer whereas the high elevation one a Cielo supercomputer. Based on available specs for each the two are entirely different systems. How can they reach conclusions about altitude-relative bit error rates when they're not even comparing the system system? The article goes on to state:
"The group had found that, when all other possible confounding issues were factored out, Cielo's SRAM had a "significantly higher rate of SRAM faults," compared with Jaguar's SRAM, Sridharan said."
Huh? They factored out all confounding issues except that they were completely different systems.
It seems to me that an unexploited structure for a low radiation environment is the bottom side of a water tower. Steel has most radionuclides slagged off when it is produced while drinking water standards ensure the water in the tower will have low radioactivity. A meter or two of water forms a nice shield for cosmic rays from above while the air below the tower shields against lower energy ground radiation. And, you get a nice heat sink in the water for cooling electronic.
As one of the authors from the LANL-side I want to be clear that Sandia National Laboratories played a vital and at least equal role in this work - paper, analysis, as well as procurement and running of the Cielo supercomputer studied. The partnership with AMD, SNL, and LANL has been outstanding.
As this is not (mainly) about the system RAM, it's about the CPU caches, I wonder if any attempt is being made to correct the errors, and if it's worthwhile. One would just need to reset the node on any sign of an error, so the capactiy penalty would be small compared to ECC. On the other hand, the errors could just as well happen in the actual logical units, and at some point it's impossible or very expensive to protect everything. Because the SRAM takes up a large fraction of the CPU area, it may be useful to add something to protect the caches.
For some workloads you can do consistency checks in software, but for many computations that would require you to run the computations twice -- which is very expensive. Maybe statistical methods can be used to include a term for gross numerical errors -- different from the small floating point errors. It would probably be close to impossible to model the effect of such errors on the results though. Another option is to shield the datacentres from cosmic rays, if those are indeed the culprit.
When you get high, memory suffers.
Is this why when I'm in an airplane I can never remember if I turned all the lights out? ;-)
Lost at C:>. Found at C.
. . . .recall that the new NSA "Supercenter" in Utah is at ~4300 feet. So they'll be making a lot MORE errors when monitoring us all. . .
The 1U lead block. Place at top of rack to protect the servers below.
Does it work? Who cares. If people will pay £150 for a wooden volume knob on their audio system, someone is going to pay whatever you ask for a lump-o-lead that may or may not improve the reliability of equipment below.
Why did anyone need to do this field survey? It simply confirms what we already know - cosmic rays create SRAM errors. Hot components fail more than cold components. Big whoop.
Damn, I jumped onto the aluminum case fad bandwagon around 2000, and never looked back. Just bought another one, a couple months ago. Now you're telling me the next big thing that all the cool kids will have, is a lead case?
Why not put a swimming pool on top of the building? A nice perk for the workers AND radiation protection for the chips.
According to the Jargon File, IBM tested lead as a method for shielding chips from cosmic rays, and found it to be ineffective.
I find it interesting that IBM's result conflicts with this DoE conclusion; however, I think it's consistent with lead being a . Of course, you said:
Does it work? Who cares. If people will pay £150 for a wooden volume knob on their audio system, someone is going to pay whatever you ask for a lump-o-lead that may or may not improve the reliability of equipment below.
so I guess, in the spirit of P.T. Barnum, carry on.
http://hyperphysics.phy-astr.gsu.edu/hbase/astro/cosmic.html
Earth act as a shield that protect memory from radiation coming from the other side of the planet. In addition, the collision probability of a particle is proportional to the distance of his travel into the atmosphere, so there is more probability on the ground to be hit by particle coming from the vertical. On a desktop computer the RAM is usually oriented vertically and exposing his shorter side from the top: the exposed area is very small for radiation coming from the top. Not that because of the motherboard orientation, this is also true for a lot of component mounted into it. On a server, the RAM might be vertical, but expose his largest side from the top. RAM mounted with a angle are even worse. Not that on server the motherboard is oriented horizontally, exposing most of the components with there biggest area from the top. So it's not a big surprise that the caches inside the CPU is hit more often on a server than in a desktop computer.
Write a cod that reserve several Go of memory on a non ECC memory, set it, and in a infinite loop check that all bits are set. Now try approach various type of arc lamps to the memory and count the number of hours (or minutes if you have a strong source) before a bit is detected to be zero. Now retry the experience with different RAM orientation.
There are statistics that cover the expected frequency of events caused by radiation in the first couple of pages.
http://docs.oracle.com/cd/E19095-01/sf3800.srvr/816-5053-10/816-5053-10.pdf
Either give it away or get top dollar, but never sell yourself cheap.
I'm also more prone to errors when I'm high
More music, fewer hits
Top of Rack Shields bottom of Rack from Cosmic Radiation - is it not this simple?
Doh,... thinner air will not absorb and remove as much heat per cubic meter of air compared to sea-level air. This plus radiation at higher altitudes...
From deep within the PDF (second link):
(Living in Colorado, I thought perhaps chips suffered from the same spurting newly opened toothpaste tube problem when manufactured at low altitude and installed into operation at high altitude, but it turned out the hypothesis was different, and, of course, left out of the Slashdot summary.)
What kind of materials (if any) are effective in blocking cosmic rays? Would it be possible to integrate cosmic radiation shielding into an average-sized PC case? If that's impractical, are there building materials that can be used in roofs and/or walls to block this stuff without breaking the bank?
Reminds me of the BOFH's excuse of the day...
Today's excuse, btw, is:
http://pages.cs.wisc.edu/~ballard/bofh/bofhserver.pl
Then wouldn't you expect a cascading rate of failures from 20% down to the baseline bottom rack in a linear fashion?
The majority of cosmic rays that make it this far are muons. These are relatively penetrating and I highly doubt that a few centimetres of metal and plastic will have anything like a 20% effect. 60m underground with the ATLAS detector at the LHC we still get a reasonable rate of cosmic rays and we use them for calibration when there is no beam. While the rate is reduced 60m of rock is far, far more shielding than a few computers plus many cosmics passing through you come at an angle so the stack above will have no effect on shielding these.
I expect that heat and vibration will be the most likely causes.
My father was an astro-physicist studying mesons and other high-energy particles, and unless you were looking for neutrinos, altitude was important. His main "meson telescope" was on top of a 14,000 foot + mountain (Mount Evans) in Colorado. Anyway, I think if you want to mitigate RAM errors in server farms, the simplest thing is to place a thin sheet of lead on top of the rack... or over the roof of the building.:-)
You would need many metres to have a noticeable effect on the penetrating muons which make up the majority of cosmics at the surface. This should tell you that a few computer boxes is not really likely to have much of a shielding effect. This is reinforced by the fact that many cosmics come at shallow angles so the stack above provides no shielding. I doubt this is a cosmic ray effect.
"However, it's not clear what causes this smaller-scale effect."
Servers are made out of metal and have EM fields! This isn't hard.
who's with me?
In aggregate entire atmosphere down to sea level works out to something like the equivalent of 30ft of water of shielding.. 20% reduction thru an entire rack of servers sounds to be in about the right ballpark.
People have been running the same experiments on international flights on laptops for years.
The DEC Alpha cluster (ASCI Q) had problems with memory errors - specifically with CPU cache errors - that they eventually found to be caused by increased levels of radiation from being at the higher (7800ft) elevation.
Living in Colorado, this just seems like long known general knowledge to me. Our bodies get a higher radiation dose at this altitude, and so does our equipment. HP was notorious for responding "Cosmic Rays!!!!11!!1" when we'd place service calls for single bit errors on PA-RISC CPUs. They wouldn't replace them until a repeating history was established for a particular CPU.
I have no idea if that's enough shielding to matter. However, if true, would we also see higher error rates in daytime when the body of the Earth isn't standing between the server and the Sun?
Cyrano de Maniac
Just curious. I've seen direct driven blowers in a number of various applications. Is there some special need to use belt driven blowers for the air in data centers?
How long before the cloud computing and storage services start charging a slight premium to have your stuff run/store on lower spots in their server racks?
SLASHDOT: news for people who can't concentrate on work or have no life at all and got tired of yelling back at the TV.
I know I have trouble remembering when I'm high. Seems like electronics should have the same problem.
While I guess why more ionizing events stemming from neutron impacts affects electronics, I don't get the blaming of "pressure"? Perhaps they mean the reduced air cooling of electrical components from a less dense atmosphere? Someone else noted that components on the top of a rack might tend to be warmer. This might be more of the same sort of effect.
Field Programmable Gate Arrays (FPGAs) have the largest and densest SRAMs of any modern device. Google "FPGA SEU" and you'll see dozens of articles describing the issue and the mitigations the various vendors make. The researchers in question were either grossly ignorant or playing games.
Spacecraft and avionics designers have even more interesting issues . . .
Isn't it obvious?
About five years ago, I was involved in the installation of a thousand-node cluster in Boulder. We knew *before we went in* that we needed to change our EDAC (memory error correction) code to account for the higher rate of bit-flips due to the altitude. Some of the people we were working with had been there when those same problems nearly caused a months-long delay in a larger installation at NCAR nearby. We ended up running into a more subtle problem involving lower air density, heat and voltage, but *this* problem was incredibly old news even then.
Slashdot - News for Herds. Stuff that Splatters.
If cosmic rays were the cause (presumably from the sun since that's the closest source) then I would assume that latitude would be an important factor as well. Less sun means less cosmic radiation. My money is on the simplest explanation: Heat.
Time flies when you don't know what you're doing
Just like humans, computers have trouble remembering things when high.
Honest, Mr NSA Sir, I was just searching for happy kitty cats and not "How to load a nuclear bomb in a suitcase" -- it must have been those nasty cosmic rays changing up my searches!
(I WONDERED why I kept finding lead suitcases with "Hello, Kitty" emblazoned on them.)
If the universe is someone's simulation -- does that mean the stars are just stuck pixels?
Almost no cosmic rays come from our local sun, mostly just slower solar wind particles.
"Think about how stupid the average person is. Now, realise that half of them are dumber than that." - George Carlin
Perhaps the researchers are too young to have read this 1979 paper http://www.ncbi.nlm.nih.gov/pubmed/17820742
My memory has gotten worse as I've gotten taller. I blame the cosmic rays.
Its probably severely shielded to prevent TEMPEST attacks. Far more so than a typical corporate datacenter.
have a high-altitude (60+ thousand feet) experiment. Bit error rates are a few orders of magnitude higher up there.
In New Orleans you could build a 'ground level' datacenter and be 'below sea level'. ... But if a dike fails, the salt water might be a little 'rough' on electronic components.
In reality, some of the old 'salt mines' might be good place for 'high reliability' data centers.
has less humidity.
The air at higher elevations tends to have significantly less humidity.
I'd like to see some distinction made between the relatively few evaporative-cooled servers.
I make new friends every day.
Check out this Actel whitepaper (PDF). Describes a similar phenomenon, with such errors taking place three times more often in mile-high Denver than Baghdad by the Bay San Francisco.