Slashdot Mirror


Elevation Plays a Role In Memory Error Rates

alphadogg writes "With memory, as with real estate, location matters. A group of researchers from AMD and the Department of Energy's Los Alamos National Laboratory have found that the altitude at which SRAM resides can influence how many random errors the memory produces. In a field study of two high-performance computers, the researchers found that L2 and L3 caches had more transient errors on the supercomputer located at a higher altitude, compared with the one closer to sea level. They attributed the disparity largely to lower air pressure and higher cosmic ray-induced neutron strikes. Strangely, higher elevation even led to more errors within a rack of servers, the researchers found. Their tests showed that memory modules on the top of a server rack had 20 percent more transient errors than those closer to the bottom of the rack. However, it's not clear what causes this smaller-scale effect."

33 of 190 comments (clear)

  1. Heat related? by Anonymous Coward · · Score: 5, Insightful

    Top of the rack tends to get toasty, but is this too simple?

    1. Re:Heat related? by Thornburg · · Score: 2

      Top of the rack tends to get toasty, but is this too simple?

      I logged in to say that.

      It seems obvious -- heat rises, I would expect top of rack components to fail more often unless the cooling design is well done.

      Completely fabricated statistic: Only 10% of datacenters have proper cooling design.

    2. Re:Heat related? by spike+hay · · Score: 3, Informative

      If it's cosmic rays causing a lot of the problem, the extra material of the racks above would make a difference.

      --
      If you don't understand any of my sayings, come to me in private and I shall take you in my German mouth.
    3. Re:Heat related? by dszd0g · · Score: 2

      As single event upsets (SEU) are caused by cosmic particles which create alpha particles. It makes sense that equipment higher in the rack would absorb more of the alpha particles and block them from systems lower in the rack, but I am not a physicist. Alpha particles are relatively easy to block with shielding.

      http://www.statemaster.com/encyclopedia/Single_event-upset

      As the link said, this was first theorized in 1978 and supercomputer companies have been designing systems with this in mind for decades.

      --
      This message is encrypted with Quad ROT-13 to protect the author's copyright under the DMCA.
    4. Re:Heat related? by AmiMoJo · · Score: 3, Interesting

      Vibration as well. The top of the stack moves quite a bit more than the bottom of the stack, even though the overall magnitude of the movement is small.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    5. Re:Heat related? by spike+hay · · Score: 4, Interesting

      Radiation blockage is mostly a function of mass the rays have to go through. The vast majority of cosmic rays are blocked by the 14 pounds per square inch/100 kilopascals of air above us. That means that a square inch of ground at sea level has 14 pounds of air above it. A square inch section of a rack above you would probably be in the pounds as well, and would block a good portion.

      --
      If you don't understand any of my sayings, come to me in private and I shall take you in my German mouth.
    6. Re:Heat related? by Anonymous Coward · · Score: 2, Informative

      Top of the rack tends to get toasty, but is this too simple?

      It is too simple.
      In a data center with downflow CRACs that push air through perforated tiles, sufficient underfloor plenum pressure is supposed to be maintained so that the upward air velocity carries cold air all the way up the front of the cabinet, affording sufficient cooling to everything. Not that it always works that way.

      But one thing to consider is dirt.
      Even with MERV 8 or better filtration, dust will still circulate in a data center cooled this way. With the filtration on the CRAC return, the lightest dust particles will float up to the return and get filtered, but the heavier particles will not make it that high. That is why a clean room has a downward airflow towards filters at the floor, unlike a data center.

      What happens is that the lowest systems in a cabinet will get the heaviest coating of dust, made up of the largest particles, with the finer dust more frequently making it into the upper systems.

      I have a good handle on dust introduced from outside air (filtration of makeup air, positve pressurization, policies against cardboard boxes, etc.), but one internal source of dust that is hard to eliminate is blower belts. Even when switching to cogged belts, black rubber dust particles will be created and get deposited on surfaces all over.

      This is only speculation, but perhaps the finer particles are more damaging than the coarser ones.

    7. Re:Heat related? by DeathToBill · · Score: 4, Interesting

      I was looking into RAM error rates a week or so ago. There's not a lot of research around, but I recall seeing suggestions that error rates were significantly smaller if the chips were mounted vertically rather than horizontally - because vertically mounted chips present a lower vertical cross-section and most error-inducing cosmic rays come at near-vertical inclination.

      --
      Slashdot - News for Nerds, Stuff that Matters, in ISO-8859-1 Has just realised that beta makes this signature redundant
    8. Re:Heat related? by barlevg · · Score: 3, Informative

      Back-of-the-envelope calculation using XCOM.

      Assume server rack and contents are made of aluminum (what is the predominant material in a server rack?). Let's say the server rack is 2m in height, but it's not fair to make the whole thing metal. Let's say 20% of it is metal (aluminum for this calculation), the rest is air (or, for the sake of calculation, vacuum). Alumnium has a density of 2g / cm^3 (so a 1m x 1m x 0.4 m slab of alumnium would weigh 800 kg, which appears to be in the middling range for what a server rack can accomodate--again, keep in mind, this is a really rough calculation).

      Okay, plugging in Aluminum into XCOM gives a total attenuation in the 100-1k MeV range of ~0.03 cm^2/g.

      e^[-(0.03 cm^2/g) * (2g / cm^3) * 40 cm] = 0.09

      In other words, that's 90% attenuation. Keep in mind that this was a ridiculously sloppy calculation, with my material assumptions (and possibly energetic ranges) being way off (also, neutron cross-sections could easily be different than photon cross-sections). The point is, it's certainly possible (nay, likely) that the material of the servers themselves are providing shielding from the servers on the bottom of the rack.

    9. Re:Heat related? by Wintermute__ · · Score: 2

      Although true, I don't imagine vibration has any effect on SRAM error rates. Hard drive failure rates, I could imagine (though that's a big stretch).

      I wonder if it has to do with the upper servers shielding the lower ones on the rack from the cosmic rays. Time for a tinfoil hat for my servers!

    10. Re:Heat related? by Anonymous Coward · · Score: 3, Insightful

      Also stack turned-off servers above an active one on bottom to see if it's shielding.

    11. Re:Heat related? by fnj · · Score: 3, Informative

      Cosmic rays (they are actually particles, not electromagnetic radiation) cover a whole range of stuff, with individual particles varying extremely widely in energy content. Primary cosmic rays originate outside Earth's atmosphere. When they collide with the atmosphere, secondary cosmic rays are generated. Primary cosmic rays are mostly (99%) nuclei of various atoms. The remaining 1% are mostly free electrons (beta particles). In turn, 90% of the nuclei are free protons (hydrogen nuclei), just because most of the matter in space is hydrogen. 9% are alpha particles (helium nuclei), and 1% are the nuclei of other (heavier) elements. There is also a very small fraction of more exotic stuff, like antimatter.

      While the mean energy content of a cosmic ray particle is in the range of only about 10^-11 to 10^-10 J, extremely rare single particles with energy content up to 50 J exist. This energy is truly astounding, as it means a single submicroscopic particle has the same kinetic energy as a slowly pitched or fairly briskly thrown baseball!

      Cosmic rays are some of the most penetrating radiative phenoma known. Just compare their mean atmospheric penetrative power to that of other radiative phenomena. The following represent rough mean values of what are actually widely distributed ranges; in other words, some fraction of cosmic rays penetrate hugely in excess of the figure quoted below, just as some fraction falls far short.

      cosmic "rays" - 10,000 m (about the same for both primary and secondary)
      gamma rays - 1000 m
      x-rays - 100 m
      alpha particles - 0.1 m

      It should also be noted that significant sources of radiative phenomena are generally point sources, or at least localized sources. They are attenuated in concentration, not total amount,by distance, even in a perfect vacuum. This arises due to spreading out according to the inverse square law. For example, if you want to escape the radiation from a nuclear explosion, even in outer space, you can just move away from it. Cosmic rays are completely different in that they are diffuse. They are not "radiating" from a single point at all. They are distributed in concentration and direction everywhere. There is no attenuation due purely to distance. The attenuation of cosmic rays by the atmosphere is a result of collisions of cosmic ray particles with the atoms in the atrmosphere.

      Cosmic rays, or better stated, cosmic ray products (neutrinos) have been detected in deep mineshafts after penetrating kilometers of rock. Clearly the beta particles are not penetrating very much at all, and even the nuclei have limited penetration, but some of the subnucleic particles ain't stoppin' for nobody.

    12. Re:Heat related? by WWJohnBrowningDo · · Score: 3, Funny

      BRB, going to convincine my boss to tip all our servers over.

    13. Re:Heat related? by gweihir · · Score: 2

      Then they do not know much about rack construction. Standard racks suck in cold air from the front (cold isle) and blow it out the back (hot isle). There is no difference whether the computer sits on the bottom or the top of the rack as the hot air from any of them never gets to another computer directly.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  2. Fusion IO? by shadowknot · · Score: 3, Interesting

    Someone tell Fusion.io. They're based at 5000+ feet here in the Salt Lake valley! It would be interesting if their QC procedures are what have made them more reliable as the failure rate is higher where the testing is performed.

  3. basements by Anonymous Coward · · Score: 5, Funny

    Another reason for nerds to stay in the basement

  4. This isn't news by dszd0g · · Score: 4, Informative

    This isn't news. Companies that make supercomputers have known this for decades. The one I worked for 15 years ago used a high elevation test environment in Colorado to verify error correcting capabilities. Even the article says that the results were not a surprise.

    --
    This message is encrypted with Quad ROT-13 to protect the author's copyright under the DMCA.
    1. Re:This isn't news by edibobb · · Score: 4, Informative

      From the article: "It is well known that the altitude at which a data center resides has consequences with regards to machine fault rates. The two primary causes of increased fault rates at higher altitude are reduced cooling due to lower air pressure and increased cosmic ray-induced neutron strikes."

  5. This is news? by Nkwe · · Score: 4, Funny

    If you get high you can lose your memory?

  6. Re:That's interesting! by Antipater · · Score: 5, Funny

    Another interesting idea would be to do the same experiment by latitude. Does the Arctic Region Supercomputing Center have a higher rate than the Maui Supercomputing Center?

    They tried to do that test a few years back, but both research teams mysteriously disappeared. The leading hypothesis is that the Arctic team was eaten by polar bears, but nobody has any idea what happened to the Maui team. The only clue left at the scene was a nearly-empty glass of pina colada.

    --
    Everything is better with chainsaws.
  7. Water towers by mdsolar · · Score: 2

    It seems to me that an unexploited structure for a low radiation environment is the bottom side of a water tower. Steel has most radionuclides slagged off when it is produced while drinking water standards ensure the water in the tower will have low radioactivity. A meter or two of water forms a nice shield for cosmic rays from above while the air below the tower shields against lower energy ground radiation. And, you get a nice heat sink in the water for cooling electronic.

  8. Hmmm .... by gstoddart · · Score: 3, Funny

    Is this why when I'm in an airplane I can never remember if I turned all the lights out? ;-)

    --
    Lost at C:>. Found at C.
  9. Re:This may be stupid... by ledow · · Score: 3, Insightful

    On Mount Everest, time slows by 0.00261261 seconds (2.6ms) compared to sea level.

    Every foot higher you go is 90 billionths of a second difference, if you want to check the maths for me. The problem is, we're not talking about a sea-level / Mount Everest communication here. The RAM chips are about a foot long at absolute maximum.

    And these sorts of effects then suddenly skitter into insignificance compared to solar radiation, different pressures, different air make-ups, heat, etc.

    The fact is, we know that this effect exists. We know that time-slowing exists (GPS wouldn't work if we didn't compensate for such things). We know that solar radiation exists. But this single statistic barely bothers to eliminate memory manufacturer, operating voltage, or ambient temperature as a cause rather than these exotic causes.

    Chances are, they might just have had a batch of dodgy RAM chips from a single manufacturer more than ANYTHING else combined.

    And, even then, you'd need thousands of test sites / machines to even hint at the cause. But, why bother? We know there would be an effect, we also know it wouldn't be this large or obvious and that - chances are - there's a much simpler explanation. The whole "top of the rack fails more often" hints at what complete and utter bullshit this is. That would be an effect we'd notice at sea-level and most likely things like ventilation and heating have orders-of-magnitutide more to do with it.

  10. On the bright side of all this. . . . by Salgak1 · · Score: 2

    . . . .recall that the new NSA "Supercenter" in Utah is at ~4300 feet. So they'll be making a lot MORE errors when monitoring us all. . .

  11. Re:Product suggestion. by PPH · · Score: 3, Funny

    OK. Where am I going to find an RoHS lead block?

    --
    Have gnu, will travel.
  12. cosmic ray flux by volvox_voxel · · Score: 3, Informative
    Here is a plot of the cosmic ray flux ( coincidence counting rate per second) vs altitude. It's also not hard to build a detector that can detect these. You can use something called coincidence detection where two scintillator plates are placed right on top of one another, and each plate is connected to a photomultiplier tube. If both photomultiplier tubes trigger, it's a cosmic ray event.. If only the top one triggers it could still be a muon though..

    http://hyperphysics.phy-astr.gsu.edu/hbase/astro/cosmic.html

  13. Oracle/Sun document that discusses this by lyapunov · · Score: 2

    There are statistics that cover the expected frequency of events caused by radiation in the first couple of pages.

    http://docs.oracle.com/cd/E19095-01/sf3800.srvr/816-5053-10/816-5053-10.pdf

    --

    Either give it away or get top dollar, but never sell yourself cheap.
  14. Re:This may be stupid... by camperdave · · Score: 2

    But could it be simply gravity?

    You mean because the 1 bits are lighter than the 0 bits? But you've got to remember about packing density. You can fit a lot more 1s than 0s because they are thinner. Vibrations in the chips will help the 1s settle to the bottom, despite being lighter.

    --
    When our name is on the back of your car, we're behind you all the way!
  15. Re:Caches, eh? by wiredlogic · · Score: 2

    For IA CPUs the L1 cache has parity and the server grade chips have ECC on the L2 cache.

    --
    I am becoming gerund, destroyer of verbs.
  16. Muons by Roger+W+Moore · · Score: 4, Informative

    Then wouldn't you expect a cascading rate of failures from 20% down to the baseline bottom rack in a linear fashion?

    The majority of cosmic rays that make it this far are muons. These are relatively penetrating and I highly doubt that a few centimetres of metal and plastic will have anything like a 20% effect. 60m underground with the ATLAS detector at the LHC we still get a reasonable rate of cosmic rays and we use them for calibration when there is no beam. While the rate is reduced 60m of rock is far, far more shielding than a few computers plus many cosmics passing through you come at an angle so the stack above will have no effect on shielding these.

    I expect that heat and vibration will be the most likely causes.

  17. Re:Of course elevation affects memory by pspahn · · Score: 2

    Enough with all the mixing of terminology.

    You use 'altitude' when referring to how high something is above the ground. You use 'elevation' when referring to how high the ground is from sea level.

    What you don't see are signs for city limits on the road with 'altitude' on them. They say 'elevation' for a reason. Just like you don't find an elevatometer inside an airplane. You find an altimeter.

    Mixing these terms as you've done (and so has TFS, so I don't blame you as you were simply restating the flawed summary) only causes confusion.

    --
    Someone flopped a steamer in the gene pool.
  18. Ancient news by Salamander · · Score: 2

    About five years ago, I was involved in the installation of a thousand-node cluster in Boulder. We knew *before we went in* that we needed to change our EDAC (memory error correction) code to account for the higher rate of bit-flips due to the altitude. Some of the people we were working with had been there when those same problems nearly caused a months-long delay in a larger installation at NCAR nearby. We ended up running into a more subtle problem involving lower air density, heat and voltage, but *this* problem was incredibly old news even then.

    --
    Slashdot - News for Herds. Stuff that Splatters.
  19. known for decades by aegl · · Score: 2

    Perhaps the researchers are too young to have read this 1979 paper http://www.ncbi.nlm.nih.gov/pubmed/17820742