Slashdot Mirror


Elevation Plays a Role In Memory Error Rates

alphadogg writes "With memory, as with real estate, location matters. A group of researchers from AMD and the Department of Energy's Los Alamos National Laboratory have found that the altitude at which SRAM resides can influence how many random errors the memory produces. In a field study of two high-performance computers, the researchers found that L2 and L3 caches had more transient errors on the supercomputer located at a higher altitude, compared with the one closer to sea level. They attributed the disparity largely to lower air pressure and higher cosmic ray-induced neutron strikes. Strangely, higher elevation even led to more errors within a rack of servers, the researchers found. Their tests showed that memory modules on the top of a server rack had 20 percent more transient errors than those closer to the bottom of the rack. However, it's not clear what causes this smaller-scale effect."

190 comments

  1. Heat related? by Anonymous Coward · · Score: 5, Insightful

    Top of the rack tends to get toasty, but is this too simple?

    1. Re:Heat related? by Anonymous Coward · · Score: 1

      +1, beat me to the comment. Repeat the test with servers spread out one per rack, at different heights over multiple racks, to see if the effect persists (guess: nope). Stack turned-off (non-heat-generating) servers below an active one on top, just to control for any non-thermal effects.

    2. Re:Heat related? by Thornburg · · Score: 2

      Top of the rack tends to get toasty, but is this too simple?

      I logged in to say that.

      It seems obvious -- heat rises, I would expect top of rack components to fail more often unless the cooling design is well done.

      Completely fabricated statistic: Only 10% of datacenters have proper cooling design.

    3. Re:Heat related? by spike+hay · · Score: 3, Informative

      If it's cosmic rays causing a lot of the problem, the extra material of the racks above would make a difference.

      --
      If you don't understand any of my sayings, come to me in private and I shall take you in my German mouth.
    4. Re:Heat related? by edibobb · · Score: 1

      They took that into account.

    5. Re:Heat related? by Anonymous Coward · · Score: 0

      Could also be the cosmic rays have more material to go thru they hit other things? Like someone else pointed out same test at different heights with stuff on and off and missing would be interesting too.

      So given this. Would it be possible to build a cosmic ray catcher to lower the rate? Basically a slice of silicon that is sacrificed to catch the rays. Maybe 2-10 layers high and the switches are staggered in them. So it does not have to be 'ultra pure' perfect silicon. Just smallest node size and staggered. The point being to just redirect the energy somewhere else. You could even use it as a source for random? Or maybe even made from a better material that has some of the properties of silicon but better heat distribution?

    6. Re:Heat related? by dszd0g · · Score: 2

      As single event upsets (SEU) are caused by cosmic particles which create alpha particles. It makes sense that equipment higher in the rack would absorb more of the alpha particles and block them from systems lower in the rack, but I am not a physicist. Alpha particles are relatively easy to block with shielding.

      http://www.statemaster.com/encyclopedia/Single_event-upset

      As the link said, this was first theorized in 1978 and supercomputer companies have been designing systems with this in mind for decades.

      --
      This message is encrypted with Quad ROT-13 to protect the author's copyright under the DMCA.
    7. Re:Heat related? by AmiMoJo · · Score: 3, Interesting

      Vibration as well. The top of the stack moves quite a bit more than the bottom of the stack, even though the overall magnitude of the movement is small.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    8. Re:Heat related? by Anonymous Coward · · Score: 0

      Wait, would or wouldn't? They said 20% that's percent more errors you say are cosmic strikes in a few inches to feet of material?

    9. Re:Heat related? by Anonymous Coward · · Score: 0

      Alpha particles, yes. But neutrons? ~10 feet of concrete over your data center works nicely. Now just need to talk to the structural engineers.

    10. Re:Heat related? by Anonymous Coward · · Score: 0

      AND lower air pressure at higher elevations is less effective and conduction heat away. Seams simple to me.

    11. Re:Heat related? by djchristensen · · Score: 1

      That assumes that the rays tend to come down vertically. I don't know what the distribution would be, but I'd be very surprised if it was mostly vertical at any particular point on earth. So then it would depend on what the rays had to travel through to get to the memory chips. I'd further assume the computers were not exposed to the sky, so I remain skeptical of the cosmic ray explanation.

      It would be easy to test though. Have a rack of servers with only the bottom one turned on. Then move that server to the top of the rack (again with the rest of the servers turned off) and compare error rates. That would eliminate heat effects (actually, it might reverse the heat effect if the server stays cooler when at the top of the rack) and allow for testing the shielding effect.

    12. Re:Heat related? by spike+hay · · Score: 4, Interesting

      Radiation blockage is mostly a function of mass the rays have to go through. The vast majority of cosmic rays are blocked by the 14 pounds per square inch/100 kilopascals of air above us. That means that a square inch of ground at sea level has 14 pounds of air above it. A square inch section of a rack above you would probably be in the pounds as well, and would block a good portion.

      --
      If you don't understand any of my sayings, come to me in private and I shall take you in my German mouth.
    13. Re:Heat related? by Anonymous Coward · · Score: 2, Informative

      Top of the rack tends to get toasty, but is this too simple?

      It is too simple.
      In a data center with downflow CRACs that push air through perforated tiles, sufficient underfloor plenum pressure is supposed to be maintained so that the upward air velocity carries cold air all the way up the front of the cabinet, affording sufficient cooling to everything. Not that it always works that way.

      But one thing to consider is dirt.
      Even with MERV 8 or better filtration, dust will still circulate in a data center cooled this way. With the filtration on the CRAC return, the lightest dust particles will float up to the return and get filtered, but the heavier particles will not make it that high. That is why a clean room has a downward airflow towards filters at the floor, unlike a data center.

      What happens is that the lowest systems in a cabinet will get the heaviest coating of dust, made up of the largest particles, with the finer dust more frequently making it into the upper systems.

      I have a good handle on dust introduced from outside air (filtration of makeup air, positve pressurization, policies against cardboard boxes, etc.), but one internal source of dust that is hard to eliminate is blower belts. Even when switching to cogged belts, black rubber dust particles will be created and get deposited on surfaces all over.

      This is only speculation, but perhaps the finer particles are more damaging than the coarser ones.

    14. Re:Heat related? by Anonymous Coward · · Score: 0

      The most intense ones, though, come vertically - less atmosphere to go through.

    15. Re:Heat related? by Anonymous Coward · · Score: 0

      Then wouldn't you expect a cascading rate of failures from 20% down to the baseline bottom rack in a linear fashion? Is that noted?

    16. Re:Heat related? by Saethan · · Score: 1

      Seems to me a Faraday cage around the server room would be easier.

    17. Re:Heat related? by DeathToBill · · Score: 4, Interesting

      I was looking into RAM error rates a week or so ago. There's not a lot of research around, but I recall seeing suggestions that error rates were significantly smaller if the chips were mounted vertically rather than horizontally - because vertically mounted chips present a lower vertical cross-section and most error-inducing cosmic rays come at near-vertical inclination.

      --
      Slashdot - News for Nerds, Stuff that Matters, in ISO-8859-1 Has just realised that beta makes this signature redundant
    18. Re:Heat related? by DeathToBill · · Score: 1

      Lead / concrete in the ceiling would seem to be an easier option.

      That said, the error rates we're talking about are not large.

      --
      Slashdot - News for Nerds, Stuff that Matters, in ISO-8859-1 Has just realised that beta makes this signature redundant
    19. Re:Heat related? by DeathToBill · · Score: 1

      And, while we're at it, ECC RAM is able to correct any single bit error in an access (whatever width the bus is) and detect any double bit error. The likelihood of more than two bits in the same word flipping is so minuscule I think it's pretty clear it's not worth it.

      --
      Slashdot - News for Nerds, Stuff that Matters, in ISO-8859-1 Has just realised that beta makes this signature redundant
    20. Re:Heat related? by John.Banister · · Score: 1

      As the energy of the rays goes up, the higher frequency means a narrower gap between cage "bars" and the amount of current to generate the reflection increases. You might find sheet metal (copper) is cheaper than a cage at the sorts of energies of some cosmic rays.

    21. Re:Heat related? by barlevg · · Score: 3, Informative

      Back-of-the-envelope calculation using XCOM.

      Assume server rack and contents are made of aluminum (what is the predominant material in a server rack?). Let's say the server rack is 2m in height, but it's not fair to make the whole thing metal. Let's say 20% of it is metal (aluminum for this calculation), the rest is air (or, for the sake of calculation, vacuum). Alumnium has a density of 2g / cm^3 (so a 1m x 1m x 0.4 m slab of alumnium would weigh 800 kg, which appears to be in the middling range for what a server rack can accomodate--again, keep in mind, this is a really rough calculation).

      Okay, plugging in Aluminum into XCOM gives a total attenuation in the 100-1k MeV range of ~0.03 cm^2/g.

      e^[-(0.03 cm^2/g) * (2g / cm^3) * 40 cm] = 0.09

      In other words, that's 90% attenuation. Keep in mind that this was a ridiculously sloppy calculation, with my material assumptions (and possibly energetic ranges) being way off (also, neutron cross-sections could easily be different than photon cross-sections). The point is, it's certainly possible (nay, likely) that the material of the servers themselves are providing shielding from the servers on the bottom of the rack.

    22. Re:Heat related? by Anonymous Coward · · Score: 0

      Actually, most cosmic rays that reach the Earth's surface are coming more or less straight down. The reason is because that is the path that has the least amount of matter in the way. Rays coming "up" would have to make their way through the planet without impacting anything. Rays coming in from the sides would have to travel through significantly more atmosphere and would therefore be much more likely to impacting a molecule before reaching the surface.

      So yes, most cosmic radiation that reaches the surface will be coming straight down.

    23. Re:Heat related? by gl4ss · · Score: 1

      just take them out of the rack to test..

      --
      world was created 5 seconds before this post as it is.
    24. Re:Heat related? by camperdave · · Score: 1

      Atmospheric pressure is not linear with altitude, so no, you wouldn't expect a linear rate of failures if this was due to the atmosphere.

      --
      When our name is on the back of your car, we're behind you all the way!
    25. Re:Heat related? by Wintermute__ · · Score: 2

      Although true, I don't imagine vibration has any effect on SRAM error rates. Hard drive failure rates, I could imagine (though that's a big stretch).

      I wonder if it has to do with the upper servers shielding the lower ones on the rack from the cosmic rays. Time for a tinfoil hat for my servers!

    26. Re:Heat related? by Richard_at_work · · Score: 1

      It doesn't necessarily assume that at all, but what it can assume is that rays coming down vertically have less atmosphere to travel through than rays at any angle, and thus have more energy when they hit the server. Same reason the midday sun has more heat than the morning or setting sun.

    27. Re:Heat related? by Roger+W+Moore · · Score: 1

      Cosmics are mainly muons - 10m of concrete will cause some to range out but even 60m underground you still get quite a few.

    28. Re:Heat related? by tibit · · Score: 1

      Stupid question: why do blowers used in a data center need belts? These days, they should all be direct-driven by brushless motors. At most you need a coupling, although blower-duty brushless motors should have bearings sufficient to support the blower, thus you need no couplings. That way only one bearing is anywhere near being exposed to air that is blown around. I've been to a clean room facility that had all ventilation systems completely direct-driven, and the facilities people loved it.

      --
      A successful API design takes a mixture of software design and pedagogy.
    29. Re:Heat related? by djchristensen · · Score: 1

      Right, cosmic rays have a hard time penetrating through too much matter, even air, so it makes sense. I've been reading articles about high energy neutrino detection and maybe confused the two just a little. I stand corrected.

    30. Re:Heat related? by Anonymous Coward · · Score: 0

      Cosmic rays are not a primary cause of single-event upset. Cosmic rays collide with atmospheric atoms, causing cascades of protons and neutrons; it is those particles that cause SEUs. Shielding, unless it is very dense and thick, is ineffective as a mitigation for SEU.

    31. Re:Heat related? by Anonymous Coward · · Score: 3, Insightful

      Also stack turned-off servers above an active one on bottom to see if it's shielding.

    32. Re:Heat related? by Anonymous Coward · · Score: 0

      what is the predominant material in a server rack?

      Steel. There are some aluminum racks in the world, but steel predominates, especially when mounting heavy, full length stuff like servers and storage.. Rack mount devices (computers, switches, etc.) are usually enclosed in steel, although aluminum chassis do exist.

      Aluminum has higher cost, so steel tends to be the default.

    33. Re:Heat related? by Anonymous Coward · · Score: 0

      We're talking about shielding within each rack, how much bottom vs top of rack makes.

    34. Re:Heat related? by barlevg · · Score: 1

      Interesting. What about the predominant metal (or material) of an actual server, since those are going to (probably?) provide quite a bit more shielding than the actual housing.

    35. Re:Heat related? by Anonymous Coward · · Score: 0

      Depends on where you are. Above the atmosphere, cosmic rays are by far mostly protons and alpha particles, followed by some heavier nuclei. At sea level, muons are most common, and below ground it is going to be almost all muons. Between sea level and upper atmosphere, including higher altitude populated areas, things are a real mess.

    36. Re:Heat related? by prelelat · · Score: 1

      Or grab the average temperature of the servers on the bottom of the rack and apply an equivalent heat source to the one server and see what the error rate is.

    37. Re:Heat related? by Anonymous Coward · · Score: 0

      Also stack turned-off servers above an active one on bottom to see if it's shielding.

      Don't forget the magnetic field generated by 1500+ watt 4U pizza box densely packed blades - that's got to add up at these 1 in a billion+ chances.

    38. Re:Heat related? by Anonymous Coward · · Score: 0

      "Radiation blockage is mostly a function of MASS..."
      Altitude is not a word or concept used by the poster.

    39. Re:Heat related? by fnj · · Score: 3, Informative

      Cosmic rays (they are actually particles, not electromagnetic radiation) cover a whole range of stuff, with individual particles varying extremely widely in energy content. Primary cosmic rays originate outside Earth's atmosphere. When they collide with the atmosphere, secondary cosmic rays are generated. Primary cosmic rays are mostly (99%) nuclei of various atoms. The remaining 1% are mostly free electrons (beta particles). In turn, 90% of the nuclei are free protons (hydrogen nuclei), just because most of the matter in space is hydrogen. 9% are alpha particles (helium nuclei), and 1% are the nuclei of other (heavier) elements. There is also a very small fraction of more exotic stuff, like antimatter.

      While the mean energy content of a cosmic ray particle is in the range of only about 10^-11 to 10^-10 J, extremely rare single particles with energy content up to 50 J exist. This energy is truly astounding, as it means a single submicroscopic particle has the same kinetic energy as a slowly pitched or fairly briskly thrown baseball!

      Cosmic rays are some of the most penetrating radiative phenoma known. Just compare their mean atmospheric penetrative power to that of other radiative phenomena. The following represent rough mean values of what are actually widely distributed ranges; in other words, some fraction of cosmic rays penetrate hugely in excess of the figure quoted below, just as some fraction falls far short.

      cosmic "rays" - 10,000 m (about the same for both primary and secondary)
      gamma rays - 1000 m
      x-rays - 100 m
      alpha particles - 0.1 m

      It should also be noted that significant sources of radiative phenomena are generally point sources, or at least localized sources. They are attenuated in concentration, not total amount,by distance, even in a perfect vacuum. This arises due to spreading out according to the inverse square law. For example, if you want to escape the radiation from a nuclear explosion, even in outer space, you can just move away from it. Cosmic rays are completely different in that they are diffuse. They are not "radiating" from a single point at all. They are distributed in concentration and direction everywhere. There is no attenuation due purely to distance. The attenuation of cosmic rays by the atmosphere is a result of collisions of cosmic ray particles with the atoms in the atrmosphere.

      Cosmic rays, or better stated, cosmic ray products (neutrinos) have been detected in deep mineshafts after penetrating kilometers of rock. Clearly the beta particles are not penetrating very much at all, and even the nuclei have limited penetration, but some of the subnucleic particles ain't stoppin' for nobody.

    40. Re:Heat related? by Anonymous Coward · · Score: 0

      If a faraday cage could block ionizing radiation, then a nuclear reactor would not emit radiation during normal operation.

    41. Re:Heat related? by Anonymous Coward · · Score: 0

      By mass, the predominant component of a server is the steel enclosure. The electronics are pretty light except for heat sinks, which are aluminum or copper. Chips are predominantly ceramic or plastic, the actual circuits being negligibly small. The circuit boards are fiber and resin laminate layers with very thin copper layers. If you pull all the electronics, excluding the drives and power supply, out of a well populated 55 lbs 2U server the guts might weight 3-4 lbs, I guess. Not much. You may attribute 75% or so of the mass to steel. 21 of those in a 42U rack is 1155 lbs of mostly steel.

    42. Re:Heat related? by TroubleMagnet · · Score: 1

      Except that a neutron hit throws off a bunch of alpha particles which do the actual bit flips. Think hand grenade not bullet. So if you have physically adjacent bits in the same ECC word there is a good chance you will have a multi-bit error. Usually they do try to spread the bits out within the DRAM for any one word but sometimes they fail to do this. There are some ECC codes that are able to tolerate adjacent double (or even more) bit failures so if you're willing to pay the price for the more complex algorithm and have large ECC words then you can tolerate multi-bit errors from a single device as well.

    43. Re:Heat related? by Anonymous Coward · · Score: 0

      3/32" or 1/16" sheet of steel top and bottom. Depending on the server, perhaps a third layer in the middle. Depending on the system, 3-8 layers of copper pour in the mainboard. Thickness will vary depending on server and manufacturer. Lots of other metal chunks too. Processor heat sinks will be big slugs of copper in most rack mount servers. DIMMs may or may not have aluminum heat spreaders on them.

    44. Re:Heat related? by Anonymous Coward · · Score: 0

      Top of the rack tends to get toasty, but is this too simple?

      TFA suggests exactly that as the probable cause:

      SRAM on the server on the top of the rack had 20 percent more transient errors than the SRAM on the servers on the lower levels. "This is not a huge effect, but it is a consistent one," Sridharan said.

      The difference probably could not be attributed solely to cosmic rays, Sridharan said. He briefly speculated on a number of possible causes. For example, because heat rises, the servers at the top of a rack are hotter than those on the bottom. Heat is a well-known culprit in equipment failure.

      A low-cost solution, such as installing heat shielding on server racks, may be worth investigating, Sridharan said.

    45. Re:Heat related? by WWJohnBrowningDo · · Score: 3, Funny

      BRB, going to convincine my boss to tip all our servers over.

    46. Re:Heat related? by gweihir · · Score: 1

      No. Or at least not with competently designed racks, as they are cooler on the top.

      The reason is likely a lot more simple: Particles that cause this come from above. Traveling through a number of steel plates (2 per server) stops some of them and reduces energy for others. Hence less reach the bottom of the rack. In addition, those that do not come straight from above have to travel through more air, hence they are fewer or have less energy. See? Simple.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    47. Re:Heat related? by __aaltlg1547 · · Score: 1

      The cascade particles each have lower energy than particle that started the cascade. Most of them don't have enough energy to penetrate the next box.

    48. Re:Heat related? by icebike · · Score: 1

      Or grab the average temperature of the servers on the bottom of the rack and apply an equivalent heat source to the one server and see what the error rate is.

      If it were strictly temperature, it would indicate a less than effective rack cooling system.
      Are the top machines really that much hotter than the lower ones?

      (Biggest rack I've played with had thermometers at top and bottom, and there were at worst, only about 5 degrees difference.)

      --
      Sig Battery depleted. Reverting to safe mode.
    49. Re:Heat related? by manu0601 · · Score: 1

      Top of the rack tends to get toasty, but is this too simple?

      This is an explanation they suggest in TFA

    50. Re:Heat related? by icebike · · Score: 1

      Radiation blockage is mostly a function of mass the rays have to go through. The vast majority of cosmic rays are blocked by the 14 pounds per square inch/100 kilopascals of air above us. That means that a square inch of ground at sea level has 14 pounds of air above it. A square inch section of a rack above you would probably be in the pounds as well, and would block a good portion.

      So how many miles of air equivalent is the typical sheet steel of a cabinet?

      Just asking, cuz I can't seem to find it on the net.

      --
      Sig Battery depleted. Reverting to safe mode.
    51. Re:Heat related? by gweihir · · Score: 2

      Then they do not know much about rack construction. Standard racks suck in cold air from the front (cold isle) and blow it out the back (hot isle). There is no difference whether the computer sits on the bottom or the top of the rack as the hot air from any of them never gets to another computer directly.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    52. Re:Heat related? by Anonymous Coward · · Score: 0

      And, while we're at it, ECC RAM is able to correct any single bit error in an access (whatever width the bus is) and detect any double bit error. The likelihood of more than two bits in the same word flipping is so minuscule I think it's pretty clear it's not worth it.

      They are talking specifically about L2 and L3 cache, not out of core RAM. Does the cache have ECC?

    53. Re:Heat related? by manu0601 · · Score: 1

      True if air flow was thought enough at installation time. There are many SMB with servers sitting in a rack without appropriate air flowing setups, and upper servers are warmer

      TFA deals about datacenters serious enough that I think it will not be the case, though.

      .

    54. Re:Heat related? by Anonymous Coward · · Score: 0

      Still steel. The other major metals are aluminum for the HDD chassis and copper for the heatsink. Most servers are sufficiently tightly packed that copper is a better choice for the heatsink and at $5000+ per unit the cost difference is negligible.

    55. Re:Heat related? by nobodie · · Score: 1

      Dark energy effect. Prove me wrong>

      --
      Subversion of spatial scale luxury decoration ideas.
  2. This may be stupid... by Anonymous Coward · · Score: 0

    But could it be simply gravity? I know that G is negligible when talking about electrons, but a difference in height in a gravity well does affect time. Maybe it's something to do with time being at different speeds or a similar effect?

    1. Re:This may be stupid... by ledow · · Score: 3, Insightful

      On Mount Everest, time slows by 0.00261261 seconds (2.6ms) compared to sea level.

      Every foot higher you go is 90 billionths of a second difference, if you want to check the maths for me. The problem is, we're not talking about a sea-level / Mount Everest communication here. The RAM chips are about a foot long at absolute maximum.

      And these sorts of effects then suddenly skitter into insignificance compared to solar radiation, different pressures, different air make-ups, heat, etc.

      The fact is, we know that this effect exists. We know that time-slowing exists (GPS wouldn't work if we didn't compensate for such things). We know that solar radiation exists. But this single statistic barely bothers to eliminate memory manufacturer, operating voltage, or ambient temperature as a cause rather than these exotic causes.

      Chances are, they might just have had a batch of dodgy RAM chips from a single manufacturer more than ANYTHING else combined.

      And, even then, you'd need thousands of test sites / machines to even hint at the cause. But, why bother? We know there would be an effect, we also know it wouldn't be this large or obvious and that - chances are - there's a much simpler explanation. The whole "top of the rack fails more often" hints at what complete and utter bullshit this is. That would be an effect we'd notice at sea-level and most likely things like ventilation and heating have orders-of-magnitutide more to do with it.

    2. Re:This may be stupid... by Anonymous Coward · · Score: 1

      The time dilation between two places separated by the height h is approximately gh/c^2. With g~10 m/s^2, h~2000m and c^2~10^17 m^2/s^2, you get a time dilation of ~2*10^-13. In other words, about seven microseconds per century. That certainly cannot explain the different error rate.

    3. Re:This may be stupid... by camperdave · · Score: 2

      But could it be simply gravity?

      You mean because the 1 bits are lighter than the 0 bits? But you've got to remember about packing density. You can fit a lot more 1s than 0s because they are thinner. Vibrations in the chips will help the 1s settle to the bottom, despite being lighter.

      --
      When our name is on the back of your car, we're behind you all the way!
    4. Re:This may be stupid... by terryk29 · · Score: 1

      On Mount Everest, time slows by 0.00261261 seconds (2.6ms) compared to sea level.

      Error (in cmp_phenom_numer), missing argument 'scale'

    5. Re:This may be stupid... by fnj · · Score: 1

      On Mount Everest, time slows by 0.00261261 seconds (2.6ms) compared to sea level.

      That statement is void of any sense. The units of time dilation are not units of time. More like unitless (percent). Do you mean that a clock on the top of Everest is slower by 0.00261261 seconds per second, 0.00261261 seconds per year, 0.00261261 seconds since the dawn of time? What exactly do you really mean?

      I doubt VERY much that the effect is 0.00261261 seconds per second (0.26%).

    6. Re:This may be stupid... by Anonymous Coward · · Score: 0

      perhaps he FUBARed his math.

      the difference is about 10^-12%

      http://www.quora.com/Is-there-a-place-on-earth-where-time-actually-slows-down

    7. Re:This may be stupid... by Anonymous Coward · · Score: 0

      Oops, should of course have been "milliseconds per century". But still, too small to have any measurable effect on the error rate.

      (And fuck Slashdot's posting limits that caused that correction to be that late. [yes, they would have allowed me to post it on weekend. But there I didn't have the opportunity. When I had the opportunity, I could not post.])

  3. Fusion IO? by shadowknot · · Score: 3, Interesting

    Someone tell Fusion.io. They're based at 5000+ feet here in the Salt Lake valley! It would be interesting if their QC procedures are what have made them more reliable as the failure rate is higher where the testing is performed.

  4. Heights? by Anonymous Coward · · Score: 1

    Do computers have a fear of heights?

  5. basements by Anonymous Coward · · Score: 5, Funny

    Another reason for nerds to stay in the basement

    1. Re:basements by Anonymous Coward · · Score: 0

      And equipment in the basement tends to gain weigh so how. It's always lighter when going downstairs but heavy when draging it upstairs..

  6. Shielding? by Anonymous Coward · · Score: 0

    >Their tests showed that memory modules on the top of a server rack had 20 percent more transient errors than those closer to the bottom of the rack.

    Maybe it has something to do with the top side being shielded by other servers.

    1. Re:Shielding? by Gandoron · · Score: 1

      Yeah it's gotta me shielding. Put two racks side by side with a lead or some other kind of shield from "cosmic ray-induced neutron strikes" on one rack. I'm guessing that gravity probably has a negligible effect on this entire test. it's probably more about the amount of atmosphere above the racks at different altitudes. -G

  7. Of course elevation affects memory by Anonymous Coward · · Score: 0

    Oxygen deprivation. Wait, are we talking about the same thing? Altitude does weird things to the brain.

    [Posted from Everest base camp]

    1. Re:Of course elevation affects memory by pspahn · · Score: 2

      Enough with all the mixing of terminology.

      You use 'altitude' when referring to how high something is above the ground. You use 'elevation' when referring to how high the ground is from sea level.

      What you don't see are signs for city limits on the road with 'altitude' on them. They say 'elevation' for a reason. Just like you don't find an elevatometer inside an airplane. You find an altimeter.

      Mixing these terms as you've done (and so has TFS, so I don't blame you as you were simply restating the flawed summary) only causes confusion.

      --
      Someone flopped a steamer in the gene pool.
    2. Re:Of course elevation affects memory by Anonymous Coward · · Score: 0

      You're telling me. My time travel machine would have been impossible to conceive at sea level.

      [Posted from Mariana Trench]

    3. Re:Of course elevation affects memory by tricorn · · Score: 1

      An altimeter in an airplane is normally adjusted to show altitude above sea level (although above a certain height, it's set to assume a standard sea level pressure rather than what the current weather is producing). The two terms have very similar meanings, although I think elevation is more often used to refer to a fixed location.

  8. Nothing new here by Anonymous Coward · · Score: 0

    This has been known for at least 20 years.

  9. This isn't news by dszd0g · · Score: 4, Informative

    This isn't news. Companies that make supercomputers have known this for decades. The one I worked for 15 years ago used a high elevation test environment in Colorado to verify error correcting capabilities. Even the article says that the results were not a surprise.

    --
    This message is encrypted with Quad ROT-13 to protect the author's copyright under the DMCA.
    1. Re:This isn't news by edibobb · · Score: 4, Informative

      From the article: "It is well known that the altitude at which a data center resides has consequences with regards to machine fault rates. The two primary causes of increased fault rates at higher altitude are reduced cooling due to lower air pressure and increased cosmic ray-induced neutron strikes."

    2. Re:This isn't news by Mr+Z · · Score: 1

      And yet the Slashdot summary makes it sound like something new. I know at work we always quote our error rates with a location and elevation (eg. New York, sea level), and I understand that's the standard way to do it.

      This stuff comes up in deep embedded systems too. Think "ABS brake controller," etc. BTW, this is part of why Toyota got in so much trouble with its drive-by-wire system—it had no parity checking on critical control values. Granted, in an automobile, you have plenty of other sources of potential bit errors, such as extreme temperatures, power issues, exposure to strong fields, etc. But, you gotta protect against them all.

    3. Re:This isn't news by turp182 · · Score: 1

      It is interesting though, and not having found out about it when it was or would have been news makes it a good Slashdot topic (if for nothing else that making more people aware).

      --
      BlameBillCosby.com
    4. Re:This isn't news by Hatta · · Score: 1

      So we should build data centers in abandoned mines. Plenty of shielding from cosmic rays, a steady 55F ambient temperature, and all the heat exchange capacity you could want.

      --
      Give me Classic Slashdot or give me death!
    5. Re:This isn't news by pspahn · · Score: 1

      And yet the Slashdot summary makes it sound like something new.

      Seeing how GP referred to "data centers at altitude", I would say this is indeed something new. You certainly don't see a floating data center in the sky every day!

      --
      Someone flopped a steamer in the gene pool.
    6. Re:This isn't news by tibit · · Score: 1

      A transient bit error on a bus that constantly refreshes the value is of no consequence. You'll get a temporary upset, but so what. Suppose the pedal is sampled at 100Hz (a reasonable value). A single upset will at most "floor" it for 10ms. You may hear it, but it's not unsafe.

      Toyota's drive-by-wire system's transient error behavior is not to blame for any of the problems. A "stuck" throttle effect is either a permanent and undetected failure of the pedal sensor assembly, a latch-up of the throttle plate actuator, or a software failure in one of the components in this chain (say throttle actuator computer, the ECU, whatever sits on the same bus, etc.).

      --
      A successful API design takes a mixture of software design and pedagogy.
    7. Re:This isn't news by Mr+Z · · Score: 1

      That may be true, but you do see cell towers and cellular basestations, which are similar in a lot of ways to data-centers. It's just that their data is phone calls and whatever data you're streaming over high-speed links.

      I recall seeing Sun Microsystems had a facility in Denver when I drove through there around a decade ago. You'd think they noticed.

    8. Re:This isn't news by Mr+Z · · Score: 1

      It wasn't a sensor bit error that it failed to guard against. The control values I referred to are those in RAM, used by the software. The RAM apparently wasn't parity protected, and a bit-flip in the right word could cause uncontrolled acceleration. It wasn't the only thing that could cause havoc; there were race conditions and stack overflows in the code, apparently, and those were more likely the sources of actual, observed UA.

      This lengthy article at EE Times digs into some of the details. The main quote, though, is on page 3:

      Memory corruption as little as one bit flip can cause a task to die. This can happen by hardware single-event upsets -- i.e., bit flip -- or via one of the many software bugs, such as buffer overflows and race conditions, we identified in the code.

      There are tens of millions of combinations of untested task death, any of which could happen in any possible vehicle/software state. Too many to test them all. But vehicle tests we have done in 2005 and 2008 Camrys show that even just the death of Task X by itself can cause loss of throttle control by the driver -- even as combustion continues to power the engine. In a nutshell, the fail safes Toyota did install have gaps in them and are inadequate to detect all of the ways UA can occur via software.

      I don't think that article pointed out this other detail:

      Although the investigation focused almost entirely on software, there is at least one HW factor: Toyota claimed the 2005 Camry's main CPU had error detecting and correcting (EDAC) RAM. It didn't. EDAC, or at least parity RAM, is relatively easy and low-cost insurance for safety-critical systems.

      This particular set of problems at Toyota was very interesting to us at work. I'm just now starting to work with our safety critical team that sells hardened controllers into the automotive market. They include all sorts of hardware failsafes, including ECC, lockstep execution between parallel cores, etc.

    9. Re:This isn't news by tibit · · Score: 1

      Wait, so there was no state machine that actually drives task lifetimes, presumably by definition not ignoring any task death? Oh boy. I'd have thought that by now everyone has realized that if there's something that can have multiple states, especially something as important as a control task, there'd be a nice FSM or a HSM taking care of it. Sigh. I mean, come on, it's quite easy to do. It will work in spite of all those bugs that cause tasks to die.

      --
      A successful API design takes a mixture of software design and pedagogy.
    10. Re:This isn't news by Anonymous Coward · · Score: 0

      JEDEC standards specify memory scrubbing rates as a function of altitude. So there. Bad, naughty summary, again.

    11. Re:This isn't news by Anonymous Coward · · Score: 0

      This may surprise you, but not all of us have worked in supercomputer facilities. The experiences and knowledge of your life are not accessible to the rest of us except by methods of communication like, say, 'news stories,' for example. These are a convenient way for those of us who haven't had a particular experience to learn a little bit about that.

    12. Re:This isn't news by sootman · · Score: 1

      If the submitters actually took the time to read the articles, the quantity of stories posted here would drop substantially. NEXT PLEASE!

      --
      Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
    13. Re:This isn't news by cthulhu11 · · Score: 1

      I worked for one 20-ish years ago, most likely not the same one as you. At one point a certain gate array in the latest generation CPU was generating errors at a much higher rate than any other component. AIUI the fix was adding a metal shield above that specific gate array, which solved the problem. This was attributed to cosmic rays.

  10. That's interesting! by MetricT · · Score: 1

    A couple of years back at one of the Supercomputing conferences (I think in Phoenix), Fermilab had a cloud chamber in their booth, and you simply *would* *not* believe the amount of ambient radiation passing you at all times. I can easily believe that altitude would have an effect.

    Another interesting idea would be to do the same experiment by latitude. Does the Arctic Region Supercomputing Center have a higher rate than the Maui Supercomputing Center? What happens during an aurora?

    1. Re:That's interesting! by Antipater · · Score: 5, Funny

      Another interesting idea would be to do the same experiment by latitude. Does the Arctic Region Supercomputing Center have a higher rate than the Maui Supercomputing Center?

      They tried to do that test a few years back, but both research teams mysteriously disappeared. The leading hypothesis is that the Arctic team was eaten by polar bears, but nobody has any idea what happened to the Maui team. The only clue left at the scene was a nearly-empty glass of pina colada.

      --
      Everything is better with chainsaws.
    2. Re:That's interesting! by Type44Q · · Score: 1

      but nobody has any idea what happened to the Maui team. The only clue left at the scene was a nearly-empty glass of pina colada

      Japanese tourists (have you seen how they get when a little alcohol's added to the mix??).

    3. Re:That's interesting! by PPH · · Score: 1

      but nobody has any idea what happened to the Maui team.

      Didn't you watch Lost? They were eaten by polar bears.

      --
      Have gnu, will travel.
    4. Re:That's interesting! by Anonymous Coward · · Score: 0

      Lies, according to Fox News it is likely a combination of terrorism and Obama that caused the disappearance!

    5. Re:That's interesting! by mjwalshe · · Score: 1

      They where taken out by shogotths whats that knocking at the door oh apparently its a nice nice man from the laundry says i have to come with them :-)

    6. Re:That's interesting! by cusco · · Score: 1

      My wife grew up in Puno, Peru, at 3840 meters (12,600 feet) altitude. You will get sunburned so fast you won't believe it, even when you're dark complected like me. Black African tourists get sunburned. IIRC, most of the air molecules belonging to Earth are well below that altitude. If the effect on ultraviolet light shielding is that dramatic I can't help but think that other cosmic radiation is going to be stronger at that altitude as well.

      --
      "Think about how stupid the average person is. Now, realise that half of them are dumber than that." - George Carlin
  11. do I remember by fluffythedestroyer · · Score: 0

    My wife tells me I have problems with my memory because she supposedly told me some things last day. I tell her I have "selective" memory instead. I choose what i want to hear... :)

    1. Re:do I remember by Anonymous Coward · · Score: 0

      And does that get you off the hook, or deeper into quicksand??
      SB

    2. Re:do I remember by fluffythedestroyer · · Score: 1

      depends on her mood lol

  12. Soft errors and altitude - not news by Anonymous Coward · · Score: 0

    Pretty well known that radiation-induced soft errors increase with altitiude - just ask your aviation and space industry brethren

  13. This is news? by Nkwe · · Score: 4, Funny

    If you get high you can lose your memory?

    1. Re:This is news? by VortexCortex · · Score: 1

      Explains why astronauts don't remember seening aliens.

    2. Re:This is news? by Anonymous Coward · · Score: 0

      double whoosh

    3. Re:This is news? by Anonymous Coward · · Score: 0

      Bob marley didn't see any aliens either.

  14. single event upsets by Anonymous Coward · · Score: 0

    the space industry has been warning the commercial industry about this for decades. 10 or so years ago we started seeing upsets in aircraft and then on the ground. This makes perfect sense, the higher you are the more upsets. Within a rack/building/etc the deeper you go the more sheilding you get, so the top gets more than the bottom.

    1. Re:single event upsets by camperdave · · Score: 1

      It has more to do with the size of the circuits today compared to yesteryear. Smaller circuits are more vulnerable.

      --
      When our name is on the back of your car, we're behind you all the way!
  15. shades of verner vinge by Anonymous Coward · · Score: 0

    Interesting that this is sort of the premise of Verner Vinge's scifi novel A Fire Upon the Deep. Namely, that the further from the busy galactic core computers are, the more error-free and powerful they are, and thus the ability of civilizations to progress is limited by their location in the galaxy.

  16. They compared two completely different systems by JoeyRox · · Score: 1

    According to the article the low elevation system was a Jaguar supercomputer whereas the high elevation one a Cielo supercomputer. Based on available specs for each the two are entirely different systems. How can they reach conclusions about altitude-relative bit error rates when they're not even comparing the system system? The article goes on to state:

    "The group had found that, when all other possible confounding issues were factored out, Cielo's SRAM had a "significantly higher rate of SRAM faults," compared with Jaguar's SRAM, Sridharan said."

    Huh? They factored out all confounding issues except that they were completely different systems.

    1. Re:They compared two completely different systems by Anonymous Coward · · Score: 0

      Being different systems would be one of the 'other possible confounding issues' that they factored out.

      Seriously. You quoted the bloody line. Read it. Understand it. *THEN* comment on it.

    2. Re:They compared two completely different systems by JoeyRox · · Score: 1

      And how exactly do they normalize the effects of two completely different system when they don't have a clear understanding of what is causing the higher bit-error rates in the first place? You can't simply an equation with two unknown factors when the interrelation between those two factors are unknown.

    3. Re:They compared two completely different systems by Anonymous Coward · · Score: 0

      Seriously. You quoted the bloody line. Read it. Understand it. *THEN* comment on it.

      I think you're missing what's difficult to understand about that, namely, how does one "factor out" the effect of them being completely different systems, and should anyone care about such results when it's so trivial to remove this particular issue from the experiment simply by comparing identical systems, thereby removing an obvious potential source for error in the results which may not be correctly "factored out" otherwise?

    4. Re:They compared two completely different systems by Anonymous Coward · · Score: 0

      not to mention this study contained a sample size of exactly TWO computers. how useful. it could have been manufacturing, assembly, any number of issues. come back with a study of thousands of units of exactly the same SRAM, running the same loads, temps, etc.

  17. Water towers by mdsolar · · Score: 2

    It seems to me that an unexploited structure for a low radiation environment is the bottom side of a water tower. Steel has most radionuclides slagged off when it is produced while drinking water standards ensure the water in the tower will have low radioactivity. A meter or two of water forms a nice shield for cosmic rays from above while the air below the tower shields against lower energy ground radiation. And, you get a nice heat sink in the water for cooling electronic.

    1. Re:Water towers by Anonymous Coward · · Score: 0

      One of my favorite things in life, is that someone can something like that, and I honestly don't know: Are you being satirical? Or is it really a good idea? Because damn.. I sincerely think that's a good idea.

    2. Re:Water towers by mdsolar · · Score: 1

      I came up with that about five years ago. Thanks. http://mdsolar.blogspot.com/2008/03/lux-lucis-tepida.html

    3. Re:Water towers by Anonymous Coward · · Score: 0

      Steel isn't particularly non-radioactive, unless you use steel made before the end of WW2. That is kind of rare and reserved for things that need it instead of water towers. Although at least the main containment aren't alpha or neutron emitters.

    4. Re:Water towers by mdsolar · · Score: 1

      A yes, I was thinking about natural radiation. Well, you'd want to test the location in any case. So, some towers may be prewar or some may have been build from ship scrap. Find a few of those and there might be something to try.

    5. Re:Water towers by Anonymous Coward · · Score: 0

      The question is not whether you can mitigate SEU by doing X, but whether the decrease in SEU is worth the cost of doing X, versus other mitigation strategies, such as the most common, using RAS hardware, and repeating the work unit when an uncorrectable error occurs.

      I would guess that embedding a data center inside a water tower would not be economical, simply putting it underneath it would not provide considerable shielding as not all high energy particles are vertically incident. Also most supercomputers are quite a bit larger than your typical elevated water tank, and putting it underneath a ground-level water tank would be more expensive than superior shielding such as putting the datacenter in an abandoned cobalt or iron mine (the typical environment for low event rate physics experiments).

    6. Re:Water towers by mdsolar · · Score: 1

      Some experiments with off-the-shelf hardware and redundancy have shown promise for space applications so perhaps that is the direction to go in any case. I do like the idea of limiting cosmic ray induced defects in solar panels though.

  18. Sandia National Labs contributed to this work also by Anonymous Coward · · Score: 1

    As one of the authors from the LANL-side I want to be clear that Sandia National Laboratories played a vital and at least equal role in this work - paper, analysis, as well as procurement and running of the Cielo supercomputer studied. The partnership with AMD, SNL, and LANL has been outstanding.

  19. Caches, eh? by fa2k · · Score: 1

    As this is not (mainly) about the system RAM, it's about the CPU caches, I wonder if any attempt is being made to correct the errors, and if it's worthwhile. One would just need to reset the node on any sign of an error, so the capactiy penalty would be small compared to ECC. On the other hand, the errors could just as well happen in the actual logical units, and at some point it's impossible or very expensive to protect everything. Because the SRAM takes up a large fraction of the CPU area, it may be useful to add something to protect the caches.

    For some workloads you can do consistency checks in software, but for many computations that would require you to run the computations twice -- which is very expensive. Maybe statistical methods can be used to include a term for gross numerical errors -- different from the small floating point errors. It would probably be close to impossible to model the effect of such errors on the results though. Another option is to shield the datacentres from cosmic rays, if those are indeed the culprit.

    1. Re:Caches, eh? by wiredlogic · · Score: 2

      For IA CPUs the L1 cache has parity and the server grade chips have ECC on the L2 cache.

      --
      I am becoming gerund, destroyer of verbs.
    2. Re:Caches, eh? by Anonymous Coward · · Score: 0

      A quite cool (actually hot) chip presentation (see page 15): http://www.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-9-Big-Iron/HC24.29.918-SPARC64X-Maruyama-Fujitsu-rev2.5.2.pdf and the presentation: http://www.youtube.com/watch?feature=player_embedded&v=ipirVUart88#t=1072
      Itanium has/had retry as well. I don't know if the E7 Xeons have it already. As the number of cores increase faster that can be served by memory systems, it might be increasingly practical to execute the same code in different parts of the chip, even in the non-mainframe applications.

  20. Duh by Anonymous Coward · · Score: 0

    When you get high, memory suffers.

    1. Re:Duh by Anonymous Coward · · Score: 1

      Proves all my theories about tall people.

  21. Hmmm .... by gstoddart · · Score: 3, Funny

    Is this why when I'm in an airplane I can never remember if I turned all the lights out? ;-)

    --
    Lost at C:>. Found at C.
    1. Re:Hmmm .... by boristdog · · Score: 1

      Actually, my company makes flash memory and we ship most of it overseas for final packaging. We have to allow for a certain amount of die loss from cosmic rays striking the wafers while they are in flight.

  22. On the bright side of all this. . . . by Salgak1 · · Score: 2

    . . . .recall that the new NSA "Supercenter" in Utah is at ~4300 feet. So they'll be making a lot MORE errors when monitoring us all. . .

  23. Product suggestion. by SuricouRaven · · Score: 1

    The 1U lead block. Place at top of rack to protect the servers below.

    Does it work? Who cares. If people will pay £150 for a wooden volume knob on their audio system, someone is going to pay whatever you ask for a lump-o-lead that may or may not improve the reliability of equipment below.

    1. Re:Product suggestion. by boristdog · · Score: 1

      Easier: Just put the rackmount UPS at the top of your rack. It's mostly lead.

    2. Re:Product suggestion. by PPH · · Score: 3, Funny

      OK. Where am I going to find an RoHS lead block?

      --
      Have gnu, will travel.
    3. Re:Product suggestion. by SuricouRaven · · Score: 1

      Actually, lead alone is a very bad material to stop high-energy neutrons. What you really want is a 1U block of boron, with a 1U block of lead or UPS below it to catch the resulting gamma.

    4. Re:Product suggestion. by boristdog · · Score: 1

      And Boron is cheap, ~$400/ton.

      Now I want a ton of boron delivered to me for some reason.

    5. Re:Product suggestion. by Anonymous Coward · · Score: 0

      Boron is mainly best at catching low energy neutrons. You need something to slow down the neutrons, and lower Z atoms work better as more energy gets transferred to them in a collision with a neutron. For example, something with a lot of hydrogen, like water or some plastics. So a layer of plastic, then boron or boronized plastic, then lead.

  24. Waste of money by jtara · · Score: 1

    Why did anyone need to do this field survey? It simply confirms what we already know - cosmic rays create SRAM errors. Hot components fail more than cold components. Big whoop.

    1. Re:Waste of money by Daniel+Hoffmann · · Score: 1

      Well it is part of the scientific process to verify studies to see if their conclusions are indeed valid.

    2. Re:Waste of money by gstoddart · · Score: 1

      Why did anyone need to do this field survey?

      Well, there's two possible responses to this.

      1) We're slashdot and we think we know everything, they should have just asked us, how dare they

      2) Maybe we might trust that a "A group of researchers from AMD and the Department of Energy's Los Alamos National Laboratory" aren't idiots and wanted specific empirical evidence on the topic?

      --
      Lost at C:>. Found at C.
    3. Re:Waste of money by Anonymous Coward · · Score: 0

      Frequently the point of many studies is not to just provide a binary, yes/no confirmation of something that is already know, but to actually quantify the strength of said effect in a specific case. It is one thing to say that error rates increase with altitude, it is another to specify how much it increases in a currently used equipment.

    4. Re:Waste of money by Tailhook · · Score: 1

      what we already know

      Conditions change. Every 18-24 months a new node appears — 22nm is the scale of contemporary shipping devices. As features shrink their behavior changes and new data is needed. There are applications that need to know error rates to compute how much mitigation is required.

      We're not all just making web pages out here.

      --
      Maw! Fire up the karma burner!
    5. Re:Waste of money by Anonymous Coward · · Score: 0

      There's also a world of difference between simply knowing "we get more SEUs at altitude", and having a nice fitted curve that predicts accurately the transfer fuction of altitude to SEU rates.

  25. so ends aluminum by Anonymous Coward · · Score: 0

    Damn, I jumped onto the aluminum case fad bandwagon around 2000, and never looked back. Just bought another one, a couple months ago. Now you're telling me the next big thing that all the cool kids will have, is a lead case?

  26. A brilliant idea by Anonymous Coward · · Score: 0

    Why not put a swimming pool on top of the building? A nice perk for the workers AND radiation protection for the chips.

  27. Lead maybe not the solution by Anonymous Coward · · Score: 0

    According to the Jargon File, IBM tested lead as a method for shielding chips from cosmic rays, and found it to be ineffective.

    I find it interesting that IBM's result conflicts with this DoE conclusion; however, I think it's consistent with lead being a . Of course, you said:

    Does it work? Who cares. If people will pay £150 for a wooden volume knob on their audio system, someone is going to pay whatever you ask for a lump-o-lead that may or may not improve the reliability of equipment below.

    so I guess, in the spirit of P.T. Barnum, carry on.

  28. cosmic ray flux by volvox_voxel · · Score: 3, Informative
    Here is a plot of the cosmic ray flux ( coincidence counting rate per second) vs altitude. It's also not hard to build a detector that can detect these. You can use something called coincidence detection where two scintillator plates are placed right on top of one another, and each plate is connected to a photomultiplier tube. If both photomultiplier tubes trigger, it's a cosmic ray event.. If only the top one triggers it could still be a muon though..

    http://hyperphysics.phy-astr.gsu.edu/hbase/astro/cosmic.html

  29. Proportional to the exposed surface. by jcdr · · Score: 1

    Earth act as a shield that protect memory from radiation coming from the other side of the planet. In addition, the collision probability of a particle is proportional to the distance of his travel into the atmosphere, so there is more probability on the ground to be hit by particle coming from the vertical. On a desktop computer the RAM is usually oriented vertically and exposing his shorter side from the top: the exposed area is very small for radiation coming from the top. Not that because of the motherboard orientation, this is also true for a lot of component mounted into it. On a server, the RAM might be vertical, but expose his largest side from the top. RAM mounted with a angle are even worse. Not that on server the motherboard is oriented horizontally, exposing most of the components with there biggest area from the top. So it's not a big surprise that the caches inside the CPU is hit more often on a server than in a desktop computer.

    Write a cod that reserve several Go of memory on a non ECC memory, set it, and in a infinite loop check that all bits are set. Now try approach various type of arc lamps to the memory and count the number of hours (or minutes if you have a strong source) before a bit is detected to be zero. Now retry the experience with different RAM orientation.

  30. Oracle/Sun document that discusses this by lyapunov · · Score: 2

    There are statistics that cover the expected frequency of events caused by radiation in the first couple of pages.

    http://docs.oracle.com/cd/E19095-01/sf3800.srvr/816-5053-10/816-5053-10.pdf

    --

    Either give it away or get top dollar, but never sell yourself cheap.
  31. I also by OglinTatas · · Score: 1

    I'm also more prone to errors when I'm high

  32. Top of Rack Shields bottom of Rack from Radiation by Anonymous Coward · · Score: 0

    Top of Rack Shields bottom of Rack from Cosmic Radiation - is it not this simple?

  33. well, do'h... by Anonymous Coward · · Score: 0

    Doh,... thinner air will not absorb and remove as much heat per cubic meter of air compared to sea-level air. This plus radiation at higher altitudes...

  34. Air pressure by michaelmalak · · Score: 1

    From deep within the PDF (second link):

    The two primary causes of increased fault rates at higher altitude are reduced cooling due to lower air pressure and increased cosmic ray-induced neutron strikes.

    (Living in Colorado, I thought perhaps chips suffered from the same spurting newly opened toothpaste tube problem when manufactured at low altitude and installed into operation at high altitude, but it turned out the hypothesis was different, and, of course, left out of the Slashdot summary.)

  35. What materials block cosmic rays? by JDG1980 · · Score: 1

    What kind of materials (if any) are effective in blocking cosmic rays? Would it be possible to integrate cosmic radiation shielding into an average-sized PC case? If that's impractical, are there building materials that can be used in roofs and/or walls to block this stuff without breaking the bank?

    1. Re:What materials block cosmic rays? by Anonymous Coward · · Score: 0

      What kind of materials (if any) are effective in blocking cosmic rays? Would it be possible to integrate cosmic radiation shielding into an average-sized PC case? If that's impractical, are there building materials that can be used in roofs and/or walls to block this stuff without breaking the bank?

      About a half mile of solid lead per 0.1% reduction or thereabouts. Good luck with that.

    2. Re:What materials block cosmic rays? by Anonymous Coward · · Score: 0

      Considering cosmic muon penetration has been used to image Mayan pyramids and the Fukushima reactor containments, you just need to deal with it. There is no economical way to shield a house or room from cosmic radiation. You would want to construct a lead or other high-Z shield around your computer case. Why you think this would improve reliability over other uses of the money such as improved cooling is another question.

    3. Re:What materials block cosmic rays? by cusco · · Score: 1

      Depends on the type of cosmic rays you want to shield. For muons and the like, good luck. Huge amounts of mass are necessary, but then muons probably wouldn't interact much with your computer RAM anyway. The practical worry is alpha particles, which will flip a bit once in a while.

      --
      "Think about how stupid the average person is. Now, realise that half of them are dumber than that." - George Carlin
  36. Memory Failure Due to High Eleation by Anonymous Coward · · Score: 0

    Reminds me of the BOFH's excuse of the day...

    Today's excuse, btw, is:
    http://pages.cs.wisc.edu/~ballard/bofh/bofhserver.pl

  37. Muons by Roger+W+Moore · · Score: 4, Informative

    Then wouldn't you expect a cascading rate of failures from 20% down to the baseline bottom rack in a linear fashion?

    The majority of cosmic rays that make it this far are muons. These are relatively penetrating and I highly doubt that a few centimetres of metal and plastic will have anything like a 20% effect. 60m underground with the ATLAS detector at the LHC we still get a reasonable rate of cosmic rays and we use them for calibration when there is no beam. While the rate is reduced 60m of rock is far, far more shielding than a few computers plus many cosmics passing through you come at an angle so the stack above will have no effect on shielding these.

    I expect that heat and vibration will be the most likely causes.

    1. Re:Muons by Anonymous Coward · · Score: 0

      Although the bit flips tend to be associated with impacts of heavier, easier to stop particles that transfer all of their energy in a small volume when stopping. The muons are unlikely to cause the bit flips, which instead are mostly associated with alpha particles, and protons to some degree, from a secondary shower. Those are much easier to attenuate in short distances.

    2. Re:Muons by Roger+W+Moore · · Score: 1

      Alpha particles are not cosmic rays this far into the atmosphere - they are too easy to stop and are not produced by showers. You can get heavy ionization from muons just before they range out. Also there is a small chance of catastrophic muon bremstrahlung where the muon essentially hits a nucleus which can cause short range, highly ionizing nuclear debris. You may get some alphas from this but a far more likely source is natural radioactivity.

    3. Re:Muons by Anonymous Coward · · Score: 0

      Muons don't have that high of a cross-section to impact memory (hence why they are also not shielded well) and additionally are not very good at transferring energy to a memory cell. A serious impact will eject a nucleus out of the memory cell, without changing the state. The vast majority of cosmic ray related impacts are due to absorption of neutrons, particular into boron in some semiconductors that then decays by alpha particle part of the time. If you look at some of the literature, you'll see that protons and pions in secondary showers cause far more changes than muons, and those are dwarfed by the effects of neutrons.

  38. Not surprising by Anonymous Coward · · Score: 0

    My father was an astro-physicist studying mesons and other high-energy particles, and unless you were looking for neutrinos, altitude was important. His main "meson telescope" was on top of a 14,000 foot + mountain (Mount Evans) in Colorado. Anyway, I think if you want to mitigate RAM errors in server farms, the simplest thing is to place a thin sheet of lead on top of the rack... or over the roof of the building.:-)

  39. ...and metres thick by Roger+W+Moore · · Score: 1

    You would need many metres to have a noticeable effect on the penetrating muons which make up the majority of cosmics at the surface. This should tell you that a few computer boxes is not really likely to have much of a shielding effect. This is reinforced by the fact that many cosmics come at shallow angles so the stack above provides no shielding. I doubt this is a cosmic ray effect.

    1. Re:...and metres thick by Anonymous Coward · · Score: 0

      Depends on what energy cosmic rays you are talking about. Lower energy muons show a cos^2 distribution over the angle from vertical, so most of them do come roughly downward. Really high energy muons won't be likely to cause bit flip errors if they interact with only a single atom in the computer with a glancing blow. You need either a really solid hit causing a bunch of recoil or something coming to a stop. Secondary particles factor a lot into this, and their distribution is far from isotropic too.

    2. Re:...and metres thick by Roger+W+Moore · · Score: 1

      You can still get delta rays and catastrophic muon bremstrahlung which, while rare, is very likely to flip bits.

  40. not that big of a mystery by slashmydots · · Score: 1

    "However, it's not clear what causes this smaller-scale effect."
    Servers are made out of metal and have EM fields! This isn't hard.

  41. buying land for server farms in death valley by Polo · · Score: 1

    who's with me?

    1. Re:buying land for server farms in death valley by Anonymous Coward · · Score: 0

      Cooling?

  42. Common wisdom by WaffleMonster · · Score: 1

    In aggregate entire atmosphere down to sea level works out to something like the equivalent of 30ft of water of shielding.. 20% reduction thru an entire rack of servers sounds to be in about the right ballpark.

    People have been running the same experiments on international flights on laptops for years.

  43. They knew this over a decade ago... by Anonymous Coward · · Score: 0

    The DEC Alpha cluster (ASCI Q) had problems with memory errors - specifically with CPU cache errors - that they eventually found to be caused by increased levels of radiation from being at the higher (7800ft) elevation.

  44. Uhm. Duh?!?! by Anonymous Coward · · Score: 0

    Living in Colorado, this just seems like long known general knowledge to me. Our bodies get a higher radiation dose at this altitude, and so does our equipment. HP was notorious for responding "Cosmic Rays!!!!11!!1" when we'd place service calls for single bit errors on PA-RISC CPUs. They wouldn't replace them until a repeating history was established for a particular CPU.

  45. Re:Top of Rack Shields bottom of Rack from Radiati by Cyrano+de+Maniac · · Score: 1

    I have no idea if that's enough shielding to matter. However, if true, would we also see higher error rates in daytime when the body of the Earth isn't standing between the server and the Sun?

    --
    Cyrano de Maniac
  46. Re:Blower Belts by John.Banister · · Score: 1

    Just curious. I've seen direct driven blowers in a number of various applications. Is there some special need to use belt driven blowers for the air in data centers?

  47. how high is your cloud? by museumpeace · · Score: 1

    How long before the cloud computing and storage services start charging a slight premium to have your stuff run/store on lower spots in their server racks?

    --
    SLASHDOT: news for people who can't concentrate on work or have no life at all and got tired of yelling back at the TV.
  48. It only makes sense... by sartin · · Score: 1

    I know I have trouble remembering when I'm high. Seems like electronics should have the same problem.

  49. "Pressure"? by khallow · · Score: 1

    While I guess why more ionizing events stemming from neutron impacts affects electronics, I don't get the blaming of "pressure"? Perhaps they mean the reduced air cooling of electrical components from a less dense atmosphere? Someone else noted that components on the top of a rack might tend to be warmer. This might be more of the same sort of effect.

    1. Re:"Pressure"? by khallow · · Score: 1

      Never mind. I wasn't thinking it through. Pressure also is an indication of how many potentially heat absorbing particles are impacting a heat sink surface.

  50. FPGA's have had to deal with this effect for years by Anonymous Coward · · Score: 0

    Field Programmable Gate Arrays (FPGAs) have the largest and densest SRAMs of any modern device. Google "FPGA SEU" and you'll see dozens of articles describing the issue and the mitigations the various vendors make. The researchers in question were either grossly ignorant or playing games.

    Spacecraft and avionics designers have even more interesting issues . . .

  51. It's those gamma ray bursts. by Kazoo+the+Clown · · Score: 1

    Isn't it obvious?

  52. Ancient news by Salamander · · Score: 2

    About five years ago, I was involved in the installation of a thousand-node cluster in Boulder. We knew *before we went in* that we needed to change our EDAC (memory error correction) code to account for the higher rate of bit-flips due to the altitude. Some of the people we were working with had been there when those same problems nearly caused a months-long delay in a larger installation at NCAR nearby. We ended up running into a more subtle problem involving lower air density, heat and voltage, but *this* problem was incredibly old news even then.

    --
    Slashdot - News for Herds. Stuff that Splatters.
  53. Cosmic rays? What about latitude then? by Floyd-ATC · · Score: 1

    If cosmic rays were the cause (presumably from the sun since that's the closest source) then I would assume that latitude would be an important factor as well. Less sun means less cosmic radiation. My money is on the simplest explanation: Heat.

    --
    Time flies when you don't know what you're doing
  54. well... by buddyglass · · Score: 1

    Just like humans, computers have trouble remembering things when high.

  55. "L2 and L3 caches had more transient errors" by grep+-v+'.*'+* · · Score: 1

    Honest, Mr NSA Sir, I was just searching for happy kitty cats and not "How to load a nuclear bomb in a suitcase" -- it must have been those nasty cosmic rays changing up my searches!

    (I WONDERED why I kept finding lead suitcases with "Hello, Kitty" emblazoned on them.)

    --
    If the universe is someone's simulation -- does that mean the stars are just stuck pixels?
  56. Re:Top of Rack Shields bottom of Rack from Radiati by cusco · · Score: 1

    Almost no cosmic rays come from our local sun, mostly just slower solar wind particles.

    --
    "Think about how stupid the average person is. Now, realise that half of them are dumber than that." - George Carlin
  57. known for decades by aegl · · Score: 2

    Perhaps the researchers are too young to have read this 1979 paper http://www.ncbi.nlm.nih.gov/pubmed/17820742

  58. my memory by Anonymous Coward · · Score: 0

    My memory has gotten worse as I've gotten taller. I blame the cosmic rays.

  59. Um... by Anonymous Coward · · Score: 0

    Its probably severely shielded to prevent TEMPEST attacks. Far more so than a typical corporate datacenter.

  60. Clearly altitude by Anonymous Coward · · Score: 0

    have a high-altitude (60+ thousand feet) experiment. Bit error rates are a few orders of magnitude higher up there.

  61. Move all 'Important Servers' to New Orleans by Anonymous Coward · · Score: 0

    In New Orleans you could build a 'ground level' datacenter and be 'below sea level'. ... But if a dike fails, the salt water might be a little 'rough' on electronic components.

    In reality, some of the old 'salt mines' might be good place for 'high reliability' data centers.

  62. The air at the top of the rack by Anonymous Coward · · Score: 0

    has less humidity.

    The air at higher elevations tends to have significantly less humidity.

    I'd like to see some distinction made between the relatively few evaporative-cooled servers.

  63. Without memory by terrywirth5 · · Score: 1

    I make new friends every day.

  64. We've Known About This for Some Time by forbin_meet_hal · · Score: 1

    Check out this Actel whitepaper (PDF). Describes a similar phenomenon, with such errors taking place three times more often in mile-high Denver than Baghdad by the Bay San Francisco.