Same Programs + Different Computers = Different Weather Forecasts

← Back to Stories (view on slashdot.org)

Same Programs + Different Computers = Different Weather Forecasts

Posted by timothy on Sunday July 28, 2013 @01:26AM from the climate-change-without-leaving-the-room dept.

knorthern knight writes "Most major weather services (US NWS, Britain's Met Office, etc) have their own supercomputers, and their own weather models. But there are some models which are used globally. A new paper has been published, comparing outputs from one such program on different machines around the world. Apparently, the same code, running on different machines, can produce different outputs due to accumulation of differing round-off errors. The handling of floating-point numbers in computing is a field in its own right. The paper apparently deals with 10-day weather forecasts. Weather forecasts are generally done in steps of 1 hour. I.e. the output from hour 1 is used as the starting condition for the hour 2 forecast. The output from hour 2 is used as the starting condition for hour 3, etc. The paper is paywalled, but the abstract says: 'The global model program (GMP) of the Global/Regional Integrated Model system (GRIMs) is tested on 10 different computer systems having different central processing unit (CPU) architectures or compilers. There exist differences in the results for different compilers, parallel libraries, and optimization levels, primarily due to the treatment of rounding errors by the different software systems. The system dependency, which is the standard deviation of the 500-hPa geopotential height averaged over the globe, increases with time. However, its fractional tendency, which is the change of the standard deviation relative to the value itself, remains nearly zero with time. In a seasonal prediction framework, the ensemble spread due to the differences in software system is comparable to the ensemble spread due to the differences in initial conditions that is used for the traditional ensemble forecasting.'"

28 of 240 comments (clear)

Min score:

Reason:

Sort:

It is the butterfly effect. by 140Mandak262Jamuna · 2013-07-28 01:34 · Score: 4, Interesting

Almost all the CFD (Computational Fluid Mechanics) simulations us time marching of Navier-Stokes equations. Despite being very non linear and very hard, one great thing about them is they naturally parallelize very well. The partition the solution domain into many subdomains and distribute the finite volume mesh associated with each sub domain to a different node. Each mesh is also parallelized using GPU. At the end of the day these threads complete execution at slightly different times and post updates asynchronously. So even if you use the same OS and the same basic cluster, if you run it twice you get two different results if you run it far enough, like 10 days. I am totally not surprised if you change OS or architecture or big-endian-small-endian things or the math processor or the GPU brands the solutions differ a lot when you make 10 day forecast.

--
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Re:Have these people never heard of IEEE754???? by cnettel · 2013-07-28 01:35 · Score: 5, Insightful

No, it isn't, when the system itself is not well-conditioned. And I bet you don't want your compiler to run a real codebase in a IEEE754 strict interpretation, as that will disallow almost any optimization. Even if you would allow it, then "trivial" rearrangements, that don't affect the theoretical analysis of stability, correctness or condition number, will still introduce different rounding perturbations. Perturb weather or some other systems, and you will get a completely different trajectory.
That said, many applied fields, including meteorology, could benefit from more well-disciplined computational science approaches. But don't expect all that much of a difference.
I've seen this before by slashgordo. · 2013-07-28 01:38 · Score: 5, Interesting

When doing spice simulations of a circuit many years ago, we ran across one interesting feature. When using the exact same inputs and the exact same executable, the sim would converge and run on one machine, but it would fail to converge on another. It just happened that one of the machines was an Intel server, and the other was an AMD, and we attributed it to ever so slightly different round off errors between the floating point implementation of the two. It didn't help that we were trying to simulate a bad circuit design that was on the hairy edge of convergence, but it was eye opening that you could not guarantee 100% identical results between different hardware platforms.
1. Re:I've seen this before by Livius · 2013-07-28 02:12 · Score: 4, Funny
  
  Well, Arrakis melange is a pretty strong drug, so consistency in spice simulations is probably a little too much to expect.
  (Yes, I know the parent really meant SPICE.)
2. Re:I've seen this before by rossdee · 2013-07-28 02:14 · Score: 4, Funny
  
  "When doing spice simulations "
  Weather forecasting on Arrakis is somewhat tricky, not only do you have the large storms, but also giant sndworms.
  (And sabotage by the Fremen)
3. Re:I've seen this before by Cassini2 · 2013-07-28 05:09 · Score: 4, Insightful
  
  This often happens when the simulation results are influenced by variations in the accuracy of the built-in functions. Every floating point unit (FPU) returns an approximation of the correct result to an arbitrary level of accuracy, and the accuracy level of these results varies considerably when built-in functions like sqrt(), sin(), cos(), ln(), and exp() are considered. Normally, the accuracy of these results is pretty high. However, the initial 8087 FPU hardware from Intel was pretty old, and it necessarily made approximations.
  At one point, Cyrix released an 80287 clone FPU that was faster and more accurate than Intel's 80287 equivalent. This broke many programs. Since then, Intel and AMD have been developing FPUs that are compatible with the 8087, ideally at least as accurate, and much faster. The GPU vendors have been doing something similar, however in video games, speed is more important than accuracy. For compatibility reasons (CPUs) and speed reasons (GPUs), vendors have focused on returning fast, compatible and reasonably accurate results.
  In terms of accuracy, the results of the key transcendental functions, exponential functions, logarithmic functions, and the sqrt function should be viewed with suspicion. At high-accuracy levels, the least-significant bits of the results may vary considerably between processor generations, and CPU/GPU vendors. Additionally, slight differences in the results of double-precision floating point to 64-bit integer conversion functions can be detected, especially when 80-bit intermediate values are considered. Given these approximations, getting repeatable results for accuracy-sensitive simulations is tough.
  It is likely that the articles weather simulations and the parent poster's simulations have differing results due to the approximations in built-in functions. Inaccuracies in the built-in functions are often much more significant that the differences due to round-off errors.
4. Re:I've seen this before by AmiMoJo · 2013-07-28 06:18 · Score: 3, Insightful
  
  In theory both should have been the same, if they stuck rigidly to the IEEE specifications. There may be other explanations though.
  Sometimes compilers create multiple code paths optimized for different CPU architectures. One might use SSE4 and be optimized for Intel CPUs, another might use AMD extensions and be tuned for performance on their hardware. There was actually some controversy when it was discovered that Intel's compiler disabled code paths that would execute quickly on AMD CPUs just because they were not Intel CPUs. Anyway, the point is that perhaps one machine was using different code and different super-scalar instructions, which operate at different word lengths. Compilers sometimes extend a 64 bit double to 80 bit super-scalar registers, for example.
  Or one machine was a Pentium. Intel will never live that one down.
  
  --
  const int one = 65536; (Silvermoon, Texture.cs)
  SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
5. Re:I've seen this before by matfud · 2013-07-28 07:35 · Score: 3, Interesting
  
  Trig functions are nasty. CPU's (FPU's) tend to use lookup tables to get a starting point and then iteratively refine that to provide more accuracy. How they do this depends on the precision and rounding of the intermediate steps and how many iterations they will undertake. Very few FPUs produce IEEE compliant results for trig. Multiple simple math operations also tend to be rounded and kept at different precisions on different processors (let alone instruction reordering done by the cpu and compiler.
  GPU's are great performance wise at float (sometimes double) math but tend to be poor at giving the result you expect. Now IEEE-754 does not remove these issues it just ensures that the issues are always the same.
  It is why languages like Java have java.lang.Math and java.lang.FastMath for trig and the strictfp keyword for float and double natives. (FastMath tends to just delegate to Math but does not have to). strictfp can kill performance as a lot of fixups have to be done in software in the better cases (also hotspot compilation can be hindered by it) and in the worst cases the entire simple operation (+,-,*,/) has to be performed in software.
Re:Have these people never heard of IEEE754???? by SlayerofGods · 2013-07-28 01:40 · Score: 3, Informative

Yes... because that never rounds off numbers.
https://en.wikipedia.org/wiki/IEEE_floating_point#Rounding_rules

--

Technology, the cause of and solution to all of life's problems.
Re:Have these people never heard of IEEE754???? by Goaway · 2013-07-28 01:42 · Score: 4, Insightful

When floating point roundoff errors grow big enough to affect the outcome of the simulation, you have long since reached the point where you are not predicting anything useful any longer. It is not exactly a problem if the results differ at that point.
Chaos by pcjunky · 2013-07-28 01:46 · Score: 5, Interesting

This very effect was noted in weather simulations back in the 1960's. Read Chaos - The making of a new science, by Jmaes Gleick.
Re:Have these people never heard of IEEE754???? by cnettel · 2013-07-28 01:47 · Score: 5, Informative

It doesn't help you that individual operations are rounded deterministically, if the order of your operations is non-deterministic. You cannot expect bit-identical results if you parallelize or allow any level of operation reordering. Even a very well-written code might implement a reduce operation in different hierarchies depending on memory layout. Enforcing all these things to be done in the exactly same order, with full IEEE754 compliance is a significant performance cost. By taking numerical aspects into account, you can ensure that your result is not invalid or unreasonable. However, for a chaotic problem where a machine epsilon difference in input data might be enough for a macroscopically different end result, there is nothing you can do and still expect reasonable utilization of modern architectures.
Yes, the Butterfly Effect, as others have said by Impy+the+Impiuos+Imp · 2013-07-28 02:05 · Score: 5, Interesting

This problem has been known since at least the 1970s, and it was weather simulation that discovered it. It lead to the field of chaos theory.
With an early simulation, they ran their program and got a result. They saved their initial variables and then ran it the next day and got a completely different result.
Looking into it, they found out that when they saved their initial values, they only saved the first 5 digits or so of their numbers. It was the tiny bit at the end that made the results completely different.
This was terribly shocking. Everybody felt that tiny differences would melt away into some averaging process, and never be an influence. Instead, it multiplied up to dominate the entire result.
To give yourself a feel for what's going on, imagine knocking a billiard ball on a table that's miles wide. How accurate must your initial angle be to knock it into a pocket on the other side? Now imagine a normal table with balls bouncing around for half an hour. Each time a ball hits another, the angle deviation multiplies. In short order with two different (very minor differences) angles, some balls are completely missing other balls. There's your entire butterfly effect.
Now imagine the other famous realm of the butterfly effect -- "time travel". You go back and make the slightest deviation in one single particle, one single quantum of energy, and in short order atmospheric molecules are bouncing around differently, this multiplies up to different weather, people are having sex at different times, different eggs are being fertilized by different sperm, and in not very long an entirely different generation starts getting born. (I read once that even if you took a temperature, pressure, wind direction, humidity measurement every cubic foot, you could only predict the weather accurately to about a month. The tiniest molecular deviation would probably get you another few days on top of that if you were lucky.)
Even if the current people in these parallel worlds lived more or less the same, their kids would be completely different. That's why all these "parallel world" stories are such a joke. You would literally need a Q-like being tracking multiple worlds, forcing things to stay more or less along similar paths.
Here's the funnest part -- if quantum "wave collapse" is truly random, then even a god setting up identical initial conditions wouldn't produce identical results in parallel worlds. (Interestingly, the mechanism on the "other side" doing the "randomization" could be deterministic, but that would not save Einstein's concept of Reality vs. Locality. It was particles that were Real, not the meta-particles running the "simulation" of them.)

--
(-1: Post disagrees with my already-settled worldview) is not a valid mod option.
Re:Doesn't matter much by AchilleTalon · 2013-07-28 02:32 · Score: 4, Informative

Measurement errors are involved once at boundary conditions. Precision errors propagates in the computations. So, even if a single precision error is magnitude orders smaller than measurement errors, they can have an impact on the result depending on the computations involved while solving the problem.

--
Achille Talon
Hop!
Hey, at it least it ran all the way. by 140Mandak262Jamuna · 2013-07-28 02:58 · Score: 3, Interesting

These numerical simulation codes can sometimes do things funny things when you port from one architecture to another. One of the most frustrating debugging session I had was when I ported my code to Linux. One of my tree class's comparison operator evaluates the key and compares the calculated key with the value stored in the instance. It was crapping out in Linux and not in Windows. I eventually discovered Linux was using 80 bit registers for floating point computation but the stored value in the instance was truncated to 64 bits.
Basically they should be happy their code ported to two different architectures and ran all the way. Expecting same results for processes behaving choatically is asking for too much.

--
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Re:Have these people never heard of IEEE754???? by amorsen · 2013-07-28 02:58 · Score: 3, Insightful

When floating point roundoff errors grow big enough to affect the outcome of the simulation, you have long since reached the point where you are not predicting anything useful any longer.
This is not true. If the model predicts rain at 2 pm two days out and different rounding moves it to 3 pm, that is still a useful forecast in a lot of cases.

--
Finally! A year of moderation! Ready for 2019?
Re:Have these people never heard of IEEE754???? by Xtifr · 2013-07-28 03:08 · Score: 5, Informative

That would be a case of solving the wrong problem. Getting the exact same result every time doesn't much matter if that result is dominated by noise and rounding errors. In fact, the diverging results are a good thing, since, once they start to diverge, you know you've reached the point where you can no longer trust any of the results. If all the machines worked exactly the same, you could figure the same thing out, but it would require some very advanced mathematical analysis. With the build-the-machines-slightly-differently approach, the point where your results are becoming meaningless leaps out at you.
Remember, the desired result here is not a set of identical numbers everywhere. It is an accurate simulation. Getting the same results everywhere would not make the simulation one bit more accurate. So really, this is a good thing.
Re:Damn you people by Anonymous Coward · 2013-07-28 03:38 · Score: 5, Insightful

Precision is the point. Mathematical chaos diverges exponentially. This means that if you have a value of 9.3440281 in one calculation and it returns 3.5 and a value of 9.344028147 in another, that you can get completely different results (where the second case returns 8.1). Now you say: well, let's just make it more precise then! So you put in the value of 9.34402814672 and get a completely different result (1.7), and so on*. If you weren't dealing with mathematical chaos, you would continually refine the values down (e.g. 3.5, 3.45, 3.467, etc.).
* Note: I should be careful with this layman's description to point out that more precise values technically shrink the window down. But since it is exponentially divergent in the first place, this might not ever do you any good in a realistic setting. Ref Lyapunov exponents and mathematical chaos
Re:Have these people never heard of IEEE754???? by Rockoon · 2013-07-28 03:39 · Score: 5, Informative

So are you saying that enforcing predictable and correct answers has a significant performance cost?
He said nothing about "correct."

And yes, enforcing predictable answers across toolchains and architectures has significant performance cost. Even ignoring optimizations, with the x87 FPU (which uses 80-bit registers) it means the compiler needs to emit a rounding operation after every single intermediate operation because the x87 uses 80-bit internal floats but IEEE754 specifies that all operations, even intermediate ones, are always to be performed as if rounded like 32-bit or 64-bit floats.

When you get into the effects of order-of-operations type optimizations even on hardware that only uses 64-bit floats, you find that in most cases (x + y + z) != (z + y + x) even when the same floating point precision is present in each step of the calculation. Even things like common-divisor optimizations (if z is used as a divisor many times, compute 1/z a single time and multiply because multiplication is much faster than division) destroy the chance of equal outcome between compilers that will do it and compilers that will not.

The best way to get insight into the issues is to become familiar with the single-digit-of-precision estimation technique.

--
"His name was James Damore."
Re:Have these people never heard of IEEE754???? by Anonymous Coward · 2013-07-28 03:54 · Score: 5, Insightful

Almost nothing you do with IEEE754 floating point numbers is correct in the strict mathematical sense. You can't even represent 0.1 (1/10) as an IEEE754 floating point number. There are entire series of lectures on the topic of scientific computing with floating point numbers. The errors are usually small enough that a few simple rules keep you safe (e.g., never compare floating point numbers for equality), but when you do many iterations, the errors can accumulate and mess with your results, and if in that case you do the calculations in a different order, the accumulated error will mess with your results in a different way. That's what's happening here.
Re:problem solved decades ago by HornWumpus · 2013-07-28 04:01 · Score: 3, Informative

A little knowledge is a dangerous thing.
Get back to us when you've recompiled the simulation using BCD and then realize that there is still rounding. .01 being a repeating decimal in float is another issue.

--
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
Lorenz, the Butterfly Effect and Chaos Theory by alanw · 2013-07-28 04:03 · Score: 3, Informative

Edward Lorenz discovered that floating point truncation causes weather simulations to diverge massively back in 1961.
This was the foundation of Chaos Theory and it was Lorenz who created the term "Butterfly Effect"
http://www.ganssle.com/articles/achaos.htm
Re:Have these people never heard of IEEE754???? by kyrsjo · 2013-07-28 04:12 · Score: 5, Interesting

*SNIP*
BTW, this is one reason why I take all the global warming predictions with a big grain of salt - they are all based on computer simulations which are difficult if not impossible to validate, and given what I've seen, I don't trust the results from them at all.
In the case of climate simulations, different models (both physics-wise and code-wise) are run with different computers on the same input data, and yield basically the same results.
When simulation chaotic behaviour, very small differences can make a big difference in the outcome of your simulations. As an example, I'm currently working on simulations of sparks in vacuum, which is a "runaway" process. In this case, adding a single particle early in the simulations (before the spark actually happens) can change the time for the spark to appear by several tens of %. This also happens if we are running with different library versions (SuperLU, Lapack), different compilers, and different compiler flags. Once the spark happens, the behaviour is predictable and repeatable - but the time for it to happen, as the system is "balancing on the edge, before falling over", is quite random.
Re: Have these people never heard of IEEE754???? by statusbar · 2013-07-28 05:59 · Score: 3, Interesting

Good points - in fact in this case one can say that ALL of the calculations done by the different computer architectures are in fact wrong. to varying degrees When doing floating point math without rounding analysis being done then all bets are off. Measurements always have accuracies, and floating point math also adds it's own inaccuracies.
The Boost library can help: http://www.boost.org/doc/libs/1_54_0/libs/numeric/interval/doc/interval.htm
Of course all this extra interval management costs in terms of development and performance. But what is the cost of having supercomputers coming up with answers with unknown accuracy?

--
ipv6 is my vpn
Re:Have these people never heard of IEEE754???? by dfghjk · 2013-07-28 06:20 · Score: 4, Insightful

"Remember, the desired result here is not a set of identical numbers everywhere. It is an accurate simulation."
*An* accurate simulation is not the desired result either, an accurate model is. Without reproducibility you don't have a model.
Reproducibility is important always.
Re:Damn you people by Anonymous Coward · 2013-07-28 06:50 · Score: 5, Funny

For being one of the many to use should of where the correct phrase is should have (often abbreviated should've, I just point at you and laugh.
Re:Have these people never heard of IEEE754???? by amorsen · 2013-07-28 07:01 · Score: 3, Funny

It is so unfortunate that academics do not have the wisdom of Slashdot available before they submit papers. Alas, that is the reality they have to live with.

--
Finally! A year of moderation! Ready for 2019?
Re: Have these people never heard of IEEE754???? by dylan_- · 2013-07-28 08:12 · Score: 3, Insightful

another one says the earth will absorb the heat Which one do you trust?
I think I'd have to go with the one that doesn't redefine "absorb" to mean "magically disappear".

--
Igor Presnyakov stole my hat