Why Computers Suck At Math
antdude writes "This TechRadar article explains why computers suck at math, and how simple calculations can be a matter of life and death, like in the case of a Patriot defense system failing to take down a Scud missile attack: 'The calculation of where to look for confirmation of an incoming missile requires knowledge of the system time, which is stored as the number of 0.1-second ticks since the system was started up. Unfortunately, 0.1 seconds cannot be expressed accurately as a binary number, so when it's shoehorned into a 24-bit register — as used in the Patriot system — it's out by a tiny amount. But all these tiny amounts add up. At the time of the missile attack, the system had been running for about 100 hours, or 3,600,000 ticks to be more specific. Multiplying this count by the tiny error led to a total error of 0.3433 seconds, during which time the Scud missile would cover 687m. The radar looked in the wrong place to receive a confirmation and saw no target. Accordingly no missile was launched to intercept the incoming Scud — and 28 people paid with their lives.'"
What Every Computer Scientist Should Know About Floating-Point Arithmetic
1) This problem was covered in Risks Digest years ago.
2) Design and production phase was completed in 1980.
http://catless.ncl.ac.uk/Risks/10.82.html#subj1
is a good start for "Why the hell are we using this weapons system the way we are?"
As memory serves the fix is to restart the system perodically.
As memory also serves that's been part of the operating procedure for a very long time.
The problem seems to be right out of the textbook for "Practical Analysis" (not sure if this is the correct translation for the german "Praktische Analysis"). This was a nandatory course for every computer science degree during my university time (20 years ago). Don't know if this is still the case. It was an eye opener to see how correct formulas and a perfectly working computer could yield absurd results. Several times i was asked for help by people claiming their Excel was broken due to such mistakes.
CU, Martin
This particular story took place in 1991, and most of the code for Patriot was written in the 70s - needless to say, software QA was a little more lax back then. The fix for this problem was out a couple days after the incident.
I actually read about this specific incidence once; I seem to remember (though honestly not sure) that the design flaw was known and the user manual indicated that the computer needed to be reset every 36 hours. However, in wartime, under attack (there were frequent Scud intercepts), the crew controlling the missile battery opted against shutting it down if even for short time. Maybe even though the manual said it SHOULD be rebooted it did not explain WHY or what the consequences would be.
I want to know who programmed a system that allowed floating point errors to accumulate over time in a critical calculation. I hope they did not receive a degree in computer science, or that if they did, it was not from my alma mater.
Seriously, what programmer has not heard of floating point errors? That has to be one of the most common phrases I have ever heard in relation to programming; even the EEs and MEs I have met are familiar with the concept.
Palm trees and 8
>>>It's also pretty pathetic that the system designers implemented a broken design and did not foresee this problem. High-resolution timekeeping has been accomplished pretty successfully already...
I sorry.
j/k.
We had a similar problem with an Aegis design, and it was a major headache for us Hardware engineers to try to convince the Systems Engineers that counting in Binary time was more logical than counting in 0.1 second increments. The SEs kept insisting that their computers at home accurately count in seconds and we hardware engineers should be able too. The HE manager and the SE manager were butting heads for about a month over this issue, until finally an upper-level manager handed-down a decision in favor of the HE manager and binary-based counting/requirements documentation.
I guess in the Patriot situation, the decision went in the opposite direction. Hence errors we introduced.
"I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
I remember this from a numerical methods class in the 1980s. To deal with situations like this, you can do one of three things :
a) Have a function that you sample as a function of t, so you don't get accumulated error.
b) Have enough bits so that error won't be an issue. This is actually hard to do because floating point errors do stack up pretty quick if you are not careful.
c) Or, you can have an error term which you can use to make adjustments along the way to account for a lack of precision. Bresenham's line does that more or less exactly when he does his lines. That's why you had "stair stepping" as the algorithm corrected itself along the way.
If the OP was correct, then PATRIOT failed because it did none of them. My bet is in reality, they simply underestimated the actual error term, but did everything else correct. This could be because of discrepancies in flight control instrumentation or some sensor, or, they were simply trying to save money on bits and didn't really do the calculation as to how far the missile could be off in an error term length seconds of flight at a particular phase in its flight profile.
Bottom line is, the engineering discipline exists to solve this problem and is really no different than error handling in any guidance system. Putting a man on the moon, launching an ICBM at target, shooting down a missile, are all essentially the same computer science problem from an error management perspective. The Phd's already nailed this decades ago. There's not a fundamental limitation to computing, in this case, merely, a failure or inability of engineers on this project to apply the correct known answer to this problem.
This is my sig.
Fixed point never rounds when operating in the range and precision for which it is designed. In this case they needed a precision of .1, using INT/10 would be 100% accurate and never give them any rounding errors for this use case.
So, in other words: You are wrong, and should probably considering using fixed point more.
With fixed point you can choice the basis of the fraction part. A binary fixed point would not help them, but a decimal fixed point of /10 or /100 would. The algrebra of fixed point is the same no matter what base you choice. This means it is fastest way to get decimal based fraction instead of binary fractions (decimal floating point is best with hardware support).
Well, in this specific instance a decimal system would have been ok, but it isn't a general answer. The general answer is "make sure your increments are divisible into your number base", if they had used 1/8th or 1/16ths of a second, or even 3/32 of a second, as their timer increment then they would not have had this problem. There's no reason why 1/10th of a second has any magic properties.
In general terms, all number bases have other number bases with which they are incompatible. The inability of binary to represent 1/10 accurately is just the same as the inability of decimal to represent 1/3 accurately. It's only because we use decimal all the time that we overlook decimal's shortcomings (or instinctively compensate for or avoid them) and then blame computers for binary's incompatibility with decimal.
Everybody knows that they exist, fewer people know how to avoid them. Lots of early multimedia frameworks, for example, were written using floating point timestamps and developed this exact problem (add some fraction repeatedly for each audio and each video frame, and after an hour the two tracks are noticeably out of sync). Now, they use a numerator-and-denominator form which is simple to add without rounding errors and so you only get them when you convert to floating point for comparison.
Even fewer people realise how compiler and hardware dependent they can be. For example, if you do a sequence of floating point operations on x86 then the values will stay in 80-bit registers until they are stored out to a variable. If you compile the same code for a newer machine with SSE or for another architecture then you will get 32-bit operations on your 32-bit floats and so you'll have less precision. A lot of compilers will even generate different precision between debug and release builds.
I am TheRaven on Soylent News
We had a similar problem with an Aegis design, and it was a major headache for us Hardware engineers to try to convince the Systems Engineers that counting in Binary time was more logical than counting in 0.1 second increments. The SEs kept insisting that their computers at home accurately count in seconds and we hardware engineers should be able too.
And the software engineers would have been right. The error was not about counting in 0.1 second increments versus 1 second increments or whatever, but it was in using floating point representation where fixed point (basically, scaled integer) would have been more appropriate.
And come to think of it, that is more or less what most desktop and server OSes do: they count number of milli, micro, or nanoseconds, and store that as an integer.
Similar issue arises in finance: you don't encode dollar amounts as floating point. Instead you store number of cents (or mils) as integer. Every programmer of financial software knows about this (... or should know about this...)
Floating point is really only appropriate to represent values which are not known precisely anyways (measurement results), where the little additional rounding error wouldn't matter. For all else, used fixed-point.
There was no evidence of them hitting a single target. None.
I know that I'm arguing with a trolling AC, but for the other readers of slashdot, you should know that the grandparent's post refers to the controversy regarding the analysis of the Patriot system during the first Gulf war. There was a huge propaganda machine behind the Patriot's "successes" which turned out to be very near zero indeed. This was covered in a series of hearings in the early 90's...
http://www.fas.org/spp/starwars/docops/pl920908.htm
You can also read up on this from transcripts from the hearings after the war.
In the interests of fairness, here is a rebuttal / review.
http://www.fas.org/spp/starwars/docops/zimmerman.htm
I remain unconvinced -- from reading this (almost 20 years ago) I concluded that at best, the military did not know for sure that these worked well.
Slashdotter, ID #101. UIDs are in binary, right?
Someone posted the actual GAO report on this, which makes a bit more sense than the gibberish TechRadar arcticle.
http://www.fas.org/spp/starwars/gao/im92026.htm
The way the system is sure it's tracking the target it was given is by predicting where it should be seen next based on speed and diretion, and then only looking for it in a window ("range gate") around that predicted position. The window is a point in space-time and therefore has time coordinates as well as space coordinates, and the problem was that the Patriot system apparently used absolute time since power on to specify the time coordinate, hence the error accumulation. The problem could have been avoided simply by using a time coordinate relative to the last tracked postion rather than an absolute one.
The GAO report also blames the 24 bit registers of the 1970's era hardware as limiting accuracy which is just garbage. A good excuse to a politician perhaps, but there was nothing stopping them from using a 64 bit, or whatever, math library if that would have helped.
Of course the Patriot was being used outside of it's original requirements spec when being used to target SCUDs, so it seems someone really screwed up in not reviewing the design beforehand and determining it's limitations (and fixing them) rather than finding out after the fact when 28 people are dead as a result.
I think the guy has a point (altough he's being a bit nationalistic about it)
I'm actually English and I live in Taiwan. I've got no plans to ever live in the US, so it's not really about nationalism.
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
Eh. Forgive me, but do you have any basis whatsoever for this claim, or are you just being arrogant?
In the UK people have been locked up for breaching the official secrets act. Fair enough you may say, but many of them seem to have been guilty more of embarrassing the government than releasing information which hurt national security.
http://news.bbc.co.uk/2/hi/uk_news/216868.stm
Now the UK is not particularly bad at this sort of thing. In far less free societies like China people have been executed because they "might" have commented on the health of senior leaders and quoted information which was publicly available. In fact most of the charges against them are never even released -
http://fairuse.100webcustomers.com/itsonlyfair/latimes0243.html
The nonconfidential version of the verdict released to the family March 24 reveals only two of eight "top secret" charges, any of which could result in the death penalty. One relates to charges from a witness that Wo "might" have intentionally passed on information about the health of senior leaders to Taiwan, Chen and Michael Rolufs said.
A second alleges that Wo collected technical information on missiles for the Taiwanese. The other six such charges were not revealed. The verdict also claims Wo received $400,000 from Taiwan.
Chen seriously doubts that Wo had access to confidential information on senior leaders' health status and notes that the verdict's use of "might" suggests a lack of certainty. On the more serious charge of obtaining technical information on Chinese missiles, the verdict suggests Wo got information from magazines. But these were all from a publicly accessible library, Chen said.
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
Ok, now go and read the article. The Patriot bug was a problem with fixed point maths. The Ariane bug was integer overflow. The Intel FPU bug was caused by a production error with nothing to do with the arithmetic actually being performed.
Reality is the ultimate Rorschach.
Yes. The issue here sounds like they had a system clock counter that was an integer, that counted the number of 0.1 second clock ticks. Then they wanted to convert this to a floating point number in 24 bit IEEE format, They simply multiplied 0.1 by the integer in the register. Of course, that still sounds like too large an error top have occured from just that, but lets pretend it did.
There are several issues here. For missiles travelling at such speeds, using a system clock counter based on 0.1 second ticks sounds terribly coarse to me. Second, since 0.1 seconds are the baseline resolution of the system, the system should have been using floating point numbers where '1' corresponds to a decisecond rather than a second. Then the time counter would be exactly expressible in the floating point format.
Lastly, if the floating point format really needed to be in units of seconds, rather than deciseconds, the time counter should have been loaded in, having an exact representation, then it should be divided by 10, which has an exact representation. This is all prety basic to anybody who has even a limited understanding of floating point. If you understand the inherent precision of every operation even better than I do, even more improvements would be possible.
But to be honest, I'm not sure why floating point was used at all here. It sounds to me like fixed point may have worked just fine for most of these problems. (Of course, fixed point has its own set of rules ensuring maximal accuracy. )
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524
You do realize that Israel hasn't been occupying a square inch of Lebanon and therefore Hizballah doesn't actually have any reason to fire rockets other than "hey let's kill some goddamn Jews" (which is their stated reason for it)?
Yup. This, combined with the parent's note about high-resolution timing shouldn't have even gotten past the first programmer to write the code. The instant they wrote a line of code that depended on timekeeping that precise, there should have been a review of the time system, or rather before, that should have been thought of in the design phase. And as for floating-point errors, any programmer that isn't aware of those issues needs to be writing ... fuck, I don't even know. Something that doesn't use floating-point numbers I guess. Why the Christ they were repeatedly adding floating-point numbers is beyond me, 99.1% of the time it's possible to either do a direct calculation or do a "resync" of some sort, and for the 0.899999998% of the rest of the time, you can use an arbitrary-precision number library, or a rational number library, or (very) carefully look at the rounding algorithm, or any number of other ways around the issue.
Anyway, this should have been caught at pretty much every layer, and whoever missed it shouldn't be in the business. Blech.
<xml><I><am><so><damn>Web 2.0</damn></so></am></I></xml>
One of the other results (the first one that comes up for me actually) claims that in testimony presented to Congress Postol's methodology was called out as flawed based on the fact that three or eight Patriots were launched at every incoming missle and his video analysis is done per interceptor fired completely ignoring the massive odds against more than one interceptor making a hit. The Isreali's independent analysis puts the success rate at 50%.
LISP, Scheme, Haskell, Mathematica, Maple, and plenty of other languages support arbitrary precision rational numbers as built in types. This fixes all rounding errors involving rational numbers (including fractions). If irrational numbers like pi, e, or transcendental functions are necessary, then there will always be inherent error in the representation and the programmer has to know how to do with that error and calculate the expected error of a sequence of operations. If you want to get fancy, you can use an algebraic language like Mathematica to symbolically solve your equations and maintain perfect accuracy with symbolic representations of irrational and transcendental numbers.
The problem is the base mathematical principles that computers are also using.
Any fraction that results in a repeating decimal or is an irrational number is going to have an inherent error due to rounding or truncating--the computer will never be using an exact value.
Any calculation using pi or e is going to have error due to rounding or truncation for the same reason--the computer will never be using an exact value.
Even respecting signficant digits, there will still be error in any calculation that has to round or truncate a fractional result.
Note that this problem is also inherent to pencil and paper calculated mathematics--at the point the calculation has to round or truncate a number with values after the radix point (so that problem can occur in ANY whole number base, not just base 10 or base 2), an exact value is not being used at that point and so that and any subsequent calculation always has error associated with it.
While I agree that the design decisions which lead to this were poorly made, this error was common knowledge.
The Patriot system _must_ be restarted every X days, exactly due to this bug. This is documented and everything.
While the initial error was with the people who created the Patriot system, the soldiers who were assigned to the system were the ones who made sure that a documented bug with a known-good work-around became a loss of life.
The problem described is not overflow, it is repeated rounding on the imprecise representation of 0.1. The systems failed after 3.6M ticks. In a 24-bit register overflow is not a problem until at least 16.7M ticks. The lowest bound is because this is not an integer register and the article does not describe the size of the exponent. If you check the figures in the article, when the system failed it was out by about three ticks in 3.6M. Overflow causes the representation to suddenly shift to a completely wrong value. Subtle shifts to nearby values are symptomatic of rounding error.
Rounding is an issue. This particular example is classic textbook stuff and been used in many a software engineering course. Using a 24-bit floating point representation store 0.1. The error will be roughly 1/(2^23). Now repeatedly accumulate this value. The problem occurs because each time the values goes past a power of two boundary in magnitude we need to round the accumulator causing a slight loss of accuracy. Then for each subsequent addition we are adding a less precise representation of 0.1 until they are more likely to round up rather than down.
The system has three sources of accuracy that anybody with experience in floating arithmetic could have pointed out easily:
1. The initial representation of 0.1 is off by a tiny amount (very small impact on the final value)
2. As the accumulator increases in magnitude it's exponent rises, in this case by log_2(3.6M) = 22 places.
3. The subsequents additions of the small value to the much larger value become increasingly imprecise.
The problem is not the clock itself, nor some integer accumulation of the time: it is a designer who chose to use a floating point accumulator. Multiplying the representation of 0.1 by the integer number of ticks at each stage would have eliminated the problem. Accumulating 1/8 ticks in floating point would have worked fine. Doing what they did was stupid.
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php