Why Computers Suck At Math
antdude writes "This TechRadar article explains why computers suck at math, and how simple calculations can be a matter of life and death, like in the case of a Patriot defense system failing to take down a Scud missile attack: 'The calculation of where to look for confirmation of an incoming missile requires knowledge of the system time, which is stored as the number of 0.1-second ticks since the system was started up. Unfortunately, 0.1 seconds cannot be expressed accurately as a binary number, so when it's shoehorned into a 24-bit register — as used in the Patriot system — it's out by a tiny amount. But all these tiny amounts add up. At the time of the missile attack, the system had been running for about 100 hours, or 3,600,000 ticks to be more specific. Multiplying this count by the tiny error led to a total error of 0.3433 seconds, during which time the Scud missile would cover 687m. The radar looked in the wrong place to receive a confirmation and saw no target. Accordingly no missile was launched to intercept the incoming Scud — and 28 people paid with their lives.'"
It's pretty pathetic and negligent that software that controls explosive missles was not tested for over 100 hours of operation. That's a standard Quality Assurance procedure for even the simplest low-budget hardware...
It's also pretty pathetic that the system designers implemented a broken design and did not foresee this problem. High-resolution timekeeping has been accomplished pretty successfully already...
I wonder how much time and money was spent in research and development for this thing
It doesn't seem like we're getting a quality product for the likely huge sum that was paid for it...
Use decimal floating point or simple swich to fixed point. Fixed point not used as often as it should, and many developers don't know how difficult ordinary floiting point really is.
Use fixed point numbers? You know, in financial apps, you never store things as floating points, use cents or 1/1000th dollars instead!
Computers don't suck at math, those programmers do. You can get any precision mathematics on even 8 bit processors, most of the time compilers will figure out everything for you just fine. If you really have to use 24 bits counters with 0.1s precision, you *know* that your timer will wrap around every 466 hours, just issue a warning to reboot every 10 days or auto reboot when it overflows.
Translation: computers are only as smart as the people programming them... and there's plenty of stupid people out there.
We knew this. This is no great revelation. So why is this news?
"All great wisdom is contained in .signature files"
All they had to do is use integers, where a value of 1 represents 0.1 s.
You know it makes sense, a little reminder from jointm1k.
The problem seems to be right out of the textbook for "Practical Analysis" (not sure if this is the correct translation for the german "Praktische Analysis"). This was a nandatory course for every computer science degree during my university time (20 years ago). Don't know if this is still the case. It was an eye opener to see how correct formulas and a perfectly working computer could yield absurd results. Several times i was asked for help by people claiming their Excel was broken due to such mistakes.
CU, Martin
I actually read about this specific incidence once; I seem to remember (though honestly not sure) that the design flaw was known and the user manual indicated that the computer needed to be reset every 36 hours. However, in wartime, under attack (there were frequent Scud intercepts), the crew controlling the missile battery opted against shutting it down if even for short time. Maybe even though the manual said it SHOULD be rebooted it did not explain WHY or what the consequences would be.
There certainly are cases of bad math in computers, particularly Intel computers. But this isn't such an example. This is just a lazy and stupid programmer who didn't understand what he was really doing who should take the blame for the failure that killed people, not the computer.
I'm an American. I love this country and the freedoms that we used to have.
I remember this from a numerical methods class in the 1980s. To deal with situations like this, you can do one of three things :
a) Have a function that you sample as a function of t, so you don't get accumulated error.
b) Have enough bits so that error won't be an issue. This is actually hard to do because floating point errors do stack up pretty quick if you are not careful.
c) Or, you can have an error term which you can use to make adjustments along the way to account for a lack of precision. Bresenham's line does that more or less exactly when he does his lines. That's why you had "stair stepping" as the algorithm corrected itself along the way.
If the OP was correct, then PATRIOT failed because it did none of them. My bet is in reality, they simply underestimated the actual error term, but did everything else correct. This could be because of discrepancies in flight control instrumentation or some sensor, or, they were simply trying to save money on bits and didn't really do the calculation as to how far the missile could be off in an error term length seconds of flight at a particular phase in its flight profile.
Bottom line is, the engineering discipline exists to solve this problem and is really no different than error handling in any guidance system. Putting a man on the moon, launching an ICBM at target, shooting down a missile, are all essentially the same computer science problem from an error management perspective. The Phd's already nailed this decades ago. There's not a fundamental limitation to computing, in this case, merely, a failure or inability of engineers on this project to apply the correct known answer to this problem.
This is my sig.
This is not an example of computers sucking at math.
This is an example of engineers and developers failing to draw up valid requirements, failing to develop to specification, and failing to test against real-world use cases.
Management undoubtedly shares an equal if not greater portion of the blame here. This is typical military-industrial complex, lowest-bidder contractor mentality at work, just another form of corporate welfare if the government doesn't turn around and punish shortfalls like this.
because military computers are 20 years out of date to start with. Heck even the awesome modern land warrior hardware, is 10 years out of tech date. Heck they could probably shave 5 pounds off of the hardware by using modern chips, and displays.
Military Spec is only good at rugged. up to date with the best is far behind.
i thought once I was found, but it was only a dream.
You're right. Just as the failure of Samuel Langley's aircraft demonstrated that man would never fly, the failure of an anti-aircraft missile to destroy only half of the ballistic missiles (targets moving at what, twice the speed of the targets it was designed to destroy?) demonstrates that ABM's will never work.
This next song is very sad. Please clap along. -- Robin Zander
The article contains some interesting examples but all of which have been in programming texts and courses for years. I'm not really sure why it's on /.
So if this is the future...where's my jet pack?
Look, you guys can talk trash all you want, but when you say this:
>>Patriot defense system failing to take down a Scud missile attack
You're just lying to yourself. The Patriots defense is awesome this year. I mean, was there really ANY point for the Titans offense to show up a couple of weeks back?
And the Scuds? C'mon man. They let go their best man two seasons ago. The QB can't hit the broadside of a barn and their entire wide-receiver corp has Jello hands anyway. The missile attack is a gadget play, pure and simple. Belichick sees right through that and you know it.
Haters need to stop all the hatin' and get on the Pats bus!!!!! GO PATRIOTS!
If Nalgene water bottles are outlawed, only outlaws will have Nalgene water bottles.
Each battery has overlapping coverage with its nearest neighbors. A proper deployment has overallping fields of fire in both depth and breadth. Surface-to-Air missile defense involves multiple layers of different systems, each specializing in different ranges: Short Range - things like stingers, Medium Range - things like HAWK, Long-Range - things like Patriot. A proper tactical deployment never relies upon a single battery to provide the sole coverage. The problem here was primarily on of tactical deployment. The technical issues can be argued, but, the real failure was a failure to deploy in tactically correct fashion. They sent a battery or two as a "Show of Force", probably overriding the tactical expertise of the officers involved for political expediency. You have jack-asses like Rumsfeld and Cheney (and their ilk) making military tactical decisions when they are not qualified to do so. The REAL failure here is one of politics.
Over-the-top Response Guy! Giving "Over-the-Top Responses" since 1970.
Say what? Citations please. Me thinks one of those 2.0 values isn't really 2.0. Hint: printing a value isn't a good way to get its actual value, because the printing function most likely rounds it to fewer digits than it's actually stored as.
There's no way a real-time missile tracking system is going to be dealing with time at an accuracy of 0.1 sec.
A Patriot missile travels at about Mach 3 (~1000 m/sec) so a rounding error of 0.05, even without any error accumulation, means you'd be off by 50m in position.
Who knows what the real story is vs the garbage that was reported, but even if there was a cumulative error that's the fault of the programmer rather than a lack of a computers ability to do math. You do your error analysis and use whatever accuracy needed to keep the errors in a tolerable range.
The part about the system running for 100 hours was pure gibberish. Yes, we can all divide that by 0.1 sec, but what on earth does that have to do with a real-time tracking system tracking a target is acquired a few minutes ago?!
A better title for the story rather than "computers can't do math" would be "we can't do tech reporting".
It's the reporting that's garbage. It makes no sense at all. A system tracking missiles travelling at Mach 3 is keeping track of time to 0.1 sec accuracy?! Do you really believe that? Wanna buy a bridge?
0.1 sec at Mach 3 is 100m, so you'd have a hope in hell of ever hitting a 3m long target.
The problem isn't the people working for the defence company, who are hard-core PhDs with some very serious domain knowledge. The problem is people like yourself who are so math illiterate as not to be able to fact check a piece-of-shit story!
We had a similar problem with an Aegis design, and it was a major headache for us Hardware engineers to try to convince the Systems Engineers that counting in Binary time was more logical than counting in 0.1 second increments. The SEs kept insisting that their computers at home accurately count in seconds and we hardware engineers should be able too.
And the software engineers would have been right. The error was not about counting in 0.1 second increments versus 1 second increments or whatever, but it was in using floating point representation where fixed point (basically, scaled integer) would have been more appropriate.
And come to think of it, that is more or less what most desktop and server OSes do: they count number of milli, micro, or nanoseconds, and store that as an integer.
Similar issue arises in finance: you don't encode dollar amounts as floating point. Instead you store number of cents (or mils) as integer. Every programmer of financial software knows about this (... or should know about this...)
Floating point is really only appropriate to represent values which are not known precisely anyways (measurement results), where the little additional rounding error wouldn't matter. For all else, used fixed-point.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
I know that I'm arguing with a trolling AC, but for the other readers of slashdot, you should know that the grandparent's post refers to the controversy regarding the analysis of the Patriot system during the first Gulf war. There was a huge propaganda machine behind the Patriot's "successes" which turned out to be very near zero indeed. This was covered in a series of hearings in the early 90's...
http://www.fas.org/spp/starwars/docops/pl920908.htm
You can also read up on this from transcripts from the hearings after the war.
In the interests of fairness, here is a rebuttal / review.
http://www.fas.org/spp/starwars/docops/zimmerman.htm
I remain unconvinced -- from reading this (almost 20 years ago) I concluded that at best, the military did not know for sure that these worked well.
Slashdotter, ID #101. UIDs are in binary, right?
Actually the main purpose is a cost plus fixed profits contract for the weapons manufacturer. Even if no one ever dies on either side of the gun, it's still a success to them.
I could see designing the system to synchronize both launch times and observations with a timer tick (it wouldn't be surprising if the whole system was driven by the timer interrupt), and then you're not going to have an error due to the spacing between ticks.
I am more bit dubious about the 24 bit thing, though. Was it fixed-point or floating-point?
I don't think it was a float. What would that be? Maybe 16 bit mantissa, 1 bit sign and 7 bit exponent would seem to be the likeliest bet for a 24 bit float. If so, then after about two hours doing t += 0.1 would stop changing t, and the error would be much bigger.
So presumably it was fixed point. But if you're doing it fixed point, instead of storing x, you store nx in an int, for some appropriate scaling factor n. But if you're going to do that, surely you'll choose n in a smart way, and in this case the obvious choice, as pointed out by many posters, is n=10. This is not only the obvious choice because it gets you more precision, but it's the obvious choice because the easiest, most obvious and most standard way of coding timers is to just increment a register with each tick. It would be silly, for instance, to let n=2^8, and then increment a register with 0.1*2^8 = 0x20. It would be a very unlikely assembly language programmer who would have put an add reg,20h opcode in interrupt hander code when inc reg would have worked.
Now maybe at some point the timer value would get converted to a float for computations. But that surely wouldn't be a 24-bit float.
So maybe the article has mangled things and it was not a 24-bit register, but a 32-bit float, with 24-bit mantissa, 7 bit exponent and 1 bit sign, and the "24" in the article came from the mantissa. That's a much more realistic choice. Still, the standard way to handle timers is to just increment a timer variable. So what I could see happening is this. There is a timer system variable t at full 0.1 second precision incremented on interrupt. (That's how PCs used to work--maybe still do--except the timer resolution was 1/30 sec.) Then for their launch calculations, they do: (float32)t / 10. And now they're going to get nasty roundoff errors as the mantissa gets filled up. At the 36 hour point, t is already about 23 bits long. So when you do a float divide by 10, you'll certainly have roundoff problems. But you're still not going to be more than one tick (0.1 sec) off, because each tick still adjusts the mantissa, while the article says they were 0.36 seconds off.
So I think something got mangled in the article. Or we had a really unlikely assembly language programmer who had floating point code executed with every tick of a timer interrupt. But even if the interrupt is only at 10hz, that's just completely contrary to the instincts of an assembly language programmer. And this would have been done back in the hey-day of assembly language programming, when one would try to optimize every clock cycle one could. (And, yes, I've worked with timer interrupt handlers, both on the Z80 and the 8086.)
FTFA:
"So computers might suck at maths, but there's always a solution available to circumvent their inherent weaknesses. And in that case, it's probably more accurate to say that computer programmers suck at maths - or at least some of them do."
Thank you, come again.
So in a system that should have clocks synchronized to less than a microsecond nobody bothered to run "ntpdate" even once in hundred days ?
Yes, obviously they just needed to ssh into their patriot missile air defense system, edit a few lines in /etc/inet/ntp.conf and svcadm restart ntp.
The obvious problem in the article, if you read it, is computer's finite precision, and how it is dealt with. By 'computer', the author could have easily included the system libraries that are actually doing all the rounding and overflows instead of implementing arbitrary precision in software.
Everyone defending the way 'computers' is used in this article, and conflating it with 'processor' is a complete idiot.
We even have a modern analog for this - the shift-lock key.
Oh, and the Scud hunting in Gulf One was largely an air exercise, as I recall, and of course they went after the launchers. It's always preferable to destroy the enemy on the ground (or in harbor, or asleep in barracks) then when they're incoming. The Japanese didn't bomb Pearl Harbor because it's impractical to sink ships at sea--it's just easier to hit slow- or non-moving targets.
This next song is very sad. Please clap along. -- Robin Zander
I'd just like to point out here that the 28 people were not killed by the failure of the intercept system. They were killed by the nice folks who launched the missile in the first place.
If at first you don't succeed, destroy all evidence that you tried.
Crap like this was alive and well when I was in uni and its still alive and well.
Witness: Limits to Growth written by Meadows et al: http://en.wikipedia.org/wiki/The_Limits_to_Growth
Consider that book was written in 1972. I was programming computers in 1972. I actually did a course in numerical analysis in 1972 and just re-read the first 10 pages or so. I happen to have read a masters thesis that came out of the Colorado School of Mines where the author stated Meadows' Runge Kutta Numerical Integrations did not converge.
Yet that book is still often quoted. Its been flawed from the get go. So consider something else! How fast were the machines that Meadows used? How big? What would be the MOST SOPHISTICATED model he could use at the time. How could _anyone_ take seriously predictions made by a primitive model run on such a machine?
Witness: The current discussion about Global Warming and Climate Change. The change in CO2 over the last 100 years is about 100 ppm if you can believe the data. This is 100/1,000,000 = 0.0001. Now the thing is this. A 32 bit float holds about 6.9 digits of precision. Lets call it 7 digits. If one were to add a whole number of some kind to the fractional change of the CO2 as measured relative to the total gases in the atmosphere then one has 7-4 = 3 digits or less to work with.
Of course one can use a double precision float. That isn't my point. One has to be an EXPERT in order to avoid huge problems with propagating rounding errors.
Its not just about pretending computers use base 10 when they don't, its about knowing the actual properties of a number of type float and what the consequences are when we use it.
In the case of that rocket I suspect the rounding error can be solved by normalizing everything so the time line is not in seconds but is actually in clock ticks... as accurately as they can be determined of course.
But in my career I have seen so few programmers who can do this that I've never even needed to look at a finger or a toe for something to count on. Nada - never met one.
I'll give another example. More than one project team that I worked with had no idea how floats even work! To sit there and try to use floats for their Accounts Payable and Accounts Receivable and then say they can't understand why nothing will balance? Arrghh! IMHO its downright incompetence. They needed to use comp which COBOL supported which is base 10 or normalize all their money into pennies and handle the decimal when the data was read in and printed.
LISP, Scheme, Haskell, Mathematica, Maple, and plenty of other languages support arbitrary precision rational numbers as built in types. This fixes all rounding errors involving rational numbers (including fractions). If irrational numbers like pi, e, or transcendental functions are necessary, then there will always be inherent error in the representation and the programmer has to know how to do with that error and calculate the expected error of a sequence of operations. If you want to get fancy, you can use an algebraic language like Mathematica to symbolically solve your equations and maintain perfect accuracy with symbolic representations of irrational and transcendental numbers.
While I agree that the design decisions which lead to this were poorly made, this error was common knowledge.
The Patriot system _must_ be restarted every X days, exactly due to this bug. This is documented and everything.
While the initial error was with the people who created the Patriot system, the soldiers who were assigned to the system were the ones who made sure that a documented bug with a known-good work-around became a loss of life.