Origin of Quake3's Fast InvSqrt()
geo writes "Beyond3D.com's Ryszard Sommefeldt dons his seersucker hunting jacket and meerschaum pipe to take on his secret identity as graphics code sleuth extraordinaire. In today's thrilling installment, the origins of one of the more famous snippets of graphics code in recent years is under the microscope — Quake3's Fast InvSqrt(), which has been known to cause strong geeks to go wobbly in the knees while contemplating its simple beauty and power." From the article: ""
"English motherfucker, do you speak it?" Anyone care to explain what that function does?
Only the State obtains its revenue by coercion. - Murray Rothbard
From the words of John Carmack himself, if you discover and implement an algorithm by yourself, even though it may have already been discovered already, you deserve credit for finding it on your own.
[Insert rant about software patents]
I have a truly marvelous proof of who wrote this code which this comment box is too narrow to contain.
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
Anyone who has ever played a computer game should pay up.
The linked site seems to be down (gee, you think it might be slashdotted?), but this paper seems to be covering the same topic.
See what I've been reading.
Can someone enlighten me?
-dave
http://millionnumbers.com/ - own the number of your dreams
Naah, just kidding. They both deserve a spot in the Clever Hacks Hall of Fame
Suddenly I'm very scared that Ballmer will want to InvSqrt me some pictures of his kids.
The inverse of the square root is the square. The reciprocal of the square root is what is being calculated. In the case of multiplication I guess it's a reasonable name but it's pretty poor form in my view.
http://www.icarusindie.com/DoItYourSelf/rtsr/index .php?section=ffi&page=sqrt
That page compares the time it takes to calculate the sqrt various ways including Carmacks. Short version is that modern processors are significantly faster since it can be done in hardware. It may still be useful in cases where the processor doesn't have the sqrt function available.
His version took 428 cycles compared to 107 cycles doing it in hardware on the same system.
Work Safe Porn
Here's a more interesting question: "How does this function compute 1/x^(1/2)?" ;)
I'm asking, because I have no idea how it works.
Introduction
Note!
This article is a republishing of something I had up on my personal website a year or so ago before I joined Beyond3D, which is itself the culmination of an investigation started in April 2004. So if timeframes appear a little wonky, it's entirely on purpose! One for the geeks, enjoy.
Origin of Quake3's Fast InvSqrt()
To most folks the following bit of C code, found in a few places in the recently released Quake3 source code, won't mean much. To the Beyond3D crowd it might ring a bell or two. It might even make some sense.
InvSqrt()
Finding the inverse square root of a number has many applications in 3D graphics, not least of all the normalisation of 3D vectors. Without something like the nrm instruction in a modern fragment processor where you can get normalisation of an fp16 3-channel vector for free on certain NVIDIA hardware if you're (or the compiler is!) careful, or if you need to do it outside of a shader program for whatever reason, inverse square root is your friend. Most of you will know that you can calculate a square root using Newton-Raphson iteration and essentially that's what the code above does, but with a twist.
How the code works
The magic of the code, even if you can't follow it, stands out as the i = 0x5f3759df - (i>>1); line. Simplified, Newton-Raphson is an approximation that starts off with a guess and refines it with iteration. Taking advantage of the nature of 32-bit x86 processors, i, an integer, is initially set to the value of the floating point number you want to take the inverse square of, using an integer cast. i is then set to 0x5f3759df, minus itself shifted one bit to the right. The right shift drops the least significant bit of i, essentially halving it.
Using the integer cast of the seeded value, i is reused and the initial guess for Newton is calculated using the magic seed value minus a free divide by 2 courtesy of the CPU.
But why that constant to start the guessing game? Chris Lomont wrote a paper analysing it while at Purdue in 2003. He'd seen the code on the gamedev.net forums and that's probably also where DemoCoder saw it before commenting in the first NV40 Doom3 thread on B3D. Chris's analysis for his paper explains it for those interested in the base math behind the implementation. Suffice to say the constant used to start the Newton iteration is a very clever one. The paper's summary wonders who wrote it and whether they got there by guessing or derivation.
So who did write it? John Carmack?
While discussing NV40's render path in the Doom3 engine as mentioned previously, the code was brought up and attributed to John Carmack; and he's the obvious choice since it appears in the source for one of his engines. Michael Abrash was mooted as a possible author too. Michael stands up here as x86 assembly optimiser extraordinaire, author of the legendary Zen of Assembly Language and Zen of Graphics Programming tomes, and employee of id during Quake's development where he worked alongside Carmack on optimising Quake's software renderer for the CPUs around at the time.
Asking John whether it was him or Michael returned a "not quite".
-----Original Message-----
From: John Carmack
Sent: 26 April 2004 19:51
Subject: Re: Origin of fast approximated inverse square root
At 06:38 PM 4/26/2004 +0100, you wrote:
>Hi John,
>
>There's a discussion on Beyond3D.com's forums about who the author of
>the following is:
>
>float InvSqrt (float x){
> float xhalf = 0.5f*x;
> int i = *(int*)
> i = 0x5f3759df - (i>>1);
> x = *(float*)
> x = x*(1.5f - xhalf*x*x);
> return x;
>}
>
>Is that something we can attribute to you? Analysis shows it to be
>extremely clever in its method and supposedly from the Q3 source.
>Most people say it's your work, a few say it's Michael Abrash's. Do
>you know who's responsible, possibly with a history of sorts?
Not me,
...the work of John Romero. Apparently he was going to call it the MakeYouMyBitch() function. ;P
-"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
Well, you're sort of right.
f^-1(x) = x^2 is the inverse of f(x) = sqrt(x) where x > 0
GAAH! MY PRINTER IS ON FIRE!!! PUT IT OUT! PUT IT OUT!
Seriously, try looking away from the genius who obviously wrote it.
- There is no single comment which would make reading and understanding what happens here much easier!
- Introduction of a magic number with no explanation whatsoever
- Magic pointer arithmetics without demystification
- Portability? Abuse of a single processor architecture, without warning that this would not work on non-x86
I know it is good code. But it is simply bad code!float InvSqrt(float x) { float xhalf = 0.5f*x; int i = *(int*) i = 0x5f3759df - (i >> 1); x = *(float*) x = x*(1.5f - xhalf*x*x); return x; } Asking John whether it was him or Michael returned a "not quite". But it's supposed to return a float!
I will forever be a student.
They're computing the inverse square root, usually written as something like x^(-2) (which is the same as 1/sqrt(x) of course) via, umm, err, magic :-)
... that's one of the crazy parts the PDF analyzes). Then it changes the int-float back into a plain float and does one step of Newton's method using the guess derived from the magic part.
:-) Read the PDF linked in the article if you want someone who still knows what they're talking about to explain it; I'm just trying my best to remember anything after all these years. I swear that analysis class gave me some Pavlovian instinct to shudder in dread whenever I see them pulling out the strange substitutions and Taylor expansions so they can calculate the error...
Seriously, they're using some evil numerical techniques that approximate the actual value of that function to withing a few percent very quickly. You could do more steps of Newton's root finding method or other complex things, but that would be slow and this is only meant to find things that are "good enough" for you to draw on someone's screen.
This function is used in graphics programming, specifically in things like the Quake 3 source code (as well as other places). First, it calculates x/2, then it gets the bits of the float x as an integer (without modifying them), then it does the "magic" step with the magic number they calculated via evil mathematics to give the best starting point they can for Newton's method using the x >> 1 to divide the float-as-integer in two (and make it positive, since your sign bit falls into the exponent of the float!
Don't mind me. I may have a degree in mathematics, and I even (somehow!) made it through at least one class on numerical analysis, but all I can tell you is that this stuff is evil
HAKMEM is classic bathroom reading for hackers. If you want to do it up old-school, print a copy from original scans, double-sided.
I wish there was an option to mod some of these posts "+1 Whoa".
I believe you meant to say x^(-1/2)
Too bad the people modding you up don't have math degrees. =P
:(){
Here's an old version of one of my webpages:
c ities.com/SiliconValley/9498/sqroot.html
http://web.archive.org/web/19990210111728/www.geo
And here's an updated version of the same page:
http://www.azillionmonkeys.com/qed/sqroot.html
It isn't an exact rendering of the code in question, but it explains enough for any skilled hacker to 1) understand what's going on and 2) modify the code to create the resulting code that's in the Quake 3 source. Furthermore this web page has existed since about 1997 (archive.org doesn't go back that far for some reason.)
Now *IF* in fact the code origin comes from someone who took ideas from my site, I should point out that *I* am not the originator of the idea either (though I did write the relevant code). Bruce Holloway (who I credit on the page) was the first person to point out this technique to me at around the 1997 timeframe (prior to this, I created my own method which is similar, but not really as fast). (Vesa Karvonen informed by of the technique (through a code snippet with no explanation) at roughly the same time as well.) It was a technique well known to hard core 3D accelerator and CPU implementors, and follows an intentional design idea from the IEEE-754 specification.
Prof. William Kahan, one of the key people who specified the IEEE-754 standard (the standard for floating point the many CPUs use, starting with Intel's 8087 coprocessor) apparently presented this idea, and is the source for where Bruce Holloway got the idea. The IEEE-754 standard came out around the 1982 time frame. Though, its very likely that these ideas originate from even earlier in computing history.
This is John Carmack we're talking about. I imagine he knows how to use a code profiler. The approximations it involve are likely to cause bugs. Floating point always involves approximations. If you don't like them, use a better approximation (double precision). But if you're just trying to, say, figure out which pixel to color, and you approximate the pixel to a few decimal places...I think you're good to go.
I mean, it's not like they're running scientific simulations with this. It's Quake3. Furthermore, converting floating-points to integers and then doing calculations on them is a trick that has been known for years, long before Quake3. They weren't converting floats to ints. They manipulated the bitwise representation of the float. I'm not sure if that's what you meant, though.
:(){
It starts by taking a guess at the right answer, and then improving the guess until it's accurate enough to use.
The first step depends heavily on the fact that a floating point number on a computer is represented as a significand (aka mantissa) and an exponent (a power of two). For the moment, consider taking just the square root of X instead of its inverse. You could separate out the exponent part of the floating point number, divide it by two, and then put the result back together with the original significand, and have a reasonable starting point.
From there, you could improve your guesses to get a better approximation. The simplest version of that would be like a high-low game -- you split the difference between the current guess and the previous guess, and then add or subtract that depending on whether your previous guess was high or low. Eventually, you'll get arbitrarily close to the correct answer.
This can take quite a few iterations to get to the right answer though. To improve that, Newton-Raphson looks at the curve of the function you're working with, and projects a line tangent to the curve at the point of the current guess. Where that line crosses the origin gives you the next guess. That's probably a lot easier to understand from picture.
In this case, we're looking for the inverse square root, which changes the curve, but not the basic idea. As a general rule, the closer your first guess, the fewer iterations you need to get some particular level of accuracy. That's the point of the:
While the originator of this constant is unknown, and some of it is rather obscure, the basic idea of most of it is fairly simple: we start by shifting the original number right a bit. This divides both the mantissa and the exponent part by two, with the possibility that IF the exponent was odd, it shifts a bit from the exponent into the mantissa. The subtraction from the magic number then does a couple of things. For one thing, if a bit from the exponent was shifted into the mantissa, it removes it. The rest of the subtraction is trickier. If memory serves, it's based on the harmonic mean of the difference between sqrt(x) and (x/2) for every possible floating point number of the size you're using.
This is where the fact that it's 1/sqrt(x) instead of sqrt(x) means a lot: 1/sqrt(x) is a curve, but it's a fairly flat curve -- much flatter than sqrt(x). The result is that we can approximate a point on the curve fairly accurately with a line. In this case, it's really two lines, which gets it a bit closer still.
From there, the number has had a bit of extra tweaking done -- it doesn't actually give the most accurate first guess, but its errors are often enough in the opposite direction from those you get in the Newton-Raphson iteration steps that it gives slightly more accurate final results.
The universe is a figment of its own imagination.
(I'm (sorry))
You don't (happen to (program (lisp))) do you?
The algorithm is simple Newton-Raphson -- make a good initial guess, then iterate making the guess better. I think Newton-Raphson on 1/sqrt picks up 5-6 bits each try in the line "x = x*(1.5f - xhalf*x*x)". It can be repeated to get a more accurate result each time it's repeated.
The problem with Newton-Raphson is making a good first guess--otherwise, you need an extra iteration or two. And that's what the magic number is doing, making a good first guess.
So let's work out what a good first guess would look like for 1/sqrt(f), to see where this code came from.
Floating Point numbers are stored with a mantissa and an exponent: f = mantissa * (2 ^ exponent), where exponent is 8-bits wide and the mantissa is 23-bits wide.
Let's take an example: 1/sqrt(16) would have f = 1.0 * (2 ^ 4). We want the result 0.25 which is f = 1.0 * ( 2 ^ -2).
So our first guess should take our exponent, negate it, and cut it in half. (Try more examples to see that this works--it's basically the definition of 1/sqrt(f)). We'll ignore the mantissa--if we can just get within a factor of 2 of the answer in one step, we're doing pretty well.
Unfortunately, the exponent is stored in FP numbers in an offset format. In memory, The mantissa is in the low 23 bits, and the most-significant bit is the sign (which will be 0 if we're taking roots). For now, let's just assume we have our exponents as 8-bit values, to work out what we need to do with the +127 offset.
We want new_actual_exp = -(actual_exp)/2. But in memory, exp = (actual_exp + 127). Or, actual_exp = exp - 127.
Substituting gives (new_exp - 127) = -(exp - 127)/2. Simplify this to: new_exp = 127 - (exp - 127)/2 => new_exp = 3*127/2 - (exp / 2).
Now the exponent is shifted 23 places in memory, so let's write out our code (and ignore the mantissa completely for now...): rewriting as hex: Well, first, it's arguable whether it should be 0x5f000000 or 0x5f400000 (The "4" is actually in the mantissa). I'm guessing resolving that dilemma led to the original author discovering that choosing a particular pattern of bits in the mantissa can help make the initial guess even more accurate, leading to the 0x5f3759df constant.
I haven't worked it out, but Chris Lomont http://www.lomont.org/Math/Papers/2003/InvSqrt.pd
I think I have found a similar solution for the reciprocal of the FOURTH root. The code looks almost the same, but it requires a new constant and you shift the bits 2 places instead of just 1. float Inv4thrt(float x) { float x0 = x; int i = *(int*) // get bits for floating value
i = 0x4F541EEF - (i>>2); // gives initial guess y0
x = *(float*) // convert bits back to float
x = (x/4.0f)*(5 - x0 * x*x*x*x); // Newton step, repeating increases accuracy
return x;
}
Note that the constant I have here is not very carefully tuned so the error may be slightly larger than with the sqrt.
Pretty cool, I think.
I imagine they ran some profiling tools and found that they were spending a substantial amount of time in calls to sqrt(), and figured that the division was also time-consuming and for an operation that is performed on so many pixels, it was worthwhile to optimize this particular routine.
I've written enough 3D graphics code -- including a textured polygon rasterizer that would probably cry and try to delete itself if it saw something like Quake 3 -- to know that they didn't have to run a profiler to know that they'd be spending too much time doing 1/sqrt(x) if they didn't have a highly optimized routine for it. It's an inherent operation in so many 3D calculations it isn't funny, and while by the time Quake 3 came out hardware floating point units were pretty fast, FP divides and FP square roots were very lengthy operations that more importantly couldn't be pipelined.
But if you're just trying to, say, figure out which pixel to color, and you approximate the pixel to a few decimal places...I think you're good to go.
Yeah, pretty much. Back when I wrote my code (pentium days) you had an FP unit but it wasn't very good, so I used fixed point math (using integer instructions) which had sufficient precision for a 320x200 display. Getting enough performance out of the core algorithms was still hard enough that I had to take a lot of shortcuts, like instead of doing the right thing by using a divide every pixel to calculate which texel to use, I used a divide every 8 pixels and linearly interpolated in between. I'm sure that Quake (the contemporary 3D engine of the day which would also make my code cry) contained many more clever optimizations and approximations, because it wouldn't have been possible on the hardware of that day without them.
In fact, approximating FP values for 3D code is so common that the 3DNow and SSE instruction sets contain instructions that approximate the square root and inverse square root to about half of single-precision floating point. The non-approximation instruction uses a lookup table to get an initial guess, then uses a couple iterations of Newton's Method to refine the result. The short cut instructions simply return the value in the lookup table.
So yeah, basically the AC OP has no idea what he is talking about, and from that basis of ignorance is denigrating what is in reality a very clever and extremely useful hack.
The enemies of Democracy are
Attribution.
That is why most, if not all, software patents are bogus. Just because you reinvent something published by a PhD working in a committee that disbanded 10 years before you knew 'C' came after 'B' in the alphabet does not make you reinvention patent worthy. The history of invsqrt() crosses disciplines of hardware and software design, spec development, graphics and math theory. With such a fascinating function having a hard to track history it is no wonder that things like James W. Hunt's 1976 diff concept can be patented in 2006 by Microsoft. (As mentioned by QuantumG on slashdot.)
The trouble had in tracking the history of InvSqrt() is really sad. Computing is an industry that hypes how digital storage of information and perfect copies eliminates the isse of decay. While the ever expanding secondary storage (just now getting to really usefull) means oblivating the scaling issues with retaining the meta-information that denotes this very history. I guess the moral of the story is: sign your contributions. It won't take up that much space in the comments.
--Alan Perlis
And will the real Fast InvSqrt() author please stand up?
"You cannot have a General Will unless you have shared experiences. You cannot be fair to people you don't know."
Answer to 2)
The f after 0.5 and the 1.5 simply means "use the floating point version of this number". Multiplying two numbers which have different types (e.g. multiplying an int and a float) or doing assignment between different types means that one of the numbers will have to be automatically converted to the other's type and which conversion happens depends on the types of both numbers. By default, I think type of the literal 0.5 (without any letters after it) is a double. By adding the f (assuming x is a float) the types of the operands are the same so there is no implicit type conversion thus eliminating the chance of the wrong conversion happening automatically. You care about what type of multiply is done because certain types can lead to unwanted loss of precision.
As for 1)
Well if you iterate by one step you don't need a loop. Presumably most looping iteration happens because your first guess wasn't accurate enough. If your first guess WAS accurate enough why would you needlessly calculate more accuracy? I'm only speculating though - I don't know if this is really the case.
-- Did you try Tao3D? http://tao3d.sourceforge.net
In modern processors, lookups from memory are far too slow. This is especially true if you are doing lots of random lookups that might go outside the cache.
Performance-wise, you are much better doing several calculations than a single lookup.
The point is that pointer-based type punning isn't really allowed by the language definition. The optimizer can take advantage of that to emit simpler code that, in this case, produces the wrong answer. Using a union short-circuits that kind of optimization.
This isn't to say any particular Gcc produces wrong answers with the code as written, but if you complained about one that did, they would just laugh at you and (if they liked you enough) tell you to rewrite it like the above instead.