Origin of Quake3's Fast InvSqrt()

← Back to Stories (view on slashdot.org)

Origin of Quake3's Fast InvSqrt()

Posted by Zonk on Friday December 1, 2006 @07:20AM from the i-know-you-were-dying-inside-without-this dept.

geo writes "Beyond3D.com's Ryszard Sommefeldt dons his seersucker hunting jacket and meerschaum pipe to take on his secret identity as graphics code sleuth extraordinaire. In today's thrilling installment, the origins of one of the more famous snippets of graphics code in recent years is under the microscope — Quake3's Fast InvSqrt(), which has been known to cause strong geeks to go wobbly in the knees while contemplating its simple beauty and power." From the article: ""

2 of 402 comments (clear)

Min score:

Reason:

Sort:

It might be damn smart.. by kan0r · 2006-12-01 08:01 · Score: 5, Insightful
But the first thing I thought when I saw this was: "Damn, that code is a mess!"
Seriously, try looking away from the genius who obviously wrote it.
- There is no single comment which would make reading and understanding what happens here much easier!
- Introduction of a magic number with no explanation whatsoever
- Magic pointer arithmetics without demystification
- Portability? Abuse of a single processor architecture, without warning that this would not work on non-x86
I know it is good code. But it is simply bad code!
Re:It was fast by systemeng · 2006-12-01 08:19 · Score: 5, Insightful

First off, this function calculates 1.0/sqrt(x), not sqrt(x). InvSqrt is a particularily nasty function because both the divide and the square root stall the floating point pipeline on IA32 processors. As a result, instead of shooting out one result per cycle that the pipelining normally allows, the processor will stall for 32 cycles for the divide after it has stalled for the 43 cycles for the square root(P4). This is a big hit to realtime performance and it also prevents 76 multiplies from getting done while the pipeline is stalled. Secondly, IA32 processors are super scalar and have multiple integer units which can do portions of this calculation in parallel. This algorithm is brilliant because it uses the integer units for a portion of the most difficult part of the calculation and the remaining floating point multiplies only take about 6 clock cycles on the FPU. The difference in clock cycles you are counting is likely because the routine as written will be implemented as a function call and the stack push overhead will eat you alive. If this is implemented inline, it's about 6 times as good as simply calling the processor's assembly instructions for root and divide in sequence with the penalty that it isn't as accurate. It is virtually impossible to beat sqrt on IA-32 but 1.0/sqrt can be computed faster with newton raphson iteration in one fell swoop than by coposition of the operations. I've worked several years implementing similar optimizations in the reference implementation of ISO/IEC 18026, a standard for digital map conversion. Most of the routines that had optimizations like this added to them saw at least 30% speed improvements. This is a bit of a soft number because many things were reordered to make the pipeline fill better but in general, a complicated function especially of trig fucntions that can be computed in one iteration of well designed newton-raphson will be much faster than the coposition of the CPU's implementation of the component functions. In short, don't write off careful numerics they can provide great sped improvements, just don't use them in code that people will want to understand later if you don't document exactly what you did and why.