Red Hat Engineer Improves Math Performance of Glibc
jones_supa writes: Siddhesh Poyarekar from Red Hat has taken a professional look into mathematical functions found in Glibc (the GNU C library). He has been able to provide an 8-times performance improvement to slowest path of pow() function. Other transcendentals got similar improvements since the fixes were mostly in the generic multiple precision code. These improvements already went into glibc-2.18 upstream. Siddhesh believes that a lot of the low hanging fruit has now been picked, but that this is definitely not the end of the road for improvements in the multiple precision performance. There are other more complicated improvements, like the limitation of worst case precision for exp() and log() functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision (PDF). One needs to prove that those results apply to the Glibc multiple precision bits.
I don't know ASM much, but my friend whose professional career is in programming told me that to really speed up the speed sometimes one may have to go the ASM way
Is that true?
Who needs glibc anymore ? we have systemd now.
I love reading stories like this. I'd love it if a concerted effort were made to further optimize various non-FP glibc methods as well. I assume they're already fairly well optimized, but I would be shocked if there weren't still some gains to be made. +10 internets to you, Mr./Ms. Poyarekar.
It looks like the slowest paths of the transcendental functions were improved by a lot. But how often do these paths get used? The article doesn't say so the performance benefits may be insignificant.
"When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
But having made an actual, real contribution to a piece of software in general use makes him more of an engineer than an unusually competent computer scientist.
Last time I checked, India was in Asia.
What is perhaps a bit of irony of history, even for humans a lookup table is faster and more precise than manually calculating it via formula. That is why they published books of logarithms. Using interpolation you can even stretch out the precision to several more digits. With a table of values in memory you can also narrow down the inputs to Newton's method and calculate any differentiable function very quickly to an arbitrary precision. With some functions the linear approximation is so close that you can reduce it in just a few cycles.
Even in most trigonometric functions there is a simple table upon which the angle addition formulas are used to get the other values[an old example].
Given the size of most operating systems, where 8k of ram is hardly noticed (most gifs are larger than this), I am actually quite surprised that the lookup table method is not more used. It would seem one of the first things to put in cache on your ALU.
The real problem is you need to be expert in the target processor(s) ...
Not really. Being an expert in assembly language in general may be required but not necessarily an expert on the target architecture. Transitioning from one target architecture to another is not like starting over from scratch. Part of becoming a good assembly language programmer is recognizing things in algorithms that can't quite be stated in a high level language, information or suggestions that can't be given to the compiler. Specific computer architectures provide the toolboxes to address such shortcomings. So a bit of the work is common regardless of the architecture and the rest is determining how to accomplish something with the architecture specific toolbox.
Circa 2000 I was beating PowerPC compilers on my first attempt. Now I had a lot of experience with x86 and some with 68K, 8051, 8048, 6502 and Z80. At the university I had taken undegrad and grad computer architecture classes, the later focused on Alpha. Before writing the "commercial" PowerPC code referred to earlier I spent a couple of weeks reading PowerPC manuals and writing little pieces of test code.
Being new to PowerPC I was concerned that I had missed something or done something wrong despite seriously beating the compiler. I was able to go to Apple and spend a couple days with their engineers. Their PowerPC people thought the assembly code was fine and couldn't really improve upon it, their compiler people couldn't improve the original C code.
That said, assembly language is unnecessary much of the time. Contrary to popular belief C code can be written in a manner that favors one architecture over another. By understanding the given architecture and its assembly language I've been able to rewrite C code and not have to go all the way to assembly. Having two C implementation of some code, one for x86 vs one for PowerPC, or more generally one for CISC vs one for RISC.
This is interesting. Do you have any examples?
Sorry, proprietary code of a past employer. It was in the domain of low level bitmapped graphics.
Its been quite a while but one method that I recall involved branch prediction. That if the condition to be used for a branch is known sufficiently far in advance then a branch on the PowerPC has no penalty. An inner loop had numerous branches whose conditions could be determined quite early. I recall doing so and storing things in extra condition registers. The result was that the inner loop no longer had any branch penalties. Myself and the Apple engineers just couldn't get the various compilers to do anything comparable.
This was just one of several things that I did but I don't really recall the others. Well, except that the rotate and mask instructions can be amazingly useful.
Newer hardware can make use of newer features which will change what should be considered the best optimisations. Addition used to be much faster than multiplication until they put barrel multipliers in chips. Once floating point cores were added, other things became faster but the early FPUs could do things like add and multiply and anything else could be very slow. I wrote a floating point library for OS9 for the radio shack color computer which had a 2 mhz 8 bit cpu with good 16 bit instructions and no floating point hardware and I could do trig and log functions faster than a 4.77 mhz 8087 floating point unit. I could use tricks like bit shifting and de-normalising floating point numbers for quick adds. There was one function that the typical Taylor series used a /3 + /5 + /7 type thing but there was an alternate that used /2 + /4 + /8 but took more steps but an integer CPU can divide by a power of 2 something like 50 times faster than dividing by an odd number so the doing the extra steps was faster. My library took advantage of precision shortcuts like simply not dealing with some of the low order bits when the precision was going to be lost in the next step or two which are things that you simply can't do efficiently with current floating point hardware.
A good example that I ran into not too long ago was trying to get GCC to autovectorize some heavy matrix multiplication operations without using vector intrinsics. No matter how hard I tried and now matter how explicitly I forced the memory alignment (on x86 double quad-word loads into XMM registers require 16 byte alignment) and ensured that all operations were 128 bits wide (SSE codepath) or 256 bits wide (AVX codepath) GCC just couldn't figure it out on its own. I poured through the compiler output and manage to clear up a few ambiguous data dependencies but I just couldn't get it to autovectorize the main loop.
I ended up digging around in the compiled ASM and noted where GCC was failing to unroll and reorder enough for SLP to work properly. I rewrote a small chunk of it by hand and got the results that I expected but doing so for a large portion of the project would have been unreasonable and also would have bound the source to x86. Instead, I switched from GCC to ICC and ICC picked up the optimization right away. For shiggles I tried Clang/LLVM as well but had no luck with it either.
With a table of values in memory you can also narrow down the inputs to Newton's method and calculate any differentiable function very quickly to an arbitrary precision. With some functions the linear approximation is so close that you can reduce it in just a few cycles.
No, you can't. I know this was done in Quake3 fastInvSqrt(), but that is the exception, not the rule in my experience. x = pow(a,b) is a differentiable function. How can you assemble a root function/Newton iteration to successively correct an initial guess for x to arbitrary precision--without actually calling pow() or other transcendental function? I have built Newton (and Halley and Householder) iterations to successively correct estimates for pow(a,b) when b is a particular rational number. You can re-arrange the root function to only have integer powers of the input a and of the solution value x, and those can be computed through successive multiplication. These can be fast, but they are certainly not useful when b is something other than a constant rational number. And even if the exponent value has only a few significant digits, the multiplication cascade starts to get expensive (that was the reason to use Halley/Householder because once you have f' calculated, f'' and f''' are almost free.)
If you know otherwise, please let me know. My current fast pow() function leverages IEEE floating point formats and Chebyshev polynomial expansions to get reasonable results. If there is way to polish an approximate pow() result with Newton (or higher order) iteration, I would be happy to learn it.
I once found an Bresenham line drawing algorithm written in Assembly (68k) on a Mac SE (black and white graphics).
That code was extremely complicated and the main mistake was that the author loaded each byte, manipulated it and rewrote it to memory each cycle.
Just to load in many cases the exact same byte again in the next cycle.
I immediately came to the idea to not load and store the bytes every cycle, and from that the jump to figure if you could also use words and long words depending on how steep the line was, was easy.
So I first wrote code in C that used chars, ints and longs depending on steepness and changed consecutive bits until it needed to access another "word". Each "word" was only loaded once into a register and was only written after all necessary bits where changed.
My C code rewrite was so fast (more than 100 times than the original assembly) that I never rewrote that in assembly itself.
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
For pow(a,b), [a,b real numbers], you are essentially calculating:
a^b = (e^log(a))^b) or pow(pow(e, log(a)), b) which is e^(b*log(a)) or pow(e, b*log(a)) where e is the base of the natural logarithm.
What you have in your table are the values for e^x and log(x), like any good book of logarithms of ancient times. Precision according to your needs. For quick lookup you can even index the mantissa in a b-tree if your table is huge.
Then it becomes very quick:
step 1: look up log (a) in the table, interpolate if needed.
step 2: calculate b * (value in last step).
step 3: lookup up e^x where x is the value at step 2 in your table, interpolate if needed.
step 4: profit! as you now have your result.
And as a bonus, you are sure the result is within the precision of your table immediately, within the error of your interpolation.
Note that interpolation for exp(x) is quite fast. There are some exotic methods out there as well for interpolating exp(x) and log(x), as per this abstract which are quite efficient if you need high precision. For 10 digit precision you could easily fit both your tables into 8k.
If you compile with -ffast-math, then they often will be. The problem is that the hardware implementations tend to fall into two categories. They're either very fast, and sometimes correct, or they're correct and very slow. The correct ones are normally microcoded, and if you've ever looked at CPU microcode then you'll know that it's an even worse thing to try to write complex algorithms in than assembly. You can generally do a better job with a C implementation that the compiler then optimises. The fast ones are very fast, but don't handle corner cases correctly (often the instructions predate the IEEE standard), so they're great for games and suchlike but not so good for scientific computing. GPUs are the worst offenders, often implementing things like trig functions as small lookup tables, leading to a lot of approximation.
I am TheRaven on Soylent News
Interesting. But just as a side note: I thought that people generally stopped rolling their own linear algebra code and started to use precanned one from Intel,amd,nVidia etc. I'm sure matlab is that way.
But that was not my question. I fully understand how to use lookup tables/Chebyshev expansions of exp(x) and ln(x) to implement pow(x,a)--I have implemented these many times. My question was specifically on your assertion that any differentiable function could be evaluated with as a Newton-style iterative correction and thus provide arbitrarily precise results. I asked specifically to see how that is accomplished for pow(). There is no corrective mechanism in the algorithm you have stated above. The precision you get is a function of the precision that you've baked into your lookup table--and then it become the space/accuracy trade-off. On desktop/server CPUs, that trade-off is more often than not won by Chebyshev expansions, especially in a world of SSE/AXV vector instructions.
So, if there is a technique that does allow me to start with an initial guess x[0] of pow(a,b) and then create corrections of the form: x[i+1] = x[i] - f(x[i]) where f() uses only intrinsic math operations (+,-,*,/,etc) but not transcendentals, then I am quite anxious to see it.