Red Hat Engineer Improves Math Performance of Glibc
jones_supa writes: Siddhesh Poyarekar from Red Hat has taken a professional look into mathematical functions found in Glibc (the GNU C library). He has been able to provide an 8-times performance improvement to slowest path of pow() function. Other transcendentals got similar improvements since the fixes were mostly in the generic multiple precision code. These improvements already went into glibc-2.18 upstream. Siddhesh believes that a lot of the low hanging fruit has now been picked, but that this is definitely not the end of the road for improvements in the multiple precision performance. There are other more complicated improvements, like the limitation of worst case precision for exp() and log() functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision (PDF). One needs to prove that those results apply to the Glibc multiple precision bits.
I don't know ASM much, but my friend whose professional career is in programming told me that to really speed up the speed sometimes one may have to go the ASM way
Is that true?
Who needs glibc anymore ? we have systemd now.
I love reading stories like this. I'd love it if a concerted effort were made to further optimize various non-FP glibc methods as well. I assume they're already fairly well optimized, but I would be shocked if there weren't still some gains to be made. +10 internets to you, Mr./Ms. Poyarekar.
It looks like the slowest paths of the transcendental functions were improved by a lot. But how often do these paths get used? The article doesn't say so the performance benefits may be insignificant.
"When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
Fast inverse square root
Presumably I'm being clueless but I had tended to assume that these days most of the libc math functions were implemented in hardware - at least for desktop and server chip sets (I could imagine that certain simple embedded chips wouldn't have a full math instruction set). So unless your software is very inefficient at calling out to the hardware wouldn't most of the performance simply depend on the hardware?
I'm not surprised to hear that the floating point operations are slow.
I've always found that it is best to avoid floating point operations as much as possible. If you want speed, use a lookup table or a couple of tables. With a few kilobytes of memory, you can get about 7-8 bits of precision--and that's 7-8 bits of accurate precision. I'll take 7-8 bits of accurate precision for 12 bits of garbage any day. I can usually get 10-20x speed increase over glibc.
There is a trade-off though. It does take up a few kilobytes of more memory. It's hardly an issue on most systems today though...
Granted, I haven't bothered to implement all the floating point operations with lookup tables, but I've done a few over the years. It is possible that some operations will not lend themselves well to look up tables, but I haven't found any of those yet.
Red Hat Computer Scientist Improves Math Performance of Glibc
There, fixed that for you.
Last time I checked, India was in Asia.
The real problem is you need to be expert in the target processor(s) ...
Not really. Being an expert in assembly language in general may be required but not necessarily an expert on the target architecture. Transitioning from one target architecture to another is not like starting over from scratch. Part of becoming a good assembly language programmer is recognizing things in algorithms that can't quite be stated in a high level language, information or suggestions that can't be given to the compiler. Specific computer architectures provide the toolboxes to address such shortcomings. So a bit of the work is common regardless of the architecture and the rest is determining how to accomplish something with the architecture specific toolbox.
Circa 2000 I was beating PowerPC compilers on my first attempt. Now I had a lot of experience with x86 and some with 68K, 8051, 8048, 6502 and Z80. At the university I had taken undegrad and grad computer architecture classes, the later focused on Alpha. Before writing the "commercial" PowerPC code referred to earlier I spent a couple of weeks reading PowerPC manuals and writing little pieces of test code.
Being new to PowerPC I was concerned that I had missed something or done something wrong despite seriously beating the compiler. I was able to go to Apple and spend a couple days with their engineers. Their PowerPC people thought the assembly code was fine and couldn't really improve upon it, their compiler people couldn't improve the original C code.
That said, assembly language is unnecessary much of the time. Contrary to popular belief C code can be written in a manner that favors one architecture over another. By understanding the given architecture and its assembly language I've been able to rewrite C code and not have to go all the way to assembly. Having two C implementation of some code, one for x86 vs one for PowerPC, or more generally one for CISC vs one for RISC.
ASM for certain inner loops can be a good sixth step in the optimization process. It only matters after the algorithm is optimized, then optimized again, then the system stuff is right (ie look at how the storage structures used by the program translate to actual physical operations).
sure thing, if you want to rewrite your code for every cpu architecture ...
Nope. I only need to write assembly for the architectures I care about, all other architectures can have the reference C code implementation.
... (and preferably also every generation of said architecture)
Probably not. Performance tuning may only be necessary for the older architectures in order to lower required/recommended system requirements so the potential market is larger.
Also assembly is often more durable, more long lived, than you might imagine. I have some math code that I wrote in assembly targeting the Pentium Pro (686) back in the day. Every once and a while I compare it against the reference C code when I have a new architecture and/or new compiler. Its still faster. The percentage improvement is getting smaller but its still a win, that original investment made in Pentium Pro days has not become counterproductive. YMMV.
And if the day arrives where the assembly becomes counterproductive then newer generations of an architecture of interest can just use the reference C code like the architectures one never cared about.
Factorials tend to cancel out nicely with other factorials.
And, also, 69! factorial is too big for most calculators - even a pre-calculated table could probably do a ton to speed things alone.
This is interesting. Do you have any examples?
What kind of horror story is this??
I couldn't think of a sane person put in a routine that first checking a look-up-table and then check the approximation to see if the result can be 'accurate enough' and after failing all that goes through a series of multiplications to obtain the final result
Not a horror story at all, just basic numerical analysis. You just described a fast, reliable and arbitrarily-accurate way to calculate transcendental functions. What's insane about that?
This is interesting. Do you have any examples?
Sorry, proprietary code of a past employer. It was in the domain of low level bitmapped graphics.
Its been quite a while but one method that I recall involved branch prediction. That if the condition to be used for a branch is known sufficiently far in advance then a branch on the PowerPC has no penalty. An inner loop had numerous branches whose conditions could be determined quite early. I recall doing so and storing things in extra condition registers. The result was that the inner loop no longer had any branch penalties. Myself and the Apple engineers just couldn't get the various compilers to do anything comparable.
This was just one of several things that I did but I don't really recall the others. Well, except that the rotate and mask instructions can be amazingly useful.
Don't tell me that you've never written an algorithm which uses speculation before. It's quite a common scenario that you have a common fast path and an uncommon slow path, and the cost of deciding which path to use is a significant fraction of the cost of the fast path + checking the result.
In the case of libm, there are a lot of code paths which are there only to maintain strict compliance; no numeric analyst would ever call exp or pow (or even round) with arguments in that range. Not on purpose, anyway.
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
This is MPI code, not IEE754 code.
People using MPI are working with EXACTLY these kinds of numbers (extremely small, so small they'd denormalise even with quadmath/256bit IEEE754, recurring for so long you lose your tail - eg: calculating Pi, or so large your exponent bits aren't enough to even represent the number w/ any standard IEEE754 implementation).
The fact there's interest in optimising the MPI code paths for these functions and their slow paths is indicative of exactly this fact, people ARE indeed passing in many numbers outside of these speculative ranges, and ARE indeed interested in how fast it performs.
Newer hardware can make use of newer features which will change what should be considered the best optimisations. Addition used to be much faster than multiplication until they put barrel multipliers in chips. Once floating point cores were added, other things became faster but the early FPUs could do things like add and multiply and anything else could be very slow. I wrote a floating point library for OS9 for the radio shack color computer which had a 2 mhz 8 bit cpu with good 16 bit instructions and no floating point hardware and I could do trig and log functions faster than a 4.77 mhz 8087 floating point unit. I could use tricks like bit shifting and de-normalising floating point numbers for quick adds. There was one function that the typical Taylor series used a /3 + /5 + /7 type thing but there was an alternate that used /2 + /4 + /8 but took more steps but an integer CPU can divide by a power of 2 something like 50 times faster than dividing by an odd number so the doing the extra steps was faster. My library took advantage of precision shortcuts like simply not dealing with some of the low order bits when the precision was going to be lost in the next step or two which are things that you simply can't do efficiently with current floating point hardware.
A good example that I ran into not too long ago was trying to get GCC to autovectorize some heavy matrix multiplication operations without using vector intrinsics. No matter how hard I tried and now matter how explicitly I forced the memory alignment (on x86 double quad-word loads into XMM registers require 16 byte alignment) and ensured that all operations were 128 bits wide (SSE codepath) or 256 bits wide (AVX codepath) GCC just couldn't figure it out on its own. I poured through the compiler output and manage to clear up a few ambiguous data dependencies but I just couldn't get it to autovectorize the main loop.
I ended up digging around in the compiled ASM and noted where GCC was failing to unroll and reorder enough for SLP to work properly. I rewrote a small chunk of it by hand and got the results that I expected but doing so for a large portion of the project would have been unreasonable and also would have bound the source to x86. Instead, I switched from GCC to ICC and ICC picked up the optimization right away. For shiggles I tried Clang/LLVM as well but had no luck with it either.
How about compiler being able to leverage Cuda hardware if found during execution? Might make for fatter binaries, but..... if it found GPU, it might help in some cases?
Well, the system of execution is often different from the system of compilation. So, you'd want to have the compiler-user specify that a CUDA-specific binary should be compiled alongside (afaik cuda/x86 fat binaries aren't a thing right now). And that falls into the realm of build system, not compiler. From what I've seen the build system (make in this case) relies on nvidia's compiler to builld the cuda libraries subsequent linking into the cuda executable along with the c* and fortran code.
Yes!
I once found an Bresenham line drawing algorithm written in Assembly (68k) on a Mac SE (black and white graphics).
That code was extremely complicated and the main mistake was that the author loaded each byte, manipulated it and rewrote it to memory each cycle.
Just to load in many cases the exact same byte again in the next cycle.
I immediately came to the idea to not load and store the bytes every cycle, and from that the jump to figure if you could also use words and long words depending on how steep the line was, was easy.
So I first wrote code in C that used chars, ints and longs depending on steepness and changed consecutive bits until it needed to access another "word". Each "word" was only loaded once into a register and was only written after all necessary bits where changed.
My C code rewrite was so fast (more than 100 times than the original assembly) that I never rewrote that in assembly itself.
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Depends on what you're doing. Sending stuff to the GPU incurs overhead. If you're doing a ton of matrix operations all in a row, it may be worth it. If you are doing a few every once and a while, probably not.
My Other Computer Is A Data General Nova III.
There is code and code. Often more than not, it is really the specifics of implementing that count (i.e. the code). I remember stumbling upon a integer circle drawing function for the Z80 on a magazine, and while building the code, I got like 4 or 5 sheets of A4 paper. I got so irritated with it I teared the sheets, and sure enough, in one hour, I had a substitute code that was no more than 10-15 lines, and much more efficient. I also remember at work having inherited code that had to read or write data to SQL, instead of using intermediary caches, gods know why. I shaved run times from 50 minutes to 5.
Having a C implementation as a reference is a good thing. Even if the assembly code was well implemented on the original architecture the hot spots needing attention might be different on the second architecture. The C code is pretty useful for profiling and testing.
Also the algorithm is always the place to start. There is little point in optimizing the C code or going to assemble language until you are sure the algorithm is right. Fewer and wider memory accesses is a pretty good direction to explore as you found out. I think last time I touched something like a bresenham I also worked on eliminating branches, maybe some sneaky code to do
x = (condition) ? a : b
in branchless PowerPC code but its been quite a while and maybe that was for something else.
But yeah if your C code is fast enough why redo it in assembly? I like programming in assembly but I'm not a zealot. Although I do recommend taking something like the algorithm you mentioned all the way to assembly as a way to learn a given architecture. I also suggest revisiting the code as one learns more. Truly understanding the architecture helps one write better C code.
Actually the trend is in the opposite direction -- fewer of the math functions are implemented in hardware than used to be. There are many reasons (optimized out-of-order CPUs and old/slow transcendental implementations) but one significant reason is that the new glibc math functions are generally correctly rounded -- exactly correct. Whereas the hardware versions are often not -- as I discussed in this recent blog post:
http://randomascii.wordpress.c...
If you compile with -ffast-math, then they often will be. The problem is that the hardware implementations tend to fall into two categories. They're either very fast, and sometimes correct, or they're correct and very slow. The correct ones are normally microcoded, and if you've ever looked at CPU microcode then you'll know that it's an even worse thing to try to write complex algorithms in than assembly. You can generally do a better job with a C implementation that the compiler then optimises. The fast ones are very fast, but don't handle corner cases correctly (often the instructions predate the IEEE standard), so they're great for games and suchlike but not so good for scientific computing. GPUs are the worst offenders, often implementing things like trig functions as small lookup tables, leading to a lot of approximation.
I am TheRaven on Soylent News
I like to program in assembly, too. :D ... on the last one only very briefly to get an idea and I forgot everything about the last four ;D - don't remember if I even truly did program on Alpha or only analyzed the compiler output.
But not on X86 architectures. On the other hand I did not do that since 20 years, so no idea if it is now less painful
I did assembly on ARM 3, 6502 - of course - 68k and Sparc, Mips, Dec Alpha, PowerPC
68k (68030/040) I liked the most due to its orthogonal architecture, it was like programming in a high level language. Or lets say other way around: it was very easy to map/compile high level languages to it.
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Interesting. But just as a side note: I thought that people generally stopped rolling their own linear algebra code and started to use precanned one from Intel,amd,nVidia etc. I'm sure matlab is that way.
x86 assembly has been pretty clean ever since we have had the 32-bit 386 mode. x86-64 assembly is pretty much normal.
Seemd to me that it may be timr for CPU manufacturers to implement math functionality that we normally provide with libraries as part of the CPU itself. Does it make more sense to add more generic cores to a CPU or to add some new instructions and add math co-processors that have ROM on board with lookup tables for logs, powers, sin, cos, tan and a bunch of other useful functions?
You can't handle the truth.
What will be important when doing optimization is that no errors are introduced. Having errors in basic mathematical routines can break things at unpredictable places. The worst errors are the ones which deal with exceptional cases which are rarely hit upon. We worked once on integer factorization algorithms [We programmed Morrison-Brillard and quadratic Sieve methods, and everything in the group had been written from scratch, even the large integer libraries were written from the ground up in Pascal by a relatively large group of programmers, not using a single line of code from the outside as probably custom in any military setup.] Suddenly, our factorization algorithms did not work any more reliably. Debugging took some time but we could pinpoint it to parts where square roots were used. Indeed, the square root routine in rings of integers had been optimized by the group working on fundamental routines. It was quickly fixed, but I can only imagine a rare error in an important library of GCC. In the current case, it looks however as if the engineer knows what he is doing. Especially implementing the squaring function separately is smart and I'm surprised that this had not been done already. When dealing with integer arithmetic like taking powers a^b modulo some number, it is well known to be best first to compute powers a,a^2,(a^2)^2 by successively taking square roots, then write b in binary form and cobbling together the power a^b as a product of the precomputation which is of course the reason why taking powers is cheap in integer arithmetic and important in cryptology like RSA. So, it is important to be able to square quickly.
You're absolutely right about that. ICC contains a number of compiler optimizations for linear algebra functions provided by Intel. I rolled my own because for no other reason than to prove that I could, it was a neat day project.
This is MPI code, not IEE754 code.
As I understood it, the issue is that glibc's libm occasionally falls back to a multi-precision version of some transcendental functions if the "pure IEEE-754" version isn't good enough on some inputs.
I did look at the code, and the implementation of exp() on 64-bit floats does indeed fall back to a slow path on extreme inputs. I can't remember if it's the MP code or not.
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
I recall that many of the Linux library math functions I've seen say "copyright Sun Microsystems", and that they were pretty old.
Contribute to civilization: ari.aynrand.org/donate
Each step uses a factorial 1 or 2 higher than the previous step. The previous step's factorial is kept around, so a simple multiply or 2 is all that's required to get the new one. He isn't calculating a new factorial afresh at each step. That said, when I've done work like this I've always made a table of precalculated factorials, which may be a mistake if it takes a long time to fetch the table.
Contribute to civilization: ari.aynrand.org/donate