Red Hat Engineer Improves Math Performance of Glibc
jones_supa writes: Siddhesh Poyarekar from Red Hat has taken a professional look into mathematical functions found in Glibc (the GNU C library). He has been able to provide an 8-times performance improvement to slowest path of pow() function. Other transcendentals got similar improvements since the fixes were mostly in the generic multiple precision code. These improvements already went into glibc-2.18 upstream. Siddhesh believes that a lot of the low hanging fruit has now been picked, but that this is definitely not the end of the road for improvements in the multiple precision performance. There are other more complicated improvements, like the limitation of worst case precision for exp() and log() functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision (PDF). One needs to prove that those results apply to the Glibc multiple precision bits.
I don't know ASM much, but my friend whose professional career is in programming told me that to really speed up the speed sometimes one may have to go the ASM way
Is that true?
Who needs glibc anymore ? we have systemd now.
I love reading stories like this. I'd love it if a concerted effort were made to further optimize various non-FP glibc methods as well. I assume they're already fairly well optimized, but I would be shocked if there weren't still some gains to be made. +10 internets to you, Mr./Ms. Poyarekar.
Tricky benchmarking of hardware and software, where yours gets the new glibc and Brand X gets the old one.
Bank on it.
It looks like the slowest paths of the transcendental functions were improved by a lot. But how often do these paths get used? The article doesn't say so the performance benefits may be insignificant.
"When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
Fast inverse square root
My bet would be 15-20 years. Originally, I'd guess it would be closer to 20 years, but being a RedHat employee, he might speed up the process.
I'm not surprised to hear that the floating point operations are slow.
I've always found that it is best to avoid floating point operations as much as possible. If you want speed, use a lookup table or a couple of tables. With a few kilobytes of memory, you can get about 7-8 bits of precision--and that's 7-8 bits of accurate precision. I'll take 7-8 bits of accurate precision for 12 bits of garbage any day. I can usually get 10-20x speed increase over glibc.
There is a trade-off though. It does take up a few kilobytes of more memory. It's hardly an issue on most systems today though...
Granted, I haven't bothered to implement all the floating point operations with lookup tables, but I've done a few over the years. It is possible that some operations will not lend themselves well to look up tables, but I haven't found any of those yet.
Red Hat Computer Scientist Improves Math Performance of Glibc
There, fixed that for you.
a noble concept, of infinite usefulness,
now, a dirty skeleton, unregarded, unmarked, unmourned.
Last time I checked, India was in Asia.
Professional merely means "paid".
Although I believe there's a mistaken presumption that a lot of work on FOSS is unpaid.
it sounds like the functions in question (pow() and exp()) work by first looking at a look-up table/polynomial approximation technique to see if the function can use that to get an accurate-enough value, then running through a series of multiplications to calculate the result if the table/approximation method wouldn't be accurate enough
What kind of horror story is this??
I couldn't think of a sane person put in a routine that first checking a look-up-table and then check the approximation to see if the result can be 'accurate enough' and after failing all that goes through a series of multiplications to obtain the final result
Oh man !
I have to admit that I did not read through TFA nor have I check the embedded functions inside the libraries, but if this particular function is of any indication of Glibc I can imagine exponential improvement could be have if all the non-essential and really stupid cruds be thrown out
Since this work improved the actual calculation part, my guess is that it will improve quite a few cases
Again, without looking at the code, my guess is that improving the calculation part would be among the last thing one should work on
If one could skip all those look-up-table / estimation of accuracy part, the speed improvement would be even more amazing
I don't understand how the use of factorials to calculate the taylor series would yield an improvement. Wouldn't the factorials be fairly expensive to compute?
I am also wondering how to test the performance improvement of something like this. Do they use a single benchmark across different platforms and report the average improvement that was measured? Does anyone know how that was done?
The real problem is you need to be expert in the target processor(s) ...
Not really. Being an expert in assembly language in general may be required but not necessarily an expert on the target architecture. Transitioning from one target architecture to another is not like starting over from scratch. Part of becoming a good assembly language programmer is recognizing things in algorithms that can't quite be stated in a high level language, information or suggestions that can't be given to the compiler. Specific computer architectures provide the toolboxes to address such shortcomings. So a bit of the work is common regardless of the architecture and the rest is determining how to accomplish something with the architecture specific toolbox.
Circa 2000 I was beating PowerPC compilers on my first attempt. Now I had a lot of experience with x86 and some with 68K, 8051, 8048, 6502 and Z80. At the university I had taken undegrad and grad computer architecture classes, the later focused on Alpha. Before writing the "commercial" PowerPC code referred to earlier I spent a couple of weeks reading PowerPC manuals and writing little pieces of test code.
Being new to PowerPC I was concerned that I had missed something or done something wrong despite seriously beating the compiler. I was able to go to Apple and spend a couple days with their engineers. Their PowerPC people thought the assembly code was fine and couldn't really improve upon it, their compiler people couldn't improve the original C code.
That said, assembly language is unnecessary much of the time. Contrary to popular belief C code can be written in a manner that favors one architecture over another. By understanding the given architecture and its assembly language I've been able to rewrite C code and not have to go all the way to assembly. Having two C implementation of some code, one for x86 vs one for PowerPC, or more generally one for CISC vs one for RISC.
ASM for certain inner loops can be a good sixth step in the optimization process. It only matters after the algorithm is optimized, then optimized again, then the system stuff is right (ie look at how the storage structures used by the program translate to actual physical operations).
Thank you.
sure thing, if you want to rewrite your code for every cpu architecture ...
Nope. I only need to write assembly for the architectures I care about, all other architectures can have the reference C code implementation.
... (and preferably also every generation of said architecture)
Probably not. Performance tuning may only be necessary for the older architectures in order to lower required/recommended system requirements so the potential market is larger.
Also assembly is often more durable, more long lived, than you might imagine. I have some math code that I wrote in assembly targeting the Pentium Pro (686) back in the day. Every once and a while I compare it against the reference C code when I have a new architecture and/or new compiler. Its still faster. The percentage improvement is getting smaller but its still a win, that original investment made in Pentium Pro days has not become counterproductive. YMMV.
And if the day arrives where the assembly becomes counterproductive then newer generations of an architecture of interest can just use the reference C code like the architectures one never cared about.
This is interesting. Do you have any examples?
Is Utah in Africa or Europe?
Do the new functions provide the exact same results as the old functions for any given input?
Ideally, we should not be using exact bit-equivalence for comparing floating-point numbers. However, we do not live in an ideal world.
Aww don't be mad, -1 is more than -2, after all, right? Also, don't talk about yourself in the third person. Scares away the girls.
I've _never_ had reason to care about glibc exp/pow/log performance _except_ when my code would hit a case that was _10,000_ @#$@# times slower than the average and cause massive failures. Thank god that they're finally doing something about this.
This is interesting. Do you have any examples?
Sorry, proprietary code of a past employer. It was in the domain of low level bitmapped graphics.
Its been quite a while but one method that I recall involved branch prediction. That if the condition to be used for a branch is known sufficiently far in advance then a branch on the PowerPC has no penalty. An inner loop had numerous branches whose conditions could be determined quite early. I recall doing so and storing things in extra condition registers. The result was that the inner loop no longer had any branch penalties. Myself and the Apple engineers just couldn't get the various compilers to do anything comparable.
This was just one of several things that I did but I don't really recall the others. Well, except that the rotate and mask instructions can be amazingly useful.
"Shut the fuck up!"
-Abraham Lincoln
Newer hardware can make use of newer features which will change what should be considered the best optimisations. Addition used to be much faster than multiplication until they put barrel multipliers in chips. Once floating point cores were added, other things became faster but the early FPUs could do things like add and multiply and anything else could be very slow. I wrote a floating point library for OS9 for the radio shack color computer which had a 2 mhz 8 bit cpu with good 16 bit instructions and no floating point hardware and I could do trig and log functions faster than a 4.77 mhz 8087 floating point unit. I could use tricks like bit shifting and de-normalising floating point numbers for quick adds. There was one function that the typical Taylor series used a /3 + /5 + /7 type thing but there was an alternate that used /2 + /4 + /8 but took more steps but an integer CPU can divide by a power of 2 something like 50 times faster than dividing by an odd number so the doing the extra steps was faster. My library took advantage of precision shortcuts like simply not dealing with some of the low order bits when the precision was going to be lost in the next step or two which are things that you simply can't do efficiently with current floating point hardware.
He moved it into the new kernel... Systemd!
How about compiler being able to leverage Cuda hardware if found during execution? Might make for fatter binaries, but..... if it found GPU, it might help in some cases?
A good example that I ran into not too long ago was trying to get GCC to autovectorize some heavy matrix multiplication operations without using vector intrinsics. No matter how hard I tried and now matter how explicitly I forced the memory alignment (on x86 double quad-word loads into XMM registers require 16 byte alignment) and ensured that all operations were 128 bits wide (SSE codepath) or 256 bits wide (AVX codepath) GCC just couldn't figure it out on its own. I poured through the compiler output and manage to clear up a few ambiguous data dependencies but I just couldn't get it to autovectorize the main loop.
I ended up digging around in the compiled ASM and noted where GCC was failing to unroll and reorder enough for SLP to work properly. I rewrote a small chunk of it by hand and got the results that I expected but doing so for a large portion of the project would have been unreasonable and also would have bound the source to x86. Instead, I switched from GCC to ICC and ICC picked up the optimization right away. For shiggles I tried Clang/LLVM as well but had no luck with it either.
Since people are concerned about speed, why not just use Java?
Everyone knows Java is faster than C/C++, and if you use Java, you don't care about IEEE floating point compliance anyway.
Who cares if your bridge static calculations are 'a bit off' , just issue a service pack and fix it.
Who cares if your financial calculations are a bit off, just claim it's "GAAP".
Who cares if your numbers are not perfect, it just has to be good eough, and complied "just in time".
I note most of the comments here address the efficiency of the code, ignoring the algorithm as the elephant in the fridge. The gains mentioned here are from improved algorithms, not improved code / implementation of the old algorithms.
I say that, since many of the sections here are overrun by useless trolls, unfortunately: They nigh constantly downmod my posts
That's most likely because almost all your posts wander off-topic. Most commonly to whine about trolls attacking you (persecution complex much?) or brag about something. It doesn't help that you almost never pass up a chance to prattle about hosts files and pimp your site.
You also have a disjointed and at times incoherent writing style; over-using caps, punctuation, and quotation marks. You frequently mis-use literary objects and rely heavily on run-on sentences. Frankly speaking, the end result are posts which read like someone who knows a lot of jargon, but doesn't really understand what he's talking about, and is just throwing as much babble together as possible to try to fake it.
I once found an Bresenham line drawing algorithm written in Assembly (68k) on a Mac SE (black and white graphics).
That code was extremely complicated and the main mistake was that the author loaded each byte, manipulated it and rewrote it to memory each cycle.
Just to load in many cases the exact same byte again in the next cycle.
I immediately came to the idea to not load and store the bytes every cycle, and from that the jump to figure if you could also use words and long words depending on how steep the line was, was easy.
So I first wrote code in C that used chars, ints and longs depending on steepness and changed consecutive bits until it needed to access another "word". Each "word" was only loaded once into a register and was only written after all necessary bits where changed.
My C code rewrite was so fast (more than 100 times than the original assembly) that I never rewrote that in assembly itself.
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
There is code and code. Often more than not, it is really the specifics of implementing that count (i.e. the code). I remember stumbling upon a integer circle drawing function for the Z80 on a magazine, and while building the code, I got like 4 or 5 sheets of A4 paper. I got so irritated with it I teared the sheets, and sure enough, in one hour, I had a substitute code that was no more than 10-15 lines, and much more efficient. I also remember at work having inherited code that had to read or write data to SQL, instead of using intermediary caches, gods know why. I shaved run times from 50 minutes to 5.
now if someone would fix huge bloating of static libraries, the execessive memory overhead, the huge size of the core system libraries, and the proliferation of "extensions" we can be happy.
Having a C implementation as a reference is a good thing. Even if the assembly code was well implemented on the original architecture the hot spots needing attention might be different on the second architecture. The C code is pretty useful for profiling and testing.
Also the algorithm is always the place to start. There is little point in optimizing the C code or going to assemble language until you are sure the algorithm is right. Fewer and wider memory accesses is a pretty good direction to explore as you found out. I think last time I touched something like a bresenham I also worked on eliminating branches, maybe some sneaky code to do
x = (condition) ? a : b
in branchless PowerPC code but its been quite a while and maybe that was for something else.
But yeah if your C code is fast enough why redo it in assembly? I like programming in assembly but I'm not a zealot. Although I do recommend taking something like the algorithm you mentioned all the way to assembly as a way to learn a given architecture. I also suggest revisiting the code as one learns more. Truly understanding the architecture helps one write better C code.
Its from red hat so has anyone checked this isn't somehow reducing randomness to allow the nsa in? Their other nsa sponsored backdoors like systemd exploit surface seem to be doing well.....
I like to program in assembly, too. :D ... on the last one only very briefly to get an idea and I forgot everything about the last four ;D - don't remember if I even truly did program on Alpha or only analyzed the compiler output.
But not on X86 architectures. On the other hand I did not do that since 20 years, so no idea if it is now less painful
I did assembly on ARM 3, 6502 - of course - 68k and Sparc, Mips, Dec Alpha, PowerPC
68k (68030/040) I liked the most due to its orthogonal architecture, it was like programming in a high level language. Or lets say other way around: it was very easy to map/compile high level languages to it.
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Interesting. But just as a side note: I thought that people generally stopped rolling their own linear algebra code and started to use precanned one from Intel,amd,nVidia etc. I'm sure matlab is that way.
Did you try the -mavx compiler optimization flag?
x86 assembly has been pretty clean ever since we have had the 32-bit 386 mode. x86-64 assembly is pretty much normal.
Has Linus approved yet? (just kidding, sort of).
Seemd to me that it may be timr for CPU manufacturers to implement math functionality that we normally provide with libraries as part of the CPU itself. Does it make more sense to add more generic cores to a CPU or to add some new instructions and add math co-processors that have ROM on board with lookup tables for logs, powers, sin, cos, tan and a bunch of other useful functions?
You can't handle the truth.
What will be important when doing optimization is that no errors are introduced. Having errors in basic mathematical routines can break things at unpredictable places. The worst errors are the ones which deal with exceptional cases which are rarely hit upon. We worked once on integer factorization algorithms [We programmed Morrison-Brillard and quadratic Sieve methods, and everything in the group had been written from scratch, even the large integer libraries were written from the ground up in Pascal by a relatively large group of programmers, not using a single line of code from the outside as probably custom in any military setup.] Suddenly, our factorization algorithms did not work any more reliably. Debugging took some time but we could pinpoint it to parts where square roots were used. Indeed, the square root routine in rings of integers had been optimized by the group working on fundamental routines. It was quickly fixed, but I can only imagine a rare error in an important library of GCC. In the current case, it looks however as if the engineer knows what he is doing. Especially implementing the squaring function separately is smart and I'm surprised that this had not been done already. When dealing with integer arithmetic like taking powers a^b modulo some number, it is well known to be best first to compute powers a,a^2,(a^2)^2 by successively taking square roots, then write b in binary form and cobbling together the power a^b as a product of the precomputation which is of course the reason why taking powers is cheap in integer arithmetic and important in cryptology like RSA. So, it is important to be able to square quickly.
Indeed, I've not yet found a compiler which is good at using the PowerPC condition registers. For example if you have a variable in which ecah bit represents a flag, a simple mtcrf can transform every bit in an easy to test branch condition (this is also very useful forrrr alignment tests
at the beginning/end of memcpy/memmove loops, mtcrf 0x1,rs). But PPC is almost unique in this respect and history makes that, sadly, compilers are implicitly tuned for architectures with a single condition core register (x86, arm).
Another instruction that compiler fail to use properly is rlwimi, which is a arbitrary bit field move between two 32 bit registers (yes it can extract a bit field from a register and insert it into another independently of the position of the field in the registers). In assembly it helps a lot.
Disclaimer: I wrote an x86 emulator for PowerPC in assembly, 5000 lines in one week, but only basic instructions, it was to run the VGA BIOS of a board, back in 1999. Estimated speed was about a slow (16MHz) 486sx (no FPU) on a 200MHz 603e, not a problem for a BIOS who spend over 90% of the 3 seconds boot time polling the timer for elapsed time while displaying the manufacturer's banner.
You're absolutely right about that. ICC contains a number of compiler optimizations for linear algebra functions provided by Intel. I rolled my own because for no other reason than to prove that I could, it was a neat day project.
He can't do it apk. Hence his minusmod of your post.
Once again, he couldn't put up apk, so he shut up and all that troll's got is a bogus minusmod.
I recall that many of the Linux library math functions I've seen say "copyright Sun Microsystems", and that they were pretty old.
Contribute to civilization: ari.aynrand.org/donate