Red Hat Engineer Improves Math Performance of Glibc

← Back to Stories (view on slashdot.org)

Red Hat Engineer Improves Math Performance of Glibc

Posted by Soulskill on Friday January 2, 2015 @10:01AM from the performance-enhancing-devs dept.

jones_supa writes: Siddhesh Poyarekar from Red Hat has taken a professional look into mathematical functions found in Glibc (the GNU C library). He has been able to provide an 8-times performance improvement to slowest path of pow() function. Other transcendentals got similar improvements since the fixes were mostly in the generic multiple precision code. These improvements already went into glibc-2.18 upstream. Siddhesh believes that a lot of the low hanging fruit has now been picked, but that this is definitely not the end of the road for improvements in the multiple precision performance. There are other more complicated improvements, like the limitation of worst case precision for exp() and log() functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision (PDF). One needs to prove that those results apply to the Glibc multiple precision bits.

13 of 226 comments (clear)

Min score:

Reason:

Sort:

Re:C versus Assembly Language by gangien · 2015-01-02 10:14 · Score: 3, Informative

It may, but it's pretty rare that it's worth it and it also increases the cost of maintaining. Though a function in glibc, might be an exception.
Re:How much benefit? by Baloroth · 2015-01-02 10:49 · Score: 4, Informative

It looks like the slowest paths of the transcendental functions were improved by a lot. But how often do these paths get used? The article doesn't say so the performance benefits may be insignificant.
From TFA, it sounds like the functions in question (pow() and exp()) work by first looking at a look-up table/polynomial approximation technique to see if the function can use that to get an accurate-enough value, then running through a series of multiplications to calculate the result if the table/approximation method wouldn't be accurate enough. Since this work improved the actual calculation part, my guess is that it will improve quite a few cases. TFA does say the lookup table method works in the "majority of cases", though it doesn't say exactly how big a majority, so it's hard to say exactly.

--
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
Title correction by Dishwasha · 2015-01-02 11:00 · Score: 1, Informative

Red Hat Computer Scientist Improves Math Performance of Glibc
There, fixed that for you.
Re: I'm surprised by Anonymous Coward · 2015-01-02 11:32 · Score: 2, Informative

Last time I checked, India was in Asia.
Re:C versus Assembly Language by hermitdev · 2015-01-02 12:39 · Score: 4, Informative

This is generally true, but when you've measured and proven the compiler is generating sub optimal code, it may be time to hand-optimize in assembly. Compilers don't always generate better code than a human. In general, if you have performance critical code, I find it better to write it in C/C++ first, then examine the compiler optimized assembly. IIF it is shown to be sub optimal, then you optimize by hand. Compilers are also finicky. One algorithm it may optimize it well, but make a slight change in a high level language and sometimes the compiler optimizer gets confused and generates horridly different and rather un-optimized code. Different compilers may also optimize differently, so it may be ideal to do the assembly by hand to get uniform performance across compilers. In general, if performance is *that* critical to you, you need to examine by hand those critical sections to verify the compiler is doing what you expect it to do. This isn't even to mention that compilers target the lowest common denominator of the architecture you specify. i.e. if you target x86, most compilers will not utilize any SSE or SIMD instructions unless separately and explicitly enabled.
Re:C versus Assembly Language by Pseudonym · 2015-01-02 12:49 · Score: 4, Informative

Indeed it is, however it's still rare that you have to go to ASM in those cases. In simple cases the compiler already generates SIMD code on code which can benefit from it, and for almost all other cases, there are C intrinsics.

--
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
Re:C versus Assembly Language by PhunkySchtuff · 2015-01-02 13:07 · Score: 3, Informative

It may, but it's pretty rare that it's worth it and it also increases the cost of maintaining. Though a function in glibc, might be an exception.
There's nothing rare about it. SIMD vectorization is useful in tons of applications.
Yes, and modern compilers are quite good at generating code that takes advantage of extended instruction sets.

--
Specialist Mac support for creative pros, Melbourne
Re: Don't need to be an expert to beat compilers . by perpenso · 2015-01-02 14:22 · Score: 3, Informative

This is interesting. Do you have any examples?
Sorry, proprietary code of a past employer. It was in the domain of low level bitmapped graphics.

Its been quite a while but one method that I recall involved branch prediction. That if the condition to be used for a branch is known sufficiently far in advance then a branch on the PowerPC has no penalty. An inner loop had numerous branches whose conditions could be determined quite early. I recall doing so and storing things in extra condition registers. The result was that the inner loop no longer had any branch penalties. Myself and the Apple engineers just couldn't get the various compilers to do anything comparable.

This was just one of several things that I did but I don't really recall the others. Well, except that the rotate and mask instructions can be amazingly useful.
Re:C versus Assembly Language by perpenso · 2015-01-02 15:02 · Score: 5, Informative

Not just every architecture. In general, you may need to write it for every major revision of every architecture. As CPU pipelines and instruction sets change, the hand-crafted assembler may no longer be optimal.
(Exercise: Write an optimal memcpy/memmove.)
I have some math code that I optimized in assembly for Pentium Pro (686) back in the day. The performance improvements vs C are getting smaller and smaller but they are still positive. At least through Core Duo, which was the last time I had access to that code. Whenever we upgraded compilers, or we upgraded our development systems, I would test my old assembly code against the reference C code.

Regarding a case like your memcpy example. An assembly version may still be warranted. A piece of software may need to optimize for the low end computers out there. So if the assembly is a win for the low end and neutral or a slight loss for the high end then it may still be the way to go. The low end is where the attention is needed to expand the pool of computers included in the minimum system requirement, think video games. You have to optimize for the three year old computer, its the one having performance problems, not the new computer. And if it does matter on the high end its simple enough to have earlier generations of an architecture use the assembly and later generations use the reference C code. Fear of future systems is no reason to leave current systems hobbled.
Re: C versus Assembly Language by debiansid · 2015-01-02 15:09 · Score: 3, Informative

FWIW, glibc string functions are hand-written in assembly for each CPU variant that has a new feature it can use. Like SSE, AVX, AVX512, etc. Likewise for non-x86 processors.
Re: excellent by debiansid · 2015-01-02 15:13 · Score: 3, Informative

There are patches in review on the glibc mailing list for this. Hopefully we will have support for vector math entry points by 2.22.
Re:excellent by tlhIngan · 2015-01-02 20:12 · Score: 4, Informative

WTF?! We're talking about general recursion, not some stupid, easily avoidable cases. Even in the case of qsort - you need to make 2 (not one) recursive calls one for left and one for right, and you need to keep the midpoint somewhere on the stack... Basically it's easier to roll your own, totally heap based accounting than try to be clever with stack frames. Plus GP explicitly said that there is recursion and your c compiler does not do tail recursion anyway (or you can't count on it)
No, you can do quicksort in one recursive call. Even in the two recursive call scenario, an optimizing compiler will turn it into one recursive call and a loop - it's such a basic operation it's called tail recursion.
Basically if you have a recursive function call at the end of the function, instead of making it a recursive call, you can reuse the current stack frame by readjusting the input parameters (on the stack) and jumping to the beginning of the function, saving yourself the headache of setting up and tearing down a stack frame because you're reusing the current stack frame. (or in other words, you're doing a loop).
In fact, this can be generalized into tall call optimization where if you call another function at the end of a function, the optimizing compiler will reuse the stack frame of the current function for the call instead.
Syntactically the code looks the same, but the output is vastly different because you get operations that rewrite the stack frame (a particularly smart compiler might actually put the parameters of the call right where they need to be by moving the arguments around so the final two instructions is a stack adjustment to compensate for different stack sizes and a direct jump, so when the tail function returns, it doesn't stop back at the calling function, but goes back to the original function.
So yes, you're right in that there's supposed to be two recursive calls in quicksort, but in practice, there's only one because the last one is always tail-recursive so compilers merely reuse the existing stack frame.
Re:C versus Assembly Language by TheRaven64 · 2015-01-02 21:52 · Score: 4, Informative

And he missed the second step:
File a bug report for the compiler with 'missed optimization opportunity' in the title and a reduced test case.
We like to see real-world examples of where we're generating bad code - if we don't see them, we can't fix them.

--
I am TheRaven on Soylent News