Red Hat Engineer Improves Math Performance of Glibc

← Back to Stories (view on slashdot.org)

Red Hat Engineer Improves Math Performance of Glibc

Posted by Soulskill on Friday January 2, 2015 @10:01AM from the performance-enhancing-devs dept.

jones_supa writes: Siddhesh Poyarekar from Red Hat has taken a professional look into mathematical functions found in Glibc (the GNU C library). He has been able to provide an 8-times performance improvement to slowest path of pow() function. Other transcendentals got similar improvements since the fixes were mostly in the generic multiple precision code. These improvements already went into glibc-2.18 upstream. Siddhesh believes that a lot of the low hanging fruit has now been picked, but that this is definitely not the end of the road for improvements in the multiple precision performance. There are other more complicated improvements, like the limitation of worst case precision for exp() and log() functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision (PDF). One needs to prove that those results apply to the Glibc multiple precision bits.

150 of 226 comments (clear)

Min score:

Reason:

Sort:

C versus Assembly Language by Anonymous Coward · 2015-01-02 10:08 · Score: 2, Interesting

I don't know ASM much, but my friend whose professional career is in programming told me that to really speed up the speed sometimes one may have to go the ASM way
Is that true?
1. Re:C versus Assembly Language by gangien · 2015-01-02 10:14 · Score: 3, Informative
  
  It may, but it's pretty rare that it's worth it and it also increases the cost of maintaining. Though a function in glibc, might be an exception.
2. Re:C versus Assembly Language by Anonymous Coward · 2015-01-02 10:20 · Score: 1
  
  It's more-or-less true that the huge performance gains are seen by optimizing in assembly. Keep in mind: this is that the compiler tried to do; when you start down this path you are saying "that fancy compiler doesn't know what its doing, I'll do it all myself".
  However, when we are talking about libraries, and items such as exponentiation, it is almost certain that these operations are written in assembly and made available to the compiled library.
3. Re:C versus Assembly Language by Anonymous Coward · 2015-01-02 10:28 · Score: 2, Insightful
  
  Compilers can't really compete with experts on performance, given enough time. If you are unfortunate enough to look into GCC, you can find some assembly here and there on performance-critical portions of the code. The major downsides are: you have to write it for each architecture you want to support (GCC supports many archs, so this is a really big deal), fewer people can read/write/fix it, bugs are even easier to slip by, takes longer to code.
4. Re:C versus Assembly Language by RingDev · 2015-01-02 10:32 · Score: 4, Insightful
  
  99.9% of the time, no.
  The purpose of the compiler is to identify and optimize the code structures in higher level languages. There are many, many tools, and generations of compilers that have been dedicated to just that. For the vast majority of cases, the compiler will do a better job and leave you with the much easier task of maintaining a high level language codebase.
  That said, there are specific operations, most frequently mathematical in nature, that are so explicitly well defined and unchanging, that writing them in ASM may actually allow the author to take procedural liberties that the compiler is unknowledgeable of or exist in such a way that the compiler is unaware of.
  The end result of such code is typically virtually unreadable. The syntax masks the math, and the math obfuscates the syntax. But the outcome is a thing of pure beauty.
  -Rick
  
  --
  "Most people in the U.S. wouldn't know they live in a tyrannical state if it walked up and grabbed their junk." - MyFirs
5. Re:C versus Assembly Language by YoungManKlaus · 2015-01-02 10:58 · Score: 1
  
  sure thing, if you want to rewrite your code for every cpu architecture (and preferably also every generation of said architecture) ...
6. Re:C versus Assembly Language by Impy+the+Impiuos+Imp · 2015-01-02 11:06 · Score: 1
  
  The real problem is you need to be expert in the target processor(s) with special instruction slots and the break even points for unrolling a loop and dozens of other peculiarities. Processors are co-designed with compilers now for a reason. . It's not enough to just be clever in translating an algorithm to a little math virtual world of chars and floats and other atomic, hardware-level operations.
  
  --
  (-1: Post disagrees with my already-settled worldview) is not a valid mod option.
7. Re:C versus Assembly Language by smittyoneeach · 2015-01-02 11:14 · Score: 1
  
  As long as you can just guess the correct answer 100% of the time, why bother with calculations?
  Call it "The God Optimization".
  
  --
  Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
8. Re:C versus Assembly Language by Lunix+Nutcase · 2015-01-02 11:14 · Score: 1
  
  It may, but it's pretty rare that it's worth it and it also increases the cost of maintaining. Though a function in glibc, might be an exception.
  There's nothing rare about it. SIMD vectorization is useful in tons of applications.
9. Re:C versus Assembly Language by Lunix+Nutcase · 2015-01-02 11:21 · Score: 1
  
  99.9% of the time, no.
  Said by someone who doesn't do anything involving multimedia or DSP. Compilers are horrendous at vectorization. It's why things like ffmpeg, x264, libjpeg-turbo, and pretty much any video/audio/image codec that has any decent performance has SIMD assembly for all platforms that support it. Otherwise they would be well more than a magnitude and then some slower.
  If you don't believe me compile x264 for without ASM optimizations even at your compilers highest optimization level and watch how it slows to a crawl compared to the hand-written SIMD version. Same with ffmpeg, etc.
10. Re:C versus Assembly Language by ClickOnThis · 2015-01-02 11:40 · Score: 5, Insightful
  
  Keep in mind: this is [w]hat the compiler tried to do; when you start down this path you are saying "that fancy compiler doesn't know what its doing, I'll do it all myself".
  Trying to outsmart a compiler defeats much of the purpose of using one.
  -- Kernighan and Plauger, The Elements of Programming Style
  
  --
  If it weren't for deadlines, nothing would be late.
11. Re:C versus Assembly Language by TheGavster · 2015-01-02 11:52 · Score: 4, Insightful
  
  I think that saying "This piece of code is going to be called a lot, so I'll implement it in assembler" is inadvisable. The more reasoned approach is "after profiling, my program spends a lot of time in this routine, so I'll go over the assembler the compiler generated to make sure it optimized correctly". The upshot being, it is useful to be able to read and write assembler when optimizing, but it would be rare that you would produce new code in assembly from whole cloth.
  
  --
  "Because Science" is one step from "Because old book". Try "Because of my experiment testing my falsifiable assertion".
12. Re:C versus Assembly Language by gangien · 2015-01-02 12:32 · Score: 1
  
  really nothing rare about it? maybe in your line of work. But since we're just talking general programming here, it's quite rare.
13. Re:C versus Assembly Language by hermitdev · 2015-01-02 12:39 · Score: 4, Informative
  
  This is generally true, but when you've measured and proven the compiler is generating sub optimal code, it may be time to hand-optimize in assembly. Compilers don't always generate better code than a human. In general, if you have performance critical code, I find it better to write it in C/C++ first, then examine the compiler optimized assembly. IIF it is shown to be sub optimal, then you optimize by hand. Compilers are also finicky. One algorithm it may optimize it well, but make a slight change in a high level language and sometimes the compiler optimizer gets confused and generates horridly different and rather un-optimized code. Different compilers may also optimize differently, so it may be ideal to do the assembly by hand to get uniform performance across compilers. In general, if performance is *that* critical to you, you need to examine by hand those critical sections to verify the compiler is doing what you expect it to do. This isn't even to mention that compilers target the lowest common denominator of the architecture you specify. i.e. if you target x86, most compilers will not utilize any SSE or SIMD instructions unless separately and explicitly enabled.
14. Re:C versus Assembly Language by Pseudonym · 2015-01-02 12:49 · Score: 4, Informative
  
  Indeed it is, however it's still rare that you have to go to ASM in those cases. In simple cases the compiler already generates SIMD code on code which can benefit from it, and for almost all other cases, there are C intrinsics.
  
  --
  sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
15. Re:C versus Assembly Language by hermitdev · 2015-01-02 12:50 · Score: 1
  
  What do you think intrinsics are? They are nothing but very thinly wrapped C functions, often marked inline, over the underlying instructions. The major downside here is you have no choice over register allocation, if you're making a number of calls, this could become an issue.
16. Re:C versus Assembly Language by Pseudonym · 2015-01-02 12:55 · Score: 4, Insightful
  
  Not just every architecture. In general, you may need to write it for every major revision of every architecture. As CPU pipelines and instruction sets change, the hand-crafted assembler may no longer be optimal.
  (Exercise: Write an optimal memcpy/memmove.)
  
  --
  sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
17. Re:C versus Assembly Language by dkf · 2015-01-02 12:57 · Score: 1
  
  when you've measured and proven the compiler is generating sub optimal code
  That's the important part. Don't start mucking around with low-level assembly for things until you've proven that you've got a problem and that the fix you're proposing to work on is worthwhile. (Where a library gets very widely distributed, such as a basic math library, it may well become worthwhile very quickly. Most code doesn't get anything like that level of distribution.)
  
  --
  "Little does he know, but there is no 'I' in 'Idiot'!"
18. Re:C versus Assembly Language by PhunkySchtuff · 2015-01-02 13:07 · Score: 3, Informative
  
  It may, but it's pretty rare that it's worth it and it also increases the cost of maintaining. Though a function in glibc, might be an exception.
  There's nothing rare about it. SIMD vectorization is useful in tons of applications.
  Yes, and modern compilers are quite good at generating code that takes advantage of extended instruction sets.
  
  --
  Specialist Mac support for creative pros, Melbourne
19. Re:C versus Assembly Language by Anonymous Coward · 2015-01-02 13:10 · Score: 1
  
  99.9% of the time, no.
  Said by someone who doesn't do anything involving multimedia or DSP.
  So, like he said - 99.9% of the time, no.
20. Re:C versus Assembly Language by beelsebob · 2015-01-02 13:34 · Score: 1
  
  The thing is, + will be compiled into something useful on all platforms the compiler targets. Intrinsics will only become something useful on the platforms they are for. That's why they're closer to assembler than + is.
21. Re:C versus Assembly Language by perpenso · 2015-01-02 15:02 · Score: 5, Informative
  
  Not just every architecture. In general, you may need to write it for every major revision of every architecture. As CPU pipelines and instruction sets change, the hand-crafted assembler may no longer be optimal.
  (Exercise: Write an optimal memcpy/memmove.)
  I have some math code that I optimized in assembly for Pentium Pro (686) back in the day. The performance improvements vs C are getting smaller and smaller but they are still positive. At least through Core Duo, which was the last time I had access to that code. Whenever we upgraded compilers, or we upgraded our development systems, I would test my old assembly code against the reference C code.
  
  Regarding a case like your memcpy example. An assembly version may still be warranted. A piece of software may need to optimize for the low end computers out there. So if the assembly is a win for the low end and neutral or a slight loss for the high end then it may still be the way to go. The low end is where the attention is needed to expand the pool of computers included in the minimum system requirement, think video games. You have to optimize for the three year old computer, its the one having performance problems, not the new computer. And if it does matter on the high end its simple enough to have earlier generations of an architecture use the assembly and later generations use the reference C code. Fear of future systems is no reason to leave current systems hobbled.
22. Re: C versus Assembly Language by debiansid · 2015-01-02 15:09 · Score: 3, Informative
  
  FWIW, glibc string functions are hand-written in assembly for each CPU variant that has a new feature it can use. Like SSE, AVX, AVX512, etc. Likewise for non-x86 processors.
23. Re:C versus Assembly Language by JBMcB · 2015-01-02 18:13 · Score: 1
  
  Not to mention Intel Performance Primitives which is basically a library of ASM blobs tailored for various combinations of sse/avx - even cache and pipes, I believe.
  
  --
  My Other Computer Is A Data General Nova III.
24. Re:C versus Assembly Language by Tough+Love · 2015-01-02 18:20 · Score: 1
  
  We weren't talking "general programming" here.
  
  --
  When all you have is a hammer, every problem starts to look like a thumb.
25. Re:C versus Assembly Language by Tough+Love · 2015-01-02 18:43 · Score: 1
  
  If you are unsure and have the time then write it both ways and compare them, no need to guess. If the assembly version wins then make it a compile option, keeping the C code for regression testing and to allow for compiler improvements over time.
  
  --
  When all you have is a hammer, every problem starts to look like a thumb.
26. Re:C versus Assembly Language by Tough+Love · 2015-01-02 18:46 · Score: 2
  
  If you're doing it on your own time then muck around with anything you want. If you win you are a hero and if fail you got some experience.
  
  --
  When all you have is a hammer, every problem starts to look like a thumb.
27. Re:C versus Assembly Language by Tough+Love · 2015-01-02 19:03 · Score: 1
  
  On that note, I am pretty sure the optimizing was done in C in this case. The art is in the careful analysis of precision.
  
  --
  When all you have is a hammer, every problem starts to look like a thumb.
28. Re: C versus Assembly Language by electrosoccertux · 2015-01-02 19:23 · Score: 1
  
  shouldn't it be possible to automate the optimization based on architecture constants? number registers, execution width, matrix dimensions, and depth of ooperation? I feel llike if, someone with MLA experience profiled it they'd have no trouble dimensionality reducing and coming up with some tunables.
29. Re:C versus Assembly Language by Bruce+Dawson · 2015-01-02 20:15 · Score: 1
  
  High-precision math is an excellent time to use assembly language. Assembly languages generally have a way to express ideas like a 32x32->64-bit multiply (and 64x64->128-bit multiply), and add-with-carry. High-level languages generally support neither of those options directly. To tell the compiler that you want a 32x32->64-bit multiply you generally have to have two 32-bit inputs, then cast one of them to 64-bit, and hope that the compiler doesn't actually generate a 64x64 multiply.
  For 64x64->128-bit multiplies the problem is more difficult because many languages don't have a 128-bit type, and yet these multiplies are crucial for getting maximal multi-precision performance on x64.
  Without access to the carry flag a programmer in a high-level language has to do things like:
  a0 += b0;
  if (a0 https://randomascii.wordpress....
30. Re:C versus Assembly Language by mwvdlee · 2015-01-02 20:58 · Score: 2
  
  Compilers can't really compete with experts on performance, given enough time.
  This.
  The thing is that the vast majority of us aren't assembly experts.
  Compilers kick my ass except for the rarest of highly specific cases (all of them SIMD).
  
  --
  Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
31. Re: C versus Assembly Language by YoungManKlaus · 2015-01-02 21:11 · Score: 1
  
  ...which one might call insane in itself ;)
32. Re:C versus Assembly Language by serviscope_minor · 2015-01-02 21:39 · Score: 1
  
  There's nothing rare about it. SIMD vectorization is useful in tons of applications.
  That's got little to do with ASM: GCC and others offer compiler intrinsics so you can access the vector instructions from C or C++.
  
  --
  SJW n. One who posts facts.
33. Re:C versus Assembly Language by Eunuchswear · 2015-01-02 21:44 · Score: 1
  
  That's a very interesting example, if you look at the code (yay free software!) you'll see that what makes glibc memcpy & pals fast is being careful about the algorithms.
  For example strchr runs about 4-8 times faster in C than a naive assembly implimentation.
  
  --
  Watch this Heartland Institute video
34. Re: C versus Assembly Language by Eunuchswear · 2015-01-02 21:49 · Score: 1
  
  No they're not. They're written in C. There are assembly versions where that is useful.
  
  --
  Watch this Heartland Institute video
35. Re:C versus Assembly Language by TheRaven64 · 2015-01-02 21:52 · Score: 4, Informative
  
  And he missed the second step:
  File a bug report for the compiler with 'missed optimization opportunity' in the title and a reduced test case.
  We like to see real-world examples of where we're generating bad code - if we don't see them, we can't fix them.
  
  --
  I am TheRaven on Soylent News
36. Re:C versus Assembly Language by TheRaven64 · 2015-01-02 21:57 · Score: 1
  
  There are two other important caveats:
  The first is that you need to keep testing. The next version of the C compiler may generate better code than your assembly. Or the version after. Optimisers keep improving (hopefully - occasionally they get worse, but if you want to avoid that then contribute test cases to the performance regression suite), but your assembly won't.
  The second is that performance for this kind of code can be highly microarchitecture dependent. One of the FreeBSD devs did some analysis of a variety of different SSE / AVX techniques for speeding up things like memset / memcpy. It turns out that across the last two generations of Intel chips you get very different performance for the different instruction combinations and there's no clear winner that's better on all. For Centaur chips, the best thing apparently is to have a simple byte copy loop aligned to fit into a fetch granule and the microcode will recognise the idiom and do a cache-line-at-a-time copy...
  
  --
  I am TheRaven on Soylent News
37. Re:C versus Assembly Language by sjames · 2015-01-02 22:01 · Score: 1
  
  At one time it was absolutely true. Nothing could beat hand assembly. These days, the compiler can do a better job MOST of the time and asm is only used for specialized implementation of low level functions like locking that C wasn't designed to handle.
38. Re:C versus Assembly Language by gnasher719 · 2015-01-03 00:54 · Score: 1
  
  I have on several occassions taken assembler code and sped it up significantly by turning it into normal high level language code.
  
  To make an algorithm fast, you need to improve the algorithm (so the number of operations that the algorithm needs to perform are minimised), and improve the number of operations per time unit that the algorithm can perform. Assembler _may_ help with the second task. However, writing asssembler code is much much more time consuming, so in a fixed amount of development time, much less time is spent on improving the algorithm itself. And anything that I ever looked at that was really time critical, I managed to get huge improvements through improved algorithms which I would never have managed had I written assembler code.
39. Re:C versus Assembly Language by gnasher719 · 2015-01-03 00:55 · Score: 1
  
  There's nothing rare about it. SIMD vectorization is useful in tons of applications.
  Vectorization is not the same as assembler. You can do vectorization just fine in a high level language. Actually, SSE3 is an absolute pain in the ass in assembler and trying to use it in assembler is absolutely misguided.
40. Re:C versus Assembly Language by gnasher719 · 2015-01-03 01:00 · Score: 1
  
  High-precision math is an excellent time to use assembly language. Assembly languages generally have a way to express ideas like a 32x32->64-bit multiply (and 64x64->128-bit multiply), and add-with-carry. High-level languages generally support neither of those options directly. To tell the compiler that you want a 32x32->64-bit multiply you generally have to have two 32-bit inputs, then cast one of them to 64-bit, and hope that the compiler doesn't actually generate a 64x64 multiply.
  That has nothing to do with "hope". Clang and GCC do that without any problems. Clang and GCC also support 128 bit integers on 64 bit processors and do the same there. Worst case, you build some inline assembler functions and use them as building blocks. Then you can do anything in high level language, and leave all the boring details to the compiler.
41. Re:C versus Assembly Language by gnasher719 · 2015-01-03 01:04 · Score: 1
  
  Regarding a case like your memcpy example. An assembly version may still be warranted. A piece of software may need to optimize for the low end computers out there.
  
  MacOS X has memcpy built into the operating system. At boot time, the OS determines the processor and copies a memcpy (and memmove, and memset) function optimised for that processor to a fixed memory location. If you copy megabytes, you will see it doing _very_ interesting stuff with cache prefetches and so on. C++ doesn't use these, unfortunately.
42. Re:C versus Assembly Language by Tailhook · 2015-01-03 01:08 · Score: 1
  
  Yes. Sometimes to get maximum possible performance you need to use assembly. The subset of professional programmers that can actually achieve significant performance gains using assembly in a reasonable amount of time is very small. Employers willing and capable of funding that sort of work is also quite small.
  Most often programmers resort to assembly because some hardware device or processor mode has to be dealt with and the compiler can't output the necessary instructions. That sort of problem is typically easier and involves less knowledge than trying to outperform compiler optimized code.
  I've done it twice in 19 years as a programmer. Once in the late 90's to reach some special processor instructions for which the compiler had no support, and again a couple years ago to write an x86 real mode BIOS extension for an embedded system.
  
  --
  Maw! Fire up the karma burner!
43. Re:C versus Assembly Language by cheesybagel · 2015-01-03 01:50 · Score: 1
  
  You are better off programming those parts of the code in OpenCL or OpenACC. This way the code is instruction set independent.
44. Re:C versus Assembly Language by petermgreen · 2015-01-03 04:21 · Score: 1
  
  What arcitectures have you encountered the problems you describe on?
  
  --
  note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
45. Re:C versus Assembly Language by K.+S.+Kyosuke · 2015-01-03 04:26 · Score: 1
  
  HSA should remove most of the latency. Having said that, glibc functions might not be a good case for it anyway.
  
  --
  Ezekiel 23:20
46. Re:C versus Assembly Language by Bruce+Dawson · 2015-01-03 05:38 · Score: 1
  
  What about the add with carry? That's the particularly hairy bit. Even if clang/gcc/VC++ recognize the pattern and turn it into optimal code, add with carry is a case where assembly language is cleaner and more elegant than the equivalent high-level language code.
  I'm not a fan of inline assembler because it often gives you the worst of both worlds -- incomplete control over code generation, and worse syntactic messiness than pure assembly language. But yes, a mixture of C++ and assembly is definitely the right solution, either inline assembly or a single separate function to do the messy math.
47. Re:C versus Assembly Language by lsatenstein · 2015-01-03 07:25 · Score: 1
  
  And he missed the second step:
  File a bug report for the compiler with 'missed optimization opportunity' in the title and a reduced test case.
  We like to see real-world examples of where we're generating bad code - if we don't see them, we can't fix them.
  How do you define bad code? I define it as producing erroneous results. But if your in a compute bound application, and the cost to upgrade hardware looks formidable, then you look at optimization. In my world, I look at optimization first, and hardware upgrades as second choice.
  There are sure hours and hours of test cases for the work was done for optimization. The author looked at employing math factoring, not in designing new hardware. Great!!!
  Well done!
  I promise to not store and count with my integer fields in display format.
  
  --
  Leslie Satenstein Montreal Quebec Canada
48. Re:C versus Assembly Language by Bite+The+Pillow · 2015-01-04 16:15 · Score: 1
  
  You've gotten sidetracked by the replies. The point here is that some math guy has noticed that recent developments in maths were not included in open source code.
  Open source contributors are often not math majors, and that's okay. When you have someone who can understand the recent findings, and translate that into an algorithm, the speedups can be a lot faster than simply switching to a lower level language.
  That's what this story is about. If you want to start an ASM flamewar, well you did, but it's best to do so in its own thread rather than hijacking the first post. Do better next time.
  Meanwhile, it is true that sometimes you have to go the ASM way to get a speedup. Whether it is significant, or whether it is needed, or all of that junk will have been argued extensively by now. "Sometimes" is a very slippery word. No matter how intelligently someone may argue that computers do just fine without optimization, sometimes a person can go beyond what the computer, which was programmed by a person, can do.
49. Re:C versus Assembly Language by Anonymous Coward · 2015-01-04 17:15 · Score: 1
  
  I can typically beat gcc/icc/etc. by a factor of two. Sometimes not, but often enough to matter.
  It often involves things that you just can't express in C/C++. Like when it's safe to assume buffers can be read past the end, because all of the input data has padding. Another very common pattern is to use SIMD where compiler sees no such opportunity.
50. Re:C versus Assembly Language by cheesybagel · 2015-01-05 05:26 · Score: 1
  
  Have you ever considered that if it doesn't take milliseconds to do and isn't user perceptible you are wasting developer resources and time in something that is not a performance problem?
  As someone else said compilers optimize scalar code well enough as it is.
51. Re:C versus Assembly Language by gangien · 2015-01-05 11:28 · Score: 1
  
  glibc isn't general programming? ok then.
52. Re:C versus Assembly Language by hermitdev · 2015-01-05 14:18 · Score: 1
  
  But yes, a mixture of C++ and assembly is definitely the right solution, either inline assembly or a single separate function to do the messy math.
  I disagree with nothing you've said here, or in your initial reply to me. One word of caution: I'd discourage mixing inline assembly with "regular" C/C++ code. Most compilers (last I've looked) immediately bail on attempting to optimize any function that contains inline assembly. I honestly don't know if you have an inline asm function how that plays in - it's just not something I deal with regularly. Last time I wrote/modified any inline assembly was to correct a x86->x64 compatibility problem roughly 6 years ago.
53. Re:C versus Assembly Language by hermitdev · 2015-01-05 14:26 · Score: 1
  
  If you have the time to write the algorithm in assembly, then you have the time to write it in the high level language (I'd imagine for anything but the most trivial of operations, you're probably looking at a minimum of 5x-10x longer to write and verify the algorithm by hand in assembly vs C or C++. I'd also wager my estimate may be an order of magnitude or more low. I'd also expect a large variance in the extra time relative to the experience and familiarity of the architecture of the developer/engineer doing the inline assembly.). The additional benefit of this, as you mentioned, but I failed to: is that you have it there for regression/comparison (assuming, of course, the compiler is generating functionally correct assembly, to begin with). Additionally, having both the assembly & high level implementations affords you the ability to quickly, if sub-optimally, support new architectures.
54. Re:C versus Assembly Language by hermitdev · 2015-01-05 14:39 · Score: 1
  
  And he missed the second step:
  File a bug report for the compiler with 'missed optimization opportunity' in the title and a reduced test case.
  We like to see real-world examples of where we're generating bad code - if we don't see them, we can't fix them.
  Definitely correct, and an oversight on my part. To my part, I've not first-hand observed a compiler generating bad or erroneous code, but was more thinking back to Linus's recent tirade about GCC 4.8. When I have seen the compiler behaving badly, it's usually been in the form of internal compiler errors. When I do encounter these, I do as much research as I reasonably can, and try and distill a minimal reproduction. Last such ICE I saw was in GCC 4.1.2 a few years ago (yes, my employer at the time was that far behind). The Ops team wanted to push a "small" update to my dev server, which I think was running Rhel 5.4 at the time to fix a completely unrelated issue. After the update, I noticed it also updated GCC to a newer point release of 4.1.2 and an unmodified code base started to fail with an ICE. A bit of investigation narrowed it down to, in C++ code, defining an anonymous namespace at the global scope (perfectly legal, but caused an ICE in that particular GCC version). Turns out it had already been reported, fixed and already released in a later version, so my Ops team had very little work to do to fix it.
55. Re:C versus Assembly Language by hermitdev · 2015-01-05 14:47 · Score: 1
  
  I, personally, define bad code (in this context), from the perspective of code generated by the compiler, as code that does not perform in an expected manner as dictated by the higher-level language lacking any sort of unspecified or undefined behavior failing to produce the expected result. In a most ridiculous example, if I had int i = 3 * 7; printf("%d", i); and anything other than 21 was output, there's some bad code present. Suboptimal does not mean bad or erroneous, if it produces the correct result. Suboptimal but correct would be a target for manual inline assembly optimization, if it is sufficiently inefficient. Bad with incorrect result might also be a target for manual assembly if a resolution from the compiler vendor could not be completed, or would otherwise be unfeasible.
56. Re:C versus Assembly Language by Bruce+Dawson · 2015-01-05 15:45 · Score: 1
  
  I've only used inline assembly in VC++ and, as you observe, it usually disoptimizes the rest of the function. I don't know if gcc/clang handle it better.
57. Re:C versus Assembly Language by Tough+Love · 2015-01-05 15:59 · Score: 1
  
  Correct, developing glibc is the domain of specialists who devote years to gaining the necessary skills, not just any random walk-in-off-the-street general programmer. Try reading any part of libc and see if you can guess what your chance would be to get a patch accepted at your current skill level.
  
  --
  When all you have is a hammer, every problem starts to look like a thumb.
58. Re:C versus Assembly Language by TheRaven64 · 2015-01-05 21:36 · Score: 1
  
  Bad in this context means (as my post said) a missed optimisation opportunity. Compilers should be doing every optimisation that can be done mechanically, programmers should be doing the ones that require knowledge of the problem domain, data, and so on. If a particular bit of code generates redundant operations, or does not manage to pattern match to the optimal instruction sequence, then we want to know. This is part of the reason that superoptimisers exist. If we don't have a set of test cases then it's hard to measure improvement (and no, SPEC is not even remotely representative of real code, but we use it anyway).
  
  --
  I am TheRaven on Soylent News
59. Re:C versus Assembly Language by ChrisMaple · 2015-01-06 14:06 · Score: 1
  
  My experience has been that code that should be vectorized has to be written in a way that makes it easy for the compiler to recognize, otherwise it won't happen. That requires either experience or luck. Most authors will be better off using optimized libraries available for the most common functions.
  
  --
  Contribute to civilization: ari.aynrand.org/donate
60. Re:C versus Assembly Language by lsatenstein · 2015-01-14 10:37 · Score: 1
  
  I programmed in Assembly Language. Over small highly repetitive sections of code, I could beat every C compiler around. However, when my assembly code for the entire program was compiled and tested, the C program proved to be faster. The reason is that the C compiler could do global optimization, guided by the user's wishes. Example of optimization for speed was loop elimination, other optimisation -- common code fragment reuse, and more.
  I tended to look at the next bottleneck and tackle that delay in assembly, not always standardizing on using specific cpu registers, and I was sometimes guilty of not looking seriously to determine of there was a better way to solve the business problem. (Generous definition of business to mean any processing problem.)
  
  --
  Leslie Satenstein Montreal Quebec Canada
why ? by Anonymous Coward · 2015-01-02 10:09 · Score: 5, Funny

Who needs glibc anymore ? we have systemd now.
1. Re:why ? by smittyoneeach · 2015-01-02 11:15 · Score: 4, Funny
  
  Was that an African or European velociraptor?
  
  --
  Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
2. Re: why ? by electrosoccertux · 2015-01-02 19:26 · Score: 1
  
  if dungeons can explain why this funny. something about d being next letter after c?
3. Re: why ? by Eunuchswear · 2015-01-02 21:51 · Score: 1
  
  No, the next letter after C is P.
  
  --
  Watch this Heartland Institute video
excellent by buddyglass · 2015-01-02 10:11 · Score: 3, Interesting

I love reading stories like this. I'd love it if a concerted effort were made to further optimize various non-FP glibc methods as well. I assume they're already fairly well optimized, but I would be shocked if there weren't still some gains to be made. +10 internets to you, Mr./Ms. Poyarekar.
1. Re:excellent by geezer+nerd · 2015-01-02 11:10 · Score: 1
  
  Given how long these libraries have existed, I am surprised there are still opportunities for improvement such as that described in the TFA.
  
  Back in the mid-80s I was involved in the design of a "mini Cray" supercomputer. We did not yet have any hardware to run on, but we did have a software simulator, and we wanted to publish some "whetstone" numbers. We got some numbers, were not too happy with them, and really dug in to analyze what we could do to improve them. The Whetstone code was in C, and used a fair number of library functions to both accomplish the numerical results and the preparation of the Whetstone answer. It turned out that most of the time was being spent in string copy and string compare functions from the library. We concentrated our efforts on redoing those library routines in assembly to take advantage of as many register-to-register operations and multiple-byte operations as we could. Although the Whetstone benchmark was supposed to measure numerical performance, our results showed that the numerical calculations took up little of the time.
  
  Sadly, our "mini Cray" never saw the light of day. The mid-80s were a tough time to stand out in that arena as there were so many people trying to do the same thing.
2. Re:excellent by buddyglass · 2015-01-02 12:03 · Score: 1
  
  This is just one data point, but I recently compared the BSD and GNU qsort() code. Built both with LLVM (on a Mac), tried both -O2 and -O3, and did some crude benchmarking. IIRC the BSD code was slightly more efficient, not to mention drastically simpler. BSD basically uses the Bentley & McIlroy code from "Engineering a Sort Function" with a few small optimizations to make it more adaptive (and exploit the fact that the algorithm is tail-recursive). So the BSD version is still recursive, whereas GNU is non-recursive but manages its own stack.
3. Re:excellent by sshir · 2015-01-02 13:27 · Score: 1
  
  Recursion makes for elegant code, but in production code should be avoided. At least when you have no control of the input data. The reason is that there is a fairly tight restriction on stack size (you have to explicitly tell your OS that you need large stack). As a result you will coredump if your input is the right kind of nasty.
4. Re:excellent by hublan · 2015-01-02 13:48 · Score: 2
  
  I was shocked to find how poor the performance of expf() was compared to exp() in glibc. Turns out that in a handful of functions, they are changing the rounding mode of the FPU, which flushes the entire FPU state, obliterating performance. After switching to a different version -- from another library -- that didn't change rounding modes, performance was back on par.
  It's perfectly understandable why rounding mode changes are necessary, since the FPU can be in any rounding mode coming in, and some guarantees are required, but they should really provide variations that do not do this. I truly hope the new implementation avoids it altogether, otherwise we're back on square one.
  
  --
  My spoon is too big.
5. Re:excellent by buddyglass · 2015-01-02 14:41 · Score: 1
  
  Agreed. It would be interesting to know the smallest number of crafted sort elements needed to cause BSD's qsort() to exceed a "normal" stack size, whatever that happens to be. If an array with that number of elements and a reasonable element size (probably sizeof(void*) since elements are almost always pointers) will always be larger than the largest system memory size then, in theory, we wouldn't need to worry about exceeding the available call stack.
  
  It may be the case, though, that it is possible to fit a crafted sequence of keys into a reasonable amount of space that will cause BSD qsort() to exceed a "normal" stack size.
6. Re:excellent by sshir · 2015-01-02 14:57 · Score: 1
  
  WTF?! We're talking about general recursion, not some stupid, easily avoidable cases. Even in the case of qsort - you need to make 2 (not one) recursive calls one for left and one for right, and you need to keep the midpoint somewhere on the stack... Basically it's easier to roll your own, totally heap based accounting than try to be clever with stack frames. Plus GP explicitly said that there is recursion and your c compiler does not do tail recursion anyway (or you can't count on it)
7. Re:excellent by sshir · 2015-01-02 15:11 · Score: 1
  
  If the algorithm is made very unlucky with selection of the pivot, then it will need one stack frame per each 2 data elements. So yeah, I think it's doable (if pivot selection is stupid in some predictable way)
8. Re: excellent by debiansid · 2015-01-02 15:13 · Score: 3, Informative
  
  There are patches in review on the glibc mailing list for this. Hopefully we will have support for vector math entry points by 2.22.
9. Re:excellent by BitZtream · 2015-01-02 15:16 · Score: 1
  
  I assume they're already fairly well optimized
  You assume wrong. glibc is pretty poorly optimized in reality for most things. It works on god knows how many systems and is generally consistent, but optimization is not its strong point.
  If speed is your concern and you have a choice between Linux with glibc and commercial OS, you pick the commercial OS and its libraries almost every time. I can't actually think of a case where this doesn't apply.
  It could be a lot worse, but its certainly not as fast as it could be just about any where. I'm fairly certain that the point is not raw speed though.
  
  --
  Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
10. Re:excellent by buddyglass · 2015-01-02 15:36 · Score: 1
  
  They use the method described in Bentley & McIlroy's paper, which uses a "median of medians" for large arrays, a simple median of first/last/middle for mid-sized arrays and the mid-point for small arrays. See page 1255 here. No idea how effective that is. They also short-circuit to insertion sort when N 7 for a given sort window.
11. Re:excellent by buddyglass · 2015-01-02 15:37 · Score: 1
  
  Slasdot ate my less-than symbol. They short-circuit to insertion when N < 7.
12. Re:excellent by sshir · 2015-01-02 15:48 · Score: 1
  
  Very deterministic. Thus, easy to screw with. Just populate sample points with that worst case pivot value and voilà. Well... it needs to be done for all consecutive rounds so it's not completely trivial but doable.
13. Re:excellent by buddyglass · 2015-01-02 15:58 · Score: 1
  
  It should be noted that since both BSD and GNU versions rely on a stack (and neither chooses its pivot point randomly, so far as I know) they're both vulnerable to crafted data that's designed to cause the algorithm to exhaust its stack. It's possible GNU dynamically grows its custom stack when it nears the point of exhaustion, but I kind of doubt it.
14. Re:excellent by sshir · 2015-01-02 16:04 · Score: 1
  
  I would guess that that "custom stack" lives in the heap, so in that case the limitation on program's stack size does not apply.
15. Re:excellent by buddyglass · 2015-01-02 17:12 · Score: 1
  
  It looks like GNU actually allocates its custom stack *on the stack*. However, in the comments they claim their algorithm guarantees that the stack need only be as large as log(sort elements), so they allocate a stack that is (in theory) guaranteed to be big enough. You can check out the code here. I'm not sure what version of glibc that's from, but it looks similar to what I extracted from 2.10 a while back.
16. Re:excellent by angel'o'sphere · 2015-01-02 17:40 · Score: 1
  
  As the algorithm is "tail recursive" it is save to assume the compiler will apply "tail recursion optimization", which means it converts the recursive call into a loop.
  
  --
  Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
17. Re: excellent by electrosoccertux · 2015-01-02 19:28 · Score: 1
  
  so how much better was it?
18. Re:excellent by Tough+Love · 2015-01-02 19:49 · Score: 1
  
  Obnoxious GP happened to be right in this case, provided that the recursion is always applied to the smallest partition first, leaving the potentially much deeper recursion on the other partition for the tail, and provided that the compiler manages to detect the tail regression opportunity, which would need to be verified.
  
  --
  When all you have is a hammer, every problem starts to look like a thumb.
19. Re:excellent by tlhIngan · 2015-01-02 20:12 · Score: 4, Informative
  
  WTF?! We're talking about general recursion, not some stupid, easily avoidable cases. Even in the case of qsort - you need to make 2 (not one) recursive calls one for left and one for right, and you need to keep the midpoint somewhere on the stack... Basically it's easier to roll your own, totally heap based accounting than try to be clever with stack frames. Plus GP explicitly said that there is recursion and your c compiler does not do tail recursion anyway (or you can't count on it)
  No, you can do quicksort in one recursive call. Even in the two recursive call scenario, an optimizing compiler will turn it into one recursive call and a loop - it's such a basic operation it's called tail recursion.
  Basically if you have a recursive function call at the end of the function, instead of making it a recursive call, you can reuse the current stack frame by readjusting the input parameters (on the stack) and jumping to the beginning of the function, saving yourself the headache of setting up and tearing down a stack frame because you're reusing the current stack frame. (or in other words, you're doing a loop).
  In fact, this can be generalized into tall call optimization where if you call another function at the end of a function, the optimizing compiler will reuse the stack frame of the current function for the call instead.
  Syntactically the code looks the same, but the output is vastly different because you get operations that rewrite the stack frame (a particularly smart compiler might actually put the parameters of the call right where they need to be by moving the arguments around so the final two instructions is a stack adjustment to compensate for different stack sizes and a direct jump, so when the tail function returns, it doesn't stop back at the calling function, but goes back to the original function.
  So yes, you're right in that there's supposed to be two recursive calls in quicksort, but in practice, there's only one because the last one is always tail-recursive so compilers merely reuse the existing stack frame.
20. Re:excellent by TheRaven64 · 2015-01-02 22:10 · Score: 1
  
  Back when I was doing my PhD, the prevailing instruction to undergraduates was to favour iteration over recursion because it's faster. One of my colleagues rewrote some code to a single loop rather than recursive calls and found that on Linux it got slower, on Windows it stayed the same. Looking at the assembly, it turned out that managing his own stack on the stack generated significantly worse code with GCC than just using the call stack. On Windows, it was even more interesting: the Microsoft compiler was able to recognise the stack pattern and turn it into a recursive function, so his rewrite ended up generating the same code.
  On the stack size issue, it's true that this can be a problem in embedded contexts. For a randomised quicksort implementation, the stack depth needed should be O(log(n)), so unless you're sorting insanely large things it shouldn't be an issue. The default stack size for FreeBSD is 1MB * sizeof(void*), so there's a fair bit of space.
  
  --
  I am TheRaven on Soylent News
21. Re:excellent by TheRaven64 · 2015-01-02 22:14 · Score: 2
  I don't know what the situation is like in GNU land, but in FreeBSD libc, we basically have two guys who care about libm (the library that implements the math.h stuff). One is a retired mathematics professor, I'm not sure what the background of the other is. There are a few other people who occasionally contribute to it, but no one else really wants to change it because floating point code is really hard to understand. You need to find someone who:
  
  Is obsessive compulsive and likes optimising code.
  
  Has a really strong background in this kind of numerical algorithms.
  
  Has the patience to write and run correctness tests for corner cases as well as performance benchmarks.
  
  Has free time.
  
  This combination is pretty rare (if you think it describes you, the freebsd-numerics list would love you hear from you!).
  --
  I am TheRaven on Soylent News
22. Re:excellent by TheRaven64 · 2015-01-02 22:16 · Score: 1
  
  For those cases you probably want to be using the builtin, which supports vector operands of arbitrary sizes. On some CPUs, it will generate a single instruction, for others it will do the extract and call the scalar version multiple times. A lot of the functions in math.h have data-dependent flow control, so probably aren't that amenable to vectorisation.
  
  --
  I am TheRaven on Soylent News
23. Re:excellent by sshir · 2015-01-02 23:33 · Score: 1
  
  Sure, but that would also mean that debug compiled code, optimized code, ported code - all would behave too differently. Which, again, not a sound engineering practice.
24. Re:excellent by sshir · 2015-01-02 23:42 · Score: 1
  
  Yes, but they are clever about what and when they put on that stack. In their case it's indeed no more than log base 2 frames (non overlapping intervals with biggest one larger than the rest combined with the "rest" following the same pattern)
25. Re:excellent by sshir · 2015-01-02 23:47 · Score: 1
  
  On stack size issue - the point was about a case of malicious/pathological input.
26. Re:excellent by sshir · 2015-01-03 00:40 · Score: 1
  
  That was my point - this way you eat only one recursive call, not two. And in general case (or naive qsort implementation) one recursive call is enough to burn (on bad input) through all your allowed stack.
27. Re:excellent by gnasher719 · 2015-01-03 01:12 · Score: 2
  
  It's quite trivial in the quicksort algorithm to limit the stack size. Split the range you are sorting using a pivot element, then push the larger range on the stack and start again with the smaller range. So the new range is always half the size or less of the original range, and the stack depth is at most log2 (n).
28. Re:excellent by buddyglass · 2015-01-03 03:24 · Score: 1
  
  It currently doesn't, but BSD's code could make the same guarantee, right? At the point where it recurses left and right, compare the length of the two sub-arrays. Sort the shorter one first via a recursive call, then take advantage of the tail recursion and sort the larger one by overwriting the current stack frame and jumping back to the top of the function. They already go halfway (taking advantage of the tail recursion); they just don't use any heuristic to choose which sub-array gets sorted via the recursive call. It's always the "left" side.
29. Re: excellent by geezer+nerd · 2015-01-03 17:08 · Score: 1
  
  It was substantially better, but you really cannot expect me to remember details after 30 years! At the time, I was most struck by how many people would post results deriving from C-coded benchmark programs where the thing actually measured was different from what was purported to be measured, largely from ignorance.
How much benefit? by scheme · 2015-01-02 10:36 · Score: 3, Insightful

It looks like the slowest paths of the transcendental functions were improved by a lot. But how often do these paths get used? The article doesn't say so the performance benefits may be insignificant.

--
"When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
1. Re:How much benefit? by Baloroth · 2015-01-02 10:49 · Score: 4, Informative
  
  It looks like the slowest paths of the transcendental functions were improved by a lot. But how often do these paths get used? The article doesn't say so the performance benefits may be insignificant.
  From TFA, it sounds like the functions in question (pow() and exp()) work by first looking at a look-up table/polynomial approximation technique to see if the function can use that to get an accurate-enough value, then running through a series of multiplications to calculate the result if the table/approximation method wouldn't be accurate enough. Since this work improved the actual calculation part, my guess is that it will improve quite a few cases. TFA does say the lookup table method works in the "majority of cases", though it doesn't say exactly how big a majority, so it's hard to say exactly.
  
  --
  "None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
2. Re:How much benefit? by marcovje · 2015-01-02 11:28 · Score: 1
  
  It seems it's the MPI functions only that are subject to this optimization, not the normal FP ones. My guess is that these aren't even used for soft floating point on FPU limited systems (since adding the exact size of the a FP type allows writing more efficient routines).
  So the impact is very low I guess.
3. Re:How much benefit? by Bananenrepublik · 2015-01-02 11:48 · Score: 2
  
  The slow paths are "several thousand times" slower, according to the article. You only need to hit them rarely to see a significant degradation of performance.
  I was recently dealing with a code that spent most of its time in pow(). Some basic algebra significantly simplified the math, and I could do away with the call to the function, but this shows its performance is a real-life concern for some people.
4. Re:How much benefit? by thogard · 2015-01-02 14:51 · Score: 1
  
  When I was in high school we had a FORTRAN class and one of the assignments was print out as many Pythagorean triples as you can in the allowed 1 minute of run time. Most students would start with power and square root function which would provide about a page and a half of results of which one was wrong because of rounding errors. Going from A^2 to A*A would get you far more pages. The system had a multiply-accumulate function that worked very well so a few changes in a formula could double the number of results.
5. Re:How much benefit? by voights · 2015-01-02 17:28 · Score: 2
  
  Wrong. Normal floating point functions (exp/log/etc) need to fallback to using multiple precision calculations to ensure that the result is perfectly accurate. Much works is done to avoid this having to happen often, but sometimes it is necessary.
6. Re:How much benefit? by Tough+Love · 2015-01-02 19:55 · Score: 1
  
  Curiously, the Red hat dev did not comment on average case performance improvement, only on the slow path improvement. I initially missed that in a quick reading, as, I suspect, did many others.
  
  --
  When all you have is a hammer, every problem starts to look like a thumb.
7. Re:How much benefit? by petermgreen · 2015-01-03 04:36 · Score: 1
  
  The problem is to work out average performance you need to know how probable different inputs are. That probability will be very application dependent.
  
  --
  note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
8. Re:How much benefit? by Anonymous Coward · 2015-01-03 10:10 · Score: 1
  
  No calculations needed for that, just string concatenation:
  3, 4, 5
  30, 40, 50
  300, 400, 500
  and so on for a minute of runtime.
9. Re:How much benefit? by Tough+Love · 2015-01-03 13:26 · Score: 1
  
  You seem to argue for never doing any benchmarking of typical cases.
  
  --
  When all you have is a hammer, every problem starts to look like a thumb.
10. Re:How much benefit? by debiansid · 2015-01-04 05:04 · Score: 2
  
  Curiously, the Red hat dev did not comment on average case performance improvement, only on the slow path improvement. I initially missed that in a quick reading, as, I suspect, did many others.
  It is difficult to compute impact of this work on the average case because we don't know precisely how many of the inputs in the entire domain (i.e. all unique FP representations in an IEEE754 double) are serviced by the slow path. I wasn't able to get the libm code to go down the mp path for the log function after a week of throwing different inputs at it. pow on the other hand hit it fairly regularly in an hour. Even the 'fairly regular' is about 1 in a 1000 inputs, so that is not much. We know that it is low enough that not a lot of people have complained about it, just some (and not all) math intensive applications. Perhaps a study similar to the worst cases paper I cited (the link is wrong btw, I'm working on getting that fixed) will give a better idea.
  The other reason is that the slow path is literally thousands of times slower, so the impact it will have on the average case will depend greatly on the kinds of inputs an application uses. So in that context, talking about an average case with an arbitrary percentage of slow inputs (say 1%) would be cheating because it may show incorrectly large improvements.
  Finally, the point of my post was more about sharing the methods I used for improvements than the statistics. I am personally not satisfied with the statistics because there is a lot more that can be done.
  PS: I am the Red Hat dev who wrote the post (if it wasn't obvious) and the correct link to the paper is: http://perso.ens-lyon.fr/jean-...
11. Re:How much benefit? by debiansid · 2015-01-04 05:13 · Score: 2
  
  It looks like the slowest paths of the transcendental functions were improved by a lot. But how often do these paths get used? The article doesn't say so the performance benefits may be insignificant.
  They don't get used very often. I don't know how much though because that would require running the functions for the entire domain of inputs which would be years for the univariate functions and pretty much forever for pow. If you want anecdotal data then I have that: I didn't hit the slow path of log for the entire week I threw different inputs at it. pow hit quite a few in an hour of running but even that was about 1 in 1000 or less. However, since there is no pattern to determine which interval of inputs will have more slow path hitting inputs, you can never say that only 0.1% of the inputs use the slow paths. Consequently, I cannot really give an average case picture based on these improvements.
  That said (as I've mentioned elsewhere), my main motive of writing the blog post was to share the methods, which I believe are more interesting than the results themselves. The results are not satisfactory IMO since there is still a lot more work to be done.
12. Re:How much benefit? by Tough+Love · 2015-01-04 11:35 · Score: 1
  
  Indeed it was obvious. Your comments here are informative and would make a nice addition to the writeup.
  
  --
  When all you have is a hammer, every problem starts to look like a thumb.
Obligatory Carmack by Anonymous Coward · 2015-01-02 10:46 · Score: 1

Fast inverse square root
1. Re:Obligatory Carmack by Pseudonym · 2015-01-02 14:32 · Score: 1
  
  You mean obligatory Walsh. Of course, it's obsolete now that RSQRTSS is ubiquitous.
  
  --
  sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
Re:Let the games begin by Anonymous Coward · 2015-01-02 10:54 · Score: 1

Presumably I'm being clueless but I had tended to assume that these days most of the libc math functions were implemented in hardware - at least for desktop and server chip sets (I could imagine that certain simple embedded chips wouldn't have a full math instruction set). So unless your software is very inefficient at calling out to the hardware wouldn't most of the performance simply depend on the hardware?
Lookup tables are faster and more accurate by Anonymous Coward · 2015-01-02 10:57 · Score: 1

I'm not surprised to hear that the floating point operations are slow.
I've always found that it is best to avoid floating point operations as much as possible. If you want speed, use a lookup table or a couple of tables. With a few kilobytes of memory, you can get about 7-8 bits of precision--and that's 7-8 bits of accurate precision. I'll take 7-8 bits of accurate precision for 12 bits of garbage any day. I can usually get 10-20x speed increase over glibc.
There is a trade-off though. It does take up a few kilobytes of more memory. It's hardly an issue on most systems today though...
Granted, I haven't bothered to implement all the floating point operations with lookup tables, but I've done a few over the years. It is possible that some operations will not lend themselves well to look up tables, but I haven't found any of those yet.
1. Re:Lookup tables are faster and more accurate by Celarent+Darii · 2015-01-02 12:10 · Score: 4, Interesting
  
  What is perhaps a bit of irony of history, even for humans a lookup table is faster and more precise than manually calculating it via formula. That is why they published books of logarithms. Using interpolation you can even stretch out the precision to several more digits. With a table of values in memory you can also narrow down the inputs to Newton's method and calculate any differentiable function very quickly to an arbitrary precision. With some functions the linear approximation is so close that you can reduce it in just a few cycles.
  Even in most trigonometric functions there is a simple table upon which the angle addition formulas are used to get the other values[an old example].
  Given the size of most operating systems, where 8k of ram is hardly noticed (most gifs are larger than this), I am actually quite surprised that the lookup table method is not more used. It would seem one of the first things to put in cache on your ALU.
2. Re:Lookup tables are faster and more accurate by mprinkey · 2015-01-02 17:16 · Score: 2
  
  With a table of values in memory you can also narrow down the inputs to Newton's method and calculate any differentiable function very quickly to an arbitrary precision. With some functions the linear approximation is so close that you can reduce it in just a few cycles.
  No, you can't. I know this was done in Quake3 fastInvSqrt(), but that is the exception, not the rule in my experience. x = pow(a,b) is a differentiable function. How can you assemble a root function/Newton iteration to successively correct an initial guess for x to arbitrary precision--without actually calling pow() or other transcendental function? I have built Newton (and Halley and Householder) iterations to successively correct estimates for pow(a,b) when b is a particular rational number. You can re-arrange the root function to only have integer powers of the input a and of the solution value x, and those can be computed through successive multiplication. These can be fast, but they are certainly not useful when b is something other than a constant rational number. And even if the exponent value has only a few significant digits, the multiplication cascade starts to get expensive (that was the reason to use Halley/Householder because once you have f' calculated, f'' and f''' are almost free.)
  If you know otherwise, please let me know. My current fast pow() function leverages IEEE floating point formats and Chebyshev polynomial expansions to get reasonable results. If there is way to polish an approximate pow() result with Newton (or higher order) iteration, I would be happy to learn it.
3. Re:Lookup tables are faster and more accurate by Celarent+Darii · 2015-01-02 20:11 · Score: 2
  
  For pow(a,b), [a,b real numbers], you are essentially calculating:
  a^b = (e^log(a))^b) or pow(pow(e, log(a)), b) which is e^(b*log(a)) or pow(e, b*log(a)) where e is the base of the natural logarithm.
  What you have in your table are the values for e^x and log(x), like any good book of logarithms of ancient times. Precision according to your needs. For quick lookup you can even index the mantissa in a b-tree if your table is huge.
  Then it becomes very quick:
  step 1: look up log (a) in the table, interpolate if needed.
  step 2: calculate b * (value in last step).
  step 3: lookup up e^x where x is the value at step 2 in your table, interpolate if needed.
  step 4: profit! as you now have your result.
  And as a bonus, you are sure the result is within the precision of your table immediately, within the error of your interpolation.
  Note that interpolation for exp(x) is quite fast. There are some exotic methods out there as well for interpolating exp(x) and log(x), as per this abstract which are quite efficient if you need high precision. For 10 digit precision you could easily fit both your tables into 8k.
4. Re:Lookup tables are faster and more accurate by Celarent+Darii · 2015-01-02 20:31 · Score: 1
  
  Here is a nice book that we had in our school days illustrating these sorts of techniques
5. Re:Lookup tables are faster and more accurate by Celarent+Darii · 2015-01-03 00:03 · Score: 1
  
  And a nice pdf paper illustrating this technique and its merits
6. Re:Lookup tables are faster and more accurate by mprinkey · 2015-01-03 01:33 · Score: 2
  
  But that was not my question. I fully understand how to use lookup tables/Chebyshev expansions of exp(x) and ln(x) to implement pow(x,a)--I have implemented these many times. My question was specifically on your assertion that any differentiable function could be evaluated with as a Newton-style iterative correction and thus provide arbitrarily precise results. I asked specifically to see how that is accomplished for pow(). There is no corrective mechanism in the algorithm you have stated above. The precision you get is a function of the precision that you've baked into your lookup table--and then it become the space/accuracy trade-off. On desktop/server CPUs, that trade-off is more often than not won by Chebyshev expansions, especially in a world of SSE/AXV vector instructions.
  So, if there is a technique that does allow me to start with an initial guess x[0] of pow(a,b) and then create corrections of the form: x[i+1] = x[i] - f(x[i]) where f() uses only intrinsic math operations (+,-,*,/,etc) but not transcendentals, then I am quite anxious to see it.
7. Re:Lookup tables are faster and more accurate by Celarent+Darii · 2015-01-03 05:37 · Score: 1
  
  True, the table is a question of space/accuracy trade-off.
  The process of correcting is in the interpolation, which is why I included the additional links in the same thread.
  For instance in the simple manner of simple linear interpolation one can interpolate an arbitrary \epsilon between two table values. Repeating this brings us closer to a fixed point. The book that I linked to gives also many other ways of interpolation, as well as the article. It is this interpolation that is the manner of finding increased accuracy.
  The article I linked to states that the method that he employs shows a noticeable gain over recalculating the value [though it is not the same function in discussion]. As in most algorithms it depends a lot on what is more important - space or time.
Title correction by Dishwasha · 2015-01-02 11:00 · Score: 1, Informative

Red Hat Computer Scientist Improves Math Performance of Glibc
There, fixed that for you.
1. Re:Title correction by Anonymous Coward · 2015-01-02 11:16 · Score: 3, Insightful
  
  But having made an actual, real contribution to a piece of software in general use makes him more of an engineer than an unusually competent computer scientist.
Re: I'm surprised by Anonymous Coward · 2015-01-02 11:32 · Score: 2, Informative

Last time I checked, India was in Asia.
Don't need to be an expert to beat compilers ... by perpenso · 2015-01-02 12:13 · Score: 3, Insightful

The real problem is you need to be expert in the target processor(s) ...
Not really. Being an expert in assembly language in general may be required but not necessarily an expert on the target architecture. Transitioning from one target architecture to another is not like starting over from scratch. Part of becoming a good assembly language programmer is recognizing things in algorithms that can't quite be stated in a high level language, information or suggestions that can't be given to the compiler. Specific computer architectures provide the toolboxes to address such shortcomings. So a bit of the work is common regardless of the architecture and the rest is determining how to accomplish something with the architecture specific toolbox.

Circa 2000 I was beating PowerPC compilers on my first attempt. Now I had a lot of experience with x86 and some with 68K, 8051, 8048, 6502 and Z80. At the university I had taken undegrad and grad computer architecture classes, the later focused on Alpha. Before writing the "commercial" PowerPC code referred to earlier I spent a couple of weeks reading PowerPC manuals and writing little pieces of test code.

Being new to PowerPC I was concerned that I had missed something or done something wrong despite seriously beating the compiler. I was able to go to Apple and spend a couple days with their engineers. Their PowerPC people thought the assembly code was fine and couldn't really improve upon it, their compiler people couldn't improve the original C code.

That said, assembly language is unnecessary much of the time. Contrary to popular belief C code can be written in a manner that favors one architecture over another. By understanding the given architecture and its assembly language I've been able to rewrite C code and not have to go all the way to assembly. Having two C implementation of some code, one for x86 vs one for PowerPC, or more generally one for CISC vs one for RISC.
first the algorithm, then system, asm by raymorris · 2015-01-02 12:21 · Score: 1

ASM for certain inner loops can be a good sixth step in the optimization process. It only matters after the algorithm is optimized, then optimized again, then the system stuff is right (ie look at how the storage structures used by the program translate to actual physical operations).
Assembly more durable than you might think ... by perpenso · 2015-01-02 12:42 · Score: 1

sure thing, if you want to rewrite your code for every cpu architecture ...
Nope. I only need to write assembly for the architectures I care about, all other architectures can have the reference C code implementation.

... (and preferably also every generation of said architecture)
Probably not. Performance tuning may only be necessary for the older architectures in order to lower required/recommended system requirements so the potential market is larger.

Also assembly is often more durable, more long lived, than you might imagine. I have some math code that I wrote in assembly targeting the Pentium Pro (686) back in the day. Every once and a while I compare it against the reference C code when I have a new architecture and/or new compiler. Its still faster. The percentage improvement is getting smaller but its still a win, that original investment made in Pentium Pro days has not become counterproductive. YMMV.

And if the day arrives where the assembly becomes counterproductive then newer generations of an architecture of interest can just use the reference C code like the architectures one never cared about.
Re:Factorials for the taylor series? by ledow · 2015-01-02 13:00 · Score: 1

Factorials tend to cancel out nicely with other factorials.
And, also, 69! factorial is too big for most calculators - even a pre-calculated table could probably do a ton to speed things alone.
Re: Don't need to be an expert to beat compilers . by RLaager · 2015-01-02 13:12 · Score: 1

This is interesting. Do you have any examples?
Re:Yikes ! by Anonymous Coward · 2015-01-02 13:23 · Score: 1

What kind of horror story is this??
I couldn't think of a sane person put in a routine that first checking a look-up-table and then check the approximation to see if the result can be 'accurate enough' and after failing all that goes through a series of multiplications to obtain the final result
Not a horror story at all, just basic numerical analysis. You just described a fast, reliable and arbitrarily-accurate way to calculate transcendental functions. What's insane about that?
Re: Don't need to be an expert to beat compilers . by perpenso · 2015-01-02 14:22 · Score: 3, Informative

This is interesting. Do you have any examples?
Sorry, proprietary code of a past employer. It was in the domain of low level bitmapped graphics.

Its been quite a while but one method that I recall involved branch prediction. That if the condition to be used for a branch is known sufficiently far in advance then a branch on the PowerPC has no penalty. An inner loop had numerous branches whose conditions could be determined quite early. I recall doing so and storing things in extra condition registers. The result was that the inner loop no longer had any branch penalties. Myself and the Apple engineers just couldn't get the various compilers to do anything comparable.

This was just one of several things that I did but I don't really recall the others. Well, except that the rotate and mask instructions can be amazingly useful.
Re:Yikes ! by Pseudonym · 2015-01-02 14:29 · Score: 1

What kind of horror story is this??
Don't tell me that you've never written an algorithm which uses speculation before. It's quite a common scenario that you have a common fast path and an uncommon slow path, and the cost of deciding which path to use is a significant fraction of the cost of the fast path + checking the result.
In the case of libm, there are a lot of code paths which are there only to maintain strict compliance; no numeric analyst would ever call exp or pow (or even round) with arguments in that range. Not on purpose, anyway.

--
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
Re:Yikes ! by Anonymous Coward · 2015-01-02 14:51 · Score: 1

This is MPI code, not IEE754 code.
People using MPI are working with EXACTLY these kinds of numbers (extremely small, so small they'd denormalise even with quadmath/256bit IEEE754, recurring for so long you lose your tail - eg: calculating Pi, or so large your exponent bits aren't enough to even represent the number w/ any standard IEEE754 implementation).
The fact there's interest in optimising the MPI code paths for these functions and their slow paths is indicative of exactly this fact, people ARE indeed passing in many numbers outside of these speculative ranges, and ARE indeed interested in how fast it performs.
Always room for improvement by thogard · 2015-01-02 15:10 · Score: 3, Interesting

Newer hardware can make use of newer features which will change what should be considered the best optimisations. Addition used to be much faster than multiplication until they put barrel multipliers in chips. Once floating point cores were added, other things became faster but the early FPUs could do things like add and multiply and anything else could be very slow. I wrote a floating point library for OS9 for the radio shack color computer which had a 2 mhz 8 bit cpu with good 16 bit instructions and no floating point hardware and I could do trig and log functions faster than a 4.77 mhz 8087 floating point unit. I could use tricks like bit shifting and de-normalising floating point numbers for quick adds. There was one function that the typical Taylor series used a /3 + /5 + /7 type thing but there was an alternate that used /2 + /4 + /8 but took more steps but an integer CPU can divide by a power of 2 something like 50 times faster than dividing by an odd number so the doing the extra steps was faster. My library took advantage of precision shortcuts like simply not dealing with some of the low order bits when the precision was going to be lost in the next step or two which are things that you simply can't do efficiently with current floating point hardware.
Re: Don't need to be an expert to beat compilers . by Pinhedd · 2015-01-02 15:39 · Score: 3, Interesting

A good example that I ran into not too long ago was trying to get GCC to autovectorize some heavy matrix multiplication operations without using vector intrinsics. No matter how hard I tried and now matter how explicitly I forced the memory alignment (on x86 double quad-word loads into XMM registers require 16 byte alignment) and ensured that all operations were 128 bits wide (SSE codepath) or 256 bits wide (AVX codepath) GCC just couldn't figure it out on its own. I poured through the compiler output and manage to clear up a few ambiguous data dependencies but I just couldn't get it to autovectorize the main loop.
I ended up digging around in the compiled ASM and noted where GCC was failing to unroll and reorder enough for SLP to work properly. I rewrote a small chunk of it by hand and got the results that I expected but doing so for a large portion of the project would have been unreasonable and also would have bound the source to x86. Instead, I switched from GCC to ICC and ICC picked up the optimization right away. For shiggles I tried Clang/LLVM as well but had no luck with it either.
Re:How about Jumping the Cuda? by inflamed · 2015-01-02 16:06 · Score: 1

How about compiler being able to leverage Cuda hardware if found during execution? Might make for fatter binaries, but..... if it found GPU, it might help in some cases?
Well, the system of execution is often different from the system of compilation. So, you'd want to have the compiler-user specify that a CUDA-specific binary should be compiled alongside (afaik cuda/x86 fat binaries aren't a thing right now). And that falls into the realm of build system, not compiler. From what I've seen the build system (make in this case) relies on nvidia's compiler to builld the cuda libraries subsequent linking into the cuda executable along with the c* and fortran code.
Re:Common Core Forever! by Anonymous Coward · 2015-01-02 16:49 · Score: 1

Yes!
Re:Don't need to be an expert to beat compilers .. by angel'o'sphere · 2015-01-02 18:06 · Score: 4, Insightful

I once found an Bresenham line drawing algorithm written in Assembly (68k) on a Mac SE (black and white graphics).
That code was extremely complicated and the main mistake was that the author loaded each byte, manipulated it and rewrote it to memory each cycle.
Just to load in many cases the exact same byte again in the next cycle.
I immediately came to the idea to not load and store the bytes every cycle, and from that the jump to figure if you could also use words and long words depending on how steep the line was, was easy.
So I first wrote code in C that used chars, ints and longs depending on steepness and changed consecutive bits until it needed to access another "word". Each "word" was only loaded once into a register and was only written after all necessary bits where changed.
My C code rewrite was so fast (more than 100 times than the original assembly) that I never rewrote that in assembly itself.

--
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Re:How about Jumping the Cuda? by JBMcB · 2015-01-02 18:18 · Score: 1

Depends on what you're doing. Sending stuff to the GPU incurs overhead. If you're doing a ton of matrix operations all in a row, it may be worth it. If you are doing a few every once and a while, probably not.

--
My Other Computer Is A Data General Nova III.
Re:Don't need to be an expert to beat compilers .. by ruir · 2015-01-02 19:13 · Score: 1

There is code and code. Often more than not, it is really the specifics of implementing that count (i.e. the code). I remember stumbling upon a integer circle drawing function for the Z80 on a magazine, and while building the code, I got like 4 or 5 sheets of A4 paper. I got so irritated with it I teared the sheets, and sure enough, in one hour, I had a substitute code that was no more than 10-15 lines, and much more efficient. I also remember at work having inherited code that had to read or write data to SQL, instead of using intermediary caches, gods know why. I shaved run times from 50 minutes to 5.
Re:Don't need to be an expert to beat compilers .. by perpenso · 2015-01-02 20:08 · Score: 1

Having a C implementation as a reference is a good thing. Even if the assembly code was well implemented on the original architecture the hot spots needing attention might be different on the second architecture. The C code is pretty useful for profiling and testing.

Also the algorithm is always the place to start. There is little point in optimizing the C code or going to assemble language until you are sure the algorithm is right. Fewer and wider memory accesses is a pretty good direction to explore as you found out. I think last time I touched something like a bresenham I also worked on eliminating branches, maybe some sneaky code to do
x = (condition) ? a : b
in branchless PowerPC code but its been quite a while and maybe that was for something else.

But yeah if your C code is fast enough why redo it in assembly? I like programming in assembly but I'm not a zealot. Although I do recommend taking something like the algorithm you mentioned all the way to assembly as a way to learn a given architecture. I also suggest revisiting the code as one learns more. Truly understanding the architecture helps one write better C code.
Re:Let the games begin by Bruce+Dawson · 2015-01-02 20:26 · Score: 1

Actually the trend is in the opposite direction -- fewer of the math functions are implemented in hardware than used to be. There are many reasons (optimized out-of-order CPUs and old/slow transcendental implementations) but one significant reason is that the new glibc math functions are generally correctly rounded -- exactly correct. Whereas the hardware versions are often not -- as I discussed in this recent blog post:
http://randomascii.wordpress.c...
Re:Let the games begin by TheRaven64 · 2015-01-02 22:21 · Score: 2

If you compile with -ffast-math, then they often will be. The problem is that the hardware implementations tend to fall into two categories. They're either very fast, and sometimes correct, or they're correct and very slow. The correct ones are normally microcoded, and if you've ever looked at CPU microcode then you'll know that it's an even worse thing to try to write complex algorithms in than assembly. You can generally do a better job with a C implementation that the compiler then optimises. The fast ones are very fast, but don't handle corner cases correctly (often the instructions predate the IEEE standard), so they're great for games and suchlike but not so good for scientific computing. GPUs are the worst offenders, often implementing things like trig functions as small lookup tables, leading to a lot of approximation.

--
I am TheRaven on Soylent News
Re:Don't need to be an expert to beat compilers .. by angel'o'sphere · 2015-01-03 00:24 · Score: 1

I like to program in assembly, too.
But not on X86 architectures. On the other hand I did not do that since 20 years, so no idea if it is now less painful :D
I did assembly on ARM 3, 6502 - of course - 68k and Sparc, Mips, Dec Alpha, PowerPC ... on the last one only very briefly to get an idea and I forgot everything about the last four ;D - don't remember if I even truly did program on Alpha or only analyzed the compiler output.
68k (68030/040) I liked the most due to its orthogonal architecture, it was like programming in a high level language. Or lets say other way around: it was very easy to map/compile high level languages to it.

--
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Re: Don't need to be an expert to beat compilers . by sshir · 2015-01-03 00:31 · Score: 2

Interesting. But just as a side note: I thought that people generally stopped rolling their own linear algebra code and started to use precanned one from Intel,amd,nVidia etc. I'm sure matlab is that way.
Re:Don't need to be an expert to beat compilers .. by cheesybagel · 2015-01-03 02:03 · Score: 1

x86 assembly has been pretty clean ever since we have had the 32-bit 386 mode. x86-64 assembly is pretty much normal.
CPU by roman_mir · 2015-01-03 03:53 · Score: 1

Seemd to me that it may be timr for CPU manufacturers to implement math functionality that we normally provide with libraries as part of the CPU itself. Does it make more sense to add more generic cores to a CPU or to add some new instructions and add math co-processors that have ROM on board with lookup tables for logs, powers, sin, cos, tan and a bunch of other useful functions?

--
You can't handle the truth.
checking for errors is crucial by e**(i+pi)-1 · 2015-01-03 06:20 · Score: 1

What will be important when doing optimization is that no errors are introduced. Having errors in basic mathematical routines can break things at unpredictable places. The worst errors are the ones which deal with exceptional cases which are rarely hit upon. We worked once on integer factorization algorithms [We programmed Morrison-Brillard and quadratic Sieve methods, and everything in the group had been written from scratch, even the large integer libraries were written from the ground up in Pascal by a relatively large group of programmers, not using a single line of code from the outside as probably custom in any military setup.] Suddenly, our factorization algorithms did not work any more reliably. Debugging took some time but we could pinpoint it to parts where square roots were used. Indeed, the square root routine in rings of integers had been optimized by the group working on fundamental routines. It was quickly fixed, but I can only imagine a rare error in an important library of GCC. In the current case, it looks however as if the engineer knows what he is doing. Especially implementing the squaring function separately is smart and I'm surprised that this had not been done already. When dealing with integer arithmetic like taking powers a^b modulo some number, it is well known to be best first to compute powers a,a^2,(a^2)^2 by successively taking square roots, then write b in binary form and cobbling together the power a^b as a product of the precomputation which is of course the reason why taking powers is cheap in integer arithmetic and important in cryptology like RSA. So, it is important to be able to square quickly.
Re: Don't need to be an expert to beat compilers . by Pinhedd · 2015-01-03 09:21 · Score: 1

You're absolutely right about that. ICC contains a number of compiler optimizations for linear algebra functions provided by Intel. I rolled my own because for no other reason than to prove that I could, it was a neat day project.
Re:Yikes ! by Pseudonym · 2015-01-03 14:27 · Score: 1

This is MPI code, not IEE754 code.
As I understood it, the issue is that glibc's libm occasionally falls back to a multi-precision version of some transcendental functions if the "pure IEEE-754" version isn't good enough on some inputs.
I did look at the code, and the implementation of exp() on 64-bit floats does indeed fall back to a slow path on extreme inputs. I can't remember if it's the MP code or not.

--
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
Re:"professional look"? by ChrisMaple · 2015-01-06 15:33 · Score: 1

I recall that many of the Linux library math functions I've seen say "copyright Sun Microsystems", and that they were pretty old.

--
Contribute to civilization: ari.aynrand.org/donate
Re:Factorials for the taylor series? by ChrisMaple · 2015-01-06 15:59 · Score: 1

Each step uses a factorial 1 or 2 higher than the previous step. The previous step's factorial is kept around, so a simple multiply or 2 is all that's required to get the new one. He isn't calculating a new factorial afresh at each step. That said, when I've done work like this I've always made a table of precalculated factorials, which may be a mistake if it takes a long time to fetch the table.

--
Contribute to civilization: ari.aynrand.org/donate