Red Hat Engineer Improves Math Performance of Glibc
jones_supa writes: Siddhesh Poyarekar from Red Hat has taken a professional look into mathematical functions found in Glibc (the GNU C library). He has been able to provide an 8-times performance improvement to slowest path of pow() function. Other transcendentals got similar improvements since the fixes were mostly in the generic multiple precision code. These improvements already went into glibc-2.18 upstream. Siddhesh believes that a lot of the low hanging fruit has now been picked, but that this is definitely not the end of the road for improvements in the multiple precision performance. There are other more complicated improvements, like the limitation of worst case precision for exp() and log() functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision (PDF). One needs to prove that those results apply to the Glibc multiple precision bits.
Compilers can't really compete with experts on performance, given enough time. If you are unfortunate enough to look into GCC, you can find some assembly here and there on performance-critical portions of the code. The major downsides are: you have to write it for each architecture you want to support (GCC supports many archs, so this is a really big deal), fewer people can read/write/fix it, bugs are even easier to slip by, takes longer to code.
99.9% of the time, no.
The purpose of the compiler is to identify and optimize the code structures in higher level languages. There are many, many tools, and generations of compilers that have been dedicated to just that. For the vast majority of cases, the compiler will do a better job and leave you with the much easier task of maintaining a high level language codebase.
That said, there are specific operations, most frequently mathematical in nature, that are so explicitly well defined and unchanging, that writing them in ASM may actually allow the author to take procedural liberties that the compiler is unknowledgeable of or exist in such a way that the compiler is unaware of.
The end result of such code is typically virtually unreadable. The syntax masks the math, and the math obfuscates the syntax. But the outcome is a thing of pure beauty.
-Rick
"Most people in the U.S. wouldn't know they live in a tyrannical state if it walked up and grabbed their junk." - MyFirs
It looks like the slowest paths of the transcendental functions were improved by a lot. But how often do these paths get used? The article doesn't say so the performance benefits may be insignificant.
"When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
But having made an actual, real contribution to a piece of software in general use makes him more of an engineer than an unusually competent computer scientist.
Keep in mind: this is [w]hat the compiler tried to do; when you start down this path you are saying "that fancy compiler doesn't know what its doing, I'll do it all myself".
Trying to outsmart a compiler defeats much of the purpose of using one.
-- Kernighan and Plauger, The Elements of Programming Style
If it weren't for deadlines, nothing would be late.
I think that saying "This piece of code is going to be called a lot, so I'll implement it in assembler" is inadvisable. The more reasoned approach is "after profiling, my program spends a lot of time in this routine, so I'll go over the assembler the compiler generated to make sure it optimized correctly". The upshot being, it is useful to be able to read and write assembler when optimizing, but it would be rare that you would produce new code in assembly from whole cloth.
"Because Science" is one step from "Because old book". Try "Because of my experiment testing my falsifiable assertion".
The real problem is you need to be expert in the target processor(s) ...
Not really. Being an expert in assembly language in general may be required but not necessarily an expert on the target architecture. Transitioning from one target architecture to another is not like starting over from scratch. Part of becoming a good assembly language programmer is recognizing things in algorithms that can't quite be stated in a high level language, information or suggestions that can't be given to the compiler. Specific computer architectures provide the toolboxes to address such shortcomings. So a bit of the work is common regardless of the architecture and the rest is determining how to accomplish something with the architecture specific toolbox.
Circa 2000 I was beating PowerPC compilers on my first attempt. Now I had a lot of experience with x86 and some with 68K, 8051, 8048, 6502 and Z80. At the university I had taken undegrad and grad computer architecture classes, the later focused on Alpha. Before writing the "commercial" PowerPC code referred to earlier I spent a couple of weeks reading PowerPC manuals and writing little pieces of test code.
Being new to PowerPC I was concerned that I had missed something or done something wrong despite seriously beating the compiler. I was able to go to Apple and spend a couple days with their engineers. Their PowerPC people thought the assembly code was fine and couldn't really improve upon it, their compiler people couldn't improve the original C code.
That said, assembly language is unnecessary much of the time. Contrary to popular belief C code can be written in a manner that favors one architecture over another. By understanding the given architecture and its assembly language I've been able to rewrite C code and not have to go all the way to assembly. Having two C implementation of some code, one for x86 vs one for PowerPC, or more generally one for CISC vs one for RISC.
Not just every architecture. In general, you may need to write it for every major revision of every architecture. As CPU pipelines and instruction sets change, the hand-crafted assembler may no longer be optimal.
(Exercise: Write an optimal memcpy/memmove.)
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
I once found an Bresenham line drawing algorithm written in Assembly (68k) on a Mac SE (black and white graphics).
That code was extremely complicated and the main mistake was that the author loaded each byte, manipulated it and rewrote it to memory each cycle.
Just to load in many cases the exact same byte again in the next cycle.
I immediately came to the idea to not load and store the bytes every cycle, and from that the jump to figure if you could also use words and long words depending on how steep the line was, was easy.
So I first wrote code in C that used chars, ints and longs depending on steepness and changed consecutive bits until it needed to access another "word". Each "word" was only loaded once into a register and was only written after all necessary bits where changed.
My C code rewrite was so fast (more than 100 times than the original assembly) that I never rewrote that in assembly itself.
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.