Is Source-Code Optimization Worthwhile?
nwetters asks: "I'd like my programs to run as fast as possible, and there are some great books (e.g. High Performance Computing) that give examples of loop optimizations and unit-stride memory use. However, I've heard that modern compilers will automatically optimize even the worst of my cludge. Should I bother with manual tweaking, or leave it up to the compiler? "
I'm sure there are innumerable ways to screw the compiler. As far as I know, no optimizer yet written converts bubble sorts into quicksorts. That said, you shouldn't worry about optimizing your loops before you've designed you program and picked appropriate algorithms. My theory is to design and code primarily for correctness, reliability, flexibility and maintenance. If the result is fast enough/small enough, then I'm done. If it's too slow (or too big, or whatever), then put it under a profiler and start picking. But definitely don't waste hand optimizing code until you've profiled it to make sure you're optimizing the *right* code!
That doesn't mean you should ignore stuff like memory stride and loop invariants, etc while coding, just because the compiler can fix it. For one thing, who knows what compiler/machine you're going to be using tomorrow? Code optimizers are stupid and conservative, and it doesn't take much to make one "give up" and go with "slow but correct". Another thing is that a lot of code that does confuse the optimizer is likely to confuse a future maintainer as well.
As V.Mole said, you can't expect a compiler to convert bubble sorts to quick sorts for you, so you certainly want to use appropriate algorithms, but you should not worry about things like precomputing common expressions or specifying what variables should be placed in registers: the compiler is, invariably, going to do a better job of sorting out these minor optimizations than you will, and it will be more portable. (the kind of optimizations that work well on a PowerPC or Alpha won't be so good on an x86 and vice versa)
As for some of the more esoteric optimizations, like loop unrolling, you may or may not get a compiler that can do good loop unrolling, so it may pay to unroll loops manually. Still, for the first attempt, you should probably write the loops without unrolling them so that you can think about the rest of the program, rather than getting caught up in minutia.
I routinely perform some small optimizations that rely on very large scale knowledge of what the code does and how it will be used. For example, I have an AVL tree library that caches the result of the last search and checks the cache variable before each lookup. That kind of optimization couldn't easily be done by the compiler (since it optimizes accross calls to different functions and it relies on what the code is being used for by the caller) but could have a big impact on code that might perform multiple lookups on the same key in sequence. (of course I would be better served by a splay tree in this circumstance, but that's beside the point here)
The kind of optimizations that it really pays to do are more along the lines of smart coding and good design than they are like traditional code optimization. In general, if you make your code readable and don't use stupid algorithms you can rely on the compiler to fix the small stuff.
For must stuff, I've really got to say it just doesn't matter. For instance, anyone who tries to optimize GUI code for speed should be repeatedly hit on the head. OTOH, some areas still require good optimiztation: graphics (I mean the low-level libs here), real time stuff, crypto, numerical work, things like that. However, most of that is not getting fast code, but fast algorithms, and finding "tricks" for doing those algorithms even faster. That is to say, worry about high level, first order effeciency, and let the compiler handle the hard stuff (like pipelining).
:)
But for the most part, it's best to just write clean, readable code. It doesn't matter that the code is fast if it can't be read, since someone will just have to redo it when it needs changes. Also it usually helps the compiler optimize more if it can figure out what's going on. For instance, the KAI C++ compiler has a very very good optimizer (my code runs 2-3 times as fast over gcc 2.95.2 on the same machine), and the docs recommend you write plain, readable code so the optimizer can see what's going on. One time I replaced some code with inline asm, and it got significantly slower. I took it back out.
Just my 2% of a $.
Sure, go ahead! But before you do, you should do a few other things:
- First, tidy up your code. Fix the indentations, elabourate the variable names, make it generally more readable.
- Then, comment it well.
- That done, write out your maintainer-level documentation, which includes all of your high-level algorithms.
- Verify that these algorithms are correct.
- Check the complexity of these algorithms. The real place to shave off time is if you can lower your running time by an O(n) or two.
- Once you've fixed your algorithms, update your documentation.
- And now, after you've done every other improvement you can think of... save a copy of your code.
- Run a profiler on your code.
- If you really still feel like it's too slow, then go ahead and optimise it in the places the profiler picks out as taking a lot of time.
The upshot is, optimising is never a bad thing. It's just an extremely low priority thing. I can't tell you how many of my students I've seen performing low-level optimisations on O(n^3) loops that could have been reduced to O(n lg n) or less. Don't waste your time optimising in the wrong places, and really, make sure you've done everything else first....``This, too, shall pass.'' ---Eastern proverb
I can think of at least one optimisation that an optimising compiler isn't likely to be able to make - namely, processing a large data set split into-self contained chunks that are small enough to fit into the cache memory of whichever system you're using. ie, if you have 512kb of L2, if you can stick to working on a total code + data size of less than 512k in one go, you should get a performance win, especially with SMP.
I would have to agree with most people here that optimization should be a secondary concern after correctness, readability, etc. The thing I wanted to stress is how important it is to profile before you optimize. I've read in numerous places that 80% of exectution time is spent in 20% of code, so it REALLY doesn't make sense to spent lots of time optimizing all of your code. Find out what the important 20% is, and concentrate on that. Also, something that compilers can't do is decide on container sizes. For example, if you have a hash table that you know is going to contain 6000 items, don't make it have only hash values (I had a real problem with this once. Increasing the hash table size to 2000 made the algorithm run 200% faster, stupid RogueWave with default hash table size of 64). Similarly if you are using vectors, strings, etc, that you know will have many items, don't construct them to have a size of one, becuase you will spend all of your time allocating memory, and copying the contents of your container.
>~~~~~~~~~~~~~~~~
>~~~~~~~~~~~~~~~~
Pilchie
5 years ago the standard prodecure was to recompile your program for the new Pentium architecture and instantly get a 10x speed boost. Today the standard prodecure is assembly language so it depends on where we are in the punctuated evolution. After the next recession they'll work their assess off to speed up processors and we'll once again benefit from recompiling more than optimizing. If you're doing database and web applications don't bother optimizing but write a game instead.
Besides the general rules (use better algorithms, use profilers to find slow code and test-Test-TEST!), something I've found very useful is caching reused values in low-level functions.
Two experiences with a client this spring show just how much this can help. The client needed to do some basic GIS functions, so there were a *lot* of trig functions off of the same values related to the lat-lon to display mapping. The original code did not cache values which were constant for each mapping; this was costly since it involved about half of all trig calculations in the code! Caching those values cut the running time in half, and no compiler would know to extend a data structure to include those cached values.
The second optimization involved a modest change to the code so that the data was organized in a rough hierarchy - the 360*180 1-degree square cells were clustered in 3-degree supercells, 15-degree super(2)-cells, and 60-degree super(3)-cells. If the area of interest wasn't in the super(3) cell, I didn't bother to check any of the 1-degree cells within it. Again, no compiler would have caught this, and few algorithm books discuss n-ary trees of spatial data. If you squint, this is also a "cache" approach, although the cache is hardcoded via the nesting factors.
Put the two modifications together and the program ran about 2 orders of magnitude faster! That was so much faster that the client started to compute the values on the fly - it was "cheaper" than precomputing the information and maintaining *that* cache!
For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
No competent grad student or professor will be "dumbfounded" that an O(n lg(n)) algorithm that hits the swap will be slower than an O(n^2) algorithm that does not. They *will* ask you why the first algorithm is so much more memory expensive, though, since most such algorithms only require a modest amount of additional memory. A fast algorithm, poorly implemented, will still run slowly.
However, when all things are equal there's no doubt that many of us will recommend the more efficient algorithm *even when it is modestly slower for the task at hand*, but that's because experience shows that problems inevitably get larger and the the more efficient algorithm *will* be faster in the near future. That's why a new 500 MHz P-III running Windows 2000 is often *slower* than a 75 MHz 486 running Windows 3.1, your system may be approx. 10x faster, but the problems you're trying to solve are also 10x larger so the O(n^2) algorithms are much slower. (Even O(n lg (n)) is slightly slower, but by a smaller amount.)
Finally, never overlook the raw stupidity of most people who blindly dismiss the more efficient algorithms. Just last month I saw a bubble sort that had a comment stating that a "bubble sort was fast enough since the cost was in the swap, not the comparison." The author obviously never considered the fact that you would never do a comparison and *not* do the required swap. I replaced that code with a call to qsort() and cut the number of comparisons from 1,000,000 to 30k. The performance improvement came to something like 20 seconds.
I'm sure that there are some clueless profs who don't know how costly it is to hit the swap, but in my experience there are *far* more programmers who do a half-assed implementation of the faster algorithm, generally bitching the entire time about how their clueless boss doesn't understand the "real world," and they then complain that the faster algorithm is actually slower. Obviously I don't know if that happened to you, but I've seen it happen enough times that I always take such criticism with a grain of salt.
For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken