Programming As If Performance Mattered

The question I always ask is by Anonymous Coward · 2004-05-05 16:41 · Score: 5, Insightful

Is the time it takes me to do the performance optimization worth it in time or money.

Funny thing about performance by ObviousGuy · 2004-05-05 16:44 · Score: 5, Interesting

You can spend all your time optimizing for performance and when you finally release your product, your competition whose main objective was to get the product out the door faster, who uses a slower algorithm, is already first in mindshare with your customers. Not only that, the processors that you thought you would be targetting are already a generation behind and that algorithm that was going to hold back your competition runs perfectly fast on new processors.

Performance gains occur at the hardware level. Any tendency to optimize prematurely ought to be avoided, at least until after v1.0 ships.

--
I have been pwned because my /. password was too easy to guess.

Re:Funny thing about performance by corngrower · 2004-05-05 16:52 · Score: 5, Insightful

Any tendency to optimize prematurely ought to be avoided, at least until after v1.0 ships.

Assuming there is a second version, which there may not be because potential customers found that the performance of v1.0 sucked.
Re:Funny thing about performance by kubrick · 2004-05-05 18:26 · Score: 5, Insightful

All Microsoft have to do is pre-announce features that won't be in their products until v3, well before the release of v1 (vide Go Corporation & Pen Windows), and that's enough to kill off the competition. Microsoft's success has become a self-fulfilling prophect for most of the market these days...

--
deus does not exist but if he does
Re:Funny thing about performance by Moraelin · 2004-05-05 22:48 · Score: 5, Insightful

Well, yes and no.

I still don't think you should start doing every single silly trick in your code, like unrolling loops by hand, unless there's a provable need to do so. Write clearly, use comments, and use a profiler to see what needs to be optimized.

That is coming from someone who used to write assembly, btw.

But here's the other side of the coin: I don't think he included better algorithms in the "premature optimization". And the same goes for having some clue of your underlying machine and architecture. And there's where most of the problem lies nowadays.

E.g., there is no way in heck that an O(n * n) algorithm can beat an O(log(n)) algorithm for large data sets, and data sets _are_ getting larger. No matter how much loop unrolling you do, no matter how you cleverly replaced the loops to count downwards, it just won't. At best you'll manage to fool yourself that it runs fast enough on those 100 record test cases. Then it goes productive with a database with 350,000 records. (And that's a small one nowadays.) Poof, it needs two days to complete now.

And no hardware in the world will save you from that kind of a performance problem.

E.g., if most of the program's time is spent waiting for a database, there's no point in unrolling loops and such. You'll save... what? 100 CPU cycles, when you wait 100,000,000 cycles or more for a single SQL query? On the other hand, you'd be surprised how much of a difference can it make if you retrieve the data in a single SQL query, instead of causing a flurry of 1000 individual connect-query-close sequences.

(And you'd also be surprised how many clueless monkeys design their architecture without ever thinking of the database. They end up with a beautiful class architecture on paper, but a catastrophic flurry of querries when they actually have to read and write it.)

E.g., if you're using EJB, it's a pointless exercise to optimize 100 CPU cycles away, when the RMI/IIOP remote call's overhead is at least somewhere between 1,000,000 and 2,000,000 CPU cycles by itself. That is, assuming that you don't also have network latency adding to that RPC time. On the other hand, optimizing the very design of your application, so it only uses 1 or 2 RPC calls, instead of a flurry of 1000 remote calls to individual getters and setters... well, that might just make or break the performance.

(And again, you'd be surprised how many people don't even know that those overheads exist. Much less actually design with them in mind.)

So in a nutshell, what I'm saying is: Optimize the algorithm and design, before you jump in to do silly micro-level tricks. That's where the real money is.

--
A polar bear is a cartesian bear after a coordinate transform.

Re:What annoys me by Anonymous Coward · 2004-05-05 17:01 · Score: 5, Funny

> a sad inditement

Well, it does have a spell checker now...

Code tweaking by Frequency+Domain · 2004-05-05 17:20 · Score: 5, Insightful

You get way more mileage out of choosing an appropriate algorithm, e.g., an O(n log n) sort instead of O(n^2), than out of tweaking the code. Hmmm, kind of reminds me of the discussion about math in CS programs.

Every time I'm tempted to start micro-optimizing, I remind myself of the following three simple rules:

1) Don't.
2) If you feel tempted to violate rule 1, at least wait until you've finished writing the program.
3) Non-trivial programs are never finished.

Re:What annoys me by DrEasy · 2004-05-05 17:21 · Score: 5, Insightful

And you aren't still using it why? (hint--your answer is the reason why MS 4 doesn't do all you need.)

Or maybe because you are forced to upgrade to read files that were created with a more recent version?

--
"In our tactical decisions, we are operating contrary to our strategic interest."

Re:me too... by Lord+Kano · 2004-05-05 17:49 · Score: 5, Funny

One of my C++ instructors told us a story about one of his former coworkers who used to go through his programs and delete ALL whitespace from the code because he thought it would make the programs smaller and faster.

LK

--
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano

Re:speed/easy coding by Lord+Kano · 2004-05-05 17:56 · Score: 5, Insightful

Really? How about the server side of things?

On the server side, I'd say that correctness and clarity are even more important. I guess it's all a matter of opinion as to where the "sweet spot" is, but most programming involves finding the right balance between speed and clarity.

If you're in a situation where you need the servers to process large amounts of data, you're most likely in a position to be able to justify the expense of throwing better hardware at the problem.

LK

--
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano

Re:Don't agree by Frobnicator · 2004-05-05 18:33 · Score: 5, Insightful

Just now I'm working on an econometric model ... in real life applications, playing with a Targa file is not the same as service critical, 300 users, number crunching, data handling systems, where a small performance improvement must be multiplied by the number of users/uses, by many many hours of operation and by years in service to understand its true impact. ... I'm playing with over 300 variables and 100 parameters to simulate dozens of different scenarios can make any server beg for more cycles, and any user beg for a crystal ball.

I don't think that fits into the description the article was talking about.

The point of this article is not targeted to you. I've seen interns as recent as last year complain about the same things mentioned in the article: division is slow, floating point is slow, missed branch prediction is slow, use MMX whenever more than one float is used, etc.

The point I get out of the article is not to bother with what is wasteful at a low level, but be concerned about the high levels. A common one I've seen lately is young programmers trying to pack all their floats into SSE2. Since that computation was not slow to begin with, they wonder why all their 'improvements' didn't speed up the code. Even the fact that they are doing a few hundred unneccessary matrix ops (each taking a few hundred CPU cycles) didn't show up on profiling. Their basic algorithm in a few cases I'm thinking about are either very wasteful, or could have been improved by a few minor adjustments.

The article mentions some basic techniques: choosing a different algorithm, pruning data, caching a few previously computed results, finding commonalities in data to improve the altorithm. Those are timeless techniques, which you probably have already learned since you work on such a big system. Writing your code so that you can find and easily implement high-level changes; that's generally more important than rewriting some specific block of code to run in the fewest CPU cycles.

A very specific example. At the last place I worked, there was one eager asm coder who write template specializations on most of the classes in the STL for intrinsic types in pure asm. His code was high quality, and had very few bugs. He re-wrote memory management so there were almost no calls to the OS for memory. When we used his libraries, it DID result in some speed gains, and it was enough to notice on wall-clock time.

However... Unlike his spending hundreds of hours on this low-return fruit, I could spend a day with a profiler, find one of the slower-running functions or pieces of functionality, figure out what made it slow, and make some small improvements. Usually, a little work on 'low-hanging fruit', stuff that gives a lot of result for a little bit of work, is the best place to look. For repeatedly computed values, I would sometimes cache a few results. Other times, I might see if there is some few system functions that can be made to do the same work. On math-heavy functions, there were times when I'd look for a better solution or 'accurate enough but much faster' solution using calculus. I'd never spend more than two days optimizing a bit of functionality, and I'd get better results than our 'optimize it in asm' guru.

Yes, I would spend a little time thinking about data locality (stuff in the CPU cache vs. ram) but typically that doesn't give me the biggest bang for the buck. But I'm not inherently wasteful, either. I still invert and multiply rather than divide (it's a habit), but I know that we have automatic vectorizers and both high-level and low-level optimizers in our compilers, and an out-of-order core with AT LEAST two floating point, two integer, and one memory interface unit.

And did I mention, I write software with realtime constraints; I'm constantly fighting my co-workers over CPU quotas. I read and refer to the intel and AMD processor documentation, but usually only to see which high-level functionality best lends itself to the hardware. I am tempted to 'go straight to the metal' occasionally, or to count the CPU cycles of everything, but I know that I can get bigger gains elsewhere. That's what the point of the article is, I believe.

--
//TODO: Think of witty sig statement

Postmature optimization by nimblebrain · 2004-05-05 19:35 · Score: 5, Informative

After years of developing, I really take to heart two things:

Premature optimization often makes better optimizations down the line much more difficult
It's 90% guaranteed that the slowdown isn't where or what you thought it was

Profilers are the best thing to happen to performance since compilers - really. I encounter a number of truths, but many myths about what degrades performance. A few examples of each:

Performance degraders

Mass object construction
Searching sequentially through large arrays
Repeated string concatenation (there are techniques to mitigate this)
Staying inside critical sections for too long

Not performance degraders

Lots of object indirection
Lots of critical sections

The "lots of object indirection" myth is one I encounter frequently. Object A calls Object B calls Object C, and it "intuitively" looks like it must be slow (Computer A calling Computer B, etc. would be slow), but even with stack frame generation, these are lightning fast compared with even the likes of "date to string" functions, never mind line-drawing commands or notification-sending.

The reason that particular myth is dangerous is that it's the single most pervasive myth (IMHO) that leads to premature optimization. People take out layers of object indirection and make it harder to put in better solutions later. I had an object that recorded object IDs in a list and let you look them up later. If I had "flattened" that into the routine that needed it, I might have effected a 0.1% speed increase (typical range for many premature optimizations). As it stood, because it hid behind an interface (equivalent to an ABC for C++ folks), when I had finally implemented a unit-tested red/black tree, it was trivial (~5 minutes) to drop in the new functionality. That's not an isolated case, either.

Mind you, I profiled the program to determine the slowdown first. Searching on the list, because so many were misses (therefore full scans), the search was taking up 98.6% of the entire operation. Switching to the red/black tree dropped the search down to 2.1%.

All in all, if you have a slow program, profile it. There is no substitute for a well-written profiler. Stepping through and "feeling" how long it takes in a debugger, while it can point you in rough directions, will miss those things that take 50 ms out of the middle of each call to the operation you're checking. Manually inserting timing calls can be frustrating enough to maintain or slow down your program enough that you can't narrow down the performance hit.

gprof works well with gcc and its relatives (make sure to add -pg to your flags), but I'm not sure if there's a good open source option out there for people using other tools that doesn't require you to alter your source.

In the Windows world, we recently got in the professional version of AQTime 3. It's an astounding package, allowing you numerous reports, pie charts and call graphs, saving the last few runs, calculating differences in performance between runs, allowing attachment to running processes, on top of a pretty nice way to define areas of the program to profile. The single nicest thing about it, though, is the performance. We turned on full profiling (that is, profiling all methods in all modules, including all class library and third party components) on the largest project we had, and it ran with perhaps a 30% slowdown. If you've used profilers before, you know how astounding that is ;)

Profiling applications always surprises me. In one case, a space-making algorithm I was running on controls seemed a little pokey; I found out more than 50% of the time spent was on constantly verifying that the lists were sorted. Today, I was investigating a dialog that looked like it must hav

--
Binary geeks can count to 1,023 on their fingers :)

Re:Followed your link by BlackHawk-666 · 2004-05-05 20:16 · Score: 5, Insightful

When coding anything, msot of the code should be done to be fast

This may be true when you are producing libraries of math routines and similar stuff like you are doing. It doesn't hold an ounce of water when you do the sort of work I do. My projects are generally medium sized, mixed languages, developers of all different skill levels. Code clarity is far more important for 98% of the stuff we do. I need my juniors to be able to follow the code the seniors write, even if they can't write it themselves. The other 2% of the time it's fine to sacrifice clarity for speed to get the performance to an acceptable level on the target platform.

I have generally found that clear code is usually good code, so long as you are aware of the cost implications of your design decisions. For instance, I seem to recall the bubble sort (mentioned earlier) was actually faster than a qsort under some circumstances. Deep data knowledge would help you to make the decision as to which would need to be used...don't just reach for that qsort, it may be the fastest under most cases, but not all.

--
All those moments will be lost in time, like tears in rain.

Re:Performance tuning. by Glonoinha · 2004-05-06 02:07 · Score: 5, Insightful

Problem is that your client is still running on the prototype of their project, not the real release. They just don't know it, and I'm guessing the original programmers don't know it.

The most effective, well used (if unintentionally used) development methodology is the prototype methodology. The first pass is simply a reality check, can we even accomplish what needs to be accomplished on the hardware and development tool we have available? The prototype is then shown to management as a proof of concept, show them that their ideas are possible, and then a second generation is re-engineered from the ground up using the lessons learned in the first generation as a foundation for a solid, well engineered deliverable product. This breaks down in one of two ways : management says screw the rewrite, lets just run what we have - or the developers are not smart enough to understand that their first pass at it wasn't production quality code, only a prototype.

What your client has right now is a prototype, a proof of concept. It 'works' inasmuch as a kite flys - as a demonstration that the concept is viable, but not meant for real work. You could probably push a big kite hard enough to 'fly' two people, but that doesn't make it a good idea. You could continue to 'tweak' a kite in order to even double the performance, get 4 people off the ground - but I wouldn't recommend using it for commercial applications.

Odds are the app needs to be understood from top to bottom so a set of software engineers know the concepts, what the package is intended to do, how it currently does it, what the expectations are for performance and growth - and then the SE's that understand it need to rewrite it from the ground up developing performance engineered code that is production quality.

--
Glonoinha the MebiByte Slayer

Slashdot Mirror

Programming As If Performance Mattered

14 of 615 comments (clear)