Is Profiling Useless in Today's World?
rngadam writes "gprof doesn't work in Linux multithreaded programs without a workaround that doesn't work that well. It seems that if you want to use profiling, you have to look for alternatives or agree with RedHat's Ulrich Drepper that "gprof is useless in today's world"... Is profiling useless? How do you profile your programs? Is the lack of good profiling tools under Linux leading us in a world of bloated applications and killing Linux adoption by the embedded developers? Or will the adoption of a LinuxThreads replacement solve our problems?"
Take a look at OProfile. It's quite a nice tool, although it's not a direct replacement for gprof. From their 'About' page:
OProfile is a system-wide profiler for Linux x86 systems, capable of profiling all running code at low overhead. OProfile is released under the GNU GPL.
It consists of a kernel module and a daemon for collecting sample data, and several post-profiling tools for turning data into information.
OProfile leverages the hardware performance counters of the CPU to enable profiling of a wide variety of interesting statistics, which can also be used for basic time-spent profiling. All code is profiled: hardware and software interrupt handlers, kernel modules, the kernel, shared libraries, and applications (the only exception being the oprofile interrupt handler itself).
While it doesn't give the exact time spent in a
given function, running 'pstack' against a
processID under Solaris will give the execution
stack trace of any threads present.
If you find that 80% of your threads are in
slow_function( someParam ) then ya better get to
work fixing it. This also has the added advantage
of not slowing down your program with profiling
code and other hooks.
Obviously this isn't great for fine-grained
profiling, or with applications with few threads,
but I've found it helpful on my larger projects.
First, a little background on gprof, for those new to the *n*x world. Gprof is what's known as a profiler. Basically, it inserts code into the beginnings and ends of functions. When you run your program through gprof, it then records how much time is spent in each part of your program. The idea is that the programmer sees where most of the time is being spent, and optimizes that part of the program.
Now, as for the charges of gprof being useless, I can say that that is far from the case. True, it falls flat when dealing with multithreaded programs. But in practice, multithreaded programs are almost always interactive, and thus are primarily limited by user response times, which are many orders of magnitude longer than even the worst algorithm. In these cases, reducing the amount of input required from the user will always pay off better than any optimizations.
As an example, in our enterprise database frontend, we had a dialog that would prompt users for an administrator password when they attempted the "delete" command. We did analysis (with a commercial profiler, but it may as well have been gprof) and found that, lo and behold, the bulk of execution time was being spent waiting for the user to type in the password. So what we did was change the delete command to "eteled" ("delete" backwards), and only told the administrators the new command name. This way, we could be certain that only administrators would even attempt a deletion, and no password prompt was necessary. We have since applied the same design philosophy throughout our software, and productivity is at an all-time high.
As is usually the case, profiling can be the most important part of a project or next to useless. It all depends on how you use it. Gprof is a great tool for what it does; you just have to know how to use it properly.
Karma: Good (despite my invention of the Karma: sig)
But even if you aren't doing something that is speed intensive like games, you always have tradeoffs when you choose your data structures and algorithms. Generally you first code up the easiest algorithm that you think will use an acceptable amount of memory and CPU time. Then, later, if something is too slow, you have to identify where the problem is. If could be that you chose an O(N^2) algorithm not realizing that N might be 1,000 instead of the max of 100 you were counting on, forcing you to switch to an O(NlogN) algorithm that is more complex.
Now, if it is a small application, you might have enough familiarity with the code to be able to guess where the problem is -- then you fix it and see if it is still slow. If that works, then you're set and profiling isn't necessary. But if the fix doesn't speed it up enough, then you're stuck. You have to profile it somehow.
You might try simple tricks like changing the code to loop on a suspected bit of code 100 times and see how much longer it takes. Or maybe throw in some printf's that spit out the current time at different points. Or maybe create your own profiling code that you manually call in functions you want to time. Or, you might use an actual profiler without modifications to the code. But lacking a profiler doesn't mean you can't or won't profile your code.
And even with CPU speed doubling every couple of years or so, that doesn't mean speed is no longer an issue. You can easily choose the wrong algorithm and have something take 1000s of times longer to run than the proper algorithm.
I used gprof quite much during my Master Thesis work this spring. gprof tells what functions consumes most cputime, and those functions could be optimised. Usually very small parts of the code consumes most of the cpu-time.
This program was parallellised on network level - all clients were singlethreaded. If someone has multithreaded for performance (to utilize more than one cpu) I suppose gprof will still work well on a single cpu machine with just one thread.
For programs that consumes lots of cpu time for well-defined computations it should not be hard to profile a single threaded version (a single threaded version is needed for debugging anyway).
More complex applications (for example a web browser) I imagine are more dependant on multi-threading, and should pose a larger problem.
gprof, is probably not dead - if you need it you can adapt the program...
You could argue that with good up front design, you'll know in advance what 10% of the code to focus on, but I don't think that works that well in practice. At best, you're making educated guesses about where bottlenecks will appear, and you'll be wrong some of the time -- requiring profiling at that point.
And lacking tools doesn't mean you can't or won't profile -- it just means you'll have to do more work to profile the code.
If you want tree profiling (i.e. information about function and child performence) then Rational Quantify is a reasonable alternative to the crap profiler that comes with MSDev.
If you want a flat profiler or need to analyze the cost of specific low level operations then you MUST get Intel VTune.
I am not a number! I am a man! And don't you
And remember, in the immortal words of Michael Abrash, "Assume Nothing. Measure the improvements. If you don't measure, you're just guessing."
How's my programming? Call 1-800-DEV-NULL
Of course, for many applications, multi-threading achieves the vast majority of the speed increase, and profiling will only be of marginal utility. The profiler is just one tool of many, and is not a silver bullet.
you will find that a profiller is quite useful.
For instance, in the last year I've been developing a automatic nesting server using Linux and gprof was very important to spot the functions that were consuming more cpu time.
With gprof it was easy to notice two small functions that were responsible for 95% of the cpu usage.
As a result, I replaced that two small functions with 180 lines of optimized assembly code and I got a very good performance increase, since I was using a lot of inter-word bit shifts that the C compiler didn't handle well.
Regarding multi-threads, I come to the conclusion that 9 out of 10 times you don't really need to use threads, even in interactive programs, since there are alternative ways of acheiving the same efects.
For instance, all the X11 toolkits like Xt/Motif, Gtk+ and Qt, have the concept of work-procedures and timeout-funcions.
If you put all of your time-consuming operations inside work-procedures, you can get the same results as you would get with multi-threads, because you have an efective way of executing several taks at the same time without blocking the user interface.
Fernando Pereira
For Java we have a really nice choice of profilers. There are basically three great products available, all of them have proved to be absolutely useful. There is JProbe, OptimizeIt and JProfiler (the 2.0 beta of JProfiler looks cool). I don't know what the problems on Linux are, but when programming Java, profiling is quite an enjoyable task.
Signature deleted by lameness filter.
minimize the use of threads whenever possible. write your code in an event driven fashion as your friendly AC suggested. the poll() system call [superior to select(), though select() works well within its fixed size filedescriptor array limits] makes this possible.
the basic mentality to switch from threads to event programming is this: anytime you're using a thread solely so that it can sit around and block on high latency events (network or disk I/O) most of its lifetime, it should not be a thread.
its acceptable to have worker threads/processes that you hand computational tasks to and they trigger an event in your event loop when they hand a result back, but don't use threads of execution to manage your state. you'll pull your hair out and still have a nonfunctional program.
Also, recently merged in 2.4.x is "next-generation threads". I'm not sure if glibc linuxthreads will utilize them, or if they're any better, but just letting everyone know the implementation might change.
If Ulrich had written that profiling was useless, your response might make some sense. He didn't. He wrote that gprof was useless. See the difference?
And for getting even more useful information out, try Prospect. It works with OProfile - there was a talk about it at this year's Ottawa Linux Symposium, which you can find in the conference proceedings (gzipped PDF).
Do you even know anything about perl? -- AC Replying to Tom Christiansen post.
Five bucks says that this server is slashdot'ed within the hour, so you may have more success with the less descriptive SourceForge project page, indicates that the project is not dead, as the homepage says.
I discovered this program when I was optimizing some code I wrote to multiply sparse matrices. By the time I had gotten it 100x faster than the initial code, gprof had lost all semblance of granularity and was giving me obviously bogus results. The problem is that such things as cache performance (i.e. optimizing for cache hits) were now heavily affecting the profile and gprof could not figure such things out. FunctionCheck works much better than gprof and actually generates accurate profile information under high-stress situations.
From the homepage (all grammatical errors theirs):
"I created FunctionCheck because the well known profiler gprof have some limitations:
- it is not possible to change the profile data file name
- multi-threads / multi-processes is not supported
- time spend in non-profiled functions is discarded
- you can't control the way profile is made
- memory profile is not managed
For all these limitations, and by the fact that I discovered a new gcc feature called -finstrument-functions, I decided to write my own profiler.My approach is simple: I add (small) treatments at each enter and exit of all the functions of the profiled program. It allows me to compute many information:
- the current call-stack
- the time at each action, to compute elapsed times in functions
- process PID / thread ID, to manage multi-threads / multi-processes
- number of calls to functions
...
With these information, I can generate profile data files (for each thread / process), which describes all the statistics (at function level) for the program execution."Try it out and please contribute some source code.
See http://www.BitWagon.com/tsprof/tsprof.html for info on a process profiler that uses hardware performance counters (with no recompile and no relink) and gives both interactive and text output in tree and flat modes.