Is Profiling Useless in Today's World?
rngadam writes "gprof doesn't work in Linux multithreaded programs without a workaround that doesn't work that well. It seems that if you want to use profiling, you have to look for alternatives or agree with RedHat's Ulrich Drepper that "gprof is useless in today's world"... Is profiling useless? How do you profile your programs? Is the lack of good profiling tools under Linux leading us in a world of bloated applications and killing Linux adoption by the embedded developers? Or will the adoption of a LinuxThreads replacement solve our problems?"
Maybe gprof, as an implementation might not be useful. But profiling, especially under Java, can make a world of different to an application.
Saying "profiling isn't useful" is similar to saying "having information isn't useful".
That's just dumb.
Ulrich Drepper is a fool, he made glibc crappy, and messed up most things he had to do with. He simply should shut up and let other people do the work and the thinking.
Yeah, mod me down, but I have insight into the things Ulrich does, and he mostly does sh*t. Just my 2 cents (USD or EUR, you decide).
A monkey is doing the real work for me.
What could be more useful is if the compiler implementor would spend as much time on the profiler than on the compiler: you would then be able to easily see faulty parts in your software and be able to determine what needs to be optimized.
Good profilers would means efficient code. Don't think profilers are useless because most implementations of them sucks.
A message from the system administrator: 'I've upped my priority. Now up yours.'
Those of us that started programming in 1k and sub megahurts can really feel the time taken by badly coded applications. We know that forgetting what is happening on the silcon can kill how well our code will run.
However, those who started coding after ~1987 don't really have a gut feeling for it. To them the latest processor will make up for their bad coding. To a certain extent they are right. Today's advances STILL keep up with Moore's law, still make up for their lack of skill. However, when one looks at what is actually performed with all that power, one tends to question why we are paying so much, for so little.
Can you actually say that MS WordXP is much better than the non-WYSIWYG wordprocessor of yesteryear (itself a blast from the past) ?
We don't need profilers, we need coders have have that tacit knowledge of what really counts, where they should put real effort.
Unfortunately that doesn't come in a software box.
Profiling in general certainly isn't useless. I'll usually write new code primarily in a high-level, high-productivity language (e.g. Python), and if it's too slow I'll profile it and rewrite applicable parts in C. Some projects require a lower level (C) approach from the start, though those are pretty rare. Without profiling you'll spend a lot of time optimizing code that isn't a bottleneck.
Remember the words of Knuth: "Premature optimization is the root of all evil." Without profiling, you don't know what optimization is really needed and what isn't.
That said...
BEGIN RANT
I've used gprof successfully with plenty of recent code. It works perfectly fine in non-threaded code, which _should_ be the majority (99%+) of code out there. Yes, that includes big network servers (the last one I wrote just recently passed the 6 billion requests served mark without blinking). Threads are a really nasty programming rathole that should be applied in a limited way; they take much of the time and effort spent developing protected memory OSes and toss it out the window. They also tend to encourage highly synchronized executions instead of decoupled execution, which often makes things both slower and more bug-prone (locking issues are _tough_ to get right when they become more than 1-level) and slower to implement than a well-designed multiprocess solution with an appropriate I/O paradigm. Just because two popular platforms (Windows and Java) make good non-threaded programming difficult doesn't mean you should cave in.
END RANT
rage, rage against the dying of the light
You could argue that with good up front design, you'll know in advance what 10% of the code to focus on, but I don't think that works that well in practice. At best, you're making educated guesses about where bottlenecks will appear
And a lot of smart people, from Knuth and Kernighan to Linus and Guido, will freely admit that predicting what to optimize is nearly impossible. Even people at that level of programming prowess are often surprised by where the bottlenecks appear (and where they don't appear). You certainly want to design for flexible optimization from the start, but you'll often discover that the stupid O(n) scan you put in is good enough for now and that you better optimize the I/O system before you think about replacing it with a tree or hash table or whatever.
Sumner
rage, rage against the dying of the light
But, the bottom line is that if you don't profile your code (and unit test it, and integration test it, and...), you are not writing good code.
That's hardly true. Certainly you shouldn't waste time optimizing code until you know where the bottlenecks are. But it a lot of cases--I'd even venture to say most cases--code gets written and is fast enough. In such cases, profiling is a waste of time. Profiling is only indicated if there's a legitimate performance problem.
To a lesser extent, the same is true of unit testing and integration testing. If you're writing some code to convert one image to a GIF and you run it successfully to get the GIF, there's no reason to unit test. Even if the code has horrible bugs on some inputs, the job is done. One-off code isn't (unfortunately) uncommon. Prototype code is also very common and often you don't need to do extensive testing on it, either. Any code where the total cost of code failure is lower than the cost of QA probably doesn't need to be QA'd (which is not to say that you should spend an amount on QA equal to the failure cost; if spending $1000 on QA reduces the chance of failure by 99.999% and spending $1000000 reduces the chance of failure by 99.9999%, the $1000 expenditure suffices in all but the most demanding applications)
Sumner
rage, rage against the dying of the light
There are very few application that don't reach out across a network for information. The bottleneck is usually this network communications. Check out Performant for tools that work on the network level.
.NET, C++, C all can theoretically produce software that is just as speedy as assembly but it rarely is. People still write assembly where performance really counts (games, realtime, etc.)
There's also a continuing trend of software developers spending user's computing power to make thier jobs easier. Java, J2EE, C#,
Some people thinks that the wasted processing power is a crime. Me, I think it's just economics. It's much cheaper to pay for processing power than it is to pay for the developers to squeeze every last bit of performance out of an app.
However, there are some applications where profiling is absolutely required. Database engines, games, simulations, anything that is CPU-bound has the potential of benifiting from profiling.
You are not a beautiful or unique snowflake -- but you could be if you got off your ass.
I've solved some important real-world problems using Quantify and Purify, especially when dealing with a huge system with a lot of developers fingers in the pie. One of the programs was handling 100,000+ transacations a day, and Quantify helped shaved enough off so we didn't have to force all of our customers to upgrade their hardware.
Faced with a similar problem in Linux, I'd probably port the program to Solaris, Quantify it there, and hope the results are similar under Linux.
The next Cmdr Taco duplicate will be ready soon, but subscribers can beat the rush and see it early!
But processes as provided by current operating systems are too expensive to use. If I have a network server (e.g. a httpd) that has to create a process for each network request, it will never scale. In theory all that has to happen is inetd (or equivalent) fork/execs and does the necessary plumbing so that the ends of the socket are STDIN and STDOUT. Then the process just reads and writes as necessary to fulfil the request. In practice, this just doesn't work.
That's why you can't use cgi for high-volume transactions. So lets make the server a single multithreaded daemon process instead, where each request is handled by a thread. Now you can handle each request much faster, but you lose the protected address space the OS gives you in a process.
Obviously, the OS needs to change, and give use something (maybe a hybrid between processes and threads) that more closely meets applications needs. I don't see anybody making suggestions as to ways to move forward. Anybody know of research in this area?
Ok, you got me. Now, let's apply a common sense filter to my original post.
Of course "one off, disposable code" doesn't need the same degree of "analness" applied to it as does mission critical code.
However, "fast enough" is a really bad metric to use. Yes, utility "X" is fast enough. But oh, I didn't realize it was going to be used in conjunction with utility "Y" and "Z". Now, everything is really slow. Hey, can you say Microsoft?
Fortune telling is not part of any programming job description I've ever seen.
But in practice, multithreaded programs are almost always interactive, and thus are primarily limited by user response times,
I would disagree with this wholeheartedly. What about databases like Oracle, MS SQL Server, and so on? They're internally multithreaded, and most definitely not "interactive" after you initiate a SQL query.
I believe apache 2.0 is threaded. HTTP by nature is not interactive. And so on. There are many other examples, left as an exercise to the reader.
While it is true that threads are very useful for interactive programs, in fact critical, their use does not stop there by a longshot. Any program which needs to do two things at once without fear of blocking on a system call is a candidate for threads. Threads are also useful for distributing compute cycles over multiple processors within a single process, allowing it to gain the benefit of concurrency.
The project I'm currently working on is a custom database application, and without threads it would be useless. And there are no users talking to it directly, that's for sure.
reducing the amount of input required from the user will always pay off better than any optimizations.
I find this perplexing. Nobody cares about optimizing a user dialog. Reducing user input or optimization of user input code would serve little purpose in most multithreaded applications I'm aware of. Generally, interactive multithreaded programs use threads so they can interact with users while simultaneously performing some other task that shouldn't be stalled by waiting for user input. For example, a network monitor might have three threads: one for watching network traffic, one for resolving IP addresses to hostnames, and one for taking user input. It doesn't matter how long the user input thread sits around waiting for the user to type/click something. There are two other threads working away in the meantime, watching traffic and displaying it for the user, oblivious to whether or not the user is doing anything. In such a case as this, profiling the watcher/resolver threads might be very useful indeed, since they need to be more or less realtime.
This gprof problem is a serious issue, and minimizing it by saying that threaded programs generally wouldn't benefit from profiling is naive.
Wrong. You design your code as a compromise between factors such as speed, maintainance, reusability, readability, and, most importantly, the resources you are allowed to expend.
If speed is a critical factor, then you might try to do some predictive profiling using exisiting principles to make sure the code is fast. Otherwise, you write the best damn code you can, which generally means using good practices to insure that you don't waste time, and then profile it. Profiling will work best if the code is written is such a way(read a lot of reusabled functions) that allows simple optimization.
BTW, the biggest wrinkle in this is that programmers time has become more valuable the clock cycles. We will now waste some clock cycles to same programmers time, which is why profiling is not nearly as important as it used to be.
If the code is not written well, and has to be rewritten when the profiler says it sucks, then you wasted your time.
"She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
How *NIX grognards always complain about multi-threading, but don't find signals (and their nasty interrupt-driven nature) to be the least bit unsettling!
A deep unwavering belief is a sure sign you're missing something...
Don't use threads.
The problem you are complaining about profiling having is that it can't profile threaded programs. Don't write threaded programs, and the problem is solved.
Frankly, I've always considered threading useful for only a few situations:
o When you have an SMP system, and you need to scale your applicaiton to multiple CPUs so that you can throw hardware at the problem instead of solving it the right way
o When you have programmers who can't write finite state automata, because they don't understand computer science, and should really be asking "Would you like fries with that?" somewhere, instead of cranking out code
o When your OS doesn't support async I/O, and you need to interleave your I/O in order to achieve better virtual concurrency
Other than those situations, threads don't make a lot of sense: you have all this extra context switching overhead, and you have all sorts of other problems -- like an iniability to reasonably profile the code with a statistical profiler.
OK... Whew! Boy do I feel better! 8-).
Statistically examining the PC, unless it's done on a per thread basis, is just a waste of time in threaded programs.
If you want to solve the profiling problem for threaded programs, then you need to go to non-statistical profiling. This requires compiler support. The compiler needs to call a profile_enter and profile_exit for each function, with the thread ID as one of the arguments. THis lets you create an arc-list per thread ID, and seperately deal with the profiling, as if you has written the threads as seperate programs. It also catches out inter-thread stalls.
-- Terry
Why are these mutually exclusive? There's efficient and there's optimised, and one is a much easier subset of the other.
He's not claiming that everyone should hand-optimise from the word go. He's saying programmers should have a basic knowledge of their craft. It doesn't take much extra effort to use an efficient sorting algorithm or store data in a fast look-up structure, rather than writing a naff, hand-crafted shuffle sort and using arrays for everything whether they're appropriate or not. And yet, through ignorance or plain laziness, most programmers in most languages take the latter approach. (If you've never seen any of the source code for big name applications/OSes, trust me, it's scary.)
Similarly, it is just careless to pass large structures by value unnecessarily in a language that has reference semantics. You have to know the basics of what is efficient use of your tools of choice if you want to write good code, and the old Moore's Law excuse is just a cover for laziness and failure to do the requisite amount of homework.
Note that, very importantly, none of these things requires more than a small effort. They certainly don't compromise maintainability, bug count or any other relevant metrics, and a competent programmer (if you can find one) will take these things in his stride, and still be faster than the others.
Interesting... We have just acquired a new P4/2.2GHz with 512MB RAM and running WinXP as a development machine at work. You know what? It's way, way slower than the 1.4GHz P4 running 2000 we already had. And that in turn is way slower than the 1GHz P3 running NT4. This is not subjective, it is based on obvious, objective measures. For example, my new machine (the fastest of the above) sometimes takes 3-4 minutes to act on an OK'd dialog in Control Panel. The NT4 box reacts instantly when you configure the equivalent options. Something is wrong at this point, and I'm betting it's a combination of code bloat and feature creep.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.