Is Profiling Useless in Today's World?
rngadam writes "gprof doesn't work in Linux multithreaded programs without a workaround that doesn't work that well. It seems that if you want to use profiling, you have to look for alternatives or agree with RedHat's Ulrich Drepper that "gprof is useless in today's world"... Is profiling useless? How do you profile your programs? Is the lack of good profiling tools under Linux leading us in a world of bloated applications and killing Linux adoption by the embedded developers? Or will the adoption of a LinuxThreads replacement solve our problems?"
Why can't my code be judged by the content of its characters, and not by the color of its extension?
:)
Down with profiling!
a yro article.
My first instinct was to agree with the headline, since Jose Padilla is hispanic. It would have at least made for a more interesting discussion.
Do not taunt Happy Fun Ball(TM)
Maybe gprof, as an implementation might not be useful. But profiling, especially under Java, can make a world of different to an application.
Saying "profiling isn't useful" is similar to saying "having information isn't useful".
That's just dumb.
i racially profile my programs.
this sig limit is too small to put anything good h
Ulrich Drepper is a fool, he made glibc crappy, and messed up most things he had to do with. He simply should shut up and let other people do the work and the thinking.
Yeah, mod me down, but I have insight into the things Ulrich does, and he mostly does sh*t. Just my 2 cents (USD or EUR, you decide).
A monkey is doing the real work for me.
I am very small, utmostly microscopic.
If your goals are speed/efficiency, then profiling is a necessary step. If your goal is just to get the program done and working, then you may be able to survive without it. People shouldn't just give up on the speed/efficiency just because they figure the newest technology will handle it fine... unless you're Microsoft.
Take a look at OProfile. It's quite a nice tool, although it's not a direct replacement for gprof. From their 'About' page:
OProfile is a system-wide profiler for Linux x86 systems, capable of profiling all running code at low overhead. OProfile is released under the GNU GPL.
It consists of a kernel module and a daemon for collecting sample data, and several post-profiling tools for turning data into information.
OProfile leverages the hardware performance counters of the CPU to enable profiling of a wide variety of interesting statistics, which can also be used for basic time-spent profiling. All code is profiled: hardware and software interrupt handlers, kernel modules, the kernel, shared libraries, and applications (the only exception being the oprofile interrupt handler itself).
but I suppose other people want to profile more than just Java. Bother.
While it doesn't give the exact time spent in a
given function, running 'pstack' against a
processID under Solaris will give the execution
stack trace of any threads present.
If you find that 80% of your threads are in
slow_function( someParam ) then ya better get to
work fixing it. This also has the added advantage
of not slowing down your program with profiling
code and other hooks.
Obviously this isn't great for fine-grained
profiling, or with applications with few threads,
but I've found it helpful on my larger projects.
What could be more useful is if the compiler implementor would spend as much time on the profiler than on the compiler: you would then be able to easily see faulty parts in your software and be able to determine what needs to be optimized.
Good profilers would means efficient code. Don't think profilers are useless because most implementations of them sucks.
A message from the system administrator: 'I've upped my priority. Now up yours.'
First, a little background on gprof, for those new to the *n*x world. Gprof is what's known as a profiler. Basically, it inserts code into the beginnings and ends of functions. When you run your program through gprof, it then records how much time is spent in each part of your program. The idea is that the programmer sees where most of the time is being spent, and optimizes that part of the program.
Now, as for the charges of gprof being useless, I can say that that is far from the case. True, it falls flat when dealing with multithreaded programs. But in practice, multithreaded programs are almost always interactive, and thus are primarily limited by user response times, which are many orders of magnitude longer than even the worst algorithm. In these cases, reducing the amount of input required from the user will always pay off better than any optimizations.
As an example, in our enterprise database frontend, we had a dialog that would prompt users for an administrator password when they attempted the "delete" command. We did analysis (with a commercial profiler, but it may as well have been gprof) and found that, lo and behold, the bulk of execution time was being spent waiting for the user to type in the password. So what we did was change the delete command to "eteled" ("delete" backwards), and only told the administrators the new command name. This way, we could be certain that only administrators would even attempt a deletion, and no password prompt was necessary. We have since applied the same design philosophy throughout our software, and productivity is at an all-time high.
As is usually the case, profiling can be the most important part of a project or next to useless. It all depends on how you use it. Gprof is a great tool for what it does; you just have to know how to use it properly.
Karma: Good (despite my invention of the Karma: sig)
I can't get any useful profiling information out of Microsoft Visual C++. When I compile in profiling mode, my program runs at less than 1% of normal speed, producing completely useless data. Am I doing something wrong? Should I be using 3rd party tools?
But even if you aren't doing something that is speed intensive like games, you always have tradeoffs when you choose your data structures and algorithms. Generally you first code up the easiest algorithm that you think will use an acceptable amount of memory and CPU time. Then, later, if something is too slow, you have to identify where the problem is. If could be that you chose an O(N^2) algorithm not realizing that N might be 1,000 instead of the max of 100 you were counting on, forcing you to switch to an O(NlogN) algorithm that is more complex.
Now, if it is a small application, you might have enough familiarity with the code to be able to guess where the problem is -- then you fix it and see if it is still slow. If that works, then you're set and profiling isn't necessary. But if the fix doesn't speed it up enough, then you're stuck. You have to profile it somehow.
You might try simple tricks like changing the code to loop on a suspected bit of code 100 times and see how much longer it takes. Or maybe throw in some printf's that spit out the current time at different points. Or maybe create your own profiling code that you manually call in functions you want to time. Or, you might use an actual profiler without modifications to the code. But lacking a profiler doesn't mean you can't or won't profile your code.
And even with CPU speed doubling every couple of years or so, that doesn't mean speed is no longer an issue. You can easily choose the wrong algorithm and have something take 1000s of times longer to run than the proper algorithm.
I used gprof quite much during my Master Thesis work this spring. gprof tells what functions consumes most cputime, and those functions could be optimised. Usually very small parts of the code consumes most of the cpu-time.
This program was parallellised on network level - all clients were singlethreaded. If someone has multithreaded for performance (to utilize more than one cpu) I suppose gprof will still work well on a single cpu machine with just one thread.
For programs that consumes lots of cpu time for well-defined computations it should not be hard to profile a single threaded version (a single threaded version is needed for debugging anyway).
More complex applications (for example a web browser) I imagine are more dependant on multi-threading, and should pose a larger problem.
gprof, is probably not dead - if you need it you can adapt the program...
Those of us that started programming in 1k and sub megahurts can really feel the time taken by badly coded applications. We know that forgetting what is happening on the silcon can kill how well our code will run.
However, those who started coding after ~1987 don't really have a gut feeling for it. To them the latest processor will make up for their bad coding. To a certain extent they are right. Today's advances STILL keep up with Moore's law, still make up for their lack of skill. However, when one looks at what is actually performed with all that power, one tends to question why we are paying so much, for so little.
Can you actually say that MS WordXP is much better than the non-WYSIWYG wordprocessor of yesteryear (itself a blast from the past) ?
We don't need profilers, we need coders have have that tacit knowledge of what really counts, where they should put real effort.
Unfortunately that doesn't come in a software box.
NuMega DevPartner.
Rational also used to produce something. I can't remember what it's called though. It was a companion to Purify.
Use VTune (http://www.intel.com/software/products/vtune/vtun e60/), Intel's profiler.
It does a pretty good job, and it uses performance counters.
Me: Really? Which part?
User: When I click the "report" icon
Me: Oh (tinkers with report code). Try it now.
User: It's still slow
Me: (shakes BOFH excuse 8-ball) Hrmm, must be interference from sunspots, try it again tommorrow
Today I didn't even have to use my AK; I got to say it was a good day -- Icecube
Profiling in general certainly isn't useless. I'll usually write new code primarily in a high-level, high-productivity language (e.g. Python), and if it's too slow I'll profile it and rewrite applicable parts in C. Some projects require a lower level (C) approach from the start, though those are pretty rare. Without profiling you'll spend a lot of time optimizing code that isn't a bottleneck.
Remember the words of Knuth: "Premature optimization is the root of all evil." Without profiling, you don't know what optimization is really needed and what isn't.
That said...
BEGIN RANT
I've used gprof successfully with plenty of recent code. It works perfectly fine in non-threaded code, which _should_ be the majority (99%+) of code out there. Yes, that includes big network servers (the last one I wrote just recently passed the 6 billion requests served mark without blinking). Threads are a really nasty programming rathole that should be applied in a limited way; they take much of the time and effort spent developing protected memory OSes and toss it out the window. They also tend to encourage highly synchronized executions instead of decoupled execution, which often makes things both slower and more bug-prone (locking issues are _tough_ to get right when they become more than 1-level) and slower to implement than a well-designed multiprocess solution with an appropriate I/O paradigm. Just because two popular platforms (Windows and Java) make good non-threaded programming difficult doesn't mean you should cave in.
END RANT
rage, rage against the dying of the light
Profiling, in one form or another, is ABSOLUTELY necessary. There is no other way to find out why (and where!) your code is running slowly.
Does gprof do everything we need? No. Are there better tools? Yes.
But, the bottom line is that if you don't profile your code (and unit test it, and integration test it, and...), you are not writing good code.
It's like debating if "breathing" is necessary or not.
If you want tree profiling (i.e. information about function and child performence) then Rational Quantify is a reasonable alternative to the crap profiler that comes with MSDev.
If you want a flat profiler or need to analyze the cost of specific low level operations then you MUST get Intel VTune.
I am not a number! I am a man! And don't you
It isn't just gprof that's broken by pthreads, other Linux tools fall victim as well. Core dump? Almost useless with pthreads running. Gdb? Getting better, but still a little wonky. Certain aspects of signal handling don't work as expected with pthreads.
And remember, in the immortal words of Michael Abrash, "Assume Nothing. Measure the improvements. If you don't measure, you're just guessing."
How's my programming? Call 1-800-DEV-NULL
Drepper is smoking some strange shit. Every serious programmer uses profiling.
Just because kernel and glibc wackos don't find it useful doesn't mean it isn't useful.
I regularly use profiling for any code that demands performance.
That was a very unfortunate remark of Ulrich.
--exa--
The problems that threading solves (multiple outstanding I/O's, multiple CPU utilization) can be solved using other methods. Those other methods have their evils, too, but trading off for the lesser net evil is what design and analysis is all about.
Lack of profiling tools is pretty far down on the list of tradeoffs, in my opinion; much higher up are issues of maintainability and portability, areas where threading does badly anyway.
Out in the console games development market, there's one real serious tool: a hardware profiler. Basiclly, it's a heavily modified PSX with bus analyzers tacked on so that it can snoop and tell *exactly* where the slowdowns are. Is it a cache miss? Is it the GPU hammering on things? There's none of the "this function is slow" -- it points out *why*.
D /0,,30_2252_3604,00.html was so-so at best for my use). Those apps don't do any of the fancy bus analysis, etc. Still, I'd suspect they're better than nothing.
You should not rely on profilers from the beginning of writing code, but you they're no cure-all either. A profiler can't tell you to use quicksort over a bubblesort. It just says what is slow, and it's up to the programmers to find a faster way to do things.
The most recent x86 profiler I've used was Intel's VTune (AMD's free tool at http://www.amd.com/us-en/Processors/DevelopWithAM
I know this is going to sound like flamebait, but C++ *does* make it very easy to shoot yourself in the foot with regards to performance. If you don't set up all your operators to properly take consts, if you forget to set things up, it can kill performance. If you rely on a *lot* of small functions, you can either (1) blow out the cache with a larger executable (more likely on consoles), or (2) forget to inline a few, and kill your performance with lots of *tiny* calls that probably won't show up under VTune. The slowness of various compilers makes people afraid of putting a lot of small functions in headers where they belong, as any change would force a slow, full rebuild.
I've seen C++ compilers decide to inline a 4x4 matrix copy by unrolling a loop to read/write the first 12 elements, then call the Vector4 copy constructor. Worst of all worlds. Replacing that with a memcpy was a huge win. But, the only way one would know *how* to fix that is to be able to look at the disassembly.
Nathan Mates
These days instruction-level efficiency simply isn't important outside of a few niche areas (embedded systems, games, multimedia, certain kinds of low-level systems work). To imply that knowing what's happening on the silicon is "what really counts" is nonsense. Using appropriate data structures and algorithms counts, and making correct software counts even more, but worrying about how many cycles one instruction takes versus another is a serious misdirection of effort on modern machines!
It's folks like you who are the reason people still write their SSH daemons in C, and why we live in a mixed up world where we have neither stability NOR speed!
you will find that a profiller is quite useful.
For instance, in the last year I've been developing a automatic nesting server using Linux and gprof was very important to spot the functions that were consuming more cpu time.
With gprof it was easy to notice two small functions that were responsible for 95% of the cpu usage.
As a result, I replaced that two small functions with 180 lines of optimized assembly code and I got a very good performance increase, since I was using a lot of inter-word bit shifts that the C compiler didn't handle well.
Regarding multi-threads, I come to the conclusion that 9 out of 10 times you don't really need to use threads, even in interactive programs, since there are alternative ways of acheiving the same efects.
For instance, all the X11 toolkits like Xt/Motif, Gtk+ and Qt, have the concept of work-procedures and timeout-funcions.
If you put all of your time-consuming operations inside work-procedures, you can get the same results as you would get with multi-threads, because you have an efective way of executing several taks at the same time without blocking the user interface.
Fernando Pereira
But, the bottom line is that if you don't profile your code (and unit test it, and integration test it, and...), you are not writing good code.
That's hardly true. Certainly you shouldn't waste time optimizing code until you know where the bottlenecks are. But it a lot of cases--I'd even venture to say most cases--code gets written and is fast enough. In such cases, profiling is a waste of time. Profiling is only indicated if there's a legitimate performance problem.
To a lesser extent, the same is true of unit testing and integration testing. If you're writing some code to convert one image to a GIF and you run it successfully to get the GIF, there's no reason to unit test. Even if the code has horrible bugs on some inputs, the job is done. One-off code isn't (unfortunately) uncommon. Prototype code is also very common and often you don't need to do extensive testing on it, either. Any code where the total cost of code failure is lower than the cost of QA probably doesn't need to be QA'd (which is not to say that you should spend an amount on QA equal to the failure cost; if spending $1000 on QA reduces the chance of failure by 99.999% and spending $1000000 reduces the chance of failure by 99.9999%, the $1000 expenditure suffices in all but the most demanding applications)
Sumner
rage, rage against the dying of the light
There are very few application that don't reach out across a network for information. The bottleneck is usually this network communications. Check out Performant for tools that work on the network level.
.NET, C++, C all can theoretically produce software that is just as speedy as assembly but it rarely is. People still write assembly where performance really counts (games, realtime, etc.)
There's also a continuing trend of software developers spending user's computing power to make thier jobs easier. Java, J2EE, C#,
Some people thinks that the wasted processing power is a crime. Me, I think it's just economics. It's much cheaper to pay for processing power than it is to pay for the developers to squeeze every last bit of performance out of an app.
However, there are some applications where profiling is absolutely required. Database engines, games, simulations, anything that is CPU-bound has the potential of benifiting from profiling.
You are not a beautiful or unique snowflake -- but you could be if you got off your ass.
I've solved some important real-world problems using Quantify and Purify, especially when dealing with a huge system with a lot of developers fingers in the pie. One of the programs was handling 100,000+ transacations a day, and Quantify helped shaved enough off so we didn't have to force all of our customers to upgrade their hardware.
Faced with a similar problem in Linux, I'd probably port the program to Solaris, Quantify it there, and hope the results are similar under Linux.
The next Cmdr Taco duplicate will be ready soon, but subscribers can beat the rush and see it early!
What if I run my software and it's is plenty fast? Is it still bad if I don't profile?
But processes as provided by current operating systems are too expensive to use. If I have a network server (e.g. a httpd) that has to create a process for each network request, it will never scale. In theory all that has to happen is inetd (or equivalent) fork/execs and does the necessary plumbing so that the ends of the socket are STDIN and STDOUT. Then the process just reads and writes as necessary to fulfil the request. In practice, this just doesn't work.
That's why you can't use cgi for high-volume transactions. So lets make the server a single multithreaded daemon process instead, where each request is handled by a thread. Now you can handle each request much faster, but you lose the protected address space the OS gives you in a process.
Obviously, the OS needs to change, and give use something (maybe a hybrid between processes and threads) that more closely meets applications needs. I don't see anybody making suggestions as to ways to move forward. Anybody know of research in this area?
For Java we have a really nice choice of profilers. There are basically three great products available, all of them have proved to be absolutely useful. There is JProbe, OptimizeIt and JProfiler (the 2.0 beta of JProfiler looks cool). I don't know what the problems on Linux are, but when programming Java, profiling is quite an enjoyable task.
Signature deleted by lameness filter.
mind you I have my own threads package - you need to if you want 1,000,000+ really small threads running together, with totally minimal stack space (4 bytes not the 1Mb that pthreads gives you). The only hard part was making gprof use SIGALTSTACK (which was broken in the kernel when I started). :-)
Of course this worked because from gprof's point of view I was running in one kernel thread - apart from that oprofile rocks
. . but not ever profiler is a terrorist?
Something is very confusing here - if it's already multithreaded, then the whole program isn't blocked waiting for input. And the thread that's waiting for input should be using a blocking interrupt-driven call to get the password, rather than polling for whether the user typed it or not. There's no reason for a process or thread that's blocked waiting for input to use any CPU time, so there should be no "execution time" impact. Reducing time waiting for user input would improve the overall time to complete the job, but it's not really applicable to questions of processor loading for a computationally intensive task, which is what profiling's really for.
I think this is a clever troll, and some moderator fell for it. Looking at your diary pretty much proves this point. But you had me going there for a minute, so good job in that respect.
Oh yeah, while I'm thinking about it:
3000th post! Yay me!
Your right to not believe: Americans United for Separation of Church and
minimize the use of threads whenever possible. write your code in an event driven fashion as your friendly AC suggested. the poll() system call [superior to select(), though select() works well within its fixed size filedescriptor array limits] makes this possible.
the basic mentality to switch from threads to event programming is this: anytime you're using a thread solely so that it can sit around and block on high latency events (network or disk I/O) most of its lifetime, it should not be a thread.
its acceptable to have worker threads/processes that you hand computational tasks to and they trigger an event in your event loop when they hand a result back, but don't use threads of execution to manage your state. you'll pull your hair out and still have a nonfunctional program.
i'll always choose a program that exists and works with a good user interface over one that is never released because the author(s) thought it could be faster.
listen to your profiler. everything else lies.
Ok, you got me. Now, let's apply a common sense filter to my original post.
Of course "one off, disposable code" doesn't need the same degree of "analness" applied to it as does mission critical code.
However, "fast enough" is a really bad metric to use. Yes, utility "X" is fast enough. But oh, I didn't realize it was going to be used in conjunction with utility "Y" and "Z". Now, everything is really slow. Hey, can you say Microsoft?
Fortune telling is not part of any programming job description I've ever seen.
However, "fast enough" is a really bad metric to use. Yes, utility "X" is fast enough. But oh, I didn't realize it was going to be used in conjunction with utility "Y" and "Z". Now, everything is really slow. Hey, can you say Microsoft?
Hey, I need this report on my desk every morning. It takes 3 hours to run. Let's kick it off every night at midnight.
Fast enough, even though a well-coded, well-designed implementation might take seconds to run. And mission critical. No point wasting programmer time speeding it up when we can do another project with big upside instead.
This sort of thing is not uncommon at all.
Sumner
rage, rage against the dying of the light
Um, but...I think there's a confusion of context occurring. The situation you describe happens when you're writing little chunks of one-off code to perform one task and be done with. Usually it'll be used once, or is part of a stopgap "until there's a real solution." If you're producing a product - if an entity external to your workplace is paying money for what you're producing, then you code isn't good without testing; and if you've got some spare cycles going on, profiling isn't too bad either. Something for a Malicious Coder to do when he's bored of adding bugs.
I'd even argue you have the same moral obligation to produce the same level of quality (in terms of well tested and possibly profiled code) if entities outside your workplace will use your software. Just because it was free doesn't mean it should suck.
IP is just rude.
Is there any torture so subl
First, the idea was to write in ASM to squeeze every drop of performance from the hardware.
Then, the idea was to write in a high-level language, but always be careful about performance.
Then, the idea was to develop apps quickly, then profile to optimize the important parts.
Now, screw optimization, let the user buy more hardware!
I think this attitude sucks. Even my 1.5Ghz Athlon-XP is slower running KDE 3.x (or any version of gnome for that matter) than my old 300Mhz PII was running Win98. And it doesn't do a hell of a lot of stuff that my old machine couldn't. I switched to Linux and took the performance hit because I hated Microsoft. I keep upgrading KDE (and my hardware) because the latest apps only work on the latest version. I don't expect more complex software to get faster, but I'd expect that as I upgrade my hardware, software should stay relatively the same speed. Yet, it seems as if software is getting slower more quickly than system bottlenecks (specifically RAM and hard-drive speed) can keep up. That means that the end-user experience is deteriorating, even as users pump more money into their hardware to get usable performance.
A deep unwavering belief is a sure sign you're missing something...
Isn't that what Open Source is all about?
The simple truth is that interstellar distances will not fit into the human imagination
- Douglas Adams
Um, but...I think there's a confusion of context occurring. The situation you describe happens when you're writing little chunks of one-off code to perform one task and be done with. Usually it'll be used once, or is part of a stopgap "until there's a real solution."
With testing, that's generally right. If something's going to run often, it can potentially fail a lot of times and so even a small cost of failure will be compounded to the point where QA is worthwile.
With performance, that's often not true. There are a lot of jobs that don't need anything approaching "good" performance (batch reports--I need a web usage report every morning on my desk/in my inbox--where the quick-and-dirty multipass solution that takes 3 hours to run can be scheduled at midnight, and the programmer can then do another project with big ROI instead of spending time writing a faster solution that takes only seconds to run) are one extremely common example of this (as is other batch processing). Many applications fall into that domain, many of them absolutely mission critical and responsible for millions in revenue but also not worth spending time optimizing when it could be better spent testing, adding features, or working on another project entirely.
And many (I'd say most) interactive application are fast enough from the get-go and never need optimization. Sure, there are some apps that either do a lot of computation (mp3 players, games, compilers, etc), or are run many times at once (web servers), or are too slow when first run for unknown reasons. But a lot of programs are fine from the start and profiling them is a waste.
Sumner
rage, rage against the dying of the light
And for getting even more useful information out, try Prospect. It works with OProfile - there was a talk about it at this year's Ottawa Linux Symposium, which you can find in the conference proceedings (gzipped PDF).
Do you even know anything about perl? -- AC Replying to Tom Christiansen post.
But what happens whe the program files overnight, and the poor user comes in in the morning to find that he doesn't have enough time to run the program again before the deadline? I bet at that piont, he'd appreciate the well-coded, well-designed version...
A deep unwavering belief is a sure sign you're missing something...
But what happens whe the program files overnight, and the poor user comes in in the morning to find that he doesn't have enough time to run the program again before the deadline?
Then you profile and optimize, because it's not "fast enough" any more.
Is that hard to understand?
Sumner
rage, rage against the dying of the light
I agree that profiling isn't always necessary, and that sometimes profiling and optimization won't reap any advantage, but I think the range between not necessary and useless is wide, and the advantage from profiling in that range is subtle but existant.
Additionally, profiling can serve other purposes. It's been suggested that, under a unit testing regime, a coder new to a project can serve as a "Malicious Coder," whose job it is to add bugs to code to catch out situations the unit tests miss. The advantage is that this can improve the testing as well as bringing up a new team member quickly. Profiling/optimization tasks can serve a similar purpose. By giving a direction to code investigation, it speeds the acquisition of familiarity with the code.
IP is just rude.
Is there any torture so subl
If you don't take a cursory run with a profiler on it, you'll never know the real cost of speeding it up.
It's worth a quick overview of the profile, to determine how long it would take to optimize said report.
I talk from painful experience - a job I once worked at ran overnight DB jobs on their Oracle database. Nobody bothered checking for efficacy of their SQL until the jobs that had accrued grew to take more than 8 hours in total, and were still running when users came in for the morning.
Then, with a scant four days of programmer, the jobs got pared back to three hours, AND some bugs got fixed. If they'd done that a few months earlier, we would have avoided 4 months of pain and anguish from users coming in, trying to use the system, and screaming bloody murder because it still wasn't available for them at 7:30 AM.
How *NIX grognards always complain about multi-threading, but don't find signals (and their nasty interrupt-driven nature) to be the least bit unsettling!
A deep unwavering belief is a sure sign you're missing something...
Now, compute intensive code tends not to spend a lot of time in system calls, so it isn't clear that it matters whether a profiler counts time spent in system calls. I kind of prefer if it doesn't because it doesn't clutter up the profile with I/O delays (which are usually unavoidable).
If you want to find out where your code is spending time in system calls, you can use "strace -c".
There are also gcov-like tools that can be used for profiling via code insertion (as opposed to statistical profiling like gprof), although I'm not sure whether PC hardware has the necessary timer support.
Overall, the answer is: yes, profiling still matters for programs that push the limits of the machine. But fewer programs do. I think most people would be a lot better off not programming in C or C++ at all and not worrying about performance. Too much worry about "efficiency" often results in code that is not only buggy but also quite inefficient: tricks that are fine for optimizing a few inner loops wreak havoc with performance when applied throughout a program. Too much tuning of low-level stuff also causes people to miss opportunities for better data structure and program logic. This is actually an endemic problem in the industry that affects almost all big C/C++ software systems. Desktop software, major servers, and even major parts of the kernel should simply not be written in C/C++ anymore.
The thing with profiling and optimization is to know when to stop, and few people know that. So, maybe the best thing to say is: "no, profiling doesn't matter anymore". That will keep most people out of trouble, and the few that still need to profile will figure it out themselves.
If you don't take a cursory run with a profiler on it, you'll never know the real cost of speeding it up.
Right. It's obviously a cost/benefit tradeoff. If you start the report at midnight and need it at 8:00 in the morning, then if it takes 15 minutes to run you probably don't even want to think about profiling. If it takes 7 hours, it's still fast enough for now but you may want to concern yourself with whether it'll always be fast enough. What's the cutoff? 1 hour? 4 hours? Depends on how crucial the report is and what other projects are on your plate at the moment.
Obviously "performance problem" is tough to quantify in general, but I still contend that you should normally only profile if there is a potential performance problem (or if you have idle resources, etc). Otherwise, go do some QA. Work on a new project. Clean up the nasty hack you wrote late at night to get it going. Write some documentation. Whatever.
Sumner
rage, rage against the dying of the light
_Native_ _x86_ multithreading is useless and harmful.
1) It heavily decreases number of processes - a very tight resource in Linux
2) It makes programs cumbersome and hard to debug
3) In x86 architecture in Linux it was not a good idea to make threads implemenation via context switches for thread switches - in x86 it's a very costly operation.
But it's only rant. Same as MIME this flawed technology is already used a lot and it's no way to turn it back.
(why mime? mime is a stupid thing - a dirty hack created from not wanting to rewrite old 7-bit protocol from scratch).
What could be more useful is if the compiler implementor would spend as much time on the profiler than on the compiler: you would then be able to easily see faulty parts in your software and be able to determine what needs to be optimized.
Better yet, if an architecture has a static branch predictor that encodes "mostly taken" or "mostly not taken", the compiler could emit profile code that measures how fast a particular variant runs and then take that into account for the next optimization pass.
Will I retire or break 10K?
I program a lot in c++, and I particularly like to use the STL. Thus, my programs often have a lot of inlined functions in them. I have found gprof to be much less useful when profiling such programs.
When a function is inlined, gprof does not account for that functions time. Nor should it be exepcted to, since optimizations may reorder the code so much that it is not feasable to attribute a particular assembly instruction to a particular function. I have tried recompiling my programs with -fno-inline to expose the names of the inlined functions, but this changes the program performance so much in some cases that I am hesititant to draw any conclusions about a program from such a profile. Short of abandoning inlining (and interprocedural optimizations, which poses the same sort of problem), does anyone have suggestions on how to profile such programs?
From a QA perspective, multi-threading has the same problems as global variables -- too much coupling, too much exposed data. If you are running a multiprocessor system and the program can be parallelized, then multithreading might be worth the trouble. But for most applications it causes more trouble than its worth.
Five bucks says that this server is slashdot'ed within the hour, so you may have more success with the less descriptive SourceForge project page, indicates that the project is not dead, as the homepage says.
I discovered this program when I was optimizing some code I wrote to multiply sparse matrices. By the time I had gotten it 100x faster than the initial code, gprof had lost all semblance of granularity and was giving me obviously bogus results. The problem is that such things as cache performance (i.e. optimizing for cache hits) were now heavily affecting the profile and gprof could not figure such things out. FunctionCheck works much better than gprof and actually generates accurate profile information under high-stress situations.
From the homepage (all grammatical errors theirs):
"I created FunctionCheck because the well known profiler gprof have some limitations:
- it is not possible to change the profile data file name
- multi-threads / multi-processes is not supported
- time spend in non-profiled functions is discarded
- you can't control the way profile is made
- memory profile is not managed
For all these limitations, and by the fact that I discovered a new gcc feature called -finstrument-functions, I decided to write my own profiler.My approach is simple: I add (small) treatments at each enter and exit of all the functions of the profiled program. It allows me to compute many information:
- the current call-stack
- the time at each action, to compute elapsed times in functions
- process PID / thread ID, to manage multi-threads / multi-processes
- number of calls to functions
...
With these information, I can generate profile data files (for each thread / process), which describes all the statistics (at function level) for the program execution."Try it out and please contribute some source code.
and thus are primarily limited by user response times, which are many orders of magnitude longer than even the worst algorithm.
User response time is many times faster then the time it takes for a function to return that's stuck in an infinet loop.
(appended to the end of comments you post)
Here's a call-graph profiler I wrote a while ago. It's rough and ready, but it does the trick, and works with both multiple threads and shared libraries.
e /
x86 Linux only. README.txt included, if it breaks keep all the pieces.
http://homepages.ihug.co.nz/~suckfish/scgprofil
The output is a series of records vaguely like the gprof call graph:
calling_functions
function_name
called_functions
The numbers are numbers of samples.
Ok, I agree that non-threaded code can be easier to understand than threaded code, and can usually run faster.
However, I'd like to hear what you'd do in a situation like this: You have a network server that has to respond to incoming packets within something like 50 ms (it's for a multiplayer action game). The server also needs to keep track of player information in a database, so it uses something like the mysql C client library. But then it has to block on a mysql_query or mysql_insert, because that library doesn't provide any way to do things asynchronously (this maybe have changed with newer versions; I looked at mysql a few years ago). Or what about DNS resolution? Or any other blocking event other than plain IO through an fd?
One solution might be to fork off a process doing the mysql, and have it communicate with the parent through pipes/sockets. But you have to invent a small, one-time protocol for each of these you do. And did I mention this has to run on Windows too, whose IPC sucks?
Well, if gcj's JVMPI becomes fully usable, maybe we could us a tool like JProfiler for natively compiled Java code. That would be great.
You rock.
But he didn't say that... He said that programmers should know where to invest the effort, and take an interest in creating efficient code. That means, first and foremost, exactly what you just said: you have to be smart about your DS&As, aware of what you're writing and not pointlessly lazy when coding. It doesn't mean, and wasn't claimed to mean, that you have to micro-optimise everything at the assembler level.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
I couldn't agree more. Sadly, the fact that almost everyone replying to your post thinks it is advocating premature optimisation at the level of assembly-level tweaks makes your point all too well.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Don't use threads.
The problem you are complaining about profiling having is that it can't profile threaded programs. Don't write threaded programs, and the problem is solved.
Frankly, I've always considered threading useful for only a few situations:
o When you have an SMP system, and you need to scale your applicaiton to multiple CPUs so that you can throw hardware at the problem instead of solving it the right way
o When you have programmers who can't write finite state automata, because they don't understand computer science, and should really be asking "Would you like fries with that?" somewhere, instead of cranking out code
o When your OS doesn't support async I/O, and you need to interleave your I/O in order to achieve better virtual concurrency
Other than those situations, threads don't make a lot of sense: you have all this extra context switching overhead, and you have all sorts of other problems -- like an iniability to reasonably profile the code with a statistical profiler.
OK... Whew! Boy do I feel better! 8-).
Statistically examining the PC, unless it's done on a per thread basis, is just a waste of time in threaded programs.
If you want to solve the profiling problem for threaded programs, then you need to go to non-statistical profiling. This requires compiler support. The compiler needs to call a profile_enter and profile_exit for each function, with the thread ID as one of the arguments. THis lets you create an arc-list per thread ID, and seperately deal with the profiling, as if you has written the threads as seperate programs. It also catches out inter-thread stalls.
-- Terry
Bah yourself. Who's this Knuth guy, and what the hell does he know about efficient programming, anyway?
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Read again what he said : _minimize_ thread use doesn't mean _don't_ use thread. Obviously there are case when you can't avoid using threads (like the one you described), but most of the time, it's possible to poll on resources (poll() on UNIX, select() on BSD, WaitForMultipleObjects() on Win32), and it should be done so to save on system overhead.
Why are these mutually exclusive? There's efficient and there's optimised, and one is a much easier subset of the other.
He's not claiming that everyone should hand-optimise from the word go. He's saying programmers should have a basic knowledge of their craft. It doesn't take much extra effort to use an efficient sorting algorithm or store data in a fast look-up structure, rather than writing a naff, hand-crafted shuffle sort and using arrays for everything whether they're appropriate or not. And yet, through ignorance or plain laziness, most programmers in most languages take the latter approach. (If you've never seen any of the source code for big name applications/OSes, trust me, it's scary.)
Similarly, it is just careless to pass large structures by value unnecessarily in a language that has reference semantics. You have to know the basics of what is efficient use of your tools of choice if you want to write good code, and the old Moore's Law excuse is just a cover for laziness and failure to do the requisite amount of homework.
Note that, very importantly, none of these things requires more than a small effort. They certainly don't compromise maintainability, bug count or any other relevant metrics, and a competent programmer (if you can find one) will take these things in his stride, and still be faster than the others.
Interesting... We have just acquired a new P4/2.2GHz with 512MB RAM and running WinXP as a development machine at work. You know what? It's way, way slower than the 1.4GHz P4 running 2000 we already had. And that in turn is way slower than the 1GHz P3 running NT4. This is not subjective, it is based on obvious, objective measures. For example, my new machine (the fastest of the above) sometimes takes 3-4 minutes to act on an OK'd dialog in Control Panel. The NT4 box reacts instantly when you configure the equivalent options. Something is wrong at this point, and I'm betting it's a combination of code bloat and feature creep.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
gprof is maybe not the most impressive tool to use, but it's quite useful. At a IA64 course at university [German, sorry] we used gprof to identify the bottlenecks in the c-code of the xvid-codec. Then we assembler-optimized like mad and got quite a nice speed-up.
Result can be found in our wiki:
Pre-OptimizationPost-Optimization
Without gprof we would have been lost... our IA64 wiki
comments about the poor implementation of
threads in Linux. Other writers suggest
avoiding threads, if possible. Note
that Java is nothing but threads. Any
Java program is running 4-6 threads (depending
on the JRE) right out of the box.
Where I work, we have had
severe problems getting Java programs to
work correctly on Linux. The IBM Java
support team has shared our frustration.
Maybe IBM's new thread implementation is
needed, just to get Linux, Java (and
thread users in general) working correctly
in an enterprise environment. After that
is working, then we can see about improving
other areas like performance.
See http://www.BitWagon.com/tsprof/tsprof.html for info on a process profiler that uses hardware performance counters (with no recompile and no relink) and gives both interactive and text output in tree and flat modes.
I suppose I'm already following that advice: I'm using a combined thread/main-loop design in the current server. There are a fixed, small, number of threads for things that are time-criticial or might block, and also a loop running in the main thread that calls a bunch of registered functions every once in a while, for low-priority tasks.
The problem is this: once you start using threads, even a few, you have to start protecting your data structures with mutexes and probably use other synchronization methods as well. If your code already has mutexes around shared structures, then there's little harm in adding a few extra threads. The big benefit I can see from minimizing threading is to get it down to a single thread, so that you can eliminate all of the synchronization overhead.
Huh? You'd better hope your boss ain't reading this.
And your attempt at keeping folks from running commands you think they shouldn't is pretty damn feeble. How do you decide someone has typed in your 'eteled' command? Maybe a strncasecmp() or similar? So if I do a "strings -a" on your executable I'd see a weird string pop out like "eteled"? And maybe I'd wonder why it was stuck in with all the strings called "select" and "update"?
OOOH, you so smart.
Has anyone ever tried to run xxgdb and gdb against a threaded program?
Has anyone tried calling fork1() from within one thread of a multithreaded process? One third of your threads shouldn't disappear, one third shouldn't wind up hung, and one third shouldn't get brain-damaging memory corruption.
I've actually reached the point of telling folks that come to me with MT requirements for Linux processes to pay the money and run Solaris. It's cheaper than paying me to get Linux threads working correctly.
Here you go: fma()
Saying profiling is useless is equivalent to saying that algorithmic complexity doesn't have to be studied. This is absurd. ;)
If for example your profiler says your function foo() is executed 100 times with a data of size 10 and 10000 times with a data of size 20, you have a serious algorithmic complexity problem. If some of them can (and should) be handled before hand, a profiler is very useful to handle those that weren't.
Algorithmic complexity is independant of computer language and of CPU speed, therefore profilers will always be useful as long as algorithmic is used by computer languages, so for quite some time still
I wouldnt say that gprof is useless... threading, however, comes very close to it.
Threading is useful in the instance where you have an application that needs to scale with SMP and which you cannot, for whatever reason, fork. But the accompanying pain of being forced to pay extremely close attention and mutex lock the code all over makes it not worth it for most situations.
Use fork. Use other IPC methods if necessary. But dont thread or you'll spend an order of magnitude more time debugging.
That depends on your point of view. Personally, I write lots of technical documents, where every other word (ish) isn't in the dictionary. That "better interface" makes my screen unreadable, since it's littered with red. On top of that, I usually spell correctly in the first place, and look words up in a dictionary as I go along if I'm not sure. Spell checkers rarely have to correct genuine mistakes in my documents. So personally, I'd much rather see that feature done away with and have the performance back, rather than waiting for Word to catch up as I type, as I had to ten years ago. If it's useful for others, by all means have it as an option, but don't call it "better" in a blanket statement.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Thanks for that information. I'm about to upgrade my trusty PII/350 running Win98 to a nice, new top-of-the-range custom-built beastie. Well, it's been four years, and it was my birthday last week. :-)
I'd been considering installing Linux as an alternative to MS stuff, since I now object enough to the nature of Microsoft's attitudes to make the effort to switch. In the light of your information, I think I'll just install Win2K instead.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
I agree with the rant, though. I have developed and maintained some significant sized threaded applications, and I loathe to do it again unless necessary.
> What about databases like Oracle, MS SQL Server,
> and so on? They're internally multithreaded,
Oracle, at least the last time I played with it
was NOT multithreaded by default. There was a
multithreaded config option, but Oracle recommended
that you did not use it.
I remember reading somewhere recently that computers use about 17 percent of power from the electricity supply grid. The latest PCs need 300W supplies, which helps neither the environment nor availability and cost during times of peak load. It's even worse because of the extra demand for AC, not to mention fan noise. Code that lots of people run should avoid excess bloat and inefficiency not just for improving end-user experience as measured by things such as response times.
Hmm, it depends on how you pay for hardware (cpu-cycles for ex.), now doesn't it?
Take IBM's scheme for Mainframes where you pay for what you use... Here I can certainly see long term gain / reduction of costs / benefit for a company optimizing a batch job taking 3 hours into running in a few seconds. Often these bottle necks don't cost astronomic sums, but "letting a ksh script loop in OMVS" certainly will.
In a society that believes in nothing, fear becomes the only agenda ~ Bill Durodié
C is rather common, and it has the "floating multiply-add" functions fma, fmaf and fmal.