A Review of GCC 4.0
ChaoticCoyote writes "
I've just posted a short review of GCC 4.0, which compares it against GCC 3.4.3 on Opteron and Pentium 4 systems, using LAME, POV-Ray, the Linux kernel, and SciMark2 as benchmarks. My conclusion:
Is GCC 4.0 better than its predecessors? In terms of raw numbers, the answer is a definite "no". I've tried GCC 4.0 on other programs, with similar results to the tests above, and I won't be recompiling my Gentoo systems with GCC 4.0 in the near future. The GCC 3.4 series still has life in it, and the GCC folk have committed to maintaining it. A 3.4.4 update is pending as I write this.
That said, no one should expect a "point-oh-point-oh" release to deliver the full potential of a product, particularly when it comes to a software system with the complexity of GCC. Version 4.0.0 is laying a foundation for the future, and should be seen as a technological step forward with new internal architectures and the addition of Fortran 95. If you compile a great deal of C++, you'll want to investigate GCC 4.0.
Keep an eye on 4.0. Like a baby, we won't really appreciate its value until it's matured a bit.
"
If you really, positively need an extra 5% performance, you might as well just buy a computer that's 5% faster.
You should do both. Choosing the right algorithms is crucial, no doubt about it. But if you've got a massive database application, that 5% can represent a huge amount of work and be worth the trouble. A little bit of extra performance can, in many cases, go a long, long way towards adding to the value of the software. Both endeavors are intelligent in many (if not most) cases. Performance is important in software, and any little bit you can squeeze out will likely be a big deal.
I love the open source movement but I wonder why the following comment is OK for open source projects and not close source?
quote "That said, no one should expect a "point-oh-point-oh" release to deliver the full potential of a product, particularly when it comes to a software system with the complexity of GCC."
I bet no one would dare say that about certain product from Redmond.
Unfortunately, including a faster computer with every copy of the code you distribute may be prohibitively expensive.
And in both groups you will find people who believe that execution speed is the measurement of code quality.
KFG
Considering the glacial pace of change in the Fortran world, using a "standard" created as recently as 1995 is considered to be a radical and reckless act.
Like a baby, we won't really appreciate its value until it's matured a bit
I'll just have to make sure you never babysit for me, if babies are that value-less to you.
Best death? What, die from a naked lady avalanche?
I think the problem is that, if I'm not mistaken, he's testing all C code except Povray. The biggest reported improvements in 4.0 were for g++, so using such a small C++ sample base (Povray - one purpose, one set of design principles, few authors) seems bound to produce inaccurate benchmarking.
;)
Further, on his most reasonable C benchmark (the Linux kernel), he only records compile time and binary size, but no performance. I call it the most reasonable benchmark because it has thousands of contributors and covers a wide range of code purposes and individual coding habits - and yet, performance is omitted.
In short, I wouldn't trust this benchmark. Probably the best benchmark would be to build a whole Gentoo system with both, with identical configurations, and check build times and performances
Dear Lord: One of your creatures may be hurt tonight. Please let it be the other creature.
It is because it is easier to delve into needlessly technical aspects afforded by compiler settings and 'optimizations' than it is to admit that one's algorithm is not sound. Kids running Gentoo delude themselves into thinking that omitting the frame pointer on compiles is going to make a massive difference in terms of performance, and fail to remember it makes bug hunting far more difficult when applications crash. Additionally, the 5% gain mentioned can be a severe overstatement. I frequent a game programming board, and the widespread use of C++ has led to an abundance of nano-optimization threads, the most amusing of which was an attempt to optimize strlen().
Optimizing every single line of code is a complete waste of time, since the 80/20 rule generally applies. Use a profiler to determine where that 20% is.
The point of this article is compiler optimizations, not algorithm selection. At the point that I look at compiler performance, I've already done all of the algorithm tuning so your point is moot. This is a very interesting benchmark for those that of who already write good code and want the compiler to make the best of it.
Because it guesses what should be vectorizeable. If it guesses wrong your program will blow up.
My conjecture is that they require it enabled by hand so that people who know what they're doing enable it, watch what code blows up, then produce intelligent bug reports that can be directly linked to the vectorization and therefore fixes can be produced for 4.0.1
I'd say "thank god" after reading that text. And mod your comment "funny".
a=min(a,b)
that is one of the most self-descriptive statements i have ever seen.
I run gentoo (not for performance, but mainly because I am familiar with it, and it is easy), and you know what...I don't bug hunt. And adding -fomitframepointer or whatever the hell the option is (its in my flags somewhere) doesn't cost me anything, makes my system say (made up stat) 5% faster and I am happy. It makes no sense why you should deride me (read: gentooers) as an idiot. We're just end users, and if we can get a little bit of performance for free, well why not.
Only on /. could this get modded up...
He's not a dumbass because he uses Gentoo. It's pretty obvious that he doesn't know what he's talking about. Straight from TFA:
Some folk may object to my use of -ffast-math -- however, in numerous accuracy tests, -ffast-math produces code that is both faster and more accurate than code generated without it.
"I don't know about you, but if I want my math done fast and wrong I'll ask my cat" - Anonymous
Unless the GCC documentation is very wrong, the only tree-ssa optimizations in 4.0 which don't get turned on by default at -O3 are -ftree-loop-linear, -ftree-loop-im, -ftree-loop-ivcanon, -fivopts, and -ftree-vectorize. It's true that some of these may be good optimization wins (probably increasing compile time in the process, but that's what the higher optimization levels are all about), but there are plenty of tree-ssa optimization passes being used in these tests.
Auto-vectorization, by the way, does not fall into a "obvious optimization wins which perhaps should be enabled at -O3 by default" category. It can bring very big performance benefits in some situations, but it should be used with caution.
I think that companies should re-evaluate their "need" for an extra 5% performance.
If you're talking about a program written by one person to be run by one person, or written by five people to be run by five people, or a program that will be run a limited number of times or while people are getting coffee, then absolutely you are correct.
But if you're talking about a small group of programmers making an interactive program (including simulations which people wait for the answer to before starting another run) to be run by millions of people, or to be run iteratively millions of times or over an enormous dataset of comparably size, then 5% is absolutely worth it. If you spend 10 manhours tweaking out 5%, and you've gained only a mere 100 milliseconds, then as a whole you've made out quite well after the collective time saved by those millions of people, or by the millions of runs, are accumulated. And often 5% can result in much more time savings than that.
If you really, positively need an extra 5% performance, you might as well just buy a computer that's 5% faster.
If you can afford all the computers that are 5% faster, then do both! Then you get 10%, and double the benefit. If the first 5% makes a significant difference for a certain application, then the second one probably will as well.
Uh, and this review is helping us... how?
lame uses assembler code for vectorization. One of the new features of gcc 4 is the beginnings of a vectorization model. A good test for gcc 4 would have been to compile some C-only bignum libraries, and Ogg Vorbis! povray is also a good example, but then you need to test more than one specific test-run. Maybe gcc 4 makes radiosity in pov-ray 400% faster at a 2% cost in the rest of the code?
This guy is the Tom's Hardware of Linux reviews, except he doesn't have the annoying ads, and he does not split his lack of content over 30 HTML pages.
The new warnings of gcc 4 have helped me find a bug in my code. That saved me a week. Consider how much faster gcc 4 needs to make pov-ray or lame to save you a week of work!
gcc 4 can now reorder functions according to profile feedback. That should make large C++ projects faster. Also, the ELF visibility should make KDE start much faster. This should have been tested!
Please note that I'm not saying gcc 4 produces faster code. I don't rightly know. I do know it produces smaller code for my project dietlibc, where size matters more than speed.
Be careful though. If you want to use sse only make sure to supply --fsingle-precission-floats. Otherwise something terrible happens: If you write
float x, y, z;
x=y*13.2*z;
gcc will take 13.2 for a double, it will do half of the expression with sse and the other half with FP.
That is in fact worse.
I wish gcc had an option
--fbest-code-for-current-machine
where it will enable all the options to get the fastest code for the machine it is executing in. So no deps on the incompetent autoconf scripts, etc. Compiler detects CPU and turns on apropriate options.
gcc 4.0 just tries to follows standards. If something doesn't compile with gcc 4, don't blame the compiler. The source code was broken at the first place.
{{.sig}}
ACK! If 70% of your time is spent in a serialization function call, FORGET about optimizing the function call.... You are WAY too fine grained in your algorithm for effective parallelization. He's have been better off running the whole damn thing serially on a single box methinks. His fancy grid algorithm spent more time doing "grid" stuff than working on his problem!
Auto-vectorization is not the reason why Intel's compiler is better. It certainly helps, but in my experience, not much. Intel's compiler just does better optimizations across the board. Which is no surprise: Intel is making a compiler for thier chipsets. They have inside knowledge of what's best to do when for a particular chip. Further, Intel's compiler is marketed as the fastest compiler for x86, which as far as I know, is true. Hence, they spend a lot of time on the optimizations.
GCC, on the other hand, has a different goal: get a working compiler on as many platforms as possible.
If I ever see a developer do something as stupid as this on a job application, there's no way they will ever get a job working for me.
Having clean, readable code is far, far more important than saving a few minutes in total in a project. Using compiler-specific features is generally frowned upon, but acceptable in cases where there are significant performance or time gains. Using a compiler-specific alias to save yourself a few extra keystrokes at the extreme cost of readability is just being lazy, and not thinking about how that code will be maintained in a year.
I feel the same way about the ternary operator, actually. There are a few cases where it's clear enough to be used, and where it saves several lines of typing. However, 95% of the time that people use it, it only makes the code impossible to understand.
No comment.
If you have a choice of algorithms, then of course use the better algorithm. But for most of the day-to-day code we deal with, we don't have that choice, because we're not dealing with code that has any grand algorithms to it. For example, if I'm writing a GUI frontend to a command line app, what are my choices of algorithms? Not much.
In my real life coding work, the places where algorithm efficiency makes a difference are far outweighed by those places that don't. And of those places that do make a difference, the performance is rarely a critical need. For example, I just coded up some RAMDAC lookup tables, and a difference of algorithm would make a huge difference in efficiency. But this particular routine was triggered by a user event (clicking a button in a config dialog), so that my dogslow but highly readable/understandable algorithm wasn't a bottleneck for anything. In this case tweaking the compiler settings would have given a 5% boost to everything, but a change in algorithm would only have given a 1/10 second boost for an event that would happen approximately once a week or less.
Don't blame me, I didn't vote for either of them!
You know, I remember when someone did this to GCC 3, comparing against 2.9.5.
4.0.0 is a brand new compiler. Lots of techniques in it are brand new. Lots of tweaks and polish can be applied. If you actually take the time to compare 3.4 to 3.0, you'll find that the gap is bigger than 4.0 to 3.4. Furthermore, if you compare 2.9.5 to 3.0, you'll find 2.9.5 is better than 3.0 by a much wider margin than 3.4 is to 4.0.
This is a misunderstanding of the nature of progress. 4.0 is a brand new compiler with brand new internal behaviors. Lots of things are at the It Works stage, instead of the It's Efficient stage. You can't compare a 3-year polished compiler to a 3-week polished compiler; it's utter nonsense.
If you want to compare 4.0 to something, compare it to 3.0, or sit down.
StoneCypher is Full of BS
I thought forced upgrades is something Microsoft did, not the open source community. I guess I was proven wrong.
You know, I'm sick and tired of idiots who think some standard on a piece of paper is more important than real-world code. The fact of the matter is that there is a lot of real-world C code out there that has compiled just fine for years until the GCC developers got all prissy and started deliberately breaking code.
I'm sick and tired of patching just about every C program written more than two years ago just because the GCC developers decided to break code that compiled just fine.
To hell with the standards. I want something that compiles all of the open source code which compiled fine five years ago without forcing me to make huge patches in the name of standards.
If the standards do not reflect real-world code, they need to be rewritten.
Exactly. You couldn't distinguish an app compiled with frame pointers disabled from one with frame pointers enabled. The reason I point this particular optimization out is that it makes life hell for developers if you end up sending a core dump file in. This is a net loss in my book, since open source software evolves rapidly and being able to assist in reporting bugs is vital.
How is pointing out that one optimization people crow about is largely ineffective being an asshole?
Is this a troll? In most high-performance computing environments (national and local supercomputing centers) -- at least when they are being well-utilized -- they are in non-stop use. You don't start ten minutes sooner because you or another user are hammering the machine with other jobs.