Speed Test: Comparing Intel C++, GNU C++, and LLVM Clang Compilers
Nerval's Lobster writes "Benchmarking is a tricky business: a valid benchmarking tries to remove all extraneous variables in order to get an accurate measurement, a process that's often problematic: sometimes it's nearly impossible to remove all outside influences, and often the process of taking the measurement can skew the results. In deciding to compare three compilers (the Intel C++ compiler, the GNU C++ compiler (g++), and the LLVM clang compiler), developer and editor Jeff Cogswell takes a number of 'real world' factors into account, such as how each compiler deals with templates, and comes to certain conclusions. 'It's interesting that the code built with the g++ compiler performed the best in most cases, although the clang compiler proved to be the fastest in terms of compilation time,' he writes. 'But I wasn't able to test much regarding the parallel processing with clang, since its Cilk Plus extension aren't quite ready, and the Threading Building Blocks team hasn't ported it yet.' Follow his work and see if you agree, and suggest where he can go from here."
compiled with clang
The benchmarks in TFA are a little funny. Why is system time so large while user time so small? The only time I've seen this in real applications is when there is major core contention for resources.
Which one produced the fastest code?
What on earth does compiler benchmarking have to do with the BI section of slashdot?
Furthermore, why on earth are you idiots creating a blurb on the main screen that just links to a different slashdot article? Its such terrible self promotion. Just freaking write the main article as the main article. No need to make it seem as if the Buisness Intellegence section is actually worth reading, its not.
Well.. maybe. Or Maybe not. But Definitely not sort of.
first post++
man, it took a long time to read it.
Interesting info, but I have a couple of issues:
First off, why wasn't Microsoft's C++ compiler included in this? That's the one we use at work, so that's the one I'd really like compared to all those others. Are we the only ones still using it or something?
More importantly, why on earth was compilation speed the only thing compared? I mean, I suppose its nice for g++ users to know that their 10 minute compiles would have been 2 minutes longer if they used the Intel compiler, but Intel users might not really care if they believe their resulting code is going to run faster. Speed of compilation of optimized code is a particularly useless metric, because different compilers have different definitions of "unoptimized", so its guaranteed you aren't comparing apples to apples.
I suppose compilation speed is a nice metric to brag about between compiler writers. But for compiler users, the most important things are roughly these, in order: Toolchain support, language feature support (eg: C++2012/14 features), clarity of error/warning messages, speed of generated code (optimization), and lastly speed of compilation. I'm not really sure why you took it upon yourself to measure the least important factor, and only that one.
The main claim for g++ for a very long time was "while it does not optimize much or support all of the language, it is FREE". With clang on the scene and offering a comparable feature set and speed of compiled code, it will be interesting to see how g++ and the gnu compiler collection in general will fare over time. Especially as a part of the canonical GNU core.
You obviously don't work on large projects where build times can be 30 minutes and link times can be 5-10 minutes on top of that. In the past we have tried just about everything possible to make our compiles faster because it allows more iteration and less time waiting on code building. This include minimize include dependencies and looking at dependency graphs, benchmarking distributed build systems (incredibuild), working with pre-compiled headers, examining unity-builds / unified builds (think one CPP that includes many other CPP's in the same system), etc. We also buy fast hardware (8 core CPU's with 16 threads), 32 GB Memory, and fast SSD's. All because minimizing build time is means more productive time for developers.
The code in the benchmark runs a parallel for over a 10 billion element array but in steps of 100 elements.
It's going to be limited by the creation and destruction of threads.
Also, by not initializing the input array, the floating point arithmetic is vulnerable to eventual denormal values.
Speed up from algorithmic changes will far out perform anything an optimizing compiler can produce. A compiler that compiles faster lets you focus on algorithmic optimizations more than a slow compiling compiler.
Besides if it's spending 80% of the time idle, then the program is waiting for the user not the other way around.
Why, so we can have more first posters?
I read TFA and all I got was this lousy cookie
Besides if it's spending 80% of the time idle, then the program is waiting for the user not the other way around.
Bingo. When the software is waiting for something to do 80% of the time, and nothing else of any importance is running on that machine, optimization is pretty much irrelevant; at best it would save a tiny amount of power by slightly reducing CPU usage.
first ++pre
+1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill
I am a scholar and study parallel computing. These benchmarks are pretty much pointless. You can not make any conclusions out of these results. Here the author take the time whole time of the execution for the creation of the process to its destruction. That means that are included lots of overhead which would be included in startup time in a real application.
There is also apparently no thread pinning to computational cores. This is known to make a HUGE difference.
Then the authors compared cilk result. cilk is known to be slow for simple codes that do not require workstealing and have complex dependencies. For the record, I know they are also comparing TBB. But TBB is implemented on top of the cilk engine in the intel compiler (I don't know about gcc).
In these results hyperthreading is enabled. The proper use of hyperthreading is complicated. There are some problems where it helps, other where it harms, and I would not be surprise that this behavior be compiler dependent.
Finally, it is almost impossible to compare compilers. On different platforms, with the same compilers you will get different results. Some functions are better compiled by one compiler and some functions are better compiled by the other compiler. This has been reported over and over and over again.
If you care about performance, you should not rely on what your compiler is doing in your back. You need to know what it is doing. Depending on memory alignment (and what the compiler knows about it), depending how the vectorization happen, depending on potential memory aliasing you will get different results.
If you care about performance, you need to benchmark and you need to optimize and you need to know what the compiler does.
int FirstPost(int a, int b)
{
if(a < b)
printf("I got first post!");
else
printf("No, I got first post!");
}
int main(int argc, const char** argv)
{
int i = 0;
// What prints out here?
FirstPost(i++, i++);
}
The best thing about UDP jokes is I don't care if you get them or not
Assuming typical C calling convention.... "No, I got first post" will be printed, where a will be 1 and b will be 0 in the call to FirstPost. This is because generally, final arguments are evaluated and pushed onto the stack before earlier ones.
Although typically, the standard may say this behavior is undefined, in practice, almost all modern C compilers will produce the output I've described here.
File under 'M' for 'Manic ranting'
The result may be nasal demons.
This information is perhaps 2 years out of date, but back for one of my projects, when we switched from g++ to Intel C++, our software got about twice as fast with no other changes. It got even faster when we took advantage of SSE3 instructions.
If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.
Actually, it's not undefined behavior. It's unspecified behavior.
Compiling SOAP on Windows or Linux takes about 20min on a well managed VM with respectable grunt. The couple of dozen other binaries that go with our application take about half that time to build in total. SOAP is not even the largest of the component source trees we have, but from the compiler's POV it's certainly takes the most effort.
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
If it were just up to the order of evaluation of the function arguments, then it would be unspecified. However, the program also modifies the same object twice without an intervening sequence point, and that puts it into undefined behavior territory (6.5/2, C99 draft standard).
hello world
Clang will just issue a warning that you are making multiple unsequenced modifications. This is undefined in the C spec and the compiler just increments i sequently printing "I got first post!." Sequence points like this are hard to clarify for all cases which is why the C99 spec leaves it undefined. In C11 a detailed memory model has been created which should define most cases. http://en.wikipedia.org/wiki/C11_(C_standard_revision)
Confirmed with:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin13.0.0
Thread model: posix
It is speed that is important which is why a lot of HPC people still prefer the intel compilers.
Why, so we can have more first posters?
Based on my analysis of 1,000 random /. stories I have found that the average number of first posts per story is exactly 1. There's no reason to spread this FUD about more first posters.
Four-hour compile times means a 1 day turnaround for any bugfix for production.
A one-hour compile time means four to six bugfix/test cycles per day.
What's the turnaround time if you change, say, one tiny part of a C function having no ramifications to other modules? Do you have a 1-second recompile time (just for that module) followed by 5-10 minutes of link time before you can re-test? Is there no incremental linking? No dynamic libraries? I'm curious what type of program you have. That seems excessively slow to me.
More than likely your main gcc use was for Mac or iOS applications and you changed compiler because you can't even figure out how to change the defaults in XCode. Having tried both with my applications I can perfectly tell that clang is not up to snuff. Sure it compiles quickly and the syntax errors have color highlighting but the quality of the code, in terms of execution speed or size, it produces is vastly inferior.
I thought that was something people used back when MS-DOS was a popular OS was not even aware the product still existed.
I am talking about Watcom C++ of course.
Don't use templates. Period.
six release cycles a day is probably why you have bugs in the first place...
Moderate template use is fine. FWIW, Unity/Unified Builds seem to help a lot with compile time on template heavy code.
So, if it is fails to be best in one case, it is therefore suboptimal in all other cases? Guess we should un-launch all the satellites since a few of them were damaged on the ground, and tell the Mars rover to power down, since other Mars missions have had problems.
I thought that was something people used back when MS-DOS was a popular OS was not even aware the product still existed.
I am talking about Watcom C++ of course.
It was open sourced some time ago. Now it supports Linux (to some extent) and some other CPU architectures.
It can still make DOS/4GW exes, though. Ahh, nostalgia.
You are an idiot.
NO U
No, _I_ am Idiotus!
Pretty sure there is no intention whatsoever of turning that into defined behavior.
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
I am not at all convinced about this "almost all modern C compilers", given how many will do fairly awesome things once they determine that the behavior is undefined.
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
Why in the hell are you testing Clang with either Cilk or OpenMP when neither have moved into mainline trunk of LLVM/Clang? This test is as worthless as Phoronix's test suite on benchmarking apps that require OpenMP and they note that Clang takes it in the shorts because it presently doesn't have OpenMP implemented. Complete waste of time.
Turn warnings off...
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
he said bugfix/test, not release.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Although you have a good point, his point is still valid.
Just because the U.S. is a republic does not mean it is not a democracy. Democracy/republic are not mutually exclusive.
I am no expert at C; I would have guessed the i++ occurs twice at the location being the semicolon after FirstPost(i++, i++).
Isn't the ++ operator supposed to occur "after" the statement has completed?
I'm skeptical about that. Wasn't Intel's compiler supposed to produce the fastest binaries - on Intel machines, at least?
Who's to say you have to recompile everything? Surely if your making a small bugfix you just recompile the files which have changed...
Plus you have tools like distcc, ccache etc
http://spamdecoy.net - free throwaway anonymous email - avoid spam!
And how often will nothing else be running on the machine?
These days a large proportion of servers run under hypervisors, and that 80% of idle time will be used by other virtual machines running on the same physical hardware. If you make your code more efficient, then you can consolidate more functions onto the same hardware which could result in significant cost savings.
And even on single standalone machines, modern powersaving functions will mean that far less power is used during the 80% idle periods, and more efficient code could potentially result in 90% idle or more... This translates to lower power usage, and subsequently longer battery life.
http://spamdecoy.net - free throwaway anonymous email - avoid spam!
I could have beaten him with my highly optimized GCC-devel compile of "FirstPost.cxx",
but I didn't quite understand the error message regarding the templates.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Yah, besides missing compiler flags, how does it perform on different intel processors, how about different AMDs?
Plus, the huge system times seems to indicate this more a kernel test than a compiler one.
Sorry, AC, I will have to let go my positive mod point to you so I can reinforce what you've said. Next time, please consider making an account so you can escape the Score: 0 limbo when you post on Slashdot :(
Since Intel has been caught red-handed crippling AMD processors on code produced by Intel C++ Compiler, I think that testing on Intel and AMD processors should the duty of every single compiler benchmark -- that is posted in Slashdot, at least.
In the times we live in - and the knowledge Ed S. has given us - do you really still trust a black-box compiler from a huge US corporation with intimate government ties?
No, the ++ operation will take place before the next sequence point (super important concept! If you do not fully grok sequence points, you are not really programming C). The end of a statement is one sequence point, a function call is another sequence point.
Here you have two modifications to i before that, and that is what is invoking undefined behaviour (in the same way i = array[i++]; is also undefined behaviour since i is modified twice before the end of the statement).
When you are sure of something, you probably are wrong (search for "Unskilled and Unaware of It").
Check Annex C: there is no sequence point between evaluation of arguments, only after evaluation of all of the arguments is complete. (Note that the comma separating arguments aren't a comma operator.)
In addition, the standard explicitly states that it is not necessary to completely evaluate one argument before moving onto the next: "The order of evaluation of the function designator, the actual arguments, and subexpressions within the actual arguments is unspeciïed, but there is a sequence point before the actual call" (6.5.2.2/10 of C99 draft). This means, for example, that f(g1() + g2(), g3() + g4()) could be evaluated by calling the g# functions in any order (as each is a subexpression within the actual arguments), and if those functions produced side effects then that would be a counterexample to your claim that there is a sequence point between arguments.
Actually, you don't strictly need sequence points to determine the order of events, sometimes they can be guaranteed simply by the defined order of operations, even if they result in side effects.
Consider the statement x = a[i++] + b[i++].. This should be equivalent to temp = a[i++], temp = temp + b[i++], x = temp, because the the order in which the + operator evaluates its operands is determined by the standard. Even without sequencing points.
But the initial example, where one passes in an argument to a function that is also being modified more than once technically isn't defined by the standard at all since the standard makes no guarantee about the order of evaluation of function arguments.
Nonetheless, one will still find that in practice, most C compilers will always evaluate the last argument first to a function first (including its side effects), as long as the compiler is configured to utilize the C calling convention for construction of stack frames (cdecl, in many compilers). If configured differently, it can, of course, have different behaviour.
File under 'M' for 'Manic ranting'
Actually I don't think my "counterexample" argument holds, because there would actually still be a sequence point at each call. You could change it to:
and the order in which the writes to the xs occur could be any, including, for example, x1, x3, x2, x4.
We are targetting custom / closed-wall systems. Single binary EXE for user systems - No user level DLL's or external binary loading allowed (for security purposes). Incremental linking not allowed on retail or profile builds (it's a compiler level hack that potentially adds a thunk per function and at the very least adds one per function moved and also significantly changes the memory footprint for code memory layout from build to build on the functions you are modifying). Additionally, we use link pass optimizations (i.e. link-time inlining) on those builds on certain platforms which on retail and profile builds which require incremental linking to be disabled.
In the best case in the above mentioned builds, we do have about a 15 second compile time followed by the full link time.
But even without these limits, it very easy to find programs where a single change in a common header causes a full recompile and non-cremental link times on all large projects (especially ones with a lot of redundant template functions per translation unit) can grow significantly.
Besides if it's spending 80% of the time idle, then the program is waiting for the user not the other way around.
Bingo. When the software is waiting for something to do 80% of the time, and nothing else of any importance is running on that machine, optimization is pretty much irrelevant; at best it would save a tiny amount of power by slightly reducing CPU usage.
Yeah, and my work laptop backup software is idle 80% of the time...except for those 4 hours every Friday when the disk utilization pegs to 100% and it starts taking several minutes just to switch to a different folder in Outlook, with no other programs open...and if I need to use SQL Developer or Access or something, it's gonna have to wait for Monday!
Nobody cares how much time your program sits there waiting for someone to push the button. They care about how quickly it reacts once you push that button. Just because it's sitting idle for most of the time doesn't mean your customers wouldn't greatly appreciate some optimization.
Just tried this with TCC (Fabrice Bellard's Tiny C Compiler) and GCC:
tcc: I got first post!
gcc: No, I got first post!
I am anarch of all I survey.
I thought that was something people used back when MS-DOS was a popular OS was not even aware the product still existed.
I am talking about Watcom C++ of course.
It was open sourced some time ago. Now it supports Linux (to some extent) and some other CPU architectures. It can still make DOS/4GW exes, though. Ahh, nostalgia.
As someone that has maintained Watcom C/C++ code, the Watcom and OpenWatcom are slightly different and code needs porting from Watcom to OpenWatcom. How much I don't know...I just know that our code needed quite a bit of work to do that. Would have been nice if we did...but no one wanted to.
Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)