Linux Number Crunching: Languages and Tools
ChaoticCoyote writes " You've covered some of my past forays into benchmarking, so I thought Slashdot might be interested in Linux Number Crunching: Benchmarking Compilers and Languages for ia32. I wrote the article while trying to decide between competing technologies. No one benchmark (or set of benchmarks) provides an absolute answer -- but information helps make reasonable decisions. Among the topics covered: C++, Java, Fortran 95, gcc, gcj, Intel compilers, SMP, double-precision math, and hyperthreading."
Coyote Gulch Mirror
Be gentle! I'm sure my server is meltable, too! ;)
IN SOVIET RUSSIA, sig changes you!
The benchmark programs aren't free - this is a non-profit industry association that charges money to cover its costs, but there are a number of universities that are members or associates which may be able to do testing that could explore some of the compiler differences; poke around their website to see who's reported what kinds of results.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Intel have released a separate version of their C++ compiler for Linux, which they claim has good GNU C/C++ compatibility http://www.intel.com/software/products/compilers/c lin/. They say it can be used to build the Linux kernel with few modifications.
I have been using native MATLAB binaries for Linux for a while now...and unfortunately they are still much slower than the windows versions (both in number crunching and interface).
WARNING: This sig does not contain a joke
Java is marketed as having several strengths, among which:
1 write once, run everywhere
2 short time to implementation, and easy to maintain, low cost
Your proposal is to optimize the app. While this works, it goes against 1: the optimisation works for one platform, but in general will just be overhead another. When you write to run on a specific platform, that is not a problem.
It also goes against 2. Taking more time costs money. Implementing more clever takes more education of the programmer: costs money. More educated programmers have higher salaries.
So, basically what you are saying is that either java is slow to run, or it is just as slow to develop as any other language.
What do you mean by "using an interpreter"? He used Sun's run time, even with the -server switch, which does some for java quite serious optimizations. The real disadvantage java has (in terms of performance, it is an advantage otherwise) isn't the IEEE requirements, but rather the extensive use of runtime binding of classes. In C code, the compiler can inline pretty much anything for you, and get long runs of instructions to schedule as it likes. The other big performance disadvantage compared to C++ is that in C++ you can often have complete control over how arrays are laid out in memory, and test different ways to see which ones work best with your cache hierarchy. In Java you just 'new' it and then you have no control over it. I often hear that Java is about half the speed of C++, but my own tests put it more in the 10% to 20% range. Maybe half is true for poorly optimized C++ code vs poorly optimized Java code, but for well optimized C++ vs well optimized java the difference seems closer to five times. Which is fine to me most of the time, so I generally use Java.
The article claims that he *is* using a Just-In-Time compiler. What makes you think otherwise?
Here are some results from an AMD system and the IBM java runtime:
Athlon XP 1600
-mcpu=athlon-xp
-O3
-mmmx
-m3dnow
IBM java - build 1.4.0, J2RE 1.4.0 IBM build cxia32140-20020917a
58.8 seconds
58.9
Sun java - build 1.4.1_01-b01 Client VM
94.2
94.1
Sun java - build 1.4.1_01-b01 Server VM
82.1
81.6
gcj - version 3.2
82
82.1
gcc - version 3.2 20020903 (Red Hat Linux 8.0 3.2-7)
31.4
31.3
26.6 -ffast-math
26.6 -ffast-math
26.4 -ffast-math -funroll-loops
26.4 -ffast-math -funroll-loops
gcj really is within 10% of g++ on this benchmark, unfortunately he built the gcj program without the all important -ffast-math option (and -funroll-loops). This is a huge penalty for gcj - more than 2x slower without.
I sent him a note and hopefully he's update his page.
It doesn't work. Linux uses gcc extensions. Plus, the number of compiler bugs Linux exposes means that running it under icc would probably involve fixing a bunch of icc bugs.
And you'd probably have to fix about a zillion Linux bugs...
May we never see th
Using reference counting is one big help. And knowing how to write exception safe code is another. Yes, it's a difficult subject, but it is something you can learn.
It may be easier to write leak-free Java than C++, but I suspect that because many Java programmers blithely assume their code can't leak, there may well be more leaky Java programs than C++ ones.
I used to work at a web shop where we used the Enhydra Java application server. Enhydra is pretty well written, but the applications that the company originally developed for it were pretty sad. As a result they had to restart their servlet process every few hours because the JVM ran out of memory!
Request your free CD of my piano music.
When I run my QR decomposition "benchmark", IBM's virtual machine always comes out about 20% faster than Sun's. It would still be a lot slower than C++ or Fortran, but the gap should be smaller. On top of that, IBM's license does not require you to accept a license which says the VM may install any software on your machine and that you automatically accept the license of that new software. See http://hal.trinhall.cam.ac.uk/~nrs27/java_eula.htm l for all the fun.
Almost *ALL* of my email is related to Java. I'll be adding the IBM JDK and older versions of the Sun JDK later today, as per reader request.
I'm making minor updates to the article as the day passes. I appreciate comments from everyone; once I'm through with my e-mail, I'll respond to these Slashdot comments.
Additional benchmarks will be added to the article with time; I'm putting together a single-precision ("float") benchamrk, for example.
All about me
Okay, first of all, for embedded systems this is actually very important. For any embedded system that has to run for years, one of the key issues is that it absolutely must use a heap manager that avoids long-term fragmentation (or have a static heap... but that's another story). Java happens to have a prepackaged solution for this. This, plus small executable size due to the use of byte codes is one of the reasons Java is used for embedded projects.
Actually, the cost of hitting the disk is typically the most important impact on system peformance once you have an app who's working set can't fit into available RAM. That's because access for swapped virtual memory is orders of magnitude slower than virtual memory that in RAM. As you said, most systems today are not CPU limited, but rather memory limited, so switching to another app which also needs to access virtual memory is actually going to make things worse (just try running two programs who's working sets are larger than available RAM and you'll get the idea).
As for memory defragging eating bandwidth like there's no tomorrow, that's only half of the picture. Good GC systems aren't constantly defragging memory at top speed, so the penality is frequently invisible. Smart systems GC when there is available CPU time and memory bandwidth (check out BEA's JVM for example). More importantly, by defragging intelligently you actually increase the chances that sequential memory accesses will hit the various memory cache's on the system (it's really nice when the cache lines are all filled with actual related data instead of unrelated fragments with gaps).
The thing to keep in mind is that the people who write GC systems (or at least the good ones) aren't stupid about how they use memory. They spend a lot of time very carefully analysing memory management issues and put their findings into their system. As a result, most GC systems are better at doing memory management than your average programmer (although there are always special problems where a particular GC will perform particularly poorly).
sigs are a waste of space
...the tools I have at hand. I have nothing against TowerJ, but can't test it if I don't own it. As for Java, I made note of a lack of flags, which is different from complaining. The -O flag is no longer supported by Sun's JDK according to the documentation.
All about me
Back when I wrote reviews for print magazines (back when there were print magazines, that is), it was standard policy to limit reviews to the actually, shipping, commercial product. Demos often lack critical features (like optimizers) or are tuned for benchmark tests, so I've kept to that policy now that I write reviews for my own web site.
I own licenses for the Intel compilers, for example -- and, of course, gcc and Sun Java don't cost anything in the first place. I'm considering my options in this case; long gone are the days when a dozen Fortran or C compilers would arrive at my door for a magazine review. Heck, there aren't a dozen compiler vendors left... ;)
All about me
For the specific case of algorithms that can be expressed strictly in terms of bounded loops where the loop bounds can all be determined at compile time, so that there are no run-time tests needed to determine if some computation must be performed or not, functional programming styles can be near-optimal. Analysis techniques can radically restructure such programs, completely reorganizing the loop nesting.
There have been a variety of stream-oriented or single-assignment languages to make such things possible: Silage and DFL, Lucid, Lustre, Sisal, and others. You can get very good code from such languages, but they aren't very general.
K is directly translatable to English (and there is a program that does it). It does this because each symbol is given a short English name (sometimes two). All K is parsed form right to left, so there is no baroque precedence rules (3*2+1 is 9 not 7).
Taking from the code sample:
This is read as: Break down the right side, what x is, a list from 1 to one less the length of the list p (i.e., if p was 5 elements long, x would be a list form 1 to 4). Now the part in the parenthesis is a list, where each list element is another list, but of numbers from 0 to 2*(x[i]+1) by 2s. So the first element of this list is a list with just 2 in it. The next list element is a list of 2,4, etc... Well, then you multiple all these values by v.This follows in Dr. Iverson's tradition of executable notation (see the J website for more information). What you are writing is a formula to manipulate data. It only took me a couple weeks to get used to the notation. After that my roommate and I were to speak across the room to each other in K.
It really is all just a matter of getting used to things.
I just ported the java version of almabench to c#.
I also compiled the cpp version , with MSVC 7.0, pentium pro optimizations. I obtained identical results from pentium and pentium pro optimizations.
I ran the two programs on my 3 machines and the results are:
1: AMD AthlonXP 1900+, 512 MB, Windows XP Pro:
cpp: 25 seconds, c-sharp: 27 seconds
2: Intel Pentium3 1300, 1024 MB, Windows XP Pro:
cpp: 33 seconds, c-sharp: 36 seconds
3: AMD AthlonXP 1500+, 512 MB, Windows 2000 Workstation:
cpp: 39 seconds, c-sharp: 43 seconds
Looks like c-sharp compares pretty well with c++, my guess for the slowdown is the array boundry checking.
If anyone wants a copy of the cs version, send me an e-mail.
Corman Lisp and MLton -- Other fast functional languages.
GNAT Ada -- Runs pretty close to C/C++ speed. Sometimes faster if the compiler can do expeditious compile-time optimizations.
The original article generated an exceptional collection of interesting and helpful of responses. In this section, I'll run down the important points people made.
Note: This is a duplicate of a new section I added to the review; I'm posting it here for posterity.
An overwhelming number of people suggested that I include results for IBM JDKs -- and I have. In fact, I've added results for Sun JDK 1.3.1_06, IBM JDK 1.4.0, and IBM 1.3.1 RC3. Adding these JVMs made a significant difference in the results, cutting runtimes in half. On the other hand, almabench runs as an infinite loop with Sun's 1.3.1 JVM (it starts, but never finishes). Note that I recompiled almabench with the corresponding javac for each JDK, so the JVMs were executing code generated by their corresponding compiler.
The problem with Java's performance is not my code or my lack of Java skills -- what real problem is that Java 1.4 is slow. Both the Sun and IBM JVMs lost significant performance in the move from 1.3.1 to 1.4, whether due to a new language requirement or other factors. My faith in Java is severely shaken when applications lose significant performance by upgrading to the current release of Java.
Java 1.4 added many new features to the language and packages; however, changing from version 1.3 to 1.4 should not double run-times! Nor am I comforted by the problems of Sun's 1.3.1 server JVM.
Given the nature of the problem, my conclusions about Java stand (albeit slightly softened). By Sun's own definition, JDK 1.3.1 is obsolete; the fact that it performs better than the most current JDK is indicative of a serious problem with Sun's improvements to their language. Since Java 1.4.1 is what Sun is promoting, so that is what I base my conclusion on. I can say that IBM's product is superior, and have already set it is my default JDK. It's no wonder Sun is upset about IBM usurping Java -- IBM is producing better tools.
Some people asked if my Java results were biased by the amount of time it takes to load the JVM. I've tested several empty and near-empty applications; a Hello, world program, for example loads and runs in less than 2/10ths of a second -- hardly significant. The start-up time increases with the number of imports -- but almabench.java imports only one small class package, java.math.*, which (in my tests) does not impose measurable overhead.
I did not include any commercial Java compilers. Most, like the oft-cited Excelsior JET, are Windows-only; this article is about Linux. I don't benchmark Visual C++ or C# for the same reason (although I will look at Mono and C# some time in the future). The free version of Borland C++ does not include a complete optimizer, so I don't think it fits in this review.
How do I know that the programs are producing the correct output? Each program includes code to display results; I run the programs with I/O to ensure that all calculations are being performed, then I comment out any header inclusions, imports, and print statements for actual testing.
How am I timing the results? With the Linux time command. Table 3 reports the real value reported by time (the elapsed time of execution.) Embedding timers in the actual code is fraught with problems; for example, each language implements different time scales and abilities. I'm sure someone will tell me that time is full of problems too, but it works for me and is consistent across all programming languages.
Amid the barrage of Java-related comments, a few people actually noticed the Fortran code. I am looking at other Fortran compilers for future updates. As for GNU g77 -- I wrote the code in Fortran 95 because I find Fortran 77 to be annoying. I wrote piles of Fortran 77 back in my CDC and VAX days, but these days I'm writing for environments where Fortran 95 is more appropriate. Believe it or not, Fortran 95 is a very clean, orderly language that eliminates many Fortran 77 idiosyncrasies while adding features important for high-performance coding.
All about me