Goto Leads to Faster Code
pdoubleya writes "There's an article over at the NY Times (registration required) about Kazushige Goto, the author of the Goto Basic Linear Algebra Subroutines (BLAS, see the wiki); his BLAS implementation is used by 4 of the current 11 fastest computers in the world. Goto is known for painstaking effort in hand-optimizing his routines; in one case, "when computer scientists at the University at Buffalo added Goto BLAS to their Pentium-based supercomputer, the calculating power of the system jumped from 1.5 trillion to 2 trillion mathematical operations per second out of a theoretical limit of 3 trillion." To quote Jack Dongarra, from the University of Tennessee, "I tell them that if they want the fastest they should still turn to Mr. Goto."" Ever get the feeling someone wrote an article merely for the pun?
It's just two syllables... the first, "go", is pronounced like the "go" in gordon, not like the English word "go". The second, "to", is not like the English word "to", it is pronounced like the "to" in "tornado". Try saying "gohr-tohr".
Goto seems like a completely reasonable Japanese name as it fits into the phonetic framework of the language. They have the sounds for "g" followed by a vowel, like o, and the sounds for "t" followed by a vowel like o. Most Japanese syllables are formed by the equivalent of one consonant and one vowel in English, with the exceptions being n, shi, chi, tsu, and the combination of ya, yu or yo with another syllable, like rya, ryu, ryo. etc.
Anybody who criticizes Goto Kazushige's Free Software credentials - he created a Linux/Alpha distribution called Stataboware, which among other things included an early version of his hand-tuned math library back in 1999 (it's now defunct, unfortunately).
Yeah, that is BS! Especially when you have Donald Knuth, Brian Kernighan and Ken Thompson on record in defense of goto when it's warranted. References: Knuth's "Structured Programming with GOTO" (link appears dead), Kernighan, and Ken Thompson: "If you want to go somewhere, goto is the best way to get there."
And then there was the famous "'GOTO Considered Harmful' Considered Harmful" by Frank Rubin, and a a decent section in Steve McConnell's Code Complete that takes an even-handed view.
Anyway, swinging carefully back on topic, it sounds like Mr. Goto's work is at the instruction level, not the C or Fortran or even really at the algorithmic level (except maybe tricks like tiling). I program in the embedded space, where getting as close to 100% efficiency is a continual challenge. I wonder if any of his techniques scale down...
--Joe
Program Intellivision!
is the guy's name actually "Goto"?
Yes
And if so... does he actually use goto statements
The article doesn't speak of goto statements whatsoever.
Yes, functions impose overhead. however:
- if your function is small enough and your compiler smart enough, it can inline the routine, removing overhead preserving readability
- nobody can say where the time-critical code is without profiling. Most of the (fortran) code I handle spends 80% of the time zeroing arrays. There's not so much to optimize in this procedure, and optimizing the remaining 20% by filling the code with gotos is only a waste of time
- if the algorithm is slow, optimize. If it's still slow, change algorithm.
- last but not least. If you optimize the code to save some hour of computational cost but you obtain code which needs an additional month to debug, you are doomed to have a very bad time.
-- "If A equals success, then the formula is A=X+Y+Z. X is work. Y is play. Z is keep your mouth shut." - Einstein
I was involved with the design and the benchmarking effort of #5, #59, #67 and a few others in Top500. The performance of a supercomputer is determined by the number of real FLOPs acheived versus theoretically claimed.
Theoretical FLOPs per processor = Core(s) * Speed_Per_Core (in Ghz) * 2. So for a Dual 3.6 Ghz Xeon, the theoretrical FLOPS is 2 * 3.6 * 2 = 14.4
An easy way to find out actual number of FLOPS a computer can acheive is to ask it to solve a number of Linear Algebra problems and then look at the time it takes to finish solving these problems. The faster the time, better FLOPs obviously.
Now, the reason we chose gotolib was:
1) It works with GCC
2) It is optimized to use the processor cache
3) And therefore fewer cache misses which translates to superior performance
4) And its free (though the source is not exactly open).
Because it uses the processor cache so effectively, it results in a better number than a regular BLAS (which does not use processor cache).
Alternatively, I've also used Intel's MKL which offers comparable performance but then it works best with ICC and its not free. Btw, #59 was benchmarked using gotolib and MKL -- but if I remember correctly, the final result was derived using MKL.
In essence, if you want to use GCC and work with lots of number crunching ie BLAS, gotolib is your best option.
Ever get the feeling someone wrote an article merely for the pun?
a per
There's of course the famous Alpher-Bethe-Gamow paper: http://en.wikipedia.org/wiki/Alpher-Bethe-Gamow_p
Apologies if I implied critcism of his work, that wasn't the intention.
My point is simply that in the field of optimisation, all your gains come from succeeding where the last guy failed. This 30% improvement in performance is not the difference between Goto's approach and his nearest competitor, it is the difference between his work and the previous solution used on that particular machine.
It's the statistic I'm criticising - it simply isn't very meaningful.
(I used to do research into hardware optimisation. Improvements of more than about 5% in any field to which smart people devote a lot of thought always make me suspicious !)
http://en.wikipedia.org/wiki/Nobukazu_Takemura
http://news.com.com/Writing+the+fastest+code%2C+by +hand%2C+for+fun/2100-1022_3-5972844.html?tag=nefd .top
"Not recommended" and "don't ever use it" are two differenet things. If it was never meant to be used, it wouldn't be in the language. The goal of the test mentioned was to write the shortest possible piece of code, not the most maintainable or most elegant.
If you don't know who Stroustrap and K&R are, you have no right commenting on C/C++ issues. I have to side with the original poster, the prof was a dogmatic asshole (and so are you).
Support Right To Repair Legislation.
Thankfully for my sanity, that one is a typo. It's Van de Geijn (the ij is originally a y with an umlaut - Dutch).
Justin.
You're only jealous cos the little penguins are talking to me.
It's just two syllables... the first, "go", is pronounced like the "go" in gordon, not like the English word "go". The second, "to", is not like the English word "to", it is pronounced like the "to" in "tornado". Try saying "gohr-tohr". *
* Note: Pronunciation instruction may only apply if you live in the city of Boston. People living in other localities may need to contact the appropriate authorities for further instruction.
If it's for-profit but free, you're not the customer -- you're the product (e.g., the Slashdot Beta's "audience").
Not entirely true. modern processors take much more clocks to calculate than one clock cycle (say 3 GHz) because they are designed for, so called, pipelining. P4 has 20-clock pipeline (or 30 for prescott), meaning that you don't really need data before, say, 5-10 clocks, so you can afford that kind of latency.
The Japanese word that means foreigner is gaijin not geijin.
SYS 49152
From the article: Some programmers have suggested that Mr. Goto has not joined the open-source movement because he wants to protect his secrets and strategies from competitors. That is not so, he said recently, noting that the Goto BLAS software is freely available for noncommercial use. And he said he was preparing an open-source version.
Except that if you want to get code running at 2/3 of the P4 theoretical limit, you better do something else during the 10 clocks... hence better have prefectched data... knowing that memory latency is 300+ cycles (not 30).
for more technical info, see his site at the Texas Advanced Computing Center. pretty pictures and software tool downloads even.
You do realise that GOTO command translated to a JMP isntruction, which any compiler produces in abundance anyway, right?
This simply isn't true, and it's a shame that it got modded up. Modern compilers do a lot of optimization to your code, and most of those optimizations are disabled as soon as the compiler sees an unconditional jump. Things like placing variables into registers, unrolling loops, etc.
But what do I know, I'm just a "dogmatic asshole"..
I can attest to the efficiency of these routines. When I benchmarked a 22 processor Opteron cluster w/ Myrinet, the use of Goto BLAS resulted in a near 20% drop in CPU utilization but yielded a ~2 GFlop gain in performance using HPL (performance was roughly 60 GFlops total. Given more time, I could have probably coaxed more out of Linpack). This compared to ATLAS, the self-tuning BLAS and LAPACK routines that I painstakingly recompiled at least a few dozen times. Generally, ATLAS yields very decent results even compared to some of the "drop-in" Lin-Alg. routines found with most high-end compilers like PGI (ACML, PGI-optimized BLAS/LAPACK/SCALAPACK) but so far, nothing I have tried rivals the performance, in the case of HPL, of Goto's implementation. Great work, man!