Grand Unified Theory of SIMD
Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "
"...the black art of assembly language magicians."
The nice thing about altivec is that it has a C interface. You don't have to use assembly!
Take a look at this Apple tutorial to see how easy it is.
For those that don't already know is that autovectorization is being worked on for GCC by folks from IBM and others.
GCC vectorizatoin project (site seem offline atm) but the abstract from a recent GCC summit is up.
Autovectorization Talk (google html view of pdf)
As two of my professors have stated in class, SIMD and moreso parallel processing will require programmers to think in a fundamentally different way in order for multi-core/multi-processor to really take off.
This project may be a step in the right direction. Benchmarks show that SIMD such as SSE/2/3 only provide a marginal speed increase. And meanwhile, the massively parallel computations done on graphics cards dwarfs anything SIMD claims to produce.
Perhaps we will see GFX manufacturers selling their technology to the CPU makers.
I forget the specifics, but a new GFX card can perform somewhere around 35 GFLOPS, while a 3.4Ghz P4(executing SIMD code) can only produce around 5-6GFLOPS at best.
With projects like Brook GPU emerging, the division of CPU and GFX processor may be narrowed significantly.
You really don't need macstl unless you have a strong desire to use valarray in C++...for example, the ATLAS project http://math-atlas.sourceforge.net/ already uses Altivec (and SSE/SSE2, etc) wherever it results in a speedup. So, if your code does linear algebra, use ATLAS and you'll see an automatic speedup in many cases. Other projects such as fftw http://fftw.org/ include Altivec/SSE/SSE2 optimizations as well. ATLAS includes lots of other optimizations such as cache-blocking, loop-unrolling, etc. I don't know of macstl includes such optimizations, but I do know that ATLAS performance approaches the theoretical peak performance on G4/G5 for things like matrix-matrix multiplication.
c Lib.html includes ATLAS so you don't even have to download or install anything - it comes with OS X.
Not only that, but Apple's vecLib http://developer.apple.com/ReleaseNotes/MacOSX/ve
All is Number -Pythagoras.