Grand Unified Theory of SIMD

← Back to Stories (view on slashdot.org)

Posted by Hemos on Monday February 7, 2005 @04:30AM from the the-string-theory-of-SIMD dept.

Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "

4 of 223 comments (clear)

Min score:

Reason:

Sort:

Black Art? Uh... by arekusu · 2005-02-07 04:47 · Score: 3, Interesting

"...the black art of assembly language magicians."

The nice thing about altivec is that it has a C interface. You don't have to use assembly!

Take a look at this Apple tutorial to see how easy it is.
Autovectorization being add in GCC 4.0 by shawnce · 2005-02-07 04:50 · Score: 5, Interesting

For those that don't already know is that autovectorization is being worked on for GCC by folks from IBM and others.

GCC vectorizatoin project (site seem offline atm) but the abstract from a recent GCC summit is up.

Autovectorization Talk (google html view of pdf)
From the limewire... by WilyCoder · 2005-02-07 05:07 · Score: 3, Interesting

As two of my professors have stated in class, SIMD and moreso parallel processing will require programmers to think in a fundamentally different way in order for multi-core/multi-processor to really take off.

This project may be a step in the right direction. Benchmarks show that SIMD such as SSE/2/3 only provide a marginal speed increase. And meanwhile, the massively parallel computations done on graphics cards dwarfs anything SIMD claims to produce.

Perhaps we will see GFX manufacturers selling their technology to the CPU makers.

I forget the specifics, but a new GFX card can perform somewhere around 35 GFLOPS, while a 3.4Ghz P4(executing SIMD code) can only produce around 5-6GFLOPS at best.

With projects like Brook GPU emerging, the division of CPU and GFX processor may be narrowed significantly.
Why? Altivec-optimized libraries supplied by Apple by coult · 2005-02-07 05:38 · Score: 3, Interesting

You really don't need macstl unless you have a strong desire to use valarray in C++...for example, the ATLAS project http://math-atlas.sourceforge.net/ already uses Altivec (and SSE/SSE2, etc) wherever it results in a speedup. So, if your code does linear algebra, use ATLAS and you'll see an automatic speedup in many cases. Other projects such as fftw http://fftw.org/ include Altivec/SSE/SSE2 optimizations as well. ATLAS includes lots of other optimizations such as cache-blocking, loop-unrolling, etc. I don't know of macstl includes such optimizations, but I do know that ATLAS performance approaches the theoretical peak performance on G4/G5 for things like matrix-matrix multiplication.

Not only that, but Apple's vecLib http://developer.apple.com/ReleaseNotes/MacOSX/vec Lib.html includes ATLAS so you don't even have to download or install anything - it comes with OS X.

--
All is Number -Pythagoras.