Grand Unified Theory of SIMD
Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "
For those who want a little background on Altivec, of course Wiki has a description here. Apple, who now ships Altivec in every system they make has a pretty good page here and Motorola nee Freescale has one here.
The benefits of Altivec can be truly astounding for those processes that can be "vectorized". After all putting these kinds of calculations in hardware has got it all over software computation. It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.
Visit Jonesblog and say hello.
Apple has had AltiVec optimized libraries for DSP and such since the early releases of OS X.
Doesn't XCode have a feature that lets you "vectorize" certain parts of your code already?
The Mac forum at Ars Technica has a long, continuing post about Altivec optimizations and how they should be used. The thread started more than two years ago and still gets relevent points and questions added to it. It's an amazing resource if you're interested in starting.
The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.
How am I supposed to fit a pithy, relevant quote into 120 characters?
Typo...
Propellerheads.SE
The principle behind SIMD, or, rather, Single Instruction Multiple Data, is that you can process wide arrays of values in a single instruction. With the PowerPC version of SIMD, also known as AltiVec, you can issue an instruction and have it work with a 128-bit wide register. These registers may contain up to 4 32-bit numbers, 8 16-bit numbers or 16 8-bit numbers. For example, I can load two AltiVec registers with 16 unsigned chars, add them together using Vec_Add() and have it return its results to an AltiVec register. So this in essense is adding 16 values at once and in theory it's good enough for markeing to claim a 16X speedup, but this is rarely the case.
The RPL ( Reciprocal Public License) is an odd choice for this project. It is an even stronger viral copy-left than the GPL, to the point where the FSF takes issue with it. If create a derivative work you are required required to 1) Notify the original author, and 2) Publish your changes even if you only use the program in house. Furthermore, their definition of derivative work is much, much broader than the "linking" definition that the GPL uses.
The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL. In fact, considering the requirements placed on you by the license, I would expect that you will have difficulty incorporating this RPL library into any existing FLOSS project without running into license conflicts. The only thing I can see this being useful for is a new project that you don't mind releasing under the RPL, or with existing BSD style licensed code which you dual license as BSD/RPL (since BSD can be included in anything).
So this library does not appear to very useable for the FLOSS world, although if you want to license it for proprietary software you may.
A bit OT, but nevertheless quite interesting to read and it contains information about SIMD instruction sets other than just MMX/SSE: http://www.fefe.de/ccccamp2003-simd.pdf
A monkey is doing the real work for me.
Vectorization (SIMD) is built into the Intel compiler. There is no need to hack in assembly as the compiler will do it for you. This is the case with most vendor supplied compilers, as they want to fully exploit their hardware functionality.
The problem is bringing this functionality to OS compilers, which as far as I know, there is not even an OpenMP (threading) implementation, let alone internal vectorization.
UBU
SIMD support already exists, in the form of C, C++, and Fortran libraries (usually, as a small part of larger numerical libraries), as well as in language constructs in languages like Fortran.
A better resource for Altivec and SIMD in general is the SIMDtech.org website and Altivec mailing list. There are tutorials and technical manuals available and the email list is indispensable. While the mailing list is mostly geared towards Altivec optimizations and discussions all SIMD discussion is welcome, including MMX/SSE. There are Apple engineers that read and contribute to the list as well as Motorola/Freescale engineers. It's probably the single best resource available to Altivec programmers and you get to talk directly to the Wizards that created it.
I'm a relative newcomer to the list and it's been an invaluable resource as I've optimized with Altivec.
--
Join the Pyramid - Free Mini Mac
infested with jello like fishes no melotron wishes
Or born, like the french word it is: née.
No need for anyone to whip out the online dictionary and tell me "formerly known as" is an acceptable alternative.
Actually, Apple's Tiger will get an auto-vectorizing compiler courtesy of the public GCC 4.0 release. The auto-vectorizer wasn't developed in Apple's version of GCC. IBM's GCC team at the Haifa Research Lab developed the vectorizer in the public LNO (loop nest optimization) branch of GCC 4.0. I'm not trying to minimize Apple's contribution here, one of their developers did work on the team, but let's give credit where credit is due.
A deep unwavering belief is a sure sign you're missing something...
Yes it does.
SIMD programming becomes as easy as this:He claims that the above code is 17.4x faster than Codewarrior MSL C++, 11.6x faster than gcc libstdc++ and 9.5x faster than Visual C++.
Macstl also provides a cross-platform syntax for using vector registers that is similar to using the native C intrinsics on each platform. So while not all of the native operations are available, his cross-platform "vec" API allows you to write cross-platform code without having to learn both the Altivec and MMX/SSE intrinsics (which is a good solution for someone who knows one platform but not the other).
--
Join the Pyramid - Free Mini Mac
infested with jello like fishes no melotron wishes
A good example is what happens when you let the compiler decide how to do aritmetic with vectors and matrixes.
Matrix a,b,c,x;
x = a + b + c;
The naked compiler, in combination with your custom Matrix class, will probably unwind the operator overloads to do something like this:All those temporary copies and inlined loops really kill performance.
Now, with an expression library, it handles each arithmetic expression discretely by type. By treating the expressions, as well as the types involved, you can do more sophisticated things. In this case, the Expression Template Library solves the problem thusly:Here the library has carnal knowledge of the data structures involved as well as order of operations to come to such a succint solution.
In the case of MACSTL, its still using these principals of "vectorizing" the expressions as well as unrolling and other traditional optimization techniques. Its also going the extra mile and using processor specific code and/or C code that targets *extremely* well to PPC. For example, the above example would opitmize well using Altivec, due to the platform's built-in vector type; you wouldn't even need a loop for adding several 'vec' instances.
I wish I knew enough about MACSTL and altivec to give a hard example of a 16X speedup. I hope this gets you closer to seeing at least *where* the reducable overhead is coming from.
Check out Blitz++'s papers listing for more info:
http://www.oonumerics.org/blitz/papers/