Introducing the PowerPC SIMD unit
An anonymous reader writes "AltiVec? Velocity Engine? VMX? If you've only been casually following PowerPC development, you might be confused by the various guises of this vector processing SIMD technology. This article covers the basics on what AltiVec is, what it does -- and how it stacks up against its competition."
This highlights one of the real advantages that AltiVec has over the various SIMD instruction sets available for x86 processors: its comparative stability. Every AltiVec processor since the original G4 has had the same essential functionality, the same large register pool that isn't shared with anything, and a reasonably complete set of likely operations. This has made it easier for support to become widespread: a program designed to take advantage of the original G4 will still get a noticeable performance improvement on today's G5. x86 SIMD was frankly botched - MMX was a very odd idea, and, though SSE & SSE2 have partially fixed the problem, the fact that SSE optimised code usually runs slower on an Athlon than 'unoptimised' code has severely limited its applications.
Get a free iPod Nano 4GB!
Its vector processing...
I'd like to know if Mac OS X uses the Altivec instructions to their full potential. For example, the article mentions that a heavily loaded server can benefit greatly from Altivec if the TCP checksum algorithm uses it. Does OS X TCP stack do this?
I've done some altivec programming in the past, and discovered it was a very effective use of my time. Since there's no mode-switching penalty for using the vector instructions you can use it for some very trivial-but-common tasks, like replacing strlen(), vector operations on small tables, etc.. I knocked a lot of computation time (25%) from one of my projects just by vectorizing three functions. Of course there's a hitch: vector processing only works for certain kinds of algorithms and requires a change in mindset. In spite of that it's a great tool to have in your box.
Not all random numbers are created equally.
is here. They talk about altivec on Page 3. IIRC, it's the best designed mass-market SIMD implementation there is out there.
Make sure everyone's vote counts: Verified Voting
If anyone is interested, simdtech.org is probably the best resource you can find for AltiVec (or any other SIMD) programming. They have a number of tutorials and technical resources and the mailing list is the best there is. Motorola, Apple, and IBM engineers frequent the list so you can get help and information directly from the guys that created AltiVec as well as from those who program for it.
--
Join the Pyramid - Free Mini Mac
infested with jello like fishes no melotron wishes
Anyway, what we need is not an autovec compiler, but instead a library with most CPU hungry algorithms well implemented with SIMD extensions.
What about an open library, cross-platform, multimedia oriented, along the line of SUN's mediaLib ? Would SUN allow the re-use of their API ?
I'm looking for such a library, with GPL/LGPL compatible license. The API has to be in C, to maximise audience. For many projects, C++ is not an option.
Primary use will be DSP work in GNU Radio project, but multimedia extensions could prove useful anywhere in GUI's to audio/video app, etc.
I would take any pointers to such an already existing API/project, or be ready to start a new one, if other people interested in.
See also this previous story for cheap recylced comments.
Love salty crackers? catchy electronica? Try !
Choosing something like AltiVec involves a bunch of trade-offs:
-- How much work do I need to do in order to take advantage of it? Some BLAS implementations may support it and some Fortran 95 compilers may generate code for it for some primitives, but other than that, it's a lot of manual work to tune code for it. (My own experience with using the AltiVec instructions can only be described as "painful", among other things because the C interface to them is poorly defined and causes name conflicts.)
-- What range of hardware can I choose from? Well, there is mainly one Apple rack-mount that runs OS X, a bunch of big Apple desktops in fancy cases, and a bunch of expensive IBM workstations. That's pretty limited.
-- What's the bang for the buck? There are actually two parts to this: what's the bang for the buck for code not specifically hacked to take advantage of AltiVec, and what's the bang for the buck for code specifically hacked for AltiVec. For code not specifically tuned for AltiVec, the bang for the buck is not so great with either Apple or IBM. For the rest, it may be reasonable.
Considering these issues, I continue to find AltiVec pretty unpersuasive. I think AltiVec won't take off until Intel and AMD's SIMD instructions are equally good; until then, there is simply not enough incentive for software writers to incorporate support into their software for it consistently. And then, frankly, we first need a market in commodity Linux PowerPC boxes until that really gets interesting. I wouldn't hold my breath.
I don't know of anyone who makes an open standards based system using the the PowerPC architecture. IBM did release a reference design for a PPC based motherboard, but as far as I know no one every produced it.
Unless and until I can go down to Fry's and buy a motherboard based off of this chip and put it into a standard case, it really doesn't matter if the CPU is better or not. It is the system as a whole that matters, not the relative performance of one of its components. I'm not going to paint myself into a corner with a proprietary system from anyone, let alone Apple.
Lee
Muslim community leaders warn of backlash from tomorrow morning's terrorist attack.
So, the G4 and the G5.
What you're really talking about here is that with a greater variety of chip models, it's harder to do all-out optimization.
Never mind that there are fewer G4s and G5s deployed combined than any one class of Intel or AMD chips which requires different SIMD optimizations.
When your market is 20 times bigger, you can afford to optimize for Athlon XP, Athlon64, Opteron, and Pentium 4, and you're still putting in only a fifth of the relative development/installed-based effort. If you like, you can still go and optimize for all of the older and niche models without being in a worse situation than PowerPC developers.
If the Athlon line wasn't x86 compatible, I suppose they'd be praising the consistency of 3D-Now. Lack of x86 compatibility is not an advantage! x86 is the defacto standard. Following standards always costs a certain amount of design freedom and elegance, but the advantages are considerable.
You really can't take someone too seriously who preaches the advantages of uniformity within a completely non-standard-compliant line of chips, while denigrating the popular standard for the relatively minor incompatibilities between its various implementations. As if small incompatibilities are bad and big ones are good!
There's a book "Vector Game Math Processors" by James Leiterman ISBN: 1-55622-921-6 that discusses programming PowerPC-AltiVec, MIPS, and 80x86 SIMD instructions. I found it pretty useful when I do vector programming with AltiVec! Some instructions that other processors have that AltiVec doesn't are simulated with what he called PseudoVec!
On the D programing newsgroup we have been talking
) );
about implementing a vectorization syntax, so
we can have portable vector code which
approach the speed of hand coded vectorization.
Here is something from the list.
What is a vectorized expression? Basically, loops that does not specify any
order of execution. If there is no order specified, of course the compiler
can choose any one that is efficient or maybe even distribute the code and
execute it in parallel.
Here is some examples.
Adding a scalar to a vector.
[i in 0..l](a[i]+=0.5)
Finding size of a vector.
size=sqrt(sum([i in 0..l](a[i]*a[i])));
Finding dot-product;
dot=sum([i in 0..l](a[i]*b[i]));
Matrix vector multiplication.
[i in 0..l](r[i]=sum([j in 0..m](a[i,j]*v[j])));
Calculating the trace of a matrix
res=sum([i in 0..l](a[i,i]));
Taylor expansion on every element in a vector
[i in 0..l](r[i]=sum([j in 0..m](a[j]*pow(v[i],j))));
Calculating Fourier series.
f=sum([j in 0..m](a[j]*cos(j*pi*x/2)+b[j]*sin(j*pi*x/2)))+c;
Calculating (A+I)*v using the Kronecker delta-tensor : delta(i,j)={i=j ? 1 : 0}
[i in 0..l](r[i]=sum([j in 0..m]((a[i,j]+delta(i,j))*v[j])));
Calculating cross product of two 3d vectors using the
antisymmetric tensor/Permutation Tensor/Levi-Civita tensor
[i in 0..3](r[i]=sum([j in 0..3,k in 0..3](anti(i,j,k)*a[i]*b[k])));
Calculating determinant of a 4x4 matrix using the antisymmetric tensor
det=sum([i in 0..4,j in 0..4,k in 0..4,l in 0..4]
(anti(i,j,k,l)*a[0,i]*a[1,j]*a[2,k]*a[3,l]
We get on average one of these per month posted here to slashdot as news.
Nothing to see here. Move along please.
Stick Men
The problems you're talking about are not the AltiVec's fault, and the AltiVec instruction set is still stable. Code will still run very quickly even if you don't optimize for the G5. But, let me bring a quote from one of those linked papers:
See, the problem you're complaining about is a problem with any port to the G5, or really any port from a slow-thin-memory-access system to a fast-wide-memory-access system. It has nothing to do with your AltiVec code. It just has to do with tuning for a larger L2 cache and and faster FSB rather than a slow FSB and a huge L3 cache.So let's not blame AltiVec for this. Except for a brief change in policy in the 745X G4, it seems like the AltiVec invocation has been stable for quite awhile.
Slashdot. It's Not For Common Sense
In my work, we use PPC extensively because the bang per Watt is so very high compared to x86 and relatives. We make good use of very domain-specific operations that have been hand-tuned for Altivec and are very happy with the results. Even without using AV, we see similar performance for our code running on 500MHz PPC and 2.5GHz P4.
AV instruction set kicks ass over SSE. And having 32 registers in each of the integer, FP, and SIMD register files also kicks ass. We can perform a lot of register-based operations on a lot of operands without dropping operands out of registers.
My complaint about vector libraries is that they tend to be basic building blocks. The overhead of hacing to chain together multiple basic functions can add up; we write specialized versions that do more per operand. Like instead of vector-multiply-add, we might have vector-multiply-multiply-add-sqrt-atan (magnitude of (I,Q) is (sqrt(I*I + Q*Q)), phase is atan2(Q, I)). More operations per read/write of memory == better performance.
I wrote AltiVec code in 1999, and I even have a faded AltiVec t-shirt from the same year. AltiVec is just not new.
"Re-Introducing" would be a better title.