Introducing the PowerPC SIMD unit

← Back to Stories (view on slashdot.org)

Introducing the PowerPC SIMD unit

Posted by Hemos on Monday March 7, 2005 @04:46AM from the one-name-to-rule-them dept.

An anonymous reader writes "AltiVec? Velocity Engine? VMX? If you've only been casually following PowerPC development, you might be confused by the various guises of this vector processing SIMD technology. This article covers the basics on what AltiVec is, what it does -- and how it stacks up against its competition."

11 of 83 comments (clear)

Min score:

Reason:

Sort:

The article makes a very good point... by tabkey12 · 2005-03-07 04:50 · Score: 4, Interesting

This highlights one of the real advantages that AltiVec has over the various SIMD instruction sets available for x86 processors: its comparative stability. Every AltiVec processor since the original G4 has had the same essential functionality, the same large register pool that isn't shared with anything, and a reasonably complete set of likely operations. This has made it easier for support to become widespread: a program designed to take advantage of the original G4 will still get a noticeable performance improvement on today's G5. x86 SIMD was frankly botched - MMX was a very odd idea, and, though SSE & SSE2 have partially fixed the problem, the fact that SSE optimised code usually runs slower on an Athlon than 'unoptimised' code has severely limited its applications.

--
Get a free iPod Nano 4GB!
1. Re:The article makes a very good point... by Anonymous Coward · 2005-03-07 04:58 · Score: 1, Interesting
  
  the fact that SSE optimised code usually runs slower on an Athlon than 'unoptimised' code has severely limited its applications.
  
  Oh? What's your source for that? Is that a problem with the processor or with the optimiser?
2. Re:The article makes a very good point... by Anonymous Coward · 2005-03-07 05:45 · Score: 3, Interesting
  
  Have you done any PPC programming? There are some big gotchas when optimizing for the G5 and I usualy have to program things twice as the typical G4 altivec optims often runs quite slower on the G5, particularly the ones where we use the stream load/save hint instruction to help the processor schedule its memory accesses. The apple site has a page dedicated to the diference between G4 and G5 when it comes to optimizing. On the contrary I never found such gotchas when programming SSE on PIII and going to the other processors. And I do a LOT of SIMD, PPC or x86 asm programming for pro audio products.
3. Re:The article makes a very good point... by Bert64 · 2005-03-07 21:29 · Score: 2, Interesting
  
  Which also means that, well optimized floating point code (which is more likely, since compilers have had many years to improve their floating point code generation) would run faster on an athlon than poorly written sse code.. Whereas the p4 would run the sse code faster simply because it runs fpu code so poorly.
  I'm sure well written sse code would run faster on both platforms, atleast in cases where vectorization makes sense.. Intel implemented the weak fpu unit in the p4 to try and steer users onto sse code even when it's not really appropriate.
  
  --
  http://spamdecoy.net - free throwaway anonymous email - avoid spam!
Re:Altivec and OS X by crow · 2005-03-07 05:14 · Score: 3, Interesting

But if you only need one or two of the registers for TCP, then perhaps it would be a win if you are only doing a save/restore on one or two registers (and presumably a status register). And if you only need it in the TCP stack, then only do the save when the kernel calls those functions (which admittedly gets complicated if the kernel is preemptable).
Re:Does it matter? by Anonymous Coward · 2005-03-07 05:38 · Score: 1, Interesting

You can just buy a $499 mini and get done with it.
Re:Altivec and OS X by Matthias+Wiesmann · 2005-03-07 05:45 · Score: 2, Interesting

Actually I read an optimisation on some darwin mailing list were the altivec register are not immediately save on a context switch, but configured in such a way that a write to those causes an interruption, and that the service routine does the register saving then. The idea was that often, you are prempted by a task that does not use the altivec register, so saving and restoring those registers would be a waste.
I can't remember if this was implemented in the end...
Re:Altivec and OS X by SewersOfRivendell · 2005-03-07 06:12 · Score: 2, Interesting

Remember that we're talking about context switch overhead, not function call overhead. It doesn't matter if you use three or thirty vector registers, the kernel still has to take the performance hit to figure out which registers are in use. The compiler will generate instructions to set the appropriate bits in the VRSAVE register, but the OS still needs to compare each bit in VRSAVE so that it knows which registers to save/restore.
portable vectorization by AeiwiMaster · 2005-03-07 06:17 · Score: 4, Interesting

On the D programing newsgroup we have been talking
about implementing a vectorization syntax, so
we can have portable vector code which
approach the speed of hand coded vectorization.

Here is something from the list.

What is a vectorized expression? Basically, loops that does not specify any
order of execution. If there is no order specified, of course the compiler
can choose any one that is efficient or maybe even distribute the code and
execute it in parallel.

Here is some examples.

Adding a scalar to a vector.
[i in 0..l](a[i]+=0.5)

Finding size of a vector.
size=sqrt(sum([i in 0..l](a[i]*a[i])));

Finding dot-product;
dot=sum([i in 0..l](a[i]*b[i]));

Matrix vector multiplication.
[i in 0..l](r[i]=sum([j in 0..m](a[i,j]*v[j])));

Calculating the trace of a matrix
res=sum([i in 0..l](a[i,i]));

Taylor expansion on every element in a vector
[i in 0..l](r[i]=sum([j in 0..m](a[j]*pow(v[i],j))));

Calculating Fourier series.
f=sum([j in 0..m](a[j]*cos(j*pi*x/2)+b[j]*sin(j*pi*x/2)))+c;

Calculating (A+I)*v using the Kronecker delta-tensor : delta(i,j)={i=j ? 1 : 0}
[i in 0..l](r[i]=sum([j in 0..m]((a[i,j]+delta(i,j))*v[j])));

Calculating cross product of two 3d vectors using the
antisymmetric tensor/Permutation Tensor/Levi-Civita tensor
[i in 0..3](r[i]=sum([j in 0..3,k in 0..3](anti(i,j,k)*a[i]*b[k])));

Calculating determinant of a 4x4 matrix using the antisymmetric tensor
det=sum([i in 0..4,j in 0..4,k in 0..4,l in 0..4]
(anti(i,j,k,l)*a[0,i]*a[1,j]*a[2,k]*a[3,l]) );
Re:One of my favorite ArsTechnica articles by adam31 · 2005-03-07 06:48 · Score: 3, Interesting

Why does no one ever talk about Sony's VU assembly in their SIMD comparisons? The parent's linked article even cites the PS2 in its very first sentence, but then ignores it completely!
The VUs have the sweetest SIMD instruction set I've seen. 32 registers (like altivec), but you can do component swizzling within an instruction, it has MADD and also a sweet Accumulate register that can be re-written to on successive cycles (throughput is worse if you accumulate results in a normal vector register, like you have to on all other SIMDs). So you can do a 4x4 matrix/vector multiply in just 4 instructions!
The big problem was that you didn't get any of the nice instruction scheduling/re-ordering that you get on PPC or x86 platforms, so the onus was on the programmer to NOP through latency issues (huge pain!)... They finally came out with the VCL that would process chunks of VU assembly and reschedule everything at compile time.
The really sad thing is that Sony/IBM/Toshiba opted for AltiVec in the Cell. I guess it probably has better tools and IBM is highly leveraged into VMX, but VU was very, very clever considering that it pre-dates all these other SIMD instruction sets.
Re:Altivec and OS X by dbrutus · 2005-03-07 07:23 · Score: 2, Interesting

Since the OS X TCP/IP stack is likely fully available in Darwin, why don't you go look and let us administrator types know?