Slashdot Mirror


Introducing the PowerPC SIMD unit

An anonymous reader writes "AltiVec? Velocity Engine? VMX? If you've only been casually following PowerPC development, you might be confused by the various guises of this vector processing SIMD technology. This article covers the basics on what AltiVec is, what it does -- and how it stacks up against its competition."

18 of 83 comments (clear)

  1. The article makes a very good point... by tabkey12 · · Score: 4, Interesting

    This highlights one of the real advantages that AltiVec has over the various SIMD instruction sets available for x86 processors: its comparative stability. Every AltiVec processor since the original G4 has had the same essential functionality, the same large register pool that isn't shared with anything, and a reasonably complete set of likely operations. This has made it easier for support to become widespread: a program designed to take advantage of the original G4 will still get a noticeable performance improvement on today's G5. x86 SIMD was frankly botched - MMX was a very odd idea, and, though SSE & SSE2 have partially fixed the problem, the fact that SSE optimised code usually runs slower on an Athlon than 'unoptimised' code has severely limited its applications.

    1. Re:The article makes a very good point... by adam31 · · Score: 4, Informative
      the fact that SSE optimised code usually runs slower on an Athlon than 'unoptimised' code has severely limited its applications.

      What does this even mean? I've written a great deal of optimized SSE code, and I can promise you that it works just as well on AMD. In fact, if you look at Athlon's pipeline, it does some really amazing things rescheduling and executing operations out-of-order. Fiddling around with ordering individual instructions is basically pointless because the scheduler has gotten so good at doing it on-the-fly.

      Can you cite a specific example, because I've never run into this.

    2. Re:The article makes a very good point... by Anonymous Coward · · Score: 3, Interesting

      Have you done any PPC programming? There are some big gotchas when optimizing for the G5 and I usualy have to program things twice as the typical G4 altivec optims often runs quite slower on the G5, particularly the ones where we use the stream load/save hint instruction to help the processor schedule its memory accesses. The apple site has a page dedicated to the diference between G4 and G5 when it comes to optimizing. On the contrary I never found such gotchas when programming SSE on PIII and going to the other processors. And I do a LOT of SIMD, PPC or x86 asm programming for pro audio products.

  2. Altivec and OS X by siliconwafer · · Score: 4, Insightful

    I'd like to know if Mac OS X uses the Altivec instructions to their full potential. For example, the article mentions that a heavily loaded server can benefit greatly from Altivec if the TCP checksum algorithm uses it. Does OS X TCP stack do this?

    1. Re:Altivec and OS X by crow · · Score: 3, Interesting

      But if you only need one or two of the registers for TCP, then perhaps it would be a win if you are only doing a save/restore on one or two registers (and presumably a status register). And if you only need it in the TCP stack, then only do the save when the kernel calls those functions (which admittedly gets complicated if the kernel is preemptable).

    2. Re:Altivec and OS X by ip_fired · · Score: 5, Informative
      I googled around and found this article on Macworld:
      According to several developers Macworld talked to who are currently working on OS X applications, anytime the OS can take advantage of the AltiVec engine, it does. This ensures that the parts of the OS that can utilize AltiVec, such as working in the new user interface, experience a significant increase in performance.

      I don't know how much of OS X has AltiVec code, but there are many other apple apps that use it. iTunes uses it for encoding music. I'm sure the video codecs in Quicktime use it as well.

      The Mac has a really nice optimization tool called shark which will help you find things that can be put into the AltiVec processor (it also helps with general optimization).
      --
      Don't count your messages before they ACK.
    3. Re:Altivec and OS X by Chuckstar · · Score: 4, Informative

      Most (all?) Apple hardware does the checksum in hardware (built into the NIC). Add to that the inefficiency of using Altivec in the kernel, especially for small data sets, and it did not make sense for Apple to develop an Altivec version of the TCP checksum code.

      The reason the article mentions the checksum case is not because Apple is missing the boat, but because there was a nice research article written about writing optimized TCP checksum code for Altivec, providing a good set of example code for aspiring Altivec coders.

    4. Re:Altivec and OS X by bill_mcgonigle · · Score: 4, Informative

      I'd like to know if Mac OS X uses the Altivec instructions to their full potential.

      No, at this point too much needs hand tuning for everything to fully utilize the potential of Altivec. Most serious DSP-class apps spend the effort to do this in critical code, but there's plenty of compiled code running in OSX that doesn't benefit from the parallel vectorization that the Altivec unit can offer.

      This is all about to change with GCC 4 which offers an SSA tree optimizer. The SSA form is particularly useful for doing automatic vectorization of code. I'm not sure what the efficiency will be like in the first release but it looks like good things are coming.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  3. AltiVec is nice... by Grand+V'izer · · Score: 5, Informative

    I've done some altivec programming in the past, and discovered it was a very effective use of my time. Since there's no mode-switching penalty for using the vector instructions you can use it for some very trivial-but-common tasks, like replacing strlen(), vector operations on small tables, etc.. I knocked a lot of computation time (25%) from one of my projects just by vectorizing three functions. Of course there's a hitch: vector processing only works for certain kinds of algorithms and requires a change in mindset. In spite of that it's a great tool to have in your box.

    --
    Not all random numbers are created equally.
    1. Re:AltiVec is nice... by evn · · Score: 5, Informative

      The other nice thing about Altivec on OS X is that Apple has done a fairly good job of making it accessible without forcing the programmer to learn and use assembly language. These libraries will automatically fall back to a scaler code path if they're running on a G3 so it saves you from a fair bit of work there too. They have included a number of optimized libraries that use Altivec that are ready to go "out of the box" with xCode including:

      • Vimage: for image processing
      • vDSP: for signal processing
      • BLAS: the name says it all: "Basic Linear Algebra"
      • LAPAC: for solving systems of equations and matrix factorization
      • Vector Math Library: unloads common operations like square root, transcendental functions, division, etc to VMX
      • vBasicOps: for simple algebra operations like integer addition, subtraction, etc.
      • VBN: for dealing with 256-1024bit numbers easily

      Apple has documentation and source code for the libraries on their Developer Connection Website. What good are vector units if nobody can make use of them? I can't wait for Apple to put the GPUs image processing abilities into my hads with CoreImage/Video.

  4. simdtech.org by kuwan · · Score: 4, Informative

    If anyone is interested, simdtech.org is probably the best resource you can find for AltiVec (or any other SIMD) programming. They have a number of tutorials and technical resources and the mailing list is the best there is. Motorola, Apple, and IBM engineers frequent the list so you can get help and information directly from the guys that created AltiVec as well as from those who program for it.

    --
    Join the Pyramid - Free Mini Mac

  5. API matters by dugenou · · Score: 4, Insightful
    /. caught again on buzz words with a shallow article.

    Anyway, what we need is not an autovec compiler, but instead a library with most CPU hungry algorithms well implemented with SIMD extensions.

    What about an open library, cross-platform, multimedia oriented, along the line of SUN's mediaLib ? Would SUN allow the re-use of their API ?

    I'm looking for such a library, with GPL/LGPL compatible license. The API has to be in C, to maximise audience. For many projects, C++ is not an option.

    Primary use will be DSP work in GNU Radio project, but multimedia extensions could prove useful anywhere in GUI's to audio/video app, etc.

    I would take any pointers to such an already existing API/project, or be ready to start a new one, if other people interested in.

    See also this previous story for cheap recylced comments.

    --
    Love salty crackers? catchy electronica? Try !
    1. Re:API matters by Kluge66 · · Score: 3, Informative
      I think you're looking for liboil: "Liboil is a library of simple functions that are optimized for various CPUs. These functions are generally loops implementing simple algorithms, such as converting an array of N integers to floating-point numbers or multiplying and summing an array of N numbers. Such functions are candidates for significant optimization using various techniques, especially by using extended instructions provided by modern CPUs (Altivec, MMX, SSE, etc.)." http://www.schleef.org/liboil/. The site seems to be down at the moment, but it's also listed on Freshmeat.

      I believe the GStreamer people are looking into using liboil. The license is two-clause BSD.

  6. Re:Does it matter? by podperson · · Score: 4, Insightful

    I don't know of anyone who makes an open standards based system using the the PowerPC architecture. IBM did release a reference design for a PPC based motherboard, but as far as I know no one every produced it.

    CHRP, the PowerPC Common Hardware Reference Platform is what you're looking for, and it's been around since before there were Apple PowerPCs. AFAIK most, if not all, the PowerPC-based workstations shipped by IBM, the BeBox, various third-party PowerPCs such as those from PowerComputing, and many of Apple's machines (even tody) are either compliant or as-close-to-compliant-as-makes-sense with this or evolutions of this standard (such that some fanatics Rhapsody/OS X were able to get it running on AIX PowerPC workstations).

    CHRP Links

    I'm not going to paint myself into a corner with a proprietary system from anyone, let alone Apple.

    Until I can make the computer from sand, copper ore, and crude oil using recipes downloaded from the internet (i.e. "The Diamond Age"), I don't see the useful distinction between being able to build a computer out of proprietary chips from one of, count them, two CPU manufacturers, a video card from one of, count them, two graphics card manufacturers, etc. and simply buying a computer that works.

  7. portable vectorization by AeiwiMaster · · Score: 4, Interesting

    On the D programing newsgroup we have been talking
    about implementing a vectorization syntax, so
    we can have portable vector code which
    approach the speed of hand coded vectorization.

    Here is something from the list.

    What is a vectorized expression? Basically, loops that does not specify any
    order of execution. If there is no order specified, of course the compiler
    can choose any one that is efficient or maybe even distribute the code and
    execute it in parallel.

    Here is some examples.

    Adding a scalar to a vector.
    [i in 0..l](a[i]+=0.5)

    Finding size of a vector.
    size=sqrt(sum([i in 0..l](a[i]*a[i])));

    Finding dot-product;
    dot=sum([i in 0..l](a[i]*b[i]));

    Matrix vector multiplication.
    [i in 0..l](r[i]=sum([j in 0..m](a[i,j]*v[j])));

    Calculating the trace of a matrix
    res=sum([i in 0..l](a[i,i]));

    Taylor expansion on every element in a vector
    [i in 0..l](r[i]=sum([j in 0..m](a[j]*pow(v[i],j))));

    Calculating Fourier series.
    f=sum([j in 0..m](a[j]*cos(j*pi*x/2)+b[j]*sin(j*pi*x/2)))+c;

    Calculating (A+I)*v using the Kronecker delta-tensor : delta(i,j)={i=j ? 1 : 0}
    [i in 0..l](r[i]=sum([j in 0..m]((a[i,j]+delta(i,j))*v[j])));

    Calculating cross product of two 3d vectors using the
    antisymmetric tensor/Permutation Tensor/Levi-Civita tensor
    [i in 0..3](r[i]=sum([j in 0..3,k in 0..3](anti(i,j,k)*a[i]*b[k])));

    Calculating determinant of a 4x4 matrix using the antisymmetric tensor
    det=sum([i in 0..4,j in 0..4,k in 0..4,l in 0..4]
    (anti(i,j,k,l)*a[0,i]*a[1,j]*a[2,k]*a[3,l]) );

  8. Hold on there, partner. This isn't AltiVec stuff! by Paradox · · Score: 4, Informative
    Ahh Slashdot. First, let's mention this link which you were evidently too busy to provide. It links to two papers on how to tune for the G5. That way, someone can verify what I'm saying.

    The problems you're talking about are not the AltiVec's fault, and the AltiVec instruction set is still stable. Code will still run very quickly even if you don't optimize for the G5. But, let me bring a quote from one of those linked papers:

    Of course, your code may still need to be restructured to handle the increased latencies of the G5 Velocity Engine pipeline. Avoid small data accesses. Due to the increased latency to memory, the longer cache lines, and the nature of the CPU-to-memory bus, small data accesses should be avoided if possible. The entire system architecture has been designed to optimize the transfer of large amounts of data (i. e. maximize system memory throughput). As a side effect, the cost to handle small accesses can be very high and is quite inefficient.
    See, the problem you're complaining about is a problem with any port to the G5, or really any port from a slow-thin-memory-access system to a fast-wide-memory-access system. It has nothing to do with your AltiVec code. It just has to do with tuning for a larger L2 cache and and faster FSB rather than a slow FSB and a huge L3 cache.

    So let's not blame AltiVec for this. Except for a brief change in policy in the 745X G4, it seems like the AltiVec invocation has been stable for quite awhile.

    --
    Slashdot. It's Not For Common Sense
  9. Re:One of my favorite ArsTechnica articles by adam31 · · Score: 3, Interesting
    Why does no one ever talk about Sony's VU assembly in their SIMD comparisons? The parent's linked article even cites the PS2 in its very first sentence, but then ignores it completely!

    The VUs have the sweetest SIMD instruction set I've seen. 32 registers (like altivec), but you can do component swizzling within an instruction, it has MADD and also a sweet Accumulate register that can be re-written to on successive cycles (throughput is worse if you accumulate results in a normal vector register, like you have to on all other SIMDs). So you can do a 4x4 matrix/vector multiply in just 4 instructions!

    The big problem was that you didn't get any of the nice instruction scheduling/re-ordering that you get on PPC or x86 platforms, so the onus was on the programmer to NOP through latency issues (huge pain!)... They finally came out with the VCL that would process chunks of VU assembly and reschedule everything at compile time.

    The really sad thing is that Sony/IBM/Toshiba opted for AltiVec in the Cell. I guess it probably has better tools and IBM is highly leveraged into VMX, but VU was very, very clever considering that it pre-dates all these other SIMD instruction sets.

  10. Re:One of my favorite ArsTechnica articles by be-fan · · Score: 3, Informative

    AltiVec will only be used in the PowerPC core of the Cell. The vector coprocessors (the SPEs), will use some other instruction set.

    --
    A deep unwavering belief is a sure sign you're missing something...