Slashdot Mirror


Grand Unified Theory of SIMD

Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "

25 of 223 comments (clear)

  1. Altivec by BWJones · · Score: 5, Informative


    For those who want a little background on Altivec, of course Wiki has a description here. Apple, who now ships Altivec in every system they make has a pretty good page here and Motorola nee Freescale has one here.

    The benefits of Altivec can be truly astounding for those processes that can be "vectorized". After all putting these kinds of calculations in hardware has got it all over software computation. It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.

    --
    Visit Jonesblog and say hello.
    1. Re:Altivec by shawnce · · Score: 4, Informative

      Just pick a few items out ...

      Apple provides source code for some of their vector libraries

    2. Re:Altivec by mod_critical · · Score: 3, Informative

      Altivec == Velocity Engine

      And is part of every G4

  2. More AltiVec Goodness by LordRPI · · Score: 4, Informative

    Apple has had AltiVec optimized libraries for DSP and such since the early releases of OS X.

    1. Re:More AltiVec Goodness by bryanzak · · Score: 3, Insightful

      One of the problems of using libraries though is that the overhead of a function call usually negates any gain in vectorization. The lib call messes all kinds of things up, including instruction flow and caching, etc.

  3. A little background by xXunderdogXx · · Score: 4, Informative
    From the Wikipedia article on SIMD:
    An example of an application that can take advantage of SIMD is one where the same value is being added to a large number of data points, a common operation in many multimedia applications. One example would be changing the brightness of an image. Each pixel of an image consists of three 8-bit values for the brightness of the red, green and blue portions of the color. To change the brightness, the R G and B values are read from memory, a value is added (or subtracted) from it, and the resulting value is written back out to memory.

    With a SIMD processor there are two improvements to this process. For one the data is understood to be in blocks, and a number of values can be loaded all at once. Instead of a series of instructions saying "get this pixel, now get this pixel", a SIMD processor will have a single instruction that effectively says "get all of these pixels" ("all" is a number that varies from design to design). For a variety of reasons, this can take much less time than it would to load each one by one as in a traditional CPU design.
    But of course I'm sure everyone here knew that..
    1. Re:A little background by DLWormwood · · Score: 3, Informative
      How is this different for MMX?

      Based on personal recollections reenforced by a quick Wiki'ing, MMX's problem wasn't the concept itself, but Intel's braindead constraints placed on x86 support for vectors. MMX recycled the same registers as used for floating point math, causing expensive context switches between each mode and only allowing integer math to be vectorized. Intel eventually developed SSE to work around some of the bottlenecks, but the eventual dominance of GPUs on the PC platform reduced the development priority for vector math in the CPU.

      --
      Those who complain about affect & effect on /. should be disemvoweled
    2. Re:A little background by Dominic_Mazzoni · · Score: 4, Informative

      Quick summary:

      MMX (x86): 8-byte registers, only integer operations
      SSE (x86): 16-byte registers, single-precision float ops
      AltiVec (PPC): 16-byte registers, integer and single-precision float ops
      SSE2 (x86): 16-byte registers, double-precision float ops

      In order to implement many complex algorithms on x86, you need to use a motley combination of MMX and SSE. There are many flaws in both; lots of very useful instructions are missing, and MMX can't be used in conjunction with non-SIMD floating-point operations without a huge expensive context switch. One of the biggest flaws in MMX/SSE that I found was the lack of instructions to shuffle data around within a (8-byte or 16-byte) register. The only advantage on a modern x86 CPU is SSE2, which is the only SIMD unit with double-precision floats. But you can only work with two doubles at a time, so the speedup is not that great.

      AltiVec, on the other hand, included both floats and integers right from the start, with no penalty for switching between them, and it includes a very detailed and useful set of instructions, including an awesome shuffle instruction. My personal experience, coding for both, is that AltiVec is about twice as useful as MMX/SSE/SSE2 combined.

      Also, note that in Mac OS X, many of the standard libraries and system calls are already AltiVec-optimized for you, and Apple also provides a great Vector library with lots of common DSP operations.

  4. Long thread about using Altivec by ThousandStars · · Score: 4, Informative

    The Mac forum at Ars Technica has a long, continuing post about Altivec optimizations and how they should be used. The thread started more than two years ago and still gets relevent points and questions added to it. It's an amazing resource if you're interested in starting.

  5. License issues by IO+ERROR · · Score: 5, Informative
    Be careful; the "open source" license (PDF) is not GPL-compatible. I don't even think it's BSD-compatible on first reading.

    The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.

    --
    How am I supposed to fit a pithy, relevant quote into 120 characters?
    1. Re:License issues by IO+ERROR · · Score: 3, Informative
      Simple to understand; if you use it for free, you're expected to release your source code (i.e. the 'reciprocal' part of RPL). If you pay to use it, you don't have to release your source code.

      True enough, but using the proprietary license makes it impossible to use this in existing projects without changing the license. Suddenly your open source project is either no longer open source, or doesn't look so attractive.

      One of the nicest features of the GPL (and, to be fair, of the BSD license) is that you do not have to release source code if you don't distribute your software. This RPL requires you to release your source code even if you don't distribute your software. And the proprietary license simply isn't appropriate for any type of open source project.

      The guy wants to get paid, and that's fine, I want to get paid, too. But he's got no business telling me I have to distribute my source code for an internal project that will never be distributed. He could easily have used a method similar to Trolltech's dual-licensing, but he chose instead to do something a whole lot more obnoxious.

      --
      How am I supposed to fit a pithy, relevant quote into 120 characters?
  6. Re:16X increase? by LordRPI · · Score: 5, Informative

    The principle behind SIMD, or, rather, Single Instruction Multiple Data, is that you can process wide arrays of values in a single instruction. With the PowerPC version of SIMD, also known as AltiVec, you can issue an instruction and have it work with a 128-bit wide register. These registers may contain up to 4 32-bit numbers, 8 16-bit numbers or 16 8-bit numbers. For example, I can load two AltiVec registers with 16 unsigned chars, add them together using Vec_Add() and have it return its results to an AltiVec register. So this in essense is adding 16 values at once and in theory it's good enough for markeing to claim a 16X speedup, but this is rarely the case.

  7. About the RPL by pavon · · Score: 4, Informative

    The RPL ( Reciprocal Public License) is an odd choice for this project. It is an even stronger viral copy-left than the GPL, to the point where the FSF takes issue with it. If create a derivative work you are required required to 1) Notify the original author, and 2) Publish your changes even if you only use the program in house. Furthermore, their definition of derivative work is much, much broader than the "linking" definition that the GPL uses.

    The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL. In fact, considering the requirements placed on you by the license, I would expect that you will have difficulty incorporating this RPL library into any existing FLOSS project without running into license conflicts. The only thing I can see this being useful for is a new project that you don't mind releasing under the RPL, or with existing BSD style licensed code which you dual license as BSD/RPL (since BSD can be included in anything).

    So this library does not appear to very useable for the FLOSS world, although if you want to license it for proprietary software you may.

  8. Black Art? Uh... by arekusu · · Score: 3, Interesting

    "...the black art of assembly language magicians."

    The nice thing about altivec is that it has a C interface. You don't have to use assembly!

    Take a look at this Apple tutorial to see how easy it is.

    1. Re:Black Art? Uh... by Leo+McGarry · · Score: 3, Funny

      Yes, I think the person who wrote the summary revealed a little more of his own ignorance than he meant to. I don't consider calling "vec_add" inside a loop to be a black art.

  9. Autovectorization being add in GCC 4.0 by shawnce · · Score: 5, Interesting

    For those that don't already know is that autovectorization is being worked on for GCC by folks from IBM and others.

    GCC vectorizatoin project (site seem offline atm) but the abstract from a recent GCC summit is up.

    Autovectorization Talk (google html view of pdf)

  10. The future by johnhennessy · · Score: 3, Insightful

    Surely people can now start to see where the future lies - from a performance viewpoint. We've reached the end of the clocking "free lunch" (see http://www.gotw.ca/publications/concurrency-ddj.ht m/).

    The way forward is turning the CPU (of a traditional) architecture into a Nanny for a range of various dedicated processing units. IBM saw this years ago, and thus began the whole Cell architecture - but I suspect that their job was much easier. The software that would run on the platform they are designing is fairly specific - games & multimedia which usually lend themselves well to vectorization.

    The real challenge for architects (in my humble opinion) is translating will be applying the same technique to other system bottlenecks.

    AMD's (and now Intel's) approach of crambing more and more processing cores onto an IC might pay off in the short term, but like the "free lunch" of clock speed, will hit a roadblock when issues like memory bandwidth and caching schemes just have too much work to do with 4 or 8 processing cores hacking at it all the time.

    --
    [ Monday is a terrible way to spend one seventh of your life. ]
  11. Read the Altivec mailing list by kuwan · · Score: 4, Informative

    A better resource for Altivec and SIMD in general is the SIMDtech.org website and Altivec mailing list. There are tutorials and technical manuals available and the email list is indispensable. While the mailing list is mostly geared towards Altivec optimizations and discussions all SIMD discussion is welcome, including MMX/SSE. There are Apple engineers that read and contribute to the list as well as Motorola/Freescale engineers. It's probably the single best resource available to Altivec programmers and you get to talk directly to the Wizards that created it.

    I'm a relative newcomer to the list and it's been an invaluable resource as I've optimized with Altivec.

    --
    Join the Pyramid - Free Mini Mac

  12. From the limewire... by WilyCoder · · Score: 3, Interesting

    As two of my professors have stated in class, SIMD and moreso parallel processing will require programmers to think in a fundamentally different way in order for multi-core/multi-processor to really take off.

    This project may be a step in the right direction. Benchmarks show that SIMD such as SSE/2/3 only provide a marginal speed increase. And meanwhile, the massively parallel computations done on graphics cards dwarfs anything SIMD claims to produce.

    Perhaps we will see GFX manufacturers selling their technology to the CPU makers.

    I forget the specifics, but a new GFX card can perform somewhere around 35 GFLOPS, while a 3.4Ghz P4(executing SIMD code) can only produce around 5-6GFLOPS at best.

    With projects like Brook GPU emerging, the division of CPU and GFX processor may be narrowed significantly.

  13. Depends on what you are doing by dsci · · Score: 5, Insightful

    We write code for hardcore chemical simulations. The limits on what can be studied, ie number of atoms/molecules or timescales of the simulations depends on one thing: speed.

    Faster computers means better simulations. BUT, if the code is not as fast as it can be on a particular architecture, your simulations are not going to be as complete as they can be. At least within a given time allotment.

    I've recently applied some code optimizations to a Monte Carlo simulation and saw speed ups of over 1000x. That's significant.

    It's naive to think that faster computers means we should live with sloppy or unoptimized code. SIMD is a useful technique, and if it means the difference between me getting work done in a week or two or three weeks, I think I'll take the one-week sim.

    --
    Computational Chemistry products and services.
  14. Re:OS X Tiger will do it for you by be-fan · · Score: 4, Informative

    Actually, Apple's Tiger will get an auto-vectorizing compiler courtesy of the public GCC 4.0 release. The auto-vectorizer wasn't developed in Apple's version of GCC. IBM's GCC team at the Haifa Research Lab developed the vectorizer in the public LNO (loop nest optimization) branch of GCC 4.0. I'm not trying to minimize Apple's contribution here, one of their developers did work on the team, but let's give credit where credit is due.

    --
    A deep unwavering belief is a sure sign you're missing something...
  15. Why? Altivec-optimized libraries supplied by Apple by coult · · Score: 3, Interesting

    You really don't need macstl unless you have a strong desire to use valarray in C++...for example, the ATLAS project http://math-atlas.sourceforge.net/ already uses Altivec (and SSE/SSE2, etc) wherever it results in a speedup. So, if your code does linear algebra, use ATLAS and you'll see an automatic speedup in many cases. Other projects such as fftw http://fftw.org/ include Altivec/SSE/SSE2 optimizations as well. ATLAS includes lots of other optimizations such as cache-blocking, loop-unrolling, etc. I don't know of macstl includes such optimizations, but I do know that ATLAS performance approaches the theoretical peak performance on G4/G5 for things like matrix-matrix multiplication.

    Not only that, but Apple's vecLib http://developer.apple.com/ReleaseNotes/MacOSX/vec Lib.html includes ATLAS so you don't even have to download or install anything - it comes with OS X.

    --

    All is Number -Pythagoras.

  16. Why limit yourself to Altivec when you have NVidia by kompiluj · · Score: 3, Insightful

    Well the processing power of Altivec or MMX/SSE/3DNow or whatever is nowhere near the power of you newest NVidia/ATI card you have surely bought for playing Doom III. Why not use it then? Get the brook compiler! Furthemore, I see they introduce classes like vec, etc. Such classes have been already designed successfuly for C++. Why not try porting Blitz to the Altivec and/or to the GPU?

    --
    You can defy gravity... for a short time
  17. Maybe it's just Ignorant criticism... by kuwan · · Score: 3, Informative
    If you'd actually read what this is all about then you'd have find out that this is a cross-platform library for SIMD programming. You program in standard C++ using std::valarray and you get code optimized for Altivec and MMX/SSE/SSE2/SSE3 without having to do anything else. You don't need to worry about coding to two different libraries on two different platforms nor do you have to worry about learning the platform-specific C intrinsics, alignment issues, head/tail cases, etc.

    SIMD programming becomes as easy as this:
    float af1 [] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    stdext::valarray <float> v1 (af1, 10); // construct from first 10 elements of af1
    stdext::valarray <float> v2 (10, 3.0f); // construct with 10 repeats of 3.0f
    stdext::valarray <float> v3 (10); // construct with 10 repeats of 0.0f

    v3 = sin (v1) * cos (v2) + sin (v2) * cos (v1);
    He claims that the above code is 17.4x faster than Codewarrior MSL C++, 11.6x faster than gcc libstdc++ and 9.5x faster than Visual C++.

    Macstl also provides a cross-platform syntax for using vector registers that is similar to using the native C intrinsics on each platform. So while not all of the native operations are available, his cross-platform "vec" API allows you to write cross-platform code without having to learn both the Altivec and MMX/SSE intrinsics (which is a good solution for someone who knows one platform but not the other).

    --
    Join the Pyramid - Free Mini Mac
  18. Re:Moore's Law has eroded the need for assembly by groomed · · Score: 3, Insightful

    Sorry, but yours is an utterly kneejerk boilerplate response which has nothing to do with the topic at hand and only serves to establish your credentials as a hard nosed realist who has been there and done it.

    Moore's Law has eroded the need for such knowledge

    Moore's "law" (which is just an off-the-cuff observation, really) has nothing to do with this. If anything, Moore's law has enabled transistor and space devouring SIMD technology.

    It would be like concerning myself on how to design circuits...

    No, it's nothing like that at all. Just because you own and know how to use money doesn't mean there is no point to the complex financial reckonings that are made every day at institutions all over the world. You may not need, but you is not under discussion.

    Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche

    By this definition, everything is niche. The whole computing industry becomes "niche". Farming is "niche". The paper industry is "niche". What you're describing is just non-descript white collar administrative work which just happens to involve a computer; bit shuffling, rather than paper shuffling.

    Those situations are about the last place you will find anyone caring about something called "assembly language."

    Again, completely irrelevant.

    The point is that with a few dozen lines of SIMD code (whether in assembly or some high level language) any reasonably competent programmer can achieve four-fold, ten-fold, even twenty-fold speedups on critical path code, from scratch, in as little as a week.

    These are amazing results, and people should be encouraged to investigate the possibilities, not be dragged down into this drab netherworld of yours.