Slashdot Mirror


Grand Unified Theory of SIMD

Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "

223 comments

  1. Altivec by BWJones · · Score: 5, Informative


    For those who want a little background on Altivec, of course Wiki has a description here. Apple, who now ships Altivec in every system they make has a pretty good page here and Motorola nee Freescale has one here.

    The benefits of Altivec can be truly astounding for those processes that can be "vectorized". After all putting these kinds of calculations in hardware has got it all over software computation. It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.

    --
    Visit Jonesblog and say hello.
    1. Re:Altivec by shawnce · · Score: 4, Informative

      Just pick a few items out ...

      Apple provides source code for some of their vector libraries

    2. Re:Altivec by Anonymous Coward · · Score: 0

      The altivec isn't simply a unit on the G4?

    3. Re:Altivec by wulfhound · · Score: 1

      Yes it does.. it's a G4, all G4s have Altivec.

    4. Re:Altivec by mod_critical · · Score: 3, Informative

      Altivec == Velocity Engine

      And is part of every G4

    5. Re:Altivec by baryon351 · · Score: 2, Interesting

      It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.


      I managed to pick up a ThunderIV last year with the DSP card, and had a run around with photoshop on it. It's impressive stuff. I have an iMac 350 here I also ran photoshop on, and while the 350 kicked the Thunder in a Quadra for many unaccelerated things, on those operations where the DSPs kicked in (and the card has those cool little LEDs to show just when it's happening) it could keep up with the iMac nearly neck & neck.

      That's a 25MHz 68040 from 1992 and Thunder IVGX vs a 350MHz G3 from 2000. Very cool.

    6. Re:Altivec by skraps · · Score: 1

      "Wiki" != "WikiPedia".
      For more, read http://en.wikipedia.org/wiki/Wiki.

      --
      Karma: -2147483648 (Mostly affected by integer overflow)
    7. Re:Altivec by Anonymous Coward · · Score: 0

      Wrong. Go back to the kitchen where you belong. This 'computer' thing isn't for you.

    8. Re:Altivec by Lord+Kano · · Score: 0, Troll

      Altivec is good stuff but, I think it's funny how for years Mac Zealots were talking about how RISC was more efficient, but Altivec proves that speed can be gained by having more instructions in the chip.

      LK

      --
      "Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano
    9. Re:Altivec by Anonymous Coward · · Score: 0

      Apple used to provide source for vMathLib, but it's now horribly out of date with respect to the latest incarnation vecLib on OS X 10.3 Panther.

      Cheers,
      Glen Low, Pixelglow Software
      www.pixelglow.com

  2. More AltiVec Goodness by LordRPI · · Score: 4, Informative

    Apple has had AltiVec optimized libraries for DSP and such since the early releases of OS X.

    1. Re:More AltiVec Goodness by goMac2500 · · Score: 1

      How is parent flamebait? It's a fact, and its not flamebait considering Apple is one of the only companies currently shipping Altivec systems.

    2. Re:More AltiVec Goodness by Anonymous Coward · · Score: 0

      What do you think SSE is? How about MMX? Yep, those are vector instructions, too. So in fact, Apple is just one company in a whole bucket of companies shipping produts with vectorized instructions. It's almost factual that Apple is one of only a few companies shipping Altivec code, but that's because Altivec is the instruction set that came with their hardware. It's like saying that Apple is one of the only companies shipping operating systems that will run on the new iMac; it's technically true, but misleading.

      I believe it's intentionally misleading. There've been far, far too many Applefan posts on Slashdot lately. Any topic that provides a possible segue to Apple is used. Big Endian? Apple uses big endian! Graphics? Apple has good ones! Low power machines? Apple machines are nearly as quiet as some x86 low power machines! Give it a rest already. The astroturfing campaign is backfiring.

    3. Re:More AltiVec Goodness by bryanzak · · Score: 3, Insightful

      One of the problems of using libraries though is that the overhead of a function call usually negates any gain in vectorization. The lib call messes all kinds of things up, including instruction flow and caching, etc.

    4. Re:More AltiVec Goodness by Woody77 · · Score: 1

      inline is your friend.

  3. Reads the article again... by Wandering+Wombat · · Score: 0, Offtopic

    *nods* Yes.

    --
    I like to place meaningful quotes in my sig, so people will know that I know what meaningful quotes are.
  4. Umm by TheKidWho · · Score: 2, Informative

    Doesn't XCode have a feature that lets you "vectorize" certain parts of your code already?

    1. Re:Umm by Richard_at_work · · Score: 2, Informative

      The next version of Xcode will support autovectorisation, but I dont think it does it atm.

    2. Re:Umm by HeghmoH · · Score: 1

      No.

      --
      Mod down posts with a "Free Mac Mini/iPod" sig, they're spam!
  5. A little background by xXunderdogXx · · Score: 4, Informative
    From the Wikipedia article on SIMD:
    An example of an application that can take advantage of SIMD is one where the same value is being added to a large number of data points, a common operation in many multimedia applications. One example would be changing the brightness of an image. Each pixel of an image consists of three 8-bit values for the brightness of the red, green and blue portions of the color. To change the brightness, the R G and B values are read from memory, a value is added (or subtracted) from it, and the resulting value is written back out to memory.

    With a SIMD processor there are two improvements to this process. For one the data is understood to be in blocks, and a number of values can be loaded all at once. Instead of a series of instructions saying "get this pixel, now get this pixel", a SIMD processor will have a single instruction that effectively says "get all of these pixels" ("all" is a number that varies from design to design). For a variety of reasons, this can take much less time than it would to load each one by one as in a traditional CPU design.
    But of course I'm sure everyone here knew that..
    1. Re:A little background by Bisqwit · · Score: 1

      An example of an application that can take advantage of SIMD is one where the same value is being added to a large number of data points, a common operation in many multimedia applications.

      How is this different for MMX?
      Because I thought MMX does exactly what you described.

    2. Re:A little background by Gr8Apes · · Score: 1

      Evidently not ;)

      --
      The cesspool just got a check and balance.
    3. Re:A little background by xXunderdogXx · · Score: 1

      If I'm not mistaken, wouldn't MMX be an implementation of SIMD?

    4. Re:A little background by DLWormwood · · Score: 3, Informative
      How is this different for MMX?

      Based on personal recollections reenforced by a quick Wiki'ing, MMX's problem wasn't the concept itself, but Intel's braindead constraints placed on x86 support for vectors. MMX recycled the same registers as used for floating point math, causing expensive context switches between each mode and only allowing integer math to be vectorized. Intel eventually developed SSE to work around some of the bottlenecks, but the eventual dominance of GPUs on the PC platform reduced the development priority for vector math in the CPU.

      --
      Those who complain about affect & effect on /. should be disemvoweled
    5. Re:A little background by Anonymous Coward · · Score: 0

      Yeah, you're right. MMX is a SIMD extension.

    6. Re:A little background by at_18 · · Score: 1

      MMX is an integer-only implementation of SIMD. It was also problematic because it didn't have its own registers, but re-used the floating point ones of the CPU. SSE is a floating-point implementation of SIMD with its own registers.

    7. Re:A little background by Dominic_Mazzoni · · Score: 4, Informative

      Quick summary:

      MMX (x86): 8-byte registers, only integer operations
      SSE (x86): 16-byte registers, single-precision float ops
      AltiVec (PPC): 16-byte registers, integer and single-precision float ops
      SSE2 (x86): 16-byte registers, double-precision float ops

      In order to implement many complex algorithms on x86, you need to use a motley combination of MMX and SSE. There are many flaws in both; lots of very useful instructions are missing, and MMX can't be used in conjunction with non-SIMD floating-point operations without a huge expensive context switch. One of the biggest flaws in MMX/SSE that I found was the lack of instructions to shuffle data around within a (8-byte or 16-byte) register. The only advantage on a modern x86 CPU is SSE2, which is the only SIMD unit with double-precision floats. But you can only work with two doubles at a time, so the speedup is not that great.

      AltiVec, on the other hand, included both floats and integers right from the start, with no penalty for switching between them, and it includes a very detailed and useful set of instructions, including an awesome shuffle instruction. My personal experience, coding for both, is that AltiVec is about twice as useful as MMX/SSE/SSE2 combined.

      Also, note that in Mac OS X, many of the standard libraries and system calls are already AltiVec-optimized for you, and Apple also provides a great Vector library with lots of common DSP operations.

    8. Re:A little background by TheRaven64 · · Score: 2, Informative

      As well as the vDSP libraries, Apple also provide a set of wrapper functions around the vector instructions. These expose the instructions directly, but let the compiler handle register allocation, making using AltiVec directly very easy.

      --
      I am TheRaven on Soylent News
    9. Re:A little background by julesh · · Score: 1

      Your list misses:

      3DNow! (AMD-x86): 8 byte registers, single precision floats

      Possibly a footnote of history now, but worth mentioning as a fairly significant proportion of processors support it.

      One of the biggest flaws in MMX/SSE that I found was the lack of instructions to shuffle data around within a (8-byte or 16-byte) register.

      You mean like the PSHUF* family of instructions? Or something else?

    10. Re:A little background by excessive · · Score: 1
      "getting all of these pixels" is going to take the same length of time, (The processor has to have the data locally to act on it - and there is still the same number of buses from RAM into the chip) it's just the arithmatic on the values that is done in parallel.

      Also, IIRC, brightness is a scaling function... (But that's just me being picky)

  6. 16X increase? by Sensible+Clod · · Score: 1

    Okay, I'm willing to believe it, but only if someone shows how that's possible.

    --

    The difference between spam and poop is that you don't have to dig through septic tanks looking for real food. -- Me
    1. Re:16X increase? by mirko · · Score: 2, Interesting

      When using Reason 3, some virtual synths have the option to produce an enhanced sound.
      What is curious is that if you are using a pre-Altivec proc (G3), it'll burn more CPU time while the same enhancement will be totally and natively supported by Altivec-enabled units : a 400MHz G4 Powerbook is enhancing these sytnhs more efficiently than an 800MHz G3.
      I guess this was like the simultaneous operations that the ARM assembly language supports (e.g. both storing and rotating values in an operation)...

      --
      Trolling using another account since 2005.
    2. Re:16X increase? by Anonymous Coward · · Score: 0

      The SSE-registers in x86 processors are 128bit long.
      You can load two register with 16 8bit values each, and add the two registers in one operation.
      In theory this gives a 16x increase, but there is additioal overhead to bee considered.

    3. Re:16X increase? by LordRPI · · Score: 5, Informative

      The principle behind SIMD, or, rather, Single Instruction Multiple Data, is that you can process wide arrays of values in a single instruction. With the PowerPC version of SIMD, also known as AltiVec, you can issue an instruction and have it work with a 128-bit wide register. These registers may contain up to 4 32-bit numbers, 8 16-bit numbers or 16 8-bit numbers. For example, I can load two AltiVec registers with 16 unsigned chars, add them together using Vec_Add() and have it return its results to an AltiVec register. So this in essense is adding 16 values at once and in theory it's good enough for markeing to claim a 16X speedup, but this is rarely the case.

    4. Re:16X increase? by Anonymous Coward · · Score: 0

      Yea, but how does this compiler/library/whatever add a 3.6X-16X speedup over handcoded simd code?

    5. Re:16X increase? by Anonymous Coward · · Score: 0

      I guess this was like the simultaneous operations that the ARM assembly language supports (e.g. both storing and rotating values in an operation)...

      No, they are not similar at all. First of all, the issue of storing and rotation is an instruction encoding issue more than an execution issue, they simply pack a couple operations in one instruction word. Internally it may be executed with both the ALU and the load/store, but this isn't a guarantee of course (like the Xscale), it depends on the core.

      SIMD is where you encode multiple data units and there are instructions that deal with computing the individual data units "independently". For ARM, there is the Enhanced DSP extensions, but I forget how SIMDish it really is. The new xscale based ARM cores from Intel (PXA27x) have a version of MMX called IWMMX (Intel wireless MMX extensions).

    6. Re:16X increase? by Anonymous Coward · · Score: 2, Informative
      The concept, and radical performance boost, is in line (pardon the pun) with Expression Templates for C++.

      A good example is what happens when you let the compiler decide how to do aritmetic with vectors and matrixes.

      Matrix a,b,c,x;
      x = a + b + c;

      The naked compiler, in combination with your custom Matrix class, will probably unwind the operator overloads to do something like this:
      // assuming a reasonable STL w/function inlining
      Matrix __t1;
      for(int i=0; i<a.width; i++){
      for(int j=0; j<a.width; j++){
      __ti[i][j] = a[i][j] + b[i][j];
      }
      }
      Matrix __t2;
      for(int i=0; i<b.width; i++){
      for(int j=0; j<c.width; j++){
      __t2[i][j] = __t1[i][j] + c[i][j];
      }
      }
      x = __t2;
      All those temporary copies and inlined loops really kill performance.

      Now, with an expression library, it handles each arithmetic expression discretely by type. By treating the expressions, as well as the types involved, you can do more sophisticated things. In this case, the Expression Template Library solves the problem thusly:
      // using ETL
      for(i=0; i<a.length; i++){
      x[i] = a[i] + b[i] + c[i];
      }
      Here the library has carnal knowledge of the data structures involved as well as order of operations to come to such a succint solution.

      In the case of MACSTL, its still using these principals of "vectorizing" the expressions as well as unrolling and other traditional optimization techniques. Its also going the extra mile and using processor specific code and/or C code that targets *extremely* well to PPC. For example, the above example would opitmize well using Altivec, due to the platform's built-in vector type; you wouldn't even need a loop for adding several 'vec' instances.

      I wish I knew enough about MACSTL and altivec to give a hard example of a 16X speedup. I hope this gets you closer to seeing at least *where* the reducable overhead is coming from. :)

      Check out Blitz++'s papers listing for more info:
      http://www.oonumerics.org/blitz/papers/
    7. Re:16X increase? by sribe · · Score: 2, Interesting

      So this in essense is adding 16 values at once and in theory it's good enough for markeing to claim a 16X speedup, but this is rarely the case.

      There are 32 of these registers (independent, not shared with the FPU) which means you can chain together a pretty complex series of calculations without intermediate load/store sequences. The unit has multiple independent computation units with their own dispatch queues (details vary between specific processor models). Some AltiVec opcodes are designed to common series of multiple scalar instructions.

      The result is that speed ups of more than 16x are not at all rare. 30x is not uncommon in graphics manipulations; I would venture to say that 100x is "rarely the case." ;-)

    8. Re:16X increase? by Anonymous Coward · · Score: 0

      Yes, and don't forget vec_madd counts as two operations! But seriously, those particular segments will get an impressive speedup, but not all code for a particular application will benefit, somewhat pertaining to Amdahl's Law, so the run time at the end of the day won't be as much.

    9. Re:16X increase? by Anonymous Coward · · Score: 0

      Do you know which architectures (Altivec, SSE, MMX, etc.) can be used with what data types? For scientific models, double precision floating point types are probably most relevant to consider (most real world quantities are floating point, and single precision is often just not enough). However, I know that many of the x86 extensions only cover integer math (MMX) or perhaps single precision. They also don't help accelerate trigonometry much, I think. Will Altivec speed up double precision floating point math?

    10. Re:16X increase? by Anonymous Coward · · Score: 0

      Will Altivec speed up double precision floating point math?

      Unfortunately not. Altivec cannot make use of double precision floating point numbers. It seems that Motorola was going for multimedia applications where that precision is not needed very much.

    11. Re:16X increase? by Anonymous Coward · · Score: 0

      However, SSE2 does support double precision floating point. AFAIK the only way in which Altivec is NOT inferior to SSE2, is some weird shuffling instruction.

    12. Re:16X increase? by Anonymous Coward · · Score: 0

      This is of course assuming that you're not already using an old optimization trick of using a word to store four bytes. (yes it requires more instructions that way, but usually not 4 times as much..)

    13. Re:16X increase? by Anonymous Coward · · Score: 0

      macstl does use the Expression Template technique; in fact I could only get that kind of speed-up with Expression Templates in combination with vectorizing code. Without expression templates, an expression like

      valarray v1 = sin (v2) + cos (v3)

      would be forced to make a temporary valarray for sin (v2) and cos (v3), so the only the individual terms would be vectorized, not the entire expression.

      The valarray was build on top of the vec classes, which amazingly have almost zero overhead in supported compilers over the native vector types. This allows me to swap in a new SIMD architecture simply by adding 3000+ SLOC of code to interface the peculiar opcodes of the new architecture to what my valarray implementation would expect.

      Cheers,
      Glen Low, Pixelglow Software
      www.pixelglow.com

    14. Re:16X increase? by tepples · · Score: 1

      This is of course assuming that you're not already using an old optimization trick of using a word to store four bytes.

      I've written a GBA audio mixer that uses a technique similar to this, but the carry from byte to byte within a machine word often gets in the way of hardcore vectorization.

  7. Long thread about using Altivec by ThousandStars · · Score: 4, Informative

    The Mac forum at Ars Technica has a long, continuing post about Altivec optimizations and how they should be used. The thread started more than two years ago and still gets relevent points and questions added to it. It's an amazing resource if you're interested in starting.

    1. Re:Long thread about using Altivec by Anonymous Coward · · Score: 0

      As useful as Altivec can be for some people, it doesn't support "double"s, as in double precision floating point values.

  8. So what? by Anonymous Coward · · Score: 0, Funny

    Big freaking deal. Hardware and software are irrelevant. It's all about content now.

    1. Re:So what? by Anonymous Coward · · Score: 0

      Let's see how you deliver that content without hardware or software.

    2. Re:So what? by Anonymous Coward · · Score: 0

      That's my point. The hardware and software are there to support content development and delivery.

      Name one area of computing that is the next "big thing". Networking: done. OSs: done. Chip design: done. Everything going on now is micro-iterations.

  9. Moore's Law has eroded the need for assembly by betelgeuse68 · · Score: 1, Interesting

    Moore's Law has eroded the need for such knowledge. It would be like concerning myself on how to design circuits to convert a DC current to AC current because I happen to use devices that use electricity, e.g., my toaster (as in bread).

    I learned assembly long ago, still retaining a fair amount of it (80x86). There have been a few occasions where I've called upon its use, yeah twice in the last eight years... and that's about it.

    Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche, much like the Win32 programming I used to do in the early 90's.

    90% of IT jobs are with non-tech companies. Those situations are about the last place you will find anyone caring about something called "assembly language."

    -M

    1. Re:Moore's Law has eroded the need for assembly by geoffspear · · Score: 2, Funny
      99% of all jobs in the world require no programming at all. Therefore, there is no need for anyone anywhere to learn C.

      90% of the worlds' people do not own cars. Therefore, there is no need for gas stations. If you pick a living human completely at random from the earth, chances are they don't drive one of these "car" things.

      --
      Don't blame me; I'm never given mod points.
    2. Re:Moore's Law has eroded the need for assembly by bonch · · Score: 1

      Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche, much like the Win32 programming I used to do in the early 90's.

      I don't consider Doom 3 to be a niche.

    3. Re:Moore's Law has eroded the need for assembly by Anonymous Coward · · Score: 0

      People who write compilers, JIT interpreters and emulators still need assembly. Very popular open source projects which the writers needed knowledge of assembly include GCC, Mono and Bochs.

    4. Re:Moore's Law has eroded the need for assembly by lowe0 · · Score: 2, Insightful

      Which is exactly why this sort of thing is so important.

      Sure, you could probably get it to work even faster with hand-tuned assembly than simply using this library. But programmer time is expensive, and customizing code adds complexity. By reusing optimized code, you can enjoy some of the benefits of SIMD without having to devote the same amount of resources.

      Let's be honest, this isn't a silver bullet - this isn't going to speed up code that doesn't use lots of floating-point vectors anyway. But if it does... (nearly) free performance is always a good thing.

    5. Re:Moore's Law has eroded the need for assembly by betelgeuse68 · · Score: 1

      Sure, and you and everyone you know is working on Doom3 or a competitor?

      Just because you use it, doesn't mean you engineer it.

      You use a TV... when was the last time you even thought of any of the eletronics inside of it?

      -M

    6. Re:Moore's Law has eroded the need for assembly by Anonymous Coward · · Score: 0

      Don't underestimate the game industry or amount of Win32 programming. Although dotnet is coming very few have yet migrated to it. And MFC isn't a "do-it-all" solution either, Win32 programs still are very much necessary. Yes, there are a lot of applications created over webservers but that is just the most visible part of it. A huge amount of programming happens "in-house" or for embedded markets or somesuch.

    7. Re:Moore's Law has eroded the need for assembly by groomed · · Score: 3, Insightful

      Sorry, but yours is an utterly kneejerk boilerplate response which has nothing to do with the topic at hand and only serves to establish your credentials as a hard nosed realist who has been there and done it.

      Moore's Law has eroded the need for such knowledge

      Moore's "law" (which is just an off-the-cuff observation, really) has nothing to do with this. If anything, Moore's law has enabled transistor and space devouring SIMD technology.

      It would be like concerning myself on how to design circuits...

      No, it's nothing like that at all. Just because you own and know how to use money doesn't mean there is no point to the complex financial reckonings that are made every day at institutions all over the world. You may not need, but you is not under discussion.

      Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche

      By this definition, everything is niche. The whole computing industry becomes "niche". Farming is "niche". The paper industry is "niche". What you're describing is just non-descript white collar administrative work which just happens to involve a computer; bit shuffling, rather than paper shuffling.

      Those situations are about the last place you will find anyone caring about something called "assembly language."

      Again, completely irrelevant.

      The point is that with a few dozen lines of SIMD code (whether in assembly or some high level language) any reasonably competent programmer can achieve four-fold, ten-fold, even twenty-fold speedups on critical path code, from scratch, in as little as a week.

      These are amazing results, and people should be encouraged to investigate the possibilities, not be dragged down into this drab netherworld of yours.

    8. Re:Moore's Law has eroded the need for assembly by afidel · · Score: 1

      Doing things like transcoding/encoding of multimedia content is one of those "niche" areas where assembly is still needed. If it takes 1.5hours or 3 hours to transcode a movie is a BIG deal, especially if you have to do it many times to archive a library of old content. Sure most programmers won't need it, ever, but that's been pretty much true since we got high level languages and computers got more than a couple hundred K of RAM.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    9. Re:Moore's Law has eroded the need for assembly by Anonymous Coward · · Score: 0

      I hate this mindset. What people don't factor into the exponential speedup of Moore's law is that software complexity also increases at an exponential rate, and the slowness of one component is often magnified by other components that rely on it.

      For a given application, it is usually composed of a lot of higher level functions calling working functions which in turn call more functions, sometimes in other libraries, which eventually make system calls, which eventually interface with drivers.

      Assume that a given operating system and application is rewritten in the latest bloaty paradigm of code design. Every linear section of code takes 1.1 times longer to execute. This includes an increase in function calls to lower layers. For an application that only has to interface through 8 levels of functions before reaching the hardware, the speed of execution is already over 2x slower (1.1^8). You can see how even more complicated software can become increasingly slow as lower level routines are rewritten bloaty. A slowdown in a midlevel library can have tremendous effects on the performance of even well-coded applications. It is easy to accumulate dozens of levels of indirection with object oriented software, with greater than O(1) behavior in some of the function calls, and with real world problems, ignoring the constants in Big-O notation can have bad consequences as well. Just because all the operations are O(1) doesn't mean they're all twice as slow as they need to be.

      In general, sloppy programming will eventually catch up with us. As programs get more complex we need better ways of writing them, not just faster ways of writing them.

    10. Re:Moore's Law has eroded the need for assembly by Crazy+Eight · · Score: 1

      IIRC, D3 was written in C++.

  10. License issues by IO+ERROR · · Score: 5, Informative
    Be careful; the "open source" license (PDF) is not GPL-compatible. I don't even think it's BSD-compatible on first reading.

    The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.

    --
    How am I supposed to fit a pithy, relevant quote into 120 characters?
    1. Re:License issues by voxlator · · Score: 2, Interesting

      True, but only if you don't purchase a license.

      Simple to understand; if you use it for free, you're expected to release your source code (i.e. the 'reciprocal' part of RPL). If you pay to use it, you don't have to release your source code.

      --#voxlator

    2. Re:License issues by IO+ERROR · · Score: 3, Informative
      Simple to understand; if you use it for free, you're expected to release your source code (i.e. the 'reciprocal' part of RPL). If you pay to use it, you don't have to release your source code.

      True enough, but using the proprietary license makes it impossible to use this in existing projects without changing the license. Suddenly your open source project is either no longer open source, or doesn't look so attractive.

      One of the nicest features of the GPL (and, to be fair, of the BSD license) is that you do not have to release source code if you don't distribute your software. This RPL requires you to release your source code even if you don't distribute your software. And the proprietary license simply isn't appropriate for any type of open source project.

      The guy wants to get paid, and that's fine, I want to get paid, too. But he's got no business telling me I have to distribute my source code for an internal project that will never be distributed. He could easily have used a method similar to Trolltech's dual-licensing, but he chose instead to do something a whole lot more obnoxious.

      --
      How am I supposed to fit a pithy, relevant quote into 120 characters?
    3. Re:License issues by Anonymous Coward · · Score: 0
      requires you to release all of your source code
      It sounds like the GPL virus to me.
    4. Re:License issues by Anonymous Coward · · Score: 0

      I would suggest getting an IP lawyer to look at the license in your application. The FSF definition of "derivative work" appears to drive their negative outlook on this license.

      If you have an existing work that you can optionally combine with the RPL licensed software, it is unlikely that a court would consider your existing work a derivative of the RPL software.

      Of course, the ethical thing is to not use the software because such use would be against the wishes of the author.

    5. Re:License issues by IO+ERROR · · Score: 2, Informative
      It sounds like the GPL virus to me.

      Look, a troll! The GPL doesn't require you to release your code, unless you distribute it. This RPL thing requires you to release your code, even if you don't distribute it. I've discussed the linking issue elsewhere.

      --
      How am I supposed to fit a pithy, relevant quote into 120 characters?
    6. Re:License issues by RupW · · Score: 1

      The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.

      IANAL, but I read the intent as "if you improve macstl you have to publish your changes to macstl" not "if you link macstl you have to publish source to the entire project".

      Obviously I can't say which one matches the legalese.

    7. Re:License issues by IO+ERROR · · Score: 1
      If you have an existing work that you can optionally combine with the RPL licensed software, it is unlikely that a court would consider your existing work a derivative of the RPL software.

      With C++ templates this is a very thorny issue. When your code instantiates the template, the library code is very inextricably an integral part of your code, and not easily (if at all) separable. This might be a different issue if it were a C library you could just call through an API.

      Currently under the GPL/LGPL this situation requires a special exception in the template library's license.

      --
      How am I supposed to fit a pithy, relevant quote into 120 characters?
    8. Re:License issues by Anonymous Coward · · Score: 0

      Very thorny. Templates through a preexisting runtime linkage mechanism, though...

      My research resulted in an understanding of derivative in line with what Linus said about the NVIDIA graphics driver.

      I'd like to see a court case take this family of issues on directly. Settled law in this area would save lots of lawyer fees.

    9. Re:License issues by jemfinch · · Score: 1
      I don't even think it's BSD-compatible on first reading.

      What, does it require you to remove other copyright notices on the file? Does it require that the name of the author be used in advertisement of software that uses it? Does it require that the author be liable for damages?

      If not, it's BSD compatible. You'll be hard-pressed to find a license that's not BSD compatible.

      Jeremy
  11. oops by Anonymous Coward · · Score: 1, Informative
  12. Moore's Law has nothing to do with assembly by Anonymous Coward · · Score: 2, Insightful

    Moore's Law has eroded the need for assembly

    Moore's Law has nothing to do with assembly language and optimizations. From Wikipedia:

    Moore's law is an empirical observation stating, in effect, that at our rate of technological development and advances in the semiconductor industry, the complexity of integrated circuits doubles every 18 months.

    I wish people would stop saying "But Moore's Law..." for every hardware-related story on Slashdot. Do a bit of reading, please.

    1. Re:Moore's Law has nothing to do with assembly by asliarun · · Score: 1

      You misunderstood.

      >> Moore's Law has eroded the need for assembly

      > Moore's Law has nothing to do with assembly language and optimizations. From Wikipedia:...

      The grandparent was saying that because processor speeds have increased to such an extent (Moore's Law), it doesn't make sense to use assembly to write modern code; even if the assembly code is faster.

    2. Re:Moore's Law has nothing to do with assembly by tzanger · · Score: 1

      Moore's Law has absolutely nothing to do with the speed of integrated circuits, it is talking about the complexity of the designs doubling roughly every 18 months. Complexity doesn't necessarily mean speed.

  13. Talk about incoherent postings by Dikeman · · Score: 0, Redundant

    I take pride in the fact that i didn't understand a word of this post.

  14. About the RPL by pavon · · Score: 4, Informative

    The RPL ( Reciprocal Public License) is an odd choice for this project. It is an even stronger viral copy-left than the GPL, to the point where the FSF takes issue with it. If create a derivative work you are required required to 1) Notify the original author, and 2) Publish your changes even if you only use the program in house. Furthermore, their definition of derivative work is much, much broader than the "linking" definition that the GPL uses.

    The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL. In fact, considering the requirements placed on you by the license, I would expect that you will have difficulty incorporating this RPL library into any existing FLOSS project without running into license conflicts. The only thing I can see this being useful for is a new project that you don't mind releasing under the RPL, or with existing BSD style licensed code which you dual license as BSD/RPL (since BSD can be included in anything).

    So this library does not appear to very useable for the FLOSS world, although if you want to license it for proprietary software you may.

    1. Re:About the RPL by geoffspear · · Score: 2, Informative
      Clearly, we need to get everyone in the world to download the source, make one superficial change, and email the entire thing back to the original developer.

      And what happens if the original developer dies? Is everyone prohibited from using his code until the copright runs out in 95 years, as they can't notify him of changes?

      --
      Don't blame me; I'm never given mod points.
    2. Re:About the RPL by MenTaLguY · · Score: 1

      And what happens if the original developer dies? Is everyone prohibited from using his code until the copright runs out in 95 years, as they can't notify him of changes?

      Yes, unless he has an identifiable successor-in-interest.

      --

      DNA just wants to be free...
    3. Re:About the RPL by Anonymous Coward · · Score: 0

      > So this library does not appear to very useable for the FLOSS world, although if you want to license it for proprietary software you may.

      Yes, clearly the world of dental hygene is not ready for such a radical license! I wonder if he means Free / Open Source Software, though what the additional "L" stands for is anyone's guess...

    4. Re:About the RPL by Baldrson · · Score: 1
      The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL.

      It's no more incompatible than is a class that overrides a method of a superclass "incompatible" with that superclass. In this instance, the release "method" is more strict.

    5. Re:About the RPL by HeghmoH · · Score: 1

      #1 is understandable, if odd, but #2 is just ridiculous. In-house use doesn't fall under copyright protection to begin with, so how can the RPL regulate it?

      --
      Mod down posts with a "Free Mac Mini/iPod" sig, they're spam!
    6. Re:About the RPL by pavon · · Score: 1

      Yes, clearly the world of dental hygene is not ready for such a radical license! I wonder if he means Free / Open Source Software, though what the additional "L" stands for is anyone's guess...

      Libre . Although it is understandable why many people have decided to drop the L from that acronym :)

    7. Re:About the RPL by CableModemSniper · · Score: 1

      You might not realize it, but that example is actually agreeing with your parent. A subclass that does that is breaking the substituion principle.

      --
      Why not fork?
    8. Re:About the RPL by phliar · · Score: 1
      In-house use doesn't fall under copyright protection to begin with
      False. You may be confusing in-house use with the doctrine of fair use.
      --
      Unlimited growth == Cancer.
    9. Re:About the RPL by HeghmoH · · Score: 1

      You're right, I wasn't thinking. Wide-scale internal use would in fact be governed by the RPL. Small-scale use that fell under fair use would not.

      --
      Mod down posts with a "Free Mac Mini/iPod" sig, they're spam!
    10. Re:About the RPL by Abcd1234 · · Score: 1

      Actually, I believe it's the other way around. i.e., the FOSS acronym existed first, and someone thought "But that's not, like, a word, and not *nearly* redundant enough. It needs something... another letter, I think. Hey, what is this 'french' dictionary, I see here? Don't they make fries or something? Ahh well, let's check that it out. Hey, here we go: 'libre'! Walla... FLOSS!" And so another stupid acronym was born...

    11. Re:About the RPL by Anonymous Coward · · Score: 0

      But what if you honestly believe that

      1) There is a heaven
      2) They must have email in heaven, for it to qualify as heaven

      Then, you could still notify the deceased developer, and use the RPL library, right?

  15. Black Art? Uh... by arekusu · · Score: 3, Interesting

    "...the black art of assembly language magicians."

    The nice thing about altivec is that it has a C interface. You don't have to use assembly!

    Take a look at this Apple tutorial to see how easy it is.

    1. Re:Black Art? Uh... by Leo+McGarry · · Score: 3, Funny

      Yes, I think the person who wrote the summary revealed a little more of his own ignorance than he meant to. I don't consider calling "vec_add" inside a loop to be a black art.

    2. Re:Black Art? Uh... by dsci · · Score: 1

      Also, the VectorC compiler by CodePlay is useful for using a C compiler that can generate SIMD for MMX, SSE and 3DNow!.
      ,br> But really, at the end of the day, what's so bad about assembly? I mean, if you inline only those (relatively small parts) you need to optimize, and let the C compiler handle all the symbol table stuff, it's not that bad. We're not talking about developing a full app, including GUI, in straight Assembly from scratch.

      --
      Computational Chemistry products and services.
    3. Re:Black Art? Uh... by Paradox · · Score: 1

      Yeah, the C library is out there, and it's not too hard to use. :)

      But one could counter that even in the C library, unless you know what you're doing, you may not get as dramatic a speedup as you wanted. Until I looked at serveral of Apple's examples, I couldn't write altivec-aware code properly (i.e. maximum performance benefit).

      Once I knew what I was doing I went back and redid the code, and it ran much faster. So it is still tricky to maximize your bang-for-buck.

      --
      Slashdot. It's Not For Common Sense
    4. Re:Black Art? Uh... by julesh · · Score: 1

      The nice thing about altivec is that it has a C interface. You don't have to use assembly!

      MS's compilers have a similar interface to MMX/SSE. I think the advantage of this project is that it's an abstraction layer that can use either.

  16. More source-distro goodness to follow? by Progman3K · · Score: 1

    Does this mean we can expect source Linux distros to start taking advantage of this?

    I know I'll sound like a wannabe leet for saying this, but I already really like my Gentoo workstation because it is a stage1 install (all from source), and I expect this will only make it even faster!

    Yay!

    --
    I don't know the meaning of the word 'don't' - J
    1. Re:More source-distro goodness to follow? by ykardia · · Score: 1

      If you are using Gentoo, there is an "icc" useflag that allows using the Intel Compiler for code that supports this. This compiler already automatically vectorizes your code to work with the Pentium SIMD units (SSE, SSE2 etc).

      The speedup is probably not as the one you would get from hand-coded libraries, but it can be quite significant (certain things can run up to twice as fast from my experience)

    2. Re:More source-distro goodness to follow? by Lussarn · · Score: 1

      Any particuar ebuilds I can test this on?

    3. Re:More source-distro goodness to follow? by Anonymous Coward · · Score: 0

      GCC 4.0 (now in beta) can vectorize loops, etc.

    4. Re:More source-distro goodness to follow? by Anonymous Coward · · Score: 0

      Wow I bow to your l33tness.

      Going from -O2 to -O99 really sped things up, didn't it?!

    5. Re:More source-distro goodness to follow? by ykardia · · Score: 1
      Try

      cd /usr/portage
      find . -name '*.ebuild' -exec egrep -H "IUSE.*\Wicc\W.*" {} \;

      Go make yourself a coffee - when you come back you should have an idea of which ebuilds take the flag. There might be a more sophisticated way of finding out which ebuilds take that flag, but I can't think of it.
    6. Re:More source-distro goodness to follow? by Lussarn · · Score: 1

      thx..

  17. Too expensive? by saddino · · Score: 1

    Sounds great, but $2499 for a redistributable binary? Ouch.

    1. Re:Too expensive? by voxlator · · Score: 2, Insightful

      In the corporate world, is it more expensive than paying a developer to design, code, test, and maintain a home-grown version?

      Once you've payed a $30 dollar/hour developer for 10 days work, you've forked out ~ $2,500...

      --#voxlator

    2. Re:Too expensive? by saddino · · Score: 1

      If the question was "Do I hire my own programmer or buy this technology?" then you would be correct.

      But, given this is an optimization and replacement for STL then the question is "Do I just live with STL, or buy this technology?"

      In other words, it isn't an essential development cost, it's an extra (I imagine most interested parties already have shipping apps that use STL).

      And at this price point, IMHO, I think the answer may be "if it ain't broke, don't fix it."

  18. Slides about SIMD by quigonn · · Score: 2, Informative

    A bit OT, but nevertheless quite interesting to read and it contains information about SIMD instruction sets other than just MMX/SSE: http://www.fefe.de/ccccamp2003-simd.pdf

    --
    A monkey is doing the real work for me.
  19. Assembly or C++? by nagora · · Score: 1
    I'll take the Assembly Language, thanks. Especially on such a nice processor.

    TWW

    --
    "Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
    1. Re:Assembly or C++? by nagora · · Score: 1
      Especially on such a nice processor as the PowerPC, that is. Sheesh.

      TWW

      --
      "Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
  20. Autovectorization being add in GCC 4.0 by shawnce · · Score: 5, Interesting

    For those that don't already know is that autovectorization is being worked on for GCC by folks from IBM and others.

    GCC vectorizatoin project (site seem offline atm) but the abstract from a recent GCC summit is up.

    Autovectorization Talk (google html view of pdf)

    1. Re:Autovectorization being add in GCC 4.0 by Anonymous Coward · · Score: 0

      Thanks for the information

      A gentoo user that is going to unmask gcc 4.0 and test it :)

    2. Re:Autovectorization being add in GCC 4.0 by TedCheshireAcad · · Score: 1

      If you're serious about performance, use XLC. GCC is great if you're cheap, but it's kind of like putting monster truck tires on a Ferarri.

    3. Re:Autovectorization being add in GCC 4.0 by joib · · Score: 1

      Yes, the new ssa architecture in GCC 4.0 allows for autovectorization, but at the moment the focus is on getting GCC 4.0 sufficiently stable for release in a few months. Because of this, IIRC, some of the fancier vectorization passes were deferred until GCC 4.1.

      So yes, you might see some performance improvements due to vectorization in 4.0, but you'll have to wait until 4.1 or maybe even 4.2 before you'll see the full potential of it.

      -joib, occasional GCC contributor (although I have absolutely zilch to do with the mid- and backend stuff, were most of the optimization passes are done)

    4. Re:Autovectorization being add in GCC 4.0 by Anonymous Coward · · Score: 0

      Which implies that it oughta be pretty simple to compile Linux for the Cell and make good use of it...we could end up with the first Cell desktop being a Linux system, and way faster than competing desktops. Or at least, Linux running on Playstation 3 before Cell desktops arrive...

    5. Re:Autovectorization being add in GCC 4.0 by bani · · Score: 1

      even without vectorization, the performance improvements in gcc4 are impressive.

      unfortunately some of the regressions are impressive as well :-/

    6. Re:Autovectorization being add in GCC 4.0 by Anonymous Coward · · Score: 0

      Thanks. There's no way I would have comprehended such an abstract statement as "it produces slow code" without the painfully retarded simile.

    7. Re:Autovectorization being add in GCC 4.0 by jd · · Score: 1
      That would be interesting. Unfortunately, as far as I can tell, XLC only runs on AIX and therefore only on IBM's medium to big iron. This parallelization seems to be available for small machines and games boxes.


      Personally, I think GCC is running into design limitations, to judge from the problems they're having getting GCC 4 to compile as well as GCC 3. In the builds that come with Fedora of both GCC 3 and GCC 4, I'm not seeing Fortran 90, although the ADA extension seems to be coming along nicely.


      What I would like to see (naive and optimistic that I am) is for commercial vendors such as the Portland Compiler Group, Intel, IBM, etc, look at GCC as a way to outsource the more mundane work, and to re-fashion their own compilers as GCC extensions.


      That way, they'd have more manpower to work on the new stuff, the stuff that makes their product different and therefore markettable. GCC could then focus on the generic stuff, the stuff they are good at, instead of working on optimization code that they are going round in circles on.


      The vendors would end up with a greater range of compilers (anything you can stuff a front-end on for GCC) and a greater range of target machines, therefore potentially having a larger audience.


      The GCC group would end up with a lot of refinements that they'd never have thought of otherwise and probably wouldn't be up to coding if they did.


      By making GCC "middleware", and having proper pluggable language front-ends, and proper pluggable targets, I think a lot of people could gain substantially. The only ones who'd lose are the ones who don't have a product in the first place, and they'll lose in time anyway.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    8. Re:Autovectorization being add in GCC 4.0 by HuguesT · · Score: 1

      XLC is actually available for MacOS/X.

      It works relatively well but I've found the XLC code to be less reliable than the FSF GCC one (i.e more compiler-related bugs). The code it produces is not very much better than the one produced by Apple's GCC. Apple's GCC also has reliability problems.

  21. It's in the compiler by Mad+Hughagi · · Score: 2, Informative

    Vectorization (SIMD) is built into the Intel compiler. There is no need to hack in assembly as the compiler will do it for you. This is the case with most vendor supplied compilers, as they want to fully exploit their hardware functionality.

    The problem is bringing this functionality to OS compilers, which as far as I know, there is not even an OpenMP (threading) implementation, let alone internal vectorization.

    --
    UBU
    1. Re:It's in the compiler by nonmaskable · · Score: 1

      It is built in but you don't automagically get full benefit unless you design your data structures and algorithms appropriately. In my case, I got no measurable benefit until I did a fairly extensive redesign.

      Intel has a great book on performance tuning that has been extremely helpful, as has Intel's VTune.

    2. Re:It's in the compiler by spitzak · · Score: 1

      With no changes to our code, but turning on most of the switches to the Linux Intel compiler, I got a huge number of "loop was vectorized" messages, and the resulting code was sped up almost 20% (verses only 5% for the Intel compiler with no switches other than -O5). Now it is quite likely that more speedup is possible, but it appears the Intel compiler was quite able to recognize and vectorize code that was not designed for it. (ps the code is floating-point image processing, with repetitive operations done to huge arrays of 32-bit floats, so it may be well-suited to their vectorization)

    3. Re:It's in the compiler by Sebastopol · · Score: 1


      Actually, you DO get automagical compiler speedup. In some cases it can identify vector-izable (is that word?) loops and promote them to SIMD operations.

      But yes, otherwise, you need to re-code if the compiler doesn't take the hint, especially in structures/classes. The only objection I have to the Intel intrinsics is they don't look pretty! ;-)

      I haven't used VTune since circa 1998, and it had this awesome feature that would point out boneheaded things in your code. One interesting suggestion it made: it noticed I wasn't using return values despite returning them, when I removed them, I got a 25% performance boost in some critical code. Made me feel like an eeeeediot.

      --
      https://www.accountkiller.com/removal-requested
    4. Re:It's in the compiler by nonmaskable · · Score: 1

      Automagical only if it can make the identification; there are several things that can prevent it from doing so, and I managed to do several of them. VTune helps a lot with code like this - I've spent many happy hours tracking down hotspots with it.

    5. Re:It's in the compiler by Anonymous Coward · · Score: 0

      There is a limit on what compiler can do. For the vast majority, there is no need to go to assembly language. For a select few, assembly language provide the extra boost of performance that they must squeeze from the hardware. Assuming you know what you are doing, of course.

  22. Licensing scheme acceptable? by Anonymous Coward · · Score: 0

    I had a look at their website and I am a bit sceptical about the licensing scheme. Why can't they just be upfront and GPL it?

  23. Stupid license by Anonymous Coward · · Score: 0

    Its a stupid license because its impossible to enforce. Its like trying to enforce a peculiar moral code with a EULA.

    Note that I'm a fan of the GPL, and I think the aim of it is entirely in concert with the type of rules programmers have followed for over 40 years.

  24. already exists by jeif1k · · Score: 2, Informative

    SIMD support already exists, in the form of C, C++, and Fortran libraries (usually, as a small part of larger numerical libraries), as well as in language constructs in languages like Fortran.

    1. Re:already exists by jkujawa · · Score: 1

      The point of MacSTL is it's portable to both PPC and Intel. You can make a portable SIMD-optimized program.

    2. Re:already exists by jeif1k · · Score: 1

      That's not much of a point. E.g., Atlas libraries support many different CPUs, and so do Fortran compilers. Many of those tools are already mature, widely used, conform to open standards, and are free for even for commercial use.

  25. Assembly by bsd4me · · Score: 2, Insightful

    Even in embedded systems, assembly isn't used as much as it used to. It still get used in bootloaders, and sometimes in device drivers. However, most devices are memory mapped, and most of the driver is written in C, and asm() calls are made when appropriate (eg, asm("eieio");), especially when you get to use gcc and asm() syntax for accessing variables.

    --

    (S(SKK)(SKK))(S(SKK)(SKK))

  26. The future by johnhennessy · · Score: 3, Insightful

    Surely people can now start to see where the future lies - from a performance viewpoint. We've reached the end of the clocking "free lunch" (see http://www.gotw.ca/publications/concurrency-ddj.ht m/).

    The way forward is turning the CPU (of a traditional) architecture into a Nanny for a range of various dedicated processing units. IBM saw this years ago, and thus began the whole Cell architecture - but I suspect that their job was much easier. The software that would run on the platform they are designing is fairly specific - games & multimedia which usually lend themselves well to vectorization.

    The real challenge for architects (in my humble opinion) is translating will be applying the same technique to other system bottlenecks.

    AMD's (and now Intel's) approach of crambing more and more processing cores onto an IC might pay off in the short term, but like the "free lunch" of clock speed, will hit a roadblock when issues like memory bandwidth and caching schemes just have too much work to do with 4 or 8 processing cores hacking at it all the time.

    --
    [ Monday is a terrible way to spend one seventh of your life. ]
    1. Re:The future by Rinikusu · · Score: 1

      Isn't that pretty much what the Amiga was doing a couple decades ago? The CPU was merely a traffic cop, directing other specialized units to actually do the real work? If so, they're a bit late to the party, eh?

      --
      If you were me, you'd be good lookin'. - six string samurai
    2. Re:The future by Duncan3 · · Score: 1

      Even Amiga was a couple of decades late to the party. Vector processors have been around for a looooooooooooong time.

      Still, it makes good headlines even today ;)

      --
      - Adam L. Beberg - The Cosm Project - http://www.mithral.com/
    3. Re:The future by jeif1k · · Score: 1

      IBM saw this years ago, and thus began the whole Cell architecture

      This was completely obvious to everybody in the 1980's. What was surprising to most people was that Intel managed to succeed with the x86 architecture for so long without innovation.

      AMD's (and now Intel's) approach of crambing more and more processing cores onto an IC might pay off in the short term, but like the "free lunch" of clock speed, will hit a roadblock when issues like memory bandwidth and caching schemes

      And how do you think Cell addresses that? Right now, it's still just a lot of CPUs on a chip, with the same memory bottleneck as everybody else.

      In fact, AMD and Intel probably are at a sweet spot with their processors where CPU power and memory bandwidth are fairly evenly matched for most applications. Cell, if anything, probably makes a bad tradeoff for real-world apps.

    4. Re:The future by johnhennessy · · Score: 1

      Ahhh Amiga,

      I spent most of the 80's watching day-time TV.

      I'm still trying to figure out which was the better option (TV or Computers).

      --
      [ Monday is a terrible way to spend one seventh of your life. ]
  27. Isn't it what std::valarray is for? by 21mhz · · Score: 1

    Reading this reminded me about that portion of the standard C++ library which is all about operations on vector data. So, my question is: could an std::valarray specialization for processor-supported types serve as a basis for portable SIMD support in C++?

    --
    My exception safety is -fno-exceptions.
    1. Re:Isn't it what std::valarray is for? by kuwan · · Score: 2, Insightful

      So, my question is: could an std::valarray specialization for processor-supported types serve as a basis for portable SIMD support in C++?

      That's exactly what this is. If you read the part on his website about valarray then you'll see that it does extensive SIMD optimizations for valarray for both Altivec and MMX/SSE/SSE2/SSE3 platforms. He's even added "parallelized algorithms such as integer division, trigonometric functions and complex number arithmetic" which you'd have to code yourself in either assembly or using the C-based intrinsics if you wanted do the SIMD programming by hand.

      So basically, this allows you to code using std::valarray using normal C++ and then plug this in under the hood to get a nice speed boost.

      --
      Join the Pyramid - Free Mini Mac

    2. Re:Isn't it what std::valarray is for? by emarkp · · Score: 1

      I guess you didn't notice: http://www.pixelglow.com/macstl/valarray/.

  28. Other way around by Kiryat+Malachi · · Score: 1

    Freescale, nee Motorola. (Nee roughly translates to "formerly known as").

    --

    ---
    Mod me down, you fucking twits. Go ahead. I dare you.
    (I read with sigs off.)
    1. Re:Other way around by Anonymous Coward · · Score: 1, Informative

      Or born, like the french word it is: née.

      No need for anyone to whip out the online dictionary and tell me "formerly known as" is an acceptable alternative.

    2. Re:Other way around by Anonymous Coward · · Score: 0

      ... or 'maiden name' as it aplies to married women a lot

    3. Re:Other way around by Anonymous Coward · · Score: 0

      Maiden name=="Eddie"
      u r teh st00p3d

    4. Re:Other way around by Anonymous Coward · · Score: 0

      Actually, 'nee' is Dutch for 'no'. No rought translation, but an actual and fully accurate translation.

      Remember "What is your favorite color. Blue, no Red!"

      Now, which was his actual and current favorite color? Blue or Red?

    5. Re:Other way around by Kiryat+Malachi · · Score: 1

      'Nee' may be Dutch for 'no', but that has nothing to do with the usage I was correcting. In that usage, it's based on French, where it is the feminine form of the past participle of the verb "to be born". Thus, it is literally translated as "born as". However, the meaning it has acquired in English usage is better served by the definition "formerly known as".

      --

      ---
      Mod me down, you fucking twits. Go ahead. I dare you.
      (I read with sigs off.)
    6. Re:Other way around by Kiryat+Malachi · · Score: 1

      Well, considering it is applied (by English speakers) to objects being renamed, the use of the word "born" is a bit suspect.

      Products aren't born, nor are companies. And yet, we see usages like Freescale, née Motorola, or "the Acura SLX (nee Isuzu Trooper)." That last was from the New York Times. As such, while the *French* word means "born", the English usage is far better represented by "formerly known as".

      We've co-opted parts of your language, Frenchie. Get over it.

      --

      ---
      Mod me down, you fucking twits. Go ahead. I dare you.
      (I read with sigs off.)
  29. Read the Altivec mailing list by kuwan · · Score: 4, Informative

    A better resource for Altivec and SIMD in general is the SIMDtech.org website and Altivec mailing list. There are tutorials and technical manuals available and the email list is indispensable. While the mailing list is mostly geared towards Altivec optimizations and discussions all SIMD discussion is welcome, including MMX/SSE. There are Apple engineers that read and contribute to the list as well as Motorola/Freescale engineers. It's probably the single best resource available to Altivec programmers and you get to talk directly to the Wizards that created it.

    I'm a relative newcomer to the list and it's been an invaluable resource as I've optimized with Altivec.

    --
    Join the Pyramid - Free Mini Mac

  30. Assembly-DSPs by Anonymous Coward · · Score: 0

    "Even in embedded systems, assembly isn't used as much as it used to. "

    It is when programming DSP's (and related devices). Don't forget that microcontrollers outnumber microprocessors by a large margin. And preformance is important there (especially automotive and aeronautics.)

    1. Re:Assembly-DSPs by bsd4me · · Score: 1

      It is when programming DSP's (and related devices).

      From my experience, yes and no. Fixed-point DSP tends to be done in assembly, mainly because FP techniques don't translate well to C. The compilers also tend to suck. A fair to large amount of floating-point DSP is done with C when the compiler support is good. I have done a lot of floating-point DSP, and we found that the write in C, refine in ASM workflow was best.

      Don't forget that microcontrollers outnumber microprocessors by a large margin.

      That is true. I usually refer to this as "high" embedded versus "low" embedded systems. Along with DSP, I have spent my career mainly working on large, embedded systems running on microprocessors (Mot 68k, PowerPC, and some MIPS) under control of an RTOS. In this application, assembly doesn't get used as much as you think even when you are dealing with hard realtime requirements.
      --

      (S(SKK)(SKK))(S(SKK)(SKK))

  31. OS X Tiger will do it for you by jilbert · · Score: 2, Interesting

    Tiger, the next OS release from Apple, will take care of vector optimization automatically in their version of gcc 4.0. I guess this will make it into the public gcc too.

    1. Re:OS X Tiger will do it for you by Junks+Jerzey · · Score: 1

      Tiger, the next OS release from Apple, will take care of vector optimization automatically [apple.com] in their version of gcc 4.0. I guess this will make it into the public gcc too.

      For the record, this has been in Intel's C compiler for years now. It's also in the current release of the Microsoft Visual C++ compiler, including the free download version.

    2. Re:OS X Tiger will do it for you by be-fan · · Score: 4, Informative

      Actually, Apple's Tiger will get an auto-vectorizing compiler courtesy of the public GCC 4.0 release. The auto-vectorizer wasn't developed in Apple's version of GCC. IBM's GCC team at the Haifa Research Lab developed the vectorizer in the public LNO (loop nest optimization) branch of GCC 4.0. I'm not trying to minimize Apple's contribution here, one of their developers did work on the team, but let's give credit where credit is due.

      --
      A deep unwavering belief is a sure sign you're missing something...
    3. Re:OS X Tiger will do it for you by johnnyb · · Score: 1

      Watch out, it's the Loop Nest Monster!

  32. Q for VMX/3D/OpenGL software developers: by tubbtubb · · Score: 1

    This is public now, so I can talk about it--
    I worked on extending the accuracy and continuity of the VMX instruction vexptefp, see the patent application here
    My understanding is that this instruction is used to compute Phong/specular hilights, and that previous implementations of this instruction were unusable because the lack of accuracy and continuity made it visually undesirable. We were able to improve the algorithm enough to be visually indistinguishable from a fully accurate non-estimate.
    Can any software developers that use this instruction comment on this?
    Is Phong hilighting mostly done on GPUs now?

  33. From the limewire... by WilyCoder · · Score: 3, Interesting

    As two of my professors have stated in class, SIMD and moreso parallel processing will require programmers to think in a fundamentally different way in order for multi-core/multi-processor to really take off.

    This project may be a step in the right direction. Benchmarks show that SIMD such as SSE/2/3 only provide a marginal speed increase. And meanwhile, the massively parallel computations done on graphics cards dwarfs anything SIMD claims to produce.

    Perhaps we will see GFX manufacturers selling their technology to the CPU makers.

    I forget the specifics, but a new GFX card can perform somewhere around 35 GFLOPS, while a 3.4Ghz P4(executing SIMD code) can only produce around 5-6GFLOPS at best.

    With projects like Brook GPU emerging, the division of CPU and GFX processor may be narrowed significantly.

    1. Re:From the limewire... by Anonymous Coward · · Score: 0

      vector optimising compilers for c/c++/ as common example languages are not able to extract all possibile candidates for vector processing. its a very complicated optimisation. Almost all optimisations in gcc such as global common subexpression, strength reduction, indicution variables, jump optimisation occur at the back end RTL phase... However vector optimising is very complicated and is done in the front end of the new GCC - it requires completely refactoring loop code into multiple atomic like loops.

      FPGA is the way of the future - fpga's get 100x the order of processing power of DSPs. the difficulty is structuring the algorithm to take advantage of it. for signal processing for example you can do broadband demoducation of 30Mhz bandwidth with a single fpga.

    2. Re:From the limewire... by Anonymous Coward · · Score: 0

      It's not really about "selling" the technology, more with the fact that GPU instructions arent really general purpose like instructions in a cpu. And to add those instructions would increase the cost, power consumption, heat generation, difficulty of manufacturing etc. It just doesn't make sense. GPUs are rather specific to graphics even though their programmability has increased significantly. Some operations are directly performed in hardware (remember the hype about hardware transform&lighting around the time Geforce 256 came out?).

    3. Re:From the limewire... by Anonymous Coward · · Score: 0

      I've made the core of macstl SIMD support as general and abstract as possible, so I'd be interested in someone adapting it for a GPU or FPGA. I developed the guts of it in Altivec, then abstracted it, and the basic MMX/SSE support was only 3,000+ extra SLOC.

      Cheers,
      Glen Low, Pixelglow Software
      www.pixelglow.com

    4. Re:From the limewire... by Anonymous Coward · · Score: 0

      As two of my professors have stated in class, SIMD and moreso parallel processing will require programmers to think in a fundamentally different way in order for multi-core/multi-processor to really take off.

      Ahhh, so THIS is what I missed out on by not going to college, the computing issues of the late 80s summed up into little aphorisms.


      This project may be a step in the right direction. Benchmarks show that SIMD such as SSE/2/3 only provide a marginal speed increase. And meanwhile, the massively parallel computations done on graphics cards dwarfs anything SIMD claims to produce.

      Perhaps we will see GFX manufacturers selling their technology to the CPU makers.


      You obviously don't have a clue. SIMD is an abstract computational concept, and has nothing to do with where or how it is implemented. The very reason that GFX processors can produce such amazing results is because of multiple (hundreds) of SIMD and MIMD vectorization units. GFX data is specialized and is massively parallelizable with the right hardware. A graphics processor is not a general purpose processor, if it were that easy you could slap one in place of your Pentium IV, and gets amazing performance advantages. It isn't this simple. The specialized nature of the computation involved for Shading, lighting, and geometry for 3d graphics makes them quite useful FOR THESE APPLICATIONS. A lot of the best minds worked on massive parallelization issues with DoD and other funding during the cold war (remember those Thinking Machines Super Computers (connection machines) in Jurassic Park with the big LED bars, I used to work with those things 12 years ago. The company went defunct and a lot of other parallel research crapped out.). Sorry but there isn't some magic Learn Parallelization in 21 Days to turn your Java coder into producing 10 fold increases in general purpose programming tasks with a GPU.

      GFX card manufacturers haven't really pioneered much over the basic parallel computation research hammered out during the 80s-90s. The big breakthroughs for them have actually been driving advances in semiconductor process issues. Putting massively interconnected SIMD and ALUs on silicon causes a whole host of problems.

    5. Re:From the limewire... by Hast · · Score: 1

      FPGAs are loads of fun. But they are not particularly suited for general cases. I was part of a group that did an image processor for FPGA, even its theoretical capacity was thouroughly whipped by a GPU running the same algorithms implemented as shaders. (And the GPU version wasn't optimised at all, eg we wasted 3 of 4 channels available for processing.)

      And for basic signal processing a DSP is really fast, typically way faster than a FPGA. It's all about how the data is structured.

      AFAIK DSP is good with "narrow" data, ie audio. FPGAs are good with wider data (eg images) with high demands on flexibility on the processing elements. Vector processors (such as GPUs, Altivec, SSE etc) are good when you have semi-wide data and general processing requirements. As it's implemented as ASIC you can get extremely high frequencies which makes for high throughput and low latency. Compared to FPGA the calculations aren't as flexible though. And finally the typical CPU which can do just about anything but comparatively slowly.

      All that said, I remember when I studied good ol' An introduction to Algorithms that there was an entire chapter on parallell processing, however it typically required multiple processor for each element in the data. Naturally that is not very useful for a normal CPU but it's good for a GPU and even better for FPGA.

  34. Ignorant submitter, or smart marketing? by javaxman · · Score: 2, Interesting
    Sorry, I can't read a story submitted by someone who doesn't even know about C libraries that have been around for years.

    Or is this just another advertisement pretending to be a story, with the submitter trying to play ignorant about alternative Altivec and MMX libraries ?

  35. Depends on what you are doing by dsci · · Score: 5, Insightful

    We write code for hardcore chemical simulations. The limits on what can be studied, ie number of atoms/molecules or timescales of the simulations depends on one thing: speed.

    Faster computers means better simulations. BUT, if the code is not as fast as it can be on a particular architecture, your simulations are not going to be as complete as they can be. At least within a given time allotment.

    I've recently applied some code optimizations to a Monte Carlo simulation and saw speed ups of over 1000x. That's significant.

    It's naive to think that faster computers means we should live with sloppy or unoptimized code. SIMD is a useful technique, and if it means the difference between me getting work done in a week or two or three weeks, I think I'll take the one-week sim.

    --
    Computational Chemistry products and services.
    1. Re:Depends on what you are doing by Anonymous Coward · · Score: 0

      Have you considered using Java or improving your algorithm rather than optimizing? That usually gives much butter results.

    2. Re:Depends on what you are doing by imsabbel · · Score: 1

      Speedups like a factor of 1000 can only come from high level optimisations (like replacing an O(n^2) with an O(n log n) algo).

      Honestly: TO be able to get a 1000 times boost, your original code must have been beyond bullshit.

      And of course using simd is better than not using it, but i would rather stay on a "let the compiler vectorize it" level. I mean, doing your inner loop in leet assambler only to NOT know after a long simulation if ther results are real or you just botched some line isnt worth it.

      --
      HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
    3. Re:Depends on what you are doing by groomed · · Score: 1

      And of course using simd is better than not using it, but i would rather stay on a "let the compiler vectorize it" level. I mean, doing your inner loop in leet assambler only to NOT know after a long simulation if ther results are real or you just botched some line isnt worth it.

      Baseless FUD. Why would a few dozen lines of hand coded assembly suddenly invalidate the results?

    4. Re:Depends on what you are doing by Dasein · · Score: 2, Insightful

      Speedups like a factor of 1000 can only come from high level optimisations (like replacing an O(n^2) with an O(n log n) algo).

      Nope. Technically, there are two constant burried in here. The definition is g(x) = O(f(x)) => g(x) <= k*f(x) where x > a for some orbitrary a. If you don't change algorithms, all you can do is manipulate the k. For a given k and a given level of improvement, I can give you a new k that hits that level of improvement.

      Honestly: TO be able to get a 1000 times boost, your original code must have been beyond bullshit.

      Also, his original code may have been "bullshit" but it may not have. It depends a lot on the algorithm in question. The higher the exponent on an exponential algorithm, they more sensitive its running time is to some optimization in an inner loop.

      And of course using simd is better than not using it, but i would rather stay on a "let the compiler vectorize it" level. I mean, doing your inner loop in leet assambler only to NOT know after a long simulation if ther results are real or you just botched some line isnt worth it.

      This is a simple matter of economics. There's a cost/benifit to expending the effort to optimize in assembly. If the compiler generates good code, then obviously, the cost/benefit of recoding in assembly is pretty high. However, without specific knowledge of *HIS* economics, I would suggest that you not spout off.

      --
      You are not a beautiful or unique snowflake -- but you could be if you got off your ass.
    5. Re:Depends on what you are doing by bigox · · Score: 1

      Some algorithms with the same complexity may have vastly different constant coefficients. I've seen some algorithms in O(n^2) and O(nlog(n)) that have constants as high as 1000. And some of the constants are actually parameters of the algorithm. Why do people look at these algorithms? Some are easier to explain (and easier to illustrate certain concepts) than others. Computational Geometry has a lot of these beasties.

    6. Re:Depends on what you are doing by aminorex · · Score: 2

      The difference between running mostly in L1 cache and regularly going to RAM (particularly when load/store patterns are pessimal), multiplied by the parallelism of exploiting SIMD can quite feasibly give a 1000x performance difference.

      --
      -I like my women like I like my tea: green-
    7. Re:Depends on what you are doing by imsabbel · · Score: 1

      er, no.
      not a factor of 1000.
      even if it before was 0% cache hitrate and only scalar and after 100% l1 cache hitrate and 4 way simd, it would only have a speedup of about 100. (l1 latency is 2-4 clocks on different cpus, main memory around 80-150).

      Believe me, i know how a nifty algotrithm can perform like crap because what was before nice block streaming in O(n^2) bcame pointer chasing in O(NlgN), but not a factor of 1000, not that way.

      I know its a bit of nitpicking, but throwing around claims like "factor 1000 improvement" need a bit of backup...

      --
      HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
    8. Re:Depends on what you are doing by aminorex · · Score: 1

      Consider a 2GHz CPU, with 2 cycle L1 latency loading one-byte from a 16 byte cache line, at stride 16 bytes. Ignoring TLB thrash, etc., to force a factor of 1000 in latency use a bit rate N, 1000ns = 128b * N ns/b. That's 7ns/b. An ECC EDO RAM 9b wide feeding an off chip MMU would do about that, for example.

      Now let's start swapping, and see if we can push a factor of 10e6, eh?

      --
      -I like my women like I like my tea: green-
  36. faster? Bogus.... by Anonymous Coward · · Score: 0, Troll

    Excuse my ignorance, but how can a C++ template library be faster than hand coded assembler? ever.... no really - with a straight face. Given of course that "hand coded" implies it's hand coded for the task at hand an not something "like" it. If this was an article about a SIMD library why does it go all koolaid? Is this today's "mac-mini" astroturf?

    1. Re:faster? Bogus.... by Anonymous Coward · · Score: 1, Insightful

      the idea that assmebler programmers can write better code than a compiler can generate is one of those urban myths that refuses to die. compilers can and do undertake code analysis that no assembler programmer could ever do - like trace back the control flow through every single branch point to find instances where data has already been precalculated. code hoisting of temporaries outside loops in a way that maximises register use over memory hits. undertaking such analysis before coding in assembler would be extremely high risk for an assembler programmer. also would you as an assmbler programmer go about inlining all your assembler functions - the code would be unmanageable? how many assembler programmers would know how to reorder their instructions to avoid pipeline stalls. all the knowledge about optimising assembly programs has been incorporated into compiler backends over the years- why wouldnt it have been?

      its been tested - get a program that converts assembler to c and then recompile with optimisation - it *will* run faster.

      the only exceptions are where the compiler lacks an algebraic or RTL awareness of an instruction on a specific architecture.

      jxxx

    2. Re:faster? Bogus.... by Anonymous Coward · · Score: 0

      Why is this insightful? Of course the assembler coder wins because s/he can use the damn compiler and then tweak the output, but this is a minor detail. The real issue is that the programmer knows way more about the code than the compiler.

      Now, this is a bit C-centric.. but e.g.

      - is the code going to be executed 10 times or 10 billion times? (ok, you may have a profiler..)
      - is it certain that these arrays don't overlap? (restrict helps you only so far..)
      - is this set of chars guaranteed to be word-aligned?
      - is the function 'pure'? can it be eliminated as a common subexpression? (well, OF COURSE it's in a different compilation unit, doh! stupid question..)
      - are we using the whole set of possible values in this variable or just some subset? can we take advantage of it?
      - can we use some dirty tricks? s/m? counted braches? precompiled bitmaps? ignoring rare situations and using exceptions to handle them? giving up precision for speed? or even taking advantage of non-standard semantics, such as rounding?
      - can the language even express certain operations? carry-arithmetic comes to mind..

      Of course these can usually be done with some non-portable code but then you're essentially using your precious compiler more or less as a macro-assembler..

    3. Re:faster? Bogus.... by Creepy · · Score: 1

      I have to agree with the "insightful" poster - he/she/it is describing optimizing using assembly, not C, and trying to optimize code in assembler is a nightmare these days. You've got deep pipelines, multiple execution units, parallel processing - it's just ugly and requires a deep understanding of how to reorder the instructions to avoid pipeline stalls as well as keeping all the parallel units working at the same time. Optimizing your C code for better assembly is a different thing entirely (unrolling loops, aligning bitmaps on power-of-2 boundaries for faster copies, putting most common loop choice first, etc.).

      About the best you can hope for is to tweak C code with assembly blocks (usually after extensive profiling). Even then, you really don't see the huge performance improvements these days (nothing like 5 years ago).

      I used to be excited about the future of this tech, but to me, GPU tech has become far more important (heck, the latest stuff I've been programming barely sucks up 800MHz of my CPU but throttles the GPU).

    4. Re:faster? Bogus.... by Anonymous Coward · · Score: 0

      I have to agree with the "insightful" poster - he/she/it is describing optimizing using assembly, not C, and trying to optimize code in assembler is a nightmare these days. You've got deep pipelines, multiple execution units, parallel processing - it's just ugly and requires a deep understanding of how to reorder the instructions to avoid pipeline stalls as well as keeping all the parallel units working at the same time.

      You can always shortcut and simply time the outcome and then tweak things and start over until satisfied. Think of it as a genetic algorithm or something. But sure, knowledge helps too.

      About the best you can hope for is to tweak C code with assembly blocks (usually after extensive profiling). Even then, you really don't see the huge performance improvements these days (nothing like 5 years ago).

      Show me a C compiler that can come even near to 3 flops per cycle on an Athlon in something easily vectorizable (say, mem/mem vectored dot product), then I'll believe you. (But yea, of course it was even better in the old days when the dirty tricks didn't cause pipeline/cache flushes and branch mispredictions..)

      They don't. Why? Because you're right. Optimization is these days indeed hard and there is endless amount of stuff that may screw it. In fact it's so hard that even the compiler doesn't get it usually right. That's why there's the occasional assembler coder that actually puts effort in tweaking some small hotspot areas.

      The point is that while no-one's claiming (at least I hope so) that whole applications ought to be written from scratch in assembler, it's absolute bollocks to claim that a compiler can always beat a competent asm coder even in isolated places. In fact that's precisely what I'd call an urban myth instead of opposite..

  37. License issues-Smells funny. by Anonymous Coward · · Score: 0

    "The guy wants to get paid, and that's fine, I want to get paid, too. But he's got no business telling me I have to distribute my source code for an internal project that will never be distributed."

    He does if you use his license willingly.

    "He could easily have used a method similar to Trolltech's dual-licensing, but he chose instead to do something a whole lot more obnoxious."

    It would be obnoxious if somehow he took away your free will to choose what license to use. He didn't and you can pick a multitude of OSI licenses.

    1. Re:License issues-Smells funny. by IO+ERROR · · Score: 1

      Of course he hasn't taken away my choice, AC. I can't reconcile either of his licenses with my existing projects, so I choose not to use his code. I suspect many existing projects will find themselves in a similar situation when they actually read the licenses, and will also choose not to use his code.

      --
      How am I supposed to fit a pithy, relevant quote into 120 characters?
  38. Oh come on by Anonymous Coward · · Score: 0

    It's not like there isn't C compiler intrinsics for MMX and SSE/SSE2/SSE3(PNI). Hell, far as I know, they're supported on both intel's and FSF's compilers.

    I have to wonder: who the hell expects a library to turn out decent SIMD code for them? I mean, what the fuck's the matter?

  39. liboil by labratuk · · Score: 2, Interesting

    Another project trying to do something similar is liboil, the Library of Optimised Inner Loops.

    However in the future I can see things changing for the structure of the stardard PC.

    At the moment in a high end machine you have the CPU, which is a scalar processor, a GPU, which is in essence a glorified vector processor (not just useful for graphics, as projects like GpGPU are showing us), and SIMD extensions to the CPU to allow it to do small amounts of vector processing.

    Scalar processors are good for some things (branchy code) and vector processors are good for other things (very predictable parallel code). Having both is very useful.

    I would say in the next 5-10 years we will see the GPU join together with the SIMD extensions to provide a seperate general purpose vector processor.

    PCs will ship with two processors - one scalar, one vector. And everyone will be happy.

    Now, whether this will be transparent to the programmer depends on how automatic code optimisation progresses over the next few years. Is Intel's icc auto vectorisation already good enough? Don't know.

    --
    Malike Bamiyi wanted my assistance.
  40. Anyone else misread the title by Anonymous Coward · · Score: 0

    Something along the lines of Grand Theft Auto: SIMS?

  41. Moore's Law is OVER by emarkp · · Score: 1
    Haven't you been paying attention? Processor speed increases stopped 2 years ago. We can put more transistors on silicon, but the free performance ride is over.

    See Herb Sutter's article in the Feb C/C++ Users Journal or the (expanded) one in the March Dr. Dobb's Journal.

  42. But times are changing, this is becoming valuable by Paradox · · Score: 1
    Recently Herb Sutter (famous software engineering guru and C++ wizard) posted this essay in which he reminds us, among other things, that the generalization of Moore's law to processor is allready failing! While computers are continuing to get faster, it's not just in their clockspeed anymore.

    While memory speeds will continue for awhile, already processor speeds are falling off. Check out this graph from the article where he clearly shows what's happening.

    This brings an interesting dilemma to modern programmers. Programs won't magically get faster anymore. We need to start coding to take advantage of concurrency.

    The same is true of using SIMD units. They can speed up your code dramatically, but they must be taken into account in your code. That's why this macstl project is such a good idea. It is a standard set of common primitives that let you harness the SIMD functions of your processor. By putting a library over the specifics, your vector-aware code will grow with modern SIMD systems.

    Few people will ask you to write in assembly these days, but if you could easily give your math-intensive program a 10x-30x speedup by using one library (that seems very easy to use, by my standards), why wouldn't you?

    --
    Slashdot. It's Not For Common Sense
  43. Black Art? Uh...Vectorizing the known. by Anonymous Coward · · Score: 0

    "I don't consider calling "vec_add" inside a loop to be a black art."

    That's not were the "black art" part comes into play. It comes into translating your algorithm into something that will satisfy all the parameters.

    We've come far in parallelizing algorithms, but there's still a bit of "hand-holding" that automated means need. As far as known "vectorized" algorithms. Well now that's easy, isn't it?

  44. Why? Altivec-optimized libraries supplied by Apple by coult · · Score: 3, Interesting

    You really don't need macstl unless you have a strong desire to use valarray in C++...for example, the ATLAS project http://math-atlas.sourceforge.net/ already uses Altivec (and SSE/SSE2, etc) wherever it results in a speedup. So, if your code does linear algebra, use ATLAS and you'll see an automatic speedup in many cases. Other projects such as fftw http://fftw.org/ include Altivec/SSE/SSE2 optimizations as well. ATLAS includes lots of other optimizations such as cache-blocking, loop-unrolling, etc. I don't know of macstl includes such optimizations, but I do know that ATLAS performance approaches the theoretical peak performance on G4/G5 for things like matrix-matrix multiplication.

    Not only that, but Apple's vecLib http://developer.apple.com/ReleaseNotes/MacOSX/vec Lib.html includes ATLAS so you don't even have to download or install anything - it comes with OS X.

    --

    All is Number -Pythagoras.

  45. Algorithms by Detritus · · Score: 1
    You often need radically different algorithms to get the full benefit of SIIMD. The processing power is there, figuring out how to exploit it can be very difficult.

    You can do a limited version of SIMD with an ordinary CPU. A 32-bit CPU can execute 32 "bit logic" operations with a single instruction. With a properly structured problem, 32 instances can be computed in parallel.

    --
    Mea navis aericumbens anguillis abundat
  46. Yes. by Trillan · · Score: 2, Informative
    1. Re:Yes. by homb · · Score: 2, Informative

      No the current version of XCode uses GCC 3.3 and does NOT support autovectorization.
      The page you link to is a page that shows how to code vector-based programs. What the parent is asking is if the standard "Hello World" program can be auto-vectorized with one command-line argument, and that won't work currently.
      The next version of XCode (2.0) with GCC 3.4 will support partial auto-vectorization, as another comment said as well.

    2. Re:Yes. by Trillan · · Score: 1

      Actually, the parent says: Doesn't XCode have a feature that lets you "vectorize" certain parts of your code already?

      The answer is yes.

      "Automatic" was introduced by one of the child posts. You're correct that the answer to that is no. However, I didn't think that's what was being asked since it was a top-level question in response to TFA's "you must use assember."

      Anyway, sounds like we agree on reality, we just disagree on what's being asked. So between the two of us we've provided a very complete answer. :)

    3. Re:Yes. by homb · · Score: 1

      Except that I was wrong on the fact that GCC 3.4 does auto-vectorization, while it's in fact GCC 4.0 (which will ship with XCode 2.0).

      Other than that, I agree we agree on agreeing to agree.

  47. Why limit yourself to Altivec when you have NVidia by kompiluj · · Score: 3, Insightful

    Well the processing power of Altivec or MMX/SSE/3DNow or whatever is nowhere near the power of you newest NVidia/ATI card you have surely bought for playing Doom III. Why not use it then? Get the brook compiler! Furthemore, I see they introduce classes like vec, etc. Such classes have been already designed successfuly for C++. Why not try porting Blitz to the Altivec and/or to the GPU?

    --
    You can defy gravity... for a short time
  48. Maybe it's just Ignorant criticism... by kuwan · · Score: 3, Informative
    If you'd actually read what this is all about then you'd have find out that this is a cross-platform library for SIMD programming. You program in standard C++ using std::valarray and you get code optimized for Altivec and MMX/SSE/SSE2/SSE3 without having to do anything else. You don't need to worry about coding to two different libraries on two different platforms nor do you have to worry about learning the platform-specific C intrinsics, alignment issues, head/tail cases, etc.

    SIMD programming becomes as easy as this:
    float af1 [] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    stdext::valarray <float> v1 (af1, 10); // construct from first 10 elements of af1
    stdext::valarray <float> v2 (10, 3.0f); // construct with 10 repeats of 3.0f
    stdext::valarray <float> v3 (10); // construct with 10 repeats of 0.0f

    v3 = sin (v1) * cos (v2) + sin (v2) * cos (v1);
    He claims that the above code is 17.4x faster than Codewarrior MSL C++, 11.6x faster than gcc libstdc++ and 9.5x faster than Visual C++.

    Macstl also provides a cross-platform syntax for using vector registers that is similar to using the native C intrinsics on each platform. So while not all of the native operations are available, his cross-platform "vec" API allows you to write cross-platform code without having to learn both the Altivec and MMX/SSE intrinsics (which is a good solution for someone who knows one platform but not the other).

    --
    Join the Pyramid - Free Mini Mac
    1. Re:Maybe it's just Ignorant criticism... by javaxman · · Score: 1
      If you'd actually read what this is all about then you'd have find out that this is a cross-platform library for SIMD programming.

      My point exactly. Does the story say cross-platform anywhere? No, it says :
      programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians

      er... so, instead of saying something like "here's a product which allows you to use the same API for both PPC and Intel SIMD", the submitter puts in the above sentence, which is clearly not factually correct unless you infer the part about cross-platform ?

      I'm not saying that the library isn't cool and potentially useful, I'm saying I'm very turned off by the claim that programming either of these architectures requires assembly language. Because it's not true. Why do you have a problem with my pointing that out? Why do you assume I didn't know what macstl is already?

      Oh, and I'm also turned off by the post essentially being an ad... like your sig.

    2. Re:Maybe it's just Ignorant criticism... by Anonymous Coward · · Score: 0

      You call that easy?
      How about:

      f(a,b) = sin(a)*cos(b)+sin(b)*cos(a);
      input1 = {0,1,2,3,4,5,6,7,8,9}
      input2 = repeat({3}, 10)
      output(i) = f(input1(i), input2(i))

      Just add vectorized lazy evaluation and voilá, instant efficiency. Oh why are we still stuck with lame imperative paradigms?

  49. OSI-approved RPL goodness. Admit it.... by Pyrosophy · · Score: 2, Funny

    This story doesn't really mean anything and people are just making up comments.

  50. Content Addressable Parallel Processors by Baldrson · · Score: 2, Interesting
    The real "grand unified theory" of SIMD is CAPP or content addressable parallel processors. I read a book on this topic back in the 1970s and it was pretty clear to me that it:
    1. Was a great way of dealing with relational data
    2. Would have to await much larger scales of integration before becoming practical.
    Since then the computer world has become much more relational due to relational databases, and the levels of integration of skyrocketed, but no one major manufacturer of silicon has bothered to revisit this very simple and powerful route to high power computing.

    Fortunately there is at least a little ongoing research.

    The beauty of these processors is they integrate memory with computation so that the massive economies of scale we witness in memory fabrication apply to computation speeds as well so long as we can move toward relational rather than function computing as a paradigm. Fortunately this appears to be supported by the study of quantum computers, however those computers may never see the light of day for more fundamental reasons.

    1. Re:Content Addressable Parallel Processors by TheRaven64 · · Score: 1

      I think you missed the articles about the Cell processor. Almost every other post in those was someone re-inventing the Transputer...

      --
      I am TheRaven on Soylent News
  51. Re:OSI-approved RPL goodness. Admit it.... by yarichg** · · Score: 1

    Its an interesting discussion , only its jumping all over the place and touching a whole bunch of interesting points...I've only written (.model small) programs in Assembly(don't think I'd want to write many larger programs in Assembly and my recollection is that given the nature of Assembly(written properly) it will just about always be faster than any compiled code....so why not use it in small doses where applicable........

  52. Binary drivers for CPUs by Anonymous Coward · · Score: 0

    It's going to be really fun. ;)

  53. macstl vs. Blitz++ by ljubom · · Score: 1, Interesting

    It will be interesting to compare performance of the macstl library to other "high speed" template libraries like Blitz++ (see http://www.oonumerics.org/blitz/)

  54. Obviously, you arent a PS2 graphics programmer.. by LordZardoz · · Score: 1

    And one step further, I am betting you do not perform any sort of graphics programming.

    On win32 / mac platforms, the need to know how to do this is pretty low. DirectX wraps most of it, as well as the processes needed for GPU programming. I am sure the Mac libs that do the same job as DirectX accomplish much the same.

    But low level graphics programming is alive and well for game programming. I do what I can to stay well clear of that, since I dont like graphics programming much (just personal preference). But the need for this type of programming continues to exist. And it will continue to exist for a while yet.

    END COMMUNICATION

  55. improving the algorithm by Anonymous Coward · · Score: 0

    It is often possible to improve solution speed with better (not butter :) algorithms, but the grandparent refers to Monte Carlo methods, which really are limited by number crunching ability. Their only algorithmic improvement in many years has been the choice of not-so-random sample points.

  56. Pedantic Pissing Contests Aside by Baldrson · · Score: 1

    The point is that the GPL doesn't specify release behavior for code that isn't distributed so any "program" P developed with regard to the GPL should not reference such release behavior -- hence the substitution principle works.

    1. Re:Pedantic Pissing Contests Aside by CableModemSniper · · Score: 1

      But pedantic pissing contests are my favorite! ;)

      --
      Why not fork?
  57. Re:Why limit yourself to Altivec when you have NVi by TheRaven64 · · Score: 2, Insightful

    The main reason is that the AGP bus is designed to move data very quickly to the card, but is not so hot at moving it back again. This should change with PCI Express.

    --
    I am TheRaven on Soylent News
  58. Now they're thinking of it by El+Cabri · · Score: 1

    Funny thing : it was PRECISELY the topic of an engineer degree internship that I've made in the summer of 1997. Making a universal C++ template lib for SIMD programming, with application to the IA-32 MMX system. At the time there was already similar work all around, with the introduction of the MMX and the popular Alpha architecture that had a similar system. All that to say that it does not sound really new to me.

  59. Assembly lives! by Omigod · · Score: 2, Interesting

    The more complex the architecture the greater need to keep around low level coding. Compilers just can't keep up. During the early days of the PS2 we commonly got 300x performance improvements when switching from high level code to carefully architected and coded assembly. Programmers have gotten lazy and have lost the skills required to maximize the performance on current architectures. If you code carefully you can make sure that you are executing the maximum number of instructions per cycle. When you use a compiler it abstracts you from seeing that if you change your instruction pairing or split off some of the instructions into another pipeline you might get better performance. In school they teach you that algorythm is the most important thing to look at and that implementation doesn't matter that much, but with todays complex bus architectures, and with everything fighting for control of the bus, if you aren't careful you can end up wasting most of your time waiting for access to data or stalling the instruction pipeline waiting for results to calculations.

  60. CAPP != Cell by Baldrson · · Score: 1

    In CAPP operations are generally carried out in a bit-serial, word-parallel manner. This is radically different from Cell processor architecture.

  61. Cell is not an SIMD but a MIMD by John+Sokol · · Score: 1

    SIMD is Single Instruction Multiple data,
    MMX, SSE,2 and 3 , Altavec are this.

    Cell is MIMD, Multiple instruction and multiple data.

    Cell is an array of small independent CPU's. (think Beowulf cluster on a chip)

    Computation is done by a systolic arrays or similar parrallel processing techniques.

    Think of cells in spread sheet, where each rectange in the performs it's computation. A Cell processor allow you to change data at the top of the spread sheet and compute results at the bottom a GHz speeds!

    Granted this isn't good for running an OS, but for video processing, Finite element simulations, Ray Tracing, code breaking, and AI, it's great.

    I was working with Chuck Moore on Project Enumera , we layed out a chip with 49 (7x7) asynchronous CPU's. (this is important).
    When doing Cell processors with 50 cores you don't want them to run step lock. This is akin to why soldiers march out of step when crossing a bridge. It's distributes the loading on the PowerSupply lines, rather then creating one big spike when they all switch.

    --
    I am always doing that which I can not do, in order that I may learn how to do it. - Pablo Picasso
    1. Re:Cell is not an SIMD but a MIMD by Anonymous Coward · · Score: 0

      I think I've read there are nine cores on a cell chip, one of those being a 'normal' cpu to drive the other 8 cores simd-style.

      Thing is, you can then put multiple cells together, getting you what you were talking about.

  62. Yes, it does. by Millennium · · Score: 1

    AltiVec was introduced with the G4 line. The Mac Mini has such a chip. If it were to use a G3 chip, then it wouldn't have AltiVec, but that is not the case.

  63. Autovectorizing by dghomefry · · Score: 1

    Autovectorizing does exist Absoft makes a product called VAST http://www.absoft.com/Products/Libraries/vast.html
    It works on C, Fortran, and C++. I've seen some reasonable performance gains from just a recompile.

  64. grand unified theory? by convolvatron · · Score: 1

    a grand unified theory would use SIMD for distribution, not just exploiting a shallow local vector unit. like ZPL, or the connection machine languages. in a manner that allows you to exploit scale.

    no one calls a cray X1 SIMD, but its alot closer than altivec.

    1. Re:grand unified theory? by Anonymous Coward · · Score: 0

      macstl has support for memory-mapped valarrays so computational load could be shared among different processes, although you still have to do this manually. If I have the time, I'd explore ways of making this automatic, perhaps using some form of standard MPI-type technology.

      Cheers,
      Glen Low, Pixelglow Software
      www.pixelglow.com

  65. Andy-C A P P by Anonymous Coward · · Score: 0

    "The beauty of these processors is they integrate memory with computation so that the massive economies of scale we witness in memory fabrication apply to computation speeds as well so long as we can move toward relational rather than function computing as a paradigm."

    Well one could say something similiar about cache memory. The thing is that regardless of were you stick the computation. You still can't ignore the fact that computational units are always going to be bigger than memory cells. The wisest usage of silicon real-estate is to put the very simplist computations near it's associated memory cell. e.g. zeroing contents, filling a range with a particular bit pattern.* Then put the more complex operations on the CPU. Or even split the duties with the main chipset.

    *Extension of the refresh logic.

  66. Re:Why? Altivec-optimized libraries supplied by Ap by jd · · Score: 1
    The maintainer for ATLAS has been unemployed and/or abducted by aliens for some considerable time. Nor does ATLAS implement all of BLAS and LAPACK. It is probably no longer optimal for many modern systems.


    That's no fault of the maintainer - rough times are, well, rough. Nonetheless, it will take a long time before ATLAS is again a serious option for the kinds of problems it is good at solving. It's a pity, but it's too big a project for one person, and other projects (albeit non-free, in any sense) are putting in far more resources than that.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  67. Re:Why? Altivec-optimized libraries supplied by Ap by Anonymous Coward · · Score: 0

    SIMD works best when all the SIMD code is within a tight inner loop, with little or no branching or conditional code except the loop itself. Especially function calls. This helps with loop unrolling, instruction scheduling and minimizes pipeline bubbles.

    The problem with any separately compiled library, including Apple's fine vecLib implementation, is that it puts the function call boundary in the wrong place.

    The compiled library will have functions like vec_sin and vec_cos that work on a large set of vectors like IBM MASS, so your call to calculate sin(x)+cos(x) would look something like:

    allocate 1000 of v1
    allocate 1000 of v2
    allocate 1000 of v3
    allocate 1000 of v4
    for each v1, v2: v2 = sin (v1) -- call to lib vec_sin
    for each v1, v3: v3 = cos (v1) -- call to lib vec_cos
    for each v4, v2, v3: v4 = v2 + v3 -- call to lib add

    Note 4 memory allocations, 2 of which are for temporaries that won't be used again; note also 3 branches away to library functions.

    Compare with macstl, which is massively inlined yet works on an element-by-element basis:

    allocate 1000 of v1
    allocate 1000 of v4
    for each v4, v1: v4 = sin(v1) + cos(v1)

    Saving 2 expensive allocations and inlining function calls to within the loop, so no conditionals or branching there.

    The only way around this with separately compiled code is to put more and more functionality into a single call e.g. FFT, linear algebra, but you lose the flexibility of creating your own equations -- what if you don't want fast fourier transforms or linear algebra, but some funky trig function?

    Check out http://www.pixelglow.com/stories/altivec-valarray- 1/.

    Cheers,
    Glen Low, Pixelglow Software
    www.pixelglow.com

  68. Like a Blitter? by gidds · · Score: 1
    It's just struck me that the Altivec unit in this 'ere PowerMac is actually a little similar to the Blitter chip that my old Atari STE used to have.

    The exact function may be slightly different -- a vector processor is far more flexible -- but it's still a special-purpose unit that drastically speeds up a few simple operations on reasonably large amounts of data, often used for graphical operations.

    Interesting how so many ideas in computing are just developments of previous ones...

    --

    Ceterum censeo subscriptionem esse delendam.

  69. Battery consumption by tepples · · Score: 1

    because processor speeds have increased to such an extent (Moore's Law), it doesn't make sense to use assembly to write modern code; even if the assembly code is faster.

    More efficient code will run in less time, letting the CPU stay in an idle state more often. This can reduce power consumption, especially on battery-constrained devices. How many watts does your l33t-0-fast processor draw again, and how long would it run on a pair of AAs?

  70. Re:Why limit yourself to Altivec when you have NVi by MasterVidBoi · · Score: 1

    For the same reason that general purpose computation isn't done on your GPU. A GPU gets it's performance from being able to do the same small task to a whole lot of data. A CPU needs to do a bunch of tasks to a small bit of data.

    So, you need to multiply two vectors of a thousand floats each. Can the GPU do it faster? Yes. But not really, because there is an astounding minimum latency before the results of that computation can be recovered from the GPU. It's a *deep* pipeline. Even though the CPU will spend longer calculating, the results will be available immediatly, and you have much better turnaround time. If you were doing a million multiplies, the answer would be different. But outside of image processing/DSP work, you rarely find such operations.

    Altivec (and MMX/SSE2/SSE3, although less useful), sit nicely in the middle, allowing you to operate on larger pieces of data in parallel without incurring the latency of a GPU operation, allowing excellent performance gains in quite a few common situations.

  71. API matters by dugenou · · Score: 1
    What about an open library, cross-platform, multimedia oriented, along the line of SUN's mediaLib ? Would SUN allow freely the re-use of their API ?

    I'm looking for such a library, with GPL/LGPL compatible license. The API has to be in C, to maximise audience. For many projects, C++ is not an option.

    Primary use will be DSP work in GNU Radio project, but multimedia extensions could prove useful anywhere in GUI's to audio/video app, etc.

    I would take any pointers to such an already existing API/project, or be ready to start a new one, if other people interested in.

    --
    Love salty crackers? catchy electronica? Try !