Grand Unified Theory of SIMD
Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "
For those who want a little background on Altivec, of course Wiki has a description here. Apple, who now ships Altivec in every system they make has a pretty good page here and Motorola nee Freescale has one here.
The benefits of Altivec can be truly astounding for those processes that can be "vectorized". After all putting these kinds of calculations in hardware has got it all over software computation. It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.
Visit Jonesblog and say hello.
Apple has had AltiVec optimized libraries for DSP and such since the early releases of OS X.
*nods* Yes.
I like to place meaningful quotes in my sig, so people will know that I know what meaningful quotes are.
Doesn't XCode have a feature that lets you "vectorize" certain parts of your code already?
Okay, I'm willing to believe it, but only if someone shows how that's possible.
The difference between spam and poop is that you don't have to dig through septic tanks looking for real food. -- Me
The Mac forum at Ars Technica has a long, continuing post about Altivec optimizations and how they should be used. The thread started more than two years ago and still gets relevent points and questions added to it. It's an amazing resource if you're interested in starting.
Big freaking deal. Hardware and software are irrelevant. It's all about content now.
Moore's Law has eroded the need for such knowledge. It would be like concerning myself on how to design circuits to convert a DC current to AC current because I happen to use devices that use electricity, e.g., my toaster (as in bread).
I learned assembly long ago, still retaining a fair amount of it (80x86). There have been a few occasions where I've called upon its use, yeah twice in the last eight years... and that's about it.
Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche, much like the Win32 programming I used to do in the early 90's.
90% of IT jobs are with non-tech companies. Those situations are about the last place you will find anyone caring about something called "assembly language."
-M
The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.
How am I supposed to fit a pithy, relevant quote into 120 characters?
Typo...
Propellerheads.SE
Moore's Law has eroded the need for assembly
Moore's Law has nothing to do with assembly language and optimizations. From Wikipedia:
Moore's law is an empirical observation stating, in effect, that at our rate of technological development and advances in the semiconductor industry, the complexity of integrated circuits doubles every 18 months.
I wish people would stop saying "But Moore's Law..." for every hardware-related story on Slashdot. Do a bit of reading, please.
I take pride in the fact that i didn't understand a word of this post.
The RPL ( Reciprocal Public License) is an odd choice for this project. It is an even stronger viral copy-left than the GPL, to the point where the FSF takes issue with it. If create a derivative work you are required required to 1) Notify the original author, and 2) Publish your changes even if you only use the program in house. Furthermore, their definition of derivative work is much, much broader than the "linking" definition that the GPL uses.
The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL. In fact, considering the requirements placed on you by the license, I would expect that you will have difficulty incorporating this RPL library into any existing FLOSS project without running into license conflicts. The only thing I can see this being useful for is a new project that you don't mind releasing under the RPL, or with existing BSD style licensed code which you dual license as BSD/RPL (since BSD can be included in anything).
So this library does not appear to very useable for the FLOSS world, although if you want to license it for proprietary software you may.
"...the black art of assembly language magicians."
The nice thing about altivec is that it has a C interface. You don't have to use assembly!
Take a look at this Apple tutorial to see how easy it is.
Does this mean we can expect source Linux distros to start taking advantage of this?
I know I'll sound like a wannabe leet for saying this, but I already really like my Gentoo workstation because it is a stage1 install (all from source), and I expect this will only make it even faster!
Yay!
I don't know the meaning of the word 'don't' - J
Sounds great, but $2499 for a redistributable binary? Ouch.
A bit OT, but nevertheless quite interesting to read and it contains information about SIMD instruction sets other than just MMX/SSE: http://www.fefe.de/ccccamp2003-simd.pdf
A monkey is doing the real work for me.
TWW
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
For those that don't already know is that autovectorization is being worked on for GCC by folks from IBM and others.
GCC vectorizatoin project (site seem offline atm) but the abstract from a recent GCC summit is up.
Autovectorization Talk (google html view of pdf)
Vectorization (SIMD) is built into the Intel compiler. There is no need to hack in assembly as the compiler will do it for you. This is the case with most vendor supplied compilers, as they want to fully exploit their hardware functionality.
The problem is bringing this functionality to OS compilers, which as far as I know, there is not even an OpenMP (threading) implementation, let alone internal vectorization.
UBU
I had a look at their website and I am a bit sceptical about the licensing scheme. Why can't they just be upfront and GPL it?
Its a stupid license because its impossible to enforce. Its like trying to enforce a peculiar moral code with a EULA.
Note that I'm a fan of the GPL, and I think the aim of it is entirely in concert with the type of rules programmers have followed for over 40 years.
SIMD support already exists, in the form of C, C++, and Fortran libraries (usually, as a small part of larger numerical libraries), as well as in language constructs in languages like Fortran.
Even in embedded systems, assembly isn't used as much as it used to. It still get used in bootloaders, and sometimes in device drivers. However, most devices are memory mapped, and most of the driver is written in C, and asm() calls are made when appropriate (eg, asm("eieio");), especially when you get to use gcc and asm() syntax for accessing variables.
(S(SKK)(SKK))(S(SKK)(SKK))
Surely people can now start to see where the future lies - from a performance viewpoint. We've reached the end of the clocking "free lunch" (see http://www.gotw.ca/publications/concurrency-ddj.ht m/).
The way forward is turning the CPU (of a traditional) architecture into a Nanny for a range of various dedicated processing units. IBM saw this years ago, and thus began the whole Cell architecture - but I suspect that their job was much easier. The software that would run on the platform they are designing is fairly specific - games & multimedia which usually lend themselves well to vectorization.
The real challenge for architects (in my humble opinion) is translating will be applying the same technique to other system bottlenecks.
AMD's (and now Intel's) approach of crambing more and more processing cores onto an IC might pay off in the short term, but like the "free lunch" of clock speed, will hit a roadblock when issues like memory bandwidth and caching schemes just have too much work to do with 4 or 8 processing cores hacking at it all the time.
[ Monday is a terrible way to spend one seventh of your life. ]
Reading this reminded me about that portion of the standard C++ library which is all about operations on vector data. So, my question is: could an std::valarray specialization for processor-supported types serve as a basis for portable SIMD support in C++?
My exception safety is -fno-exceptions.
Freescale, nee Motorola. (Nee roughly translates to "formerly known as").
---
Mod me down, you fucking twits. Go ahead. I dare you.
(I read with sigs off.)
A better resource for Altivec and SIMD in general is the SIMDtech.org website and Altivec mailing list. There are tutorials and technical manuals available and the email list is indispensable. While the mailing list is mostly geared towards Altivec optimizations and discussions all SIMD discussion is welcome, including MMX/SSE. There are Apple engineers that read and contribute to the list as well as Motorola/Freescale engineers. It's probably the single best resource available to Altivec programmers and you get to talk directly to the Wizards that created it.
I'm a relative newcomer to the list and it's been an invaluable resource as I've optimized with Altivec.
--
Join the Pyramid - Free Mini Mac
infested with jello like fishes no melotron wishes
"Even in embedded systems, assembly isn't used as much as it used to. "
It is when programming DSP's (and related devices). Don't forget that microcontrollers outnumber microprocessors by a large margin. And preformance is important there (especially automotive and aeronautics.)
Tiger, the next OS release from Apple, will take care of vector optimization automatically in their version of gcc 4.0. I guess this will make it into the public gcc too.
This is public now, so I can talk about it--
I worked on extending the accuracy and continuity of the VMX instruction vexptefp, see the patent application here
My understanding is that this instruction is used to compute Phong/specular hilights, and that previous implementations of this instruction were unusable because the lack of accuracy and continuity made it visually undesirable. We were able to improve the algorithm enough to be visually indistinguishable from a fully accurate non-estimate.
Can any software developers that use this instruction comment on this?
Is Phong hilighting mostly done on GPUs now?
As two of my professors have stated in class, SIMD and moreso parallel processing will require programmers to think in a fundamentally different way in order for multi-core/multi-processor to really take off.
This project may be a step in the right direction. Benchmarks show that SIMD such as SSE/2/3 only provide a marginal speed increase. And meanwhile, the massively parallel computations done on graphics cards dwarfs anything SIMD claims to produce.
Perhaps we will see GFX manufacturers selling their technology to the CPU makers.
I forget the specifics, but a new GFX card can perform somewhere around 35 GFLOPS, while a 3.4Ghz P4(executing SIMD code) can only produce around 5-6GFLOPS at best.
With projects like Brook GPU emerging, the division of CPU and GFX processor may be narrowed significantly.
Or is this just another advertisement pretending to be a story, with the submitter trying to play ignorant about alternative Altivec and MMX libraries ?
We write code for hardcore chemical simulations. The limits on what can be studied, ie number of atoms/molecules or timescales of the simulations depends on one thing: speed.
Faster computers means better simulations. BUT, if the code is not as fast as it can be on a particular architecture, your simulations are not going to be as complete as they can be. At least within a given time allotment.
I've recently applied some code optimizations to a Monte Carlo simulation and saw speed ups of over 1000x. That's significant.
It's naive to think that faster computers means we should live with sloppy or unoptimized code. SIMD is a useful technique, and if it means the difference between me getting work done in a week or two or three weeks, I think I'll take the one-week sim.
Computational Chemistry products and services.
Excuse my ignorance, but how can a C++ template library be faster than hand coded assembler? ever.... no really - with a straight face. Given of course that "hand coded" implies it's hand coded for the task at hand an not something "like" it. If this was an article about a SIMD library why does it go all koolaid? Is this today's "mac-mini" astroturf?
"The guy wants to get paid, and that's fine, I want to get paid, too. But he's got no business telling me I have to distribute my source code for an internal project that will never be distributed."
He does if you use his license willingly.
"He could easily have used a method similar to Trolltech's dual-licensing, but he chose instead to do something a whole lot more obnoxious."
It would be obnoxious if somehow he took away your free will to choose what license to use. He didn't and you can pick a multitude of OSI licenses.
It's not like there isn't C compiler intrinsics for MMX and SSE/SSE2/SSE3(PNI). Hell, far as I know, they're supported on both intel's and FSF's compilers.
I have to wonder: who the hell expects a library to turn out decent SIMD code for them? I mean, what the fuck's the matter?
Another project trying to do something similar is liboil, the Library of Optimised Inner Loops.
However in the future I can see things changing for the structure of the stardard PC.
At the moment in a high end machine you have the CPU, which is a scalar processor, a GPU, which is in essence a glorified vector processor (not just useful for graphics, as projects like GpGPU are showing us), and SIMD extensions to the CPU to allow it to do small amounts of vector processing.
Scalar processors are good for some things (branchy code) and vector processors are good for other things (very predictable parallel code). Having both is very useful.
I would say in the next 5-10 years we will see the GPU join together with the SIMD extensions to provide a seperate general purpose vector processor.
PCs will ship with two processors - one scalar, one vector. And everyone will be happy.
Now, whether this will be transparent to the programmer depends on how automatic code optimisation progresses over the next few years. Is Intel's icc auto vectorisation already good enough? Don't know.
Malike Bamiyi wanted my assistance.
Something along the lines of Grand Theft Auto: SIMS?
See Herb Sutter's article in the Feb C/C++ Users Journal or the (expanded) one in the March Dr. Dobb's Journal.
While memory speeds will continue for awhile, already processor speeds are falling off. Check out this graph from the article where he clearly shows what's happening.
This brings an interesting dilemma to modern programmers. Programs won't magically get faster anymore. We need to start coding to take advantage of concurrency.
The same is true of using SIMD units. They can speed up your code dramatically, but they must be taken into account in your code. That's why this macstl project is such a good idea. It is a standard set of common primitives that let you harness the SIMD functions of your processor. By putting a library over the specifics, your vector-aware code will grow with modern SIMD systems.
Few people will ask you to write in assembly these days, but if you could easily give your math-intensive program a 10x-30x speedup by using one library (that seems very easy to use, by my standards), why wouldn't you?
Slashdot. It's Not For Common Sense
"I don't consider calling "vec_add" inside a loop to be a black art."
That's not were the "black art" part comes into play. It comes into translating your algorithm into something that will satisfy all the parameters.
We've come far in parallelizing algorithms, but there's still a bit of "hand-holding" that automated means need. As far as known "vectorized" algorithms. Well now that's easy, isn't it?
You really don't need macstl unless you have a strong desire to use valarray in C++...for example, the ATLAS project http://math-atlas.sourceforge.net/ already uses Altivec (and SSE/SSE2, etc) wherever it results in a speedup. So, if your code does linear algebra, use ATLAS and you'll see an automatic speedup in many cases. Other projects such as fftw http://fftw.org/ include Altivec/SSE/SSE2 optimizations as well. ATLAS includes lots of other optimizations such as cache-blocking, loop-unrolling, etc. I don't know of macstl includes such optimizations, but I do know that ATLAS performance approaches the theoretical peak performance on G4/G5 for things like matrix-matrix multiplication.
c Lib.html includes ATLAS so you don't even have to download or install anything - it comes with OS X.
Not only that, but Apple's vecLib http://developer.apple.com/ReleaseNotes/MacOSX/ve
All is Number -Pythagoras.
You can do a limited version of SIMD with an ordinary CPU. A 32-bit CPU can execute 32 "bit logic" operations with a single instruction. With a properly structured problem, 32 instances can be computed in parallel.
Mea navis aericumbens anguillis abundat
Yes it does.
Well the processing power of Altivec or MMX/SSE/3DNow or whatever is nowhere near the power of you newest NVidia/ATI card you have surely bought for playing Doom III. Why not use it then? Get the brook compiler! Furthemore, I see they introduce classes like vec, etc. Such classes have been already designed successfuly for C++. Why not try porting Blitz to the Altivec and/or to the GPU?
You can defy gravity... for a short time
SIMD programming becomes as easy as this:He claims that the above code is 17.4x faster than Codewarrior MSL C++, 11.6x faster than gcc libstdc++ and 9.5x faster than Visual C++.
Macstl also provides a cross-platform syntax for using vector registers that is similar to using the native C intrinsics on each platform. So while not all of the native operations are available, his cross-platform "vec" API allows you to write cross-platform code without having to learn both the Altivec and MMX/SSE intrinsics (which is a good solution for someone who knows one platform but not the other).
--
Join the Pyramid - Free Mini Mac
infested with jello like fishes no melotron wishes
This story doesn't really mean anything and people are just making up comments.
- Was a great way of dealing with relational data
- Would have to await much larger scales of integration before becoming practical.
Since then the computer world has become much more relational due to relational databases, and the levels of integration of skyrocketed, but no one major manufacturer of silicon has bothered to revisit this very simple and powerful route to high power computing.Fortunately there is at least a little ongoing research.
The beauty of these processors is they integrate memory with computation so that the massive economies of scale we witness in memory fabrication apply to computation speeds as well so long as we can move toward relational rather than function computing as a paradigm. Fortunately this appears to be supported by the study of quantum computers, however those computers may never see the light of day for more fundamental reasons.
Seastead this.
Its an interesting discussion , only its jumping all over the place and touching a whole bunch of interesting points...I've only written (.model small) programs in Assembly(don't think I'd want to write many larger programs in Assembly and my recollection is that given the nature of Assembly(written properly) it will just about always be faster than any compiled code....so why not use it in small doses where applicable........
It's going to be really fun. ;)
It will be interesting to compare performance of the macstl library to other "high speed" template libraries like Blitz++ (see http://www.oonumerics.org/blitz/)
And one step further, I am betting you do not perform any sort of graphics programming.
On win32 / mac platforms, the need to know how to do this is pretty low. DirectX wraps most of it, as well as the processes needed for GPU programming. I am sure the Mac libs that do the same job as DirectX accomplish much the same.
But low level graphics programming is alive and well for game programming. I do what I can to stay well clear of that, since I dont like graphics programming much (just personal preference). But the need for this type of programming continues to exist. And it will continue to exist for a while yet.
END COMMUNICATION
It is often possible to improve solution speed with better (not butter :) algorithms, but the grandparent refers to Monte Carlo methods, which really are limited by number crunching ability. Their only algorithmic improvement in many years has been the choice of not-so-random sample points.
The point is that the GPL doesn't specify release behavior for code that isn't distributed so any "program" P developed with regard to the GPL should not reference such release behavior -- hence the substitution principle works.
Seastead this.
The main reason is that the AGP bus is designed to move data very quickly to the card, but is not so hot at moving it back again. This should change with PCI Express.
I am TheRaven on Soylent News
Funny thing : it was PRECISELY the topic of an engineer degree internship that I've made in the summer of 1997. Making a universal C++ template lib for SIMD programming, with application to the IA-32 MMX system. At the time there was already similar work all around, with the introduction of the MMX and the popular Alpha architecture that had a similar system. All that to say that it does not sound really new to me.
The more complex the architecture the greater need to keep around low level coding. Compilers just can't keep up. During the early days of the PS2 we commonly got 300x performance improvements when switching from high level code to carefully architected and coded assembly. Programmers have gotten lazy and have lost the skills required to maximize the performance on current architectures. If you code carefully you can make sure that you are executing the maximum number of instructions per cycle. When you use a compiler it abstracts you from seeing that if you change your instruction pairing or split off some of the instructions into another pipeline you might get better performance. In school they teach you that algorythm is the most important thing to look at and that implementation doesn't matter that much, but with todays complex bus architectures, and with everything fighting for control of the bus, if you aren't careful you can end up wasting most of your time waiting for access to data or stalling the instruction pipeline waiting for results to calculations.
In CAPP operations are generally carried out in a bit-serial, word-parallel manner. This is radically different from Cell processor architecture.
Seastead this.
SIMD is Single Instruction Multiple data,
MMX, SSE,2 and 3 , Altavec are this.
Cell is MIMD, Multiple instruction and multiple data.
Cell is an array of small independent CPU's. (think Beowulf cluster on a chip)
Computation is done by a systolic arrays or similar parrallel processing techniques.
Think of cells in spread sheet, where each rectange in the performs it's computation. A Cell processor allow you to change data at the top of the spread sheet and compute results at the bottom a GHz speeds!
Granted this isn't good for running an OS, but for video processing, Finite element simulations, Ray Tracing, code breaking, and AI, it's great.
I was working with Chuck Moore on Project Enumera , we layed out a chip with 49 (7x7) asynchronous CPU's. (this is important).
When doing Cell processors with 50 cores you don't want them to run step lock. This is akin to why soldiers march out of step when crossing a bridge. It's distributes the loading on the PowerSupply lines, rather then creating one big spike when they all switch.
I am always doing that which I can not do, in order that I may learn how to do it. - Pablo Picasso
AltiVec was introduced with the G4 line. The Mac Mini has such a chip. If it were to use a G3 chip, then it wouldn't have AltiVec, but that is not the case.
Autovectorizing does exist Absoft makes a product called VAST http://www.absoft.com/Products/Libraries/vast.html
It works on C, Fortran, and C++. I've seen some reasonable performance gains from just a recompile.
a grand unified theory would use SIMD for distribution, not just exploiting a shallow local vector unit. like ZPL, or the connection machine languages. in a manner that allows you to exploit scale.
no one calls a cray X1 SIMD, but its alot closer than altivec.
"The beauty of these processors is they integrate memory with computation so that the massive economies of scale we witness in memory fabrication apply to computation speeds as well so long as we can move toward relational rather than function computing as a paradigm."
Well one could say something similiar about cache memory. The thing is that regardless of were you stick the computation. You still can't ignore the fact that computational units are always going to be bigger than memory cells. The wisest usage of silicon real-estate is to put the very simplist computations near it's associated memory cell. e.g. zeroing contents, filling a range with a particular bit pattern.* Then put the more complex operations on the CPU. Or even split the duties with the main chipset.
*Extension of the refresh logic.
That's no fault of the maintainer - rough times are, well, rough. Nonetheless, it will take a long time before ATLAS is again a serious option for the kinds of problems it is good at solving. It's a pity, but it's too big a project for one person, and other projects (albeit non-free, in any sense) are putting in far more resources than that.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
SIMD works best when all the SIMD code is within a tight inner loop, with little or no branching or conditional code except the loop itself. Especially function calls. This helps with loop unrolling, instruction scheduling and minimizes pipeline bubbles.
- 1/.
The problem with any separately compiled library, including Apple's fine vecLib implementation, is that it puts the function call boundary in the wrong place.
The compiled library will have functions like vec_sin and vec_cos that work on a large set of vectors like IBM MASS, so your call to calculate sin(x)+cos(x) would look something like:
allocate 1000 of v1
allocate 1000 of v2
allocate 1000 of v3
allocate 1000 of v4
for each v1, v2: v2 = sin (v1) -- call to lib vec_sin
for each v1, v3: v3 = cos (v1) -- call to lib vec_cos
for each v4, v2, v3: v4 = v2 + v3 -- call to lib add
Note 4 memory allocations, 2 of which are for temporaries that won't be used again; note also 3 branches away to library functions.
Compare with macstl, which is massively inlined yet works on an element-by-element basis:
allocate 1000 of v1
allocate 1000 of v4
for each v4, v1: v4 = sin(v1) + cos(v1)
Saving 2 expensive allocations and inlining function calls to within the loop, so no conditionals or branching there.
The only way around this with separately compiled code is to put more and more functionality into a single call e.g. FFT, linear algebra, but you lose the flexibility of creating your own equations -- what if you don't want fast fourier transforms or linear algebra, but some funky trig function?
Check out http://www.pixelglow.com/stories/altivec-valarray
Cheers,
Glen Low, Pixelglow Software
www.pixelglow.com
The exact function may be slightly different -- a vector processor is far more flexible -- but it's still a special-purpose unit that drastically speeds up a few simple operations on reasonably large amounts of data, often used for graphical operations.
Interesting how so many ideas in computing are just developments of previous ones...
Ceterum censeo subscriptionem esse delendam.
because processor speeds have increased to such an extent (Moore's Law), it doesn't make sense to use assembly to write modern code; even if the assembly code is faster.
More efficient code will run in less time, letting the CPU stay in an idle state more often. This can reduce power consumption, especially on battery-constrained devices. How many watts does your l33t-0-fast processor draw again, and how long would it run on a pair of AAs?
For the same reason that general purpose computation isn't done on your GPU. A GPU gets it's performance from being able to do the same small task to a whole lot of data. A CPU needs to do a bunch of tasks to a small bit of data.
So, you need to multiply two vectors of a thousand floats each. Can the GPU do it faster? Yes. But not really, because there is an astounding minimum latency before the results of that computation can be recovered from the GPU. It's a *deep* pipeline. Even though the CPU will spend longer calculating, the results will be available immediatly, and you have much better turnaround time. If you were doing a million multiplies, the answer would be different. But outside of image processing/DSP work, you rarely find such operations.
Altivec (and MMX/SSE2/SSE3, although less useful), sit nicely in the middle, allowing you to operate on larger pieces of data in parallel without incurring the latency of a GPU operation, allowing excellent performance gains in quite a few common situations.
I'm looking for such a library, with GPL/LGPL compatible license. The API has to be in C, to maximise audience. For many projects, C++ is not an option.
Primary use will be DSP work in GNU Radio project, but multimedia extensions could prove useful anywhere in GUI's to audio/video app, etc.
I would take any pointers to such an already existing API/project, or be ready to start a new one, if other people interested in.
Love salty crackers? catchy electronica? Try !