Grand Unified Theory of SIMD
Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "
For those who want a little background on Altivec, of course Wiki has a description here. Apple, who now ships Altivec in every system they make has a pretty good page here and Motorola nee Freescale has one here.
The benefits of Altivec can be truly astounding for those processes that can be "vectorized". After all putting these kinds of calculations in hardware has got it all over software computation. It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.
Visit Jonesblog and say hello.
Apple has had AltiVec optimized libraries for DSP and such since the early releases of OS X.
Doesn't XCode have a feature that lets you "vectorize" certain parts of your code already?
The Mac forum at Ars Technica has a long, continuing post about Altivec optimizations and how they should be used. The thread started more than two years ago and still gets relevent points and questions added to it. It's an amazing resource if you're interested in starting.
The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.
How am I supposed to fit a pithy, relevant quote into 120 characters?
When using Reason 3, some virtual synths have the option to produce an enhanced sound.
What is curious is that if you are using a pre-Altivec proc (G3), it'll burn more CPU time while the same enhancement will be totally and natively supported by Altivec-enabled units : a 400MHz G4 Powerbook is enhancing these sytnhs more efficiently than an 800MHz G3.
I guess this was like the simultaneous operations that the ARM assembly language supports (e.g. both storing and rotating values in an operation)...
Trolling using another account since 2005.
Moore's Law has eroded the need for assembly
Moore's Law has nothing to do with assembly language and optimizations. From Wikipedia:
Moore's law is an empirical observation stating, in effect, that at our rate of technological development and advances in the semiconductor industry, the complexity of integrated circuits doubles every 18 months.
I wish people would stop saying "But Moore's Law..." for every hardware-related story on Slashdot. Do a bit of reading, please.
The principle behind SIMD, or, rather, Single Instruction Multiple Data, is that you can process wide arrays of values in a single instruction. With the PowerPC version of SIMD, also known as AltiVec, you can issue an instruction and have it work with a 128-bit wide register. These registers may contain up to 4 32-bit numbers, 8 16-bit numbers or 16 8-bit numbers. For example, I can load two AltiVec registers with 16 unsigned chars, add them together using Vec_Add() and have it return its results to an AltiVec register. So this in essense is adding 16 values at once and in theory it's good enough for markeing to claim a 16X speedup, but this is rarely the case.
The RPL ( Reciprocal Public License) is an odd choice for this project. It is an even stronger viral copy-left than the GPL, to the point where the FSF takes issue with it. If create a derivative work you are required required to 1) Notify the original author, and 2) Publish your changes even if you only use the program in house. Furthermore, their definition of derivative work is much, much broader than the "linking" definition that the GPL uses.
The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL. In fact, considering the requirements placed on you by the license, I would expect that you will have difficulty incorporating this RPL library into any existing FLOSS project without running into license conflicts. The only thing I can see this being useful for is a new project that you don't mind releasing under the RPL, or with existing BSD style licensed code which you dual license as BSD/RPL (since BSD can be included in anything).
So this library does not appear to very useable for the FLOSS world, although if you want to license it for proprietary software you may.
"...the black art of assembly language magicians."
The nice thing about altivec is that it has a C interface. You don't have to use assembly!
Take a look at this Apple tutorial to see how easy it is.
90% of the worlds' people do not own cars. Therefore, there is no need for gas stations. If you pick a living human completely at random from the earth, chances are they don't drive one of these "car" things.
Don't blame me; I'm never given mod points.
A bit OT, but nevertheless quite interesting to read and it contains information about SIMD instruction sets other than just MMX/SSE: http://www.fefe.de/ccccamp2003-simd.pdf
A monkey is doing the real work for me.
For those that don't already know is that autovectorization is being worked on for GCC by folks from IBM and others.
GCC vectorizatoin project (site seem offline atm) but the abstract from a recent GCC summit is up.
Autovectorization Talk (google html view of pdf)
Vectorization (SIMD) is built into the Intel compiler. There is no need to hack in assembly as the compiler will do it for you. This is the case with most vendor supplied compilers, as they want to fully exploit their hardware functionality.
The problem is bringing this functionality to OS compilers, which as far as I know, there is not even an OpenMP (threading) implementation, let alone internal vectorization.
UBU
SIMD support already exists, in the form of C, C++, and Fortran libraries (usually, as a small part of larger numerical libraries), as well as in language constructs in languages like Fortran.
Even in embedded systems, assembly isn't used as much as it used to. It still get used in bootloaders, and sometimes in device drivers. However, most devices are memory mapped, and most of the driver is written in C, and asm() calls are made when appropriate (eg, asm("eieio");), especially when you get to use gcc and asm() syntax for accessing variables.
(S(SKK)(SKK))(S(SKK)(SKK))
Surely people can now start to see where the future lies - from a performance viewpoint. We've reached the end of the clocking "free lunch" (see http://www.gotw.ca/publications/concurrency-ddj.ht m/).
The way forward is turning the CPU (of a traditional) architecture into a Nanny for a range of various dedicated processing units. IBM saw this years ago, and thus began the whole Cell architecture - but I suspect that their job was much easier. The software that would run on the platform they are designing is fairly specific - games & multimedia which usually lend themselves well to vectorization.
The real challenge for architects (in my humble opinion) is translating will be applying the same technique to other system bottlenecks.
AMD's (and now Intel's) approach of crambing more and more processing cores onto an IC might pay off in the short term, but like the "free lunch" of clock speed, will hit a roadblock when issues like memory bandwidth and caching schemes just have too much work to do with 4 or 8 processing cores hacking at it all the time.
[ Monday is a terrible way to spend one seventh of your life. ]
In the corporate world, is it more expensive than paying a developer to design, code, test, and maintain a home-grown version?
Once you've payed a $30 dollar/hour developer for 10 days work, you've forked out ~ $2,500...
--#voxlator
Which is exactly why this sort of thing is so important.
Sure, you could probably get it to work even faster with hand-tuned assembly than simply using this library. But programmer time is expensive, and customizing code adds complexity. By reusing optimized code, you can enjoy some of the benefits of SIMD without having to devote the same amount of resources.
Let's be honest, this isn't a silver bullet - this isn't going to speed up code that doesn't use lots of floating-point vectors anyway. But if it does... (nearly) free performance is always a good thing.
A better resource for Altivec and SIMD in general is the SIMDtech.org website and Altivec mailing list. There are tutorials and technical manuals available and the email list is indispensable. While the mailing list is mostly geared towards Altivec optimizations and discussions all SIMD discussion is welcome, including MMX/SSE. There are Apple engineers that read and contribute to the list as well as Motorola/Freescale engineers. It's probably the single best resource available to Altivec programmers and you get to talk directly to the Wizards that created it.
I'm a relative newcomer to the list and it's been an invaluable resource as I've optimized with Altivec.
--
Join the Pyramid - Free Mini Mac
infested with jello like fishes no melotron wishes
Tiger, the next OS release from Apple, will take care of vector optimization automatically in their version of gcc 4.0. I guess this will make it into the public gcc too.
As two of my professors have stated in class, SIMD and moreso parallel processing will require programmers to think in a fundamentally different way in order for multi-core/multi-processor to really take off.
This project may be a step in the right direction. Benchmarks show that SIMD such as SSE/2/3 only provide a marginal speed increase. And meanwhile, the massively parallel computations done on graphics cards dwarfs anything SIMD claims to produce.
Perhaps we will see GFX manufacturers selling their technology to the CPU makers.
I forget the specifics, but a new GFX card can perform somewhere around 35 GFLOPS, while a 3.4Ghz P4(executing SIMD code) can only produce around 5-6GFLOPS at best.
With projects like Brook GPU emerging, the division of CPU and GFX processor may be narrowed significantly.
Or is this just another advertisement pretending to be a story, with the submitter trying to play ignorant about alternative Altivec and MMX libraries ?
We write code for hardcore chemical simulations. The limits on what can be studied, ie number of atoms/molecules or timescales of the simulations depends on one thing: speed.
Faster computers means better simulations. BUT, if the code is not as fast as it can be on a particular architecture, your simulations are not going to be as complete as they can be. At least within a given time allotment.
I've recently applied some code optimizations to a Monte Carlo simulation and saw speed ups of over 1000x. That's significant.
It's naive to think that faster computers means we should live with sloppy or unoptimized code. SIMD is a useful technique, and if it means the difference between me getting work done in a week or two or three weeks, I think I'll take the one-week sim.
Computational Chemistry products and services.
So, my question is: could an std::valarray specialization for processor-supported types serve as a basis for portable SIMD support in C++?
That's exactly what this is. If you read the part on his website about valarray then you'll see that it does extensive SIMD optimizations for valarray for both Altivec and MMX/SSE/SSE2/SSE3 platforms. He's even added "parallelized algorithms such as integer division, trigonometric functions and complex number arithmetic" which you'd have to code yourself in either assembly or using the C-based intrinsics if you wanted do the SIMD programming by hand.
So basically, this allows you to code using std::valarray using normal C++ and then plug this in under the hood to get a nice speed boost.
--
Join the Pyramid - Free Mini Mac
infested with jello like fishes no melotron wishes
Another project trying to do something similar is liboil, the Library of Optimised Inner Loops.
However in the future I can see things changing for the structure of the stardard PC.
At the moment in a high end machine you have the CPU, which is a scalar processor, a GPU, which is in essence a glorified vector processor (not just useful for graphics, as projects like GpGPU are showing us), and SIMD extensions to the CPU to allow it to do small amounts of vector processing.
Scalar processors are good for some things (branchy code) and vector processors are good for other things (very predictable parallel code). Having both is very useful.
I would say in the next 5-10 years we will see the GPU join together with the SIMD extensions to provide a seperate general purpose vector processor.
PCs will ship with two processors - one scalar, one vector. And everyone will be happy.
Now, whether this will be transparent to the programmer depends on how automatic code optimisation progresses over the next few years. Is Intel's icc auto vectorisation already good enough? Don't know.
Malike Bamiyi wanted my assistance.
You really don't need macstl unless you have a strong desire to use valarray in C++...for example, the ATLAS project http://math-atlas.sourceforge.net/ already uses Altivec (and SSE/SSE2, etc) wherever it results in a speedup. So, if your code does linear algebra, use ATLAS and you'll see an automatic speedup in many cases. Other projects such as fftw http://fftw.org/ include Altivec/SSE/SSE2 optimizations as well. ATLAS includes lots of other optimizations such as cache-blocking, loop-unrolling, etc. I don't know of macstl includes such optimizations, but I do know that ATLAS performance approaches the theoretical peak performance on G4/G5 for things like matrix-matrix multiplication.
c Lib.html includes ATLAS so you don't even have to download or install anything - it comes with OS X.
Not only that, but Apple's vecLib http://developer.apple.com/ReleaseNotes/MacOSX/ve
All is Number -Pythagoras.
Yes it does.
Well the processing power of Altivec or MMX/SSE/3DNow or whatever is nowhere near the power of you newest NVidia/ATI card you have surely bought for playing Doom III. Why not use it then? Get the brook compiler! Furthemore, I see they introduce classes like vec, etc. Such classes have been already designed successfuly for C++. Why not try porting Blitz to the Altivec and/or to the GPU?
You can defy gravity... for a short time
SIMD programming becomes as easy as this:He claims that the above code is 17.4x faster than Codewarrior MSL C++, 11.6x faster than gcc libstdc++ and 9.5x faster than Visual C++.
Macstl also provides a cross-platform syntax for using vector registers that is similar to using the native C intrinsics on each platform. So while not all of the native operations are available, his cross-platform "vec" API allows you to write cross-platform code without having to learn both the Altivec and MMX/SSE intrinsics (which is a good solution for someone who knows one platform but not the other).
--
Join the Pyramid - Free Mini Mac
infested with jello like fishes no melotron wishes
A good example is what happens when you let the compiler decide how to do aritmetic with vectors and matrixes.
Matrix a,b,c,x;
x = a + b + c;
The naked compiler, in combination with your custom Matrix class, will probably unwind the operator overloads to do something like this:All those temporary copies and inlined loops really kill performance.
Now, with an expression library, it handles each arithmetic expression discretely by type. By treating the expressions, as well as the types involved, you can do more sophisticated things. In this case, the Expression Template Library solves the problem thusly:Here the library has carnal knowledge of the data structures involved as well as order of operations to come to such a succint solution.
In the case of MACSTL, its still using these principals of "vectorizing" the expressions as well as unrolling and other traditional optimization techniques. Its also going the extra mile and using processor specific code and/or C code that targets *extremely* well to PPC. For example, the above example would opitmize well using Altivec, due to the platform's built-in vector type; you wouldn't even need a loop for adding several 'vec' instances.
I wish I knew enough about MACSTL and altivec to give a hard example of a 16X speedup. I hope this gets you closer to seeing at least *where* the reducable overhead is coming from.
Check out Blitz++'s papers listing for more info:
http://www.oonumerics.org/blitz/papers/
Sorry, but yours is an utterly kneejerk boilerplate response which has nothing to do with the topic at hand and only serves to establish your credentials as a hard nosed realist who has been there and done it.
Moore's Law has eroded the need for such knowledge
Moore's "law" (which is just an off-the-cuff observation, really) has nothing to do with this. If anything, Moore's law has enabled transistor and space devouring SIMD technology.
It would be like concerning myself on how to design circuits...
No, it's nothing like that at all. Just because you own and know how to use money doesn't mean there is no point to the complex financial reckonings that are made every day at institutions all over the world. You may not need, but you is not under discussion.
Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche
By this definition, everything is niche. The whole computing industry becomes "niche". Farming is "niche". The paper industry is "niche". What you're describing is just non-descript white collar administrative work which just happens to involve a computer; bit shuffling, rather than paper shuffling.
Those situations are about the last place you will find anyone caring about something called "assembly language."
Again, completely irrelevant.
The point is that with a few dozen lines of SIMD code (whether in assembly or some high level language) any reasonably competent programmer can achieve four-fold, ten-fold, even twenty-fold speedups on critical path code, from scratch, in as little as a week.
These are amazing results, and people should be encouraged to investigate the possibilities, not be dragged down into this drab netherworld of yours.
This story doesn't really mean anything and people are just making up comments.
- Was a great way of dealing with relational data
- Would have to await much larger scales of integration before becoming practical.
Since then the computer world has become much more relational due to relational databases, and the levels of integration of skyrocketed, but no one major manufacturer of silicon has bothered to revisit this very simple and powerful route to high power computing.Fortunately there is at least a little ongoing research.
The beauty of these processors is they integrate memory with computation so that the massive economies of scale we witness in memory fabrication apply to computation speeds as well so long as we can move toward relational rather than function computing as a paradigm. Fortunately this appears to be supported by the study of quantum computers, however those computers may never see the light of day for more fundamental reasons.
Seastead this.
So this in essense is adding 16 values at once and in theory it's good enough for markeing to claim a 16X speedup, but this is rarely the case.
;-)
There are 32 of these registers (independent, not shared with the FPU) which means you can chain together a pretty complex series of calculations without intermediate load/store sequences. The unit has multiple independent computation units with their own dispatch queues (details vary between specific processor models). Some AltiVec opcodes are designed to common series of multiple scalar instructions.
The result is that speed ups of more than 16x are not at all rare. 30x is not uncommon in graphics manipulations; I would venture to say that 100x is "rarely the case."
The main reason is that the AGP bus is designed to move data very quickly to the card, but is not so hot at moving it back again. This should change with PCI Express.
I am TheRaven on Soylent News
The more complex the architecture the greater need to keep around low level coding. Compilers just can't keep up. During the early days of the PS2 we commonly got 300x performance improvements when switching from high level code to carefully architected and coded assembly. Programmers have gotten lazy and have lost the skills required to maximize the performance on current architectures. If you code carefully you can make sure that you are executing the maximum number of instructions per cycle. When you use a compiler it abstracts you from seeing that if you change your instruction pairing or split off some of the instructions into another pipeline you might get better performance. In school they teach you that algorythm is the most important thing to look at and that implementation doesn't matter that much, but with todays complex bus architectures, and with everything fighting for control of the bus, if you aren't careful you can end up wasting most of your time waiting for access to data or stalling the instruction pipeline waiting for results to calculations.