Grand Unified Theory of SIMD
Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "
For those who want a little background on Altivec, of course Wiki has a description here. Apple, who now ships Altivec in every system they make has a pretty good page here and Motorola nee Freescale has one here.
The benefits of Altivec can be truly astounding for those processes that can be "vectorized". After all putting these kinds of calculations in hardware has got it all over software computation. It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.
Visit Jonesblog and say hello.
Apple has had AltiVec optimized libraries for DSP and such since the early releases of OS X.
Doesn't XCode have a feature that lets you "vectorize" certain parts of your code already?
Okay, I'm willing to believe it, but only if someone shows how that's possible.
The difference between spam and poop is that you don't have to dig through septic tanks looking for real food. -- Me
The Mac forum at Ars Technica has a long, continuing post about Altivec optimizations and how they should be used. The thread started more than two years ago and still gets relevent points and questions added to it. It's an amazing resource if you're interested in starting.
Moore's Law has eroded the need for such knowledge. It would be like concerning myself on how to design circuits to convert a DC current to AC current because I happen to use devices that use electricity, e.g., my toaster (as in bread).
I learned assembly long ago, still retaining a fair amount of it (80x86). There have been a few occasions where I've called upon its use, yeah twice in the last eight years... and that's about it.
Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche, much like the Win32 programming I used to do in the early 90's.
90% of IT jobs are with non-tech companies. Those situations are about the last place you will find anyone caring about something called "assembly language."
-M
The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.
How am I supposed to fit a pithy, relevant quote into 120 characters?
Typo...
Propellerheads.SE
Moore's Law has eroded the need for assembly
Moore's Law has nothing to do with assembly language and optimizations. From Wikipedia:
Moore's law is an empirical observation stating, in effect, that at our rate of technological development and advances in the semiconductor industry, the complexity of integrated circuits doubles every 18 months.
I wish people would stop saying "But Moore's Law..." for every hardware-related story on Slashdot. Do a bit of reading, please.
The RPL ( Reciprocal Public License) is an odd choice for this project. It is an even stronger viral copy-left than the GPL, to the point where the FSF takes issue with it. If create a derivative work you are required required to 1) Notify the original author, and 2) Publish your changes even if you only use the program in house. Furthermore, their definition of derivative work is much, much broader than the "linking" definition that the GPL uses.
The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL. In fact, considering the requirements placed on you by the license, I would expect that you will have difficulty incorporating this RPL library into any existing FLOSS project without running into license conflicts. The only thing I can see this being useful for is a new project that you don't mind releasing under the RPL, or with existing BSD style licensed code which you dual license as BSD/RPL (since BSD can be included in anything).
So this library does not appear to very useable for the FLOSS world, although if you want to license it for proprietary software you may.
"...the black art of assembly language magicians."
The nice thing about altivec is that it has a C interface. You don't have to use assembly!
Take a look at this Apple tutorial to see how easy it is.
Does this mean we can expect source Linux distros to start taking advantage of this?
I know I'll sound like a wannabe leet for saying this, but I already really like my Gentoo workstation because it is a stage1 install (all from source), and I expect this will only make it even faster!
Yay!
I don't know the meaning of the word 'don't' - J
Sounds great, but $2499 for a redistributable binary? Ouch.
A bit OT, but nevertheless quite interesting to read and it contains information about SIMD instruction sets other than just MMX/SSE: http://www.fefe.de/ccccamp2003-simd.pdf
A monkey is doing the real work for me.
TWW
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
For those that don't already know is that autovectorization is being worked on for GCC by folks from IBM and others.
GCC vectorizatoin project (site seem offline atm) but the abstract from a recent GCC summit is up.
Autovectorization Talk (google html view of pdf)
Vectorization (SIMD) is built into the Intel compiler. There is no need to hack in assembly as the compiler will do it for you. This is the case with most vendor supplied compilers, as they want to fully exploit their hardware functionality.
The problem is bringing this functionality to OS compilers, which as far as I know, there is not even an OpenMP (threading) implementation, let alone internal vectorization.
UBU
SIMD support already exists, in the form of C, C++, and Fortran libraries (usually, as a small part of larger numerical libraries), as well as in language constructs in languages like Fortran.
Even in embedded systems, assembly isn't used as much as it used to. It still get used in bootloaders, and sometimes in device drivers. However, most devices are memory mapped, and most of the driver is written in C, and asm() calls are made when appropriate (eg, asm("eieio");), especially when you get to use gcc and asm() syntax for accessing variables.
(S(SKK)(SKK))(S(SKK)(SKK))
Surely people can now start to see where the future lies - from a performance viewpoint. We've reached the end of the clocking "free lunch" (see http://www.gotw.ca/publications/concurrency-ddj.ht m/).
The way forward is turning the CPU (of a traditional) architecture into a Nanny for a range of various dedicated processing units. IBM saw this years ago, and thus began the whole Cell architecture - but I suspect that their job was much easier. The software that would run on the platform they are designing is fairly specific - games & multimedia which usually lend themselves well to vectorization.
The real challenge for architects (in my humble opinion) is translating will be applying the same technique to other system bottlenecks.
AMD's (and now Intel's) approach of crambing more and more processing cores onto an IC might pay off in the short term, but like the "free lunch" of clock speed, will hit a roadblock when issues like memory bandwidth and caching schemes just have too much work to do with 4 or 8 processing cores hacking at it all the time.
[ Monday is a terrible way to spend one seventh of your life. ]
Reading this reminded me about that portion of the standard C++ library which is all about operations on vector data. So, my question is: could an std::valarray specialization for processor-supported types serve as a basis for portable SIMD support in C++?
My exception safety is -fno-exceptions.
Freescale, nee Motorola. (Nee roughly translates to "formerly known as").
---
Mod me down, you fucking twits. Go ahead. I dare you.
(I read with sigs off.)
A better resource for Altivec and SIMD in general is the SIMDtech.org website and Altivec mailing list. There are tutorials and technical manuals available and the email list is indispensable. While the mailing list is mostly geared towards Altivec optimizations and discussions all SIMD discussion is welcome, including MMX/SSE. There are Apple engineers that read and contribute to the list as well as Motorola/Freescale engineers. It's probably the single best resource available to Altivec programmers and you get to talk directly to the Wizards that created it.
I'm a relative newcomer to the list and it's been an invaluable resource as I've optimized with Altivec.
--
Join the Pyramid - Free Mini Mac
infested with jello like fishes no melotron wishes
Tiger, the next OS release from Apple, will take care of vector optimization automatically in their version of gcc 4.0. I guess this will make it into the public gcc too.
This is public now, so I can talk about it--
I worked on extending the accuracy and continuity of the VMX instruction vexptefp, see the patent application here
My understanding is that this instruction is used to compute Phong/specular hilights, and that previous implementations of this instruction were unusable because the lack of accuracy and continuity made it visually undesirable. We were able to improve the algorithm enough to be visually indistinguishable from a fully accurate non-estimate.
Can any software developers that use this instruction comment on this?
Is Phong hilighting mostly done on GPUs now?
As two of my professors have stated in class, SIMD and moreso parallel processing will require programmers to think in a fundamentally different way in order for multi-core/multi-processor to really take off.
This project may be a step in the right direction. Benchmarks show that SIMD such as SSE/2/3 only provide a marginal speed increase. And meanwhile, the massively parallel computations done on graphics cards dwarfs anything SIMD claims to produce.
Perhaps we will see GFX manufacturers selling their technology to the CPU makers.
I forget the specifics, but a new GFX card can perform somewhere around 35 GFLOPS, while a 3.4Ghz P4(executing SIMD code) can only produce around 5-6GFLOPS at best.
With projects like Brook GPU emerging, the division of CPU and GFX processor may be narrowed significantly.
Or is this just another advertisement pretending to be a story, with the submitter trying to play ignorant about alternative Altivec and MMX libraries ?
We write code for hardcore chemical simulations. The limits on what can be studied, ie number of atoms/molecules or timescales of the simulations depends on one thing: speed.
Faster computers means better simulations. BUT, if the code is not as fast as it can be on a particular architecture, your simulations are not going to be as complete as they can be. At least within a given time allotment.
I've recently applied some code optimizations to a Monte Carlo simulation and saw speed ups of over 1000x. That's significant.
It's naive to think that faster computers means we should live with sloppy or unoptimized code. SIMD is a useful technique, and if it means the difference between me getting work done in a week or two or three weeks, I think I'll take the one-week sim.
Computational Chemistry products and services.
Of course he hasn't taken away my choice, AC. I can't reconcile either of his licenses with my existing projects, so I choose not to use his code. I suspect many existing projects will find themselves in a similar situation when they actually read the licenses, and will also choose not to use his code.
How am I supposed to fit a pithy, relevant quote into 120 characters?
Another project trying to do something similar is liboil, the Library of Optimised Inner Loops.
However in the future I can see things changing for the structure of the stardard PC.
At the moment in a high end machine you have the CPU, which is a scalar processor, a GPU, which is in essence a glorified vector processor (not just useful for graphics, as projects like GpGPU are showing us), and SIMD extensions to the CPU to allow it to do small amounts of vector processing.
Scalar processors are good for some things (branchy code) and vector processors are good for other things (very predictable parallel code). Having both is very useful.
I would say in the next 5-10 years we will see the GPU join together with the SIMD extensions to provide a seperate general purpose vector processor.
PCs will ship with two processors - one scalar, one vector. And everyone will be happy.
Now, whether this will be transparent to the programmer depends on how automatic code optimisation progresses over the next few years. Is Intel's icc auto vectorisation already good enough? Don't know.
Malike Bamiyi wanted my assistance.
See Herb Sutter's article in the Feb C/C++ Users Journal or the (expanded) one in the March Dr. Dobb's Journal.
While memory speeds will continue for awhile, already processor speeds are falling off. Check out this graph from the article where he clearly shows what's happening.
This brings an interesting dilemma to modern programmers. Programs won't magically get faster anymore. We need to start coding to take advantage of concurrency.
The same is true of using SIMD units. They can speed up your code dramatically, but they must be taken into account in your code. That's why this macstl project is such a good idea. It is a standard set of common primitives that let you harness the SIMD functions of your processor. By putting a library over the specifics, your vector-aware code will grow with modern SIMD systems.
Few people will ask you to write in assembly these days, but if you could easily give your math-intensive program a 10x-30x speedup by using one library (that seems very easy to use, by my standards), why wouldn't you?
Slashdot. It's Not For Common Sense
You really don't need macstl unless you have a strong desire to use valarray in C++...for example, the ATLAS project http://math-atlas.sourceforge.net/ already uses Altivec (and SSE/SSE2, etc) wherever it results in a speedup. So, if your code does linear algebra, use ATLAS and you'll see an automatic speedup in many cases. Other projects such as fftw http://fftw.org/ include Altivec/SSE/SSE2 optimizations as well. ATLAS includes lots of other optimizations such as cache-blocking, loop-unrolling, etc. I don't know of macstl includes such optimizations, but I do know that ATLAS performance approaches the theoretical peak performance on G4/G5 for things like matrix-matrix multiplication.
c Lib.html includes ATLAS so you don't even have to download or install anything - it comes with OS X.
Not only that, but Apple's vecLib http://developer.apple.com/ReleaseNotes/MacOSX/ve
All is Number -Pythagoras.
You can do a limited version of SIMD with an ordinary CPU. A 32-bit CPU can execute 32 "bit logic" operations with a single instruction. With a properly structured problem, 32 instances can be computed in parallel.
Mea navis aericumbens anguillis abundat
Yes it does.
Well the processing power of Altivec or MMX/SSE/3DNow or whatever is nowhere near the power of you newest NVidia/ATI card you have surely bought for playing Doom III. Why not use it then? Get the brook compiler! Furthemore, I see they introduce classes like vec, etc. Such classes have been already designed successfuly for C++. Why not try porting Blitz to the Altivec and/or to the GPU?
You can defy gravity... for a short time
SIMD programming becomes as easy as this:He claims that the above code is 17.4x faster than Codewarrior MSL C++, 11.6x faster than gcc libstdc++ and 9.5x faster than Visual C++.
Macstl also provides a cross-platform syntax for using vector registers that is similar to using the native C intrinsics on each platform. So while not all of the native operations are available, his cross-platform "vec" API allows you to write cross-platform code without having to learn both the Altivec and MMX/SSE intrinsics (which is a good solution for someone who knows one platform but not the other).
--
Join the Pyramid - Free Mini Mac
infested with jello like fishes no melotron wishes
This story doesn't really mean anything and people are just making up comments.
- Was a great way of dealing with relational data
- Would have to await much larger scales of integration before becoming practical.
Since then the computer world has become much more relational due to relational databases, and the levels of integration of skyrocketed, but no one major manufacturer of silicon has bothered to revisit this very simple and powerful route to high power computing.Fortunately there is at least a little ongoing research.
The beauty of these processors is they integrate memory with computation so that the massive economies of scale we witness in memory fabrication apply to computation speeds as well so long as we can move toward relational rather than function computing as a paradigm. Fortunately this appears to be supported by the study of quantum computers, however those computers may never see the light of day for more fundamental reasons.
Seastead this.
It is when programming DSP's (and related devices).
From my experience, yes and no. Fixed-point DSP tends to be done in assembly, mainly because FP techniques don't translate well to C. The compilers also tend to suck. A fair to large amount of floating-point DSP is done with C when the compiler support is good. I have done a lot of floating-point DSP, and we found that the write in C, refine in ASM workflow was best.
Don't forget that microcontrollers outnumber microprocessors by a large margin.
That is true. I usually refer to this as "high" embedded versus "low" embedded systems. Along with DSP, I have spent my career mainly working on large, embedded systems running on microprocessors (Mot 68k, PowerPC, and some MIPS) under control of an RTOS. In this application, assembly doesn't get used as much as you think even when you are dealing with hard realtime requirements.(S(SKK)(SKK))(S(SKK)(SKK))
the idea that assmebler programmers can write better code than a compiler can generate is one of those urban myths that refuses to die. compilers can and do undertake code analysis that no assembler programmer could ever do - like trace back the control flow through every single branch point to find instances where data has already been precalculated. code hoisting of temporaries outside loops in a way that maximises register use over memory hits. undertaking such analysis before coding in assembler would be extremely high risk for an assembler programmer. also would you as an assmbler programmer go about inlining all your assembler functions - the code would be unmanageable? how many assembler programmers would know how to reorder their instructions to avoid pipeline stalls. all the knowledge about optimising assembly programs has been incorporated into compiler backends over the years- why wouldnt it have been?
its been tested - get a program that converts assembler to c and then recompile with optimisation - it *will* run faster.
the only exceptions are where the compiler lacks an algebraic or RTL awareness of an instruction on a specific architecture.
jxxx
Its an interesting discussion , only its jumping all over the place and touching a whole bunch of interesting points...I've only written (.model small) programs in Assembly(don't think I'd want to write many larger programs in Assembly and my recollection is that given the nature of Assembly(written properly) it will just about always be faster than any compiled code....so why not use it in small doses where applicable........
It will be interesting to compare performance of the macstl library to other "high speed" template libraries like Blitz++ (see http://www.oonumerics.org/blitz/)
And one step further, I am betting you do not perform any sort of graphics programming.
On win32 / mac platforms, the need to know how to do this is pretty low. DirectX wraps most of it, as well as the processes needed for GPU programming. I am sure the Mac libs that do the same job as DirectX accomplish much the same.
But low level graphics programming is alive and well for game programming. I do what I can to stay well clear of that, since I dont like graphics programming much (just personal preference). But the need for this type of programming continues to exist. And it will continue to exist for a while yet.
END COMMUNICATION
The point is that the GPL doesn't specify release behavior for code that isn't distributed so any "program" P developed with regard to the GPL should not reference such release behavior -- hence the substitution principle works.
Seastead this.
The main reason is that the AGP bus is designed to move data very quickly to the card, but is not so hot at moving it back again. This should change with PCI Express.
I am TheRaven on Soylent News
Funny thing : it was PRECISELY the topic of an engineer degree internship that I've made in the summer of 1997. Making a universal C++ template lib for SIMD programming, with application to the IA-32 MMX system. At the time there was already similar work all around, with the introduction of the MMX and the popular Alpha architecture that had a similar system. All that to say that it does not sound really new to me.
The more complex the architecture the greater need to keep around low level coding. Compilers just can't keep up. During the early days of the PS2 we commonly got 300x performance improvements when switching from high level code to carefully architected and coded assembly. Programmers have gotten lazy and have lost the skills required to maximize the performance on current architectures. If you code carefully you can make sure that you are executing the maximum number of instructions per cycle. When you use a compiler it abstracts you from seeing that if you change your instruction pairing or split off some of the instructions into another pipeline you might get better performance. In school they teach you that algorythm is the most important thing to look at and that implementation doesn't matter that much, but with todays complex bus architectures, and with everything fighting for control of the bus, if you aren't careful you can end up wasting most of your time waiting for access to data or stalling the instruction pipeline waiting for results to calculations.
In CAPP operations are generally carried out in a bit-serial, word-parallel manner. This is radically different from Cell processor architecture.
Seastead this.
SIMD is Single Instruction Multiple data,
MMX, SSE,2 and 3 , Altavec are this.
Cell is MIMD, Multiple instruction and multiple data.
Cell is an array of small independent CPU's. (think Beowulf cluster on a chip)
Computation is done by a systolic arrays or similar parrallel processing techniques.
Think of cells in spread sheet, where each rectange in the performs it's computation. A Cell processor allow you to change data at the top of the spread sheet and compute results at the bottom a GHz speeds!
Granted this isn't good for running an OS, but for video processing, Finite element simulations, Ray Tracing, code breaking, and AI, it's great.
I was working with Chuck Moore on Project Enumera , we layed out a chip with 49 (7x7) asynchronous CPU's. (this is important).
When doing Cell processors with 50 cores you don't want them to run step lock. This is akin to why soldiers march out of step when crossing a bridge. It's distributes the loading on the PowerSupply lines, rather then creating one big spike when they all switch.
I am always doing that which I can not do, in order that I may learn how to do it. - Pablo Picasso
AltiVec was introduced with the G4 line. The Mac Mini has such a chip. If it were to use a G3 chip, then it wouldn't have AltiVec, but that is not the case.
Autovectorizing does exist Absoft makes a product called VAST http://www.absoft.com/Products/Libraries/vast.html
It works on C, Fortran, and C++. I've seen some reasonable performance gains from just a recompile.
a grand unified theory would use SIMD for distribution, not just exploiting a shallow local vector unit. like ZPL, or the connection machine languages. in a manner that allows you to exploit scale.
no one calls a cray X1 SIMD, but its alot closer than altivec.
I have to agree with the "insightful" poster - he/she/it is describing optimizing using assembly, not C, and trying to optimize code in assembler is a nightmare these days. You've got deep pipelines, multiple execution units, parallel processing - it's just ugly and requires a deep understanding of how to reorder the instructions to avoid pipeline stalls as well as keeping all the parallel units working at the same time. Optimizing your C code for better assembly is a different thing entirely (unrolling loops, aligning bitmaps on power-of-2 boundaries for faster copies, putting most common loop choice first, etc.).
About the best you can hope for is to tweak C code with assembly blocks (usually after extensive profiling). Even then, you really don't see the huge performance improvements these days (nothing like 5 years ago).
I used to be excited about the future of this tech, but to me, GPU tech has become far more important (heck, the latest stuff I've been programming barely sucks up 800MHz of my CPU but throttles the GPU).
That's no fault of the maintainer - rough times are, well, rough. Nonetheless, it will take a long time before ATLAS is again a serious option for the kinds of problems it is good at solving. It's a pity, but it's too big a project for one person, and other projects (albeit non-free, in any sense) are putting in far more resources than that.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
The exact function may be slightly different -- a vector processor is far more flexible -- but it's still a special-purpose unit that drastically speeds up a few simple operations on reasonably large amounts of data, often used for graphical operations.
Interesting how so many ideas in computing are just developments of previous ones...
Ceterum censeo subscriptionem esse delendam.
because processor speeds have increased to such an extent (Moore's Law), it doesn't make sense to use assembly to write modern code; even if the assembly code is faster.
More efficient code will run in less time, letting the CPU stay in an idle state more often. This can reduce power consumption, especially on battery-constrained devices. How many watts does your l33t-0-fast processor draw again, and how long would it run on a pair of AAs?
For the same reason that general purpose computation isn't done on your GPU. A GPU gets it's performance from being able to do the same small task to a whole lot of data. A CPU needs to do a bunch of tasks to a small bit of data.
So, you need to multiply two vectors of a thousand floats each. Can the GPU do it faster? Yes. But not really, because there is an astounding minimum latency before the results of that computation can be recovered from the GPU. It's a *deep* pipeline. Even though the CPU will spend longer calculating, the results will be available immediatly, and you have much better turnaround time. If you were doing a million multiplies, the answer would be different. But outside of image processing/DSP work, you rarely find such operations.
Altivec (and MMX/SSE2/SSE3, although less useful), sit nicely in the middle, allowing you to operate on larger pieces of data in parallel without incurring the latency of a GPU operation, allowing excellent performance gains in quite a few common situations.
I'm looking for such a library, with GPL/LGPL compatible license. The API has to be in C, to maximise audience. For many projects, C++ is not an option.
Primary use will be DSP work in GNU Radio project, but multimedia extensions could prove useful anywhere in GUI's to audio/video app, etc.
I would take any pointers to such an already existing API/project, or be ready to start a new one, if other people interested in.
Love salty crackers? catchy electronica? Try !