Grand Unified Theory of SIMD

Altivec by BWJones · 2005-02-07 04:31 · Score: 5, Informative

For those who want a little background on Altivec, of course Wiki has a description here. Apple, who now ships Altivec in every system they make has a pretty good page here and Motorola nee Freescale has one here.

The benefits of Altivec can be truly astounding for those processes that can be "vectorized". After all putting these kinds of calculations in hardware has got it all over software computation. It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.

--
Visit Jonesblog and say hello.

Re:Altivec by shawnce · 2005-02-07 04:39 · Score: 4, Informative

Just pick a few items out ...

Apple provides source code for some of their vector libraries
Re:Altivec by wulfhound · 2005-02-07 04:40 · Score: 1

Yes it does.. it's a G4, all G4s have Altivec.
Re:Altivec by mod_critical · 2005-02-07 04:42 · Score: 3, Informative

Altivec == Velocity Engine

And is part of every G4
Re:Altivec by baryon351 · 2005-02-07 04:44 · Score: 2, Interesting

It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.

I managed to pick up a ThunderIV last year with the DSP card, and had a run around with photoshop on it. It's impressive stuff. I have an iMac 350 here I also ran photoshop on, and while the 350 kicked the Thunder in a Quadra for many unaccelerated things, on those operations where the DSPs kicked in (and the card has those cool little LEDs to show just when it's happening) it could keep up with the iMac nearly neck & neck.

That's a 25MHz 68040 from 1992 and Thunder IVGX vs a 350MHz G3 from 2000. Very cool.
Re:Altivec by skraps · 2005-02-07 05:12 · Score: 1

"Wiki" != "WikiPedia".
For more, read http://en.wikipedia.org/wiki/Wiki.

--
Karma: -2147483648 (Mostly affected by integer overflow)

More AltiVec Goodness by LordRPI · 2005-02-07 04:33 · Score: 4, Informative

Apple has had AltiVec optimized libraries for DSP and such since the early releases of OS X.

Re:More AltiVec Goodness by goMac2500 · 2005-02-07 04:51 · Score: 1

How is parent flamebait? It's a fact, and its not flamebait considering Apple is one of the only companies currently shipping Altivec systems.
Re:More AltiVec Goodness by bryanzak · 2005-02-07 06:41 · Score: 3, Insightful

One of the problems of using libraries though is that the overhead of a function call usually negates any gain in vectorization. The lib call messes all kinds of things up, including instruction flow and caching, etc.
Re:More AltiVec Goodness by Woody77 · 2005-02-07 11:20 · Score: 1

inline is your friend.

Umm by TheKidWho · 2005-02-07 04:35 · Score: 2, Informative

Doesn't XCode have a feature that lets you "vectorize" certain parts of your code already?

Re:Umm by Richard_at_work · 2005-02-07 05:18 · Score: 2, Informative

The next version of Xcode will support autovectorisation, but I dont think it does it atm.
Re:Umm by HeghmoH · 2005-02-07 05:37 · Score: 1

No.

--
Mod down posts with a "Free Mac Mini/iPod" sig, they're spam!

A little background by xXunderdogXx · 2005-02-07 04:35 · Score: 4, Informative

From the Wikipedia article on SIMD:

An example of an application that can take advantage of SIMD is one where the same value is being added to a large number of data points, a common operation in many multimedia applications. One example would be changing the brightness of an image. Each pixel of an image consists of three 8-bit values for the brightness of the red, green and blue portions of the color. To change the brightness, the R G and B values are read from memory, a value is added (or subtracted) from it, and the resulting value is written back out to memory.

With a SIMD processor there are two improvements to this process. For one the data is understood to be in blocks, and a number of values can be loaded all at once. Instead of a series of instructions saying "get this pixel, now get this pixel", a SIMD processor will have a single instruction that effectively says "get all of these pixels" ("all" is a number that varies from design to design). For a variety of reasons, this can take much less time than it would to load each one by one as in a traditional CPU design.

But of course I'm sure everyone here knew that..

Re:A little background by Bisqwit · 2005-02-07 05:04 · Score: 1

An example of an application that can take advantage of SIMD is one where the same value is being added to a large number of data points, a common operation in many multimedia applications.

How is this different for MMX?
Because I thought MMX does exactly what you described.
Re:A little background by Gr8Apes · 2005-02-07 05:08 · Score: 1

Evidently not ;)

--
The cesspool just got a check and balance.
Re:A little background by xXunderdogXx · 2005-02-07 05:10 · Score: 1

If I'm not mistaken, wouldn't MMX be an implementation of SIMD?
Re:A little background by DLWormwood · 2005-02-07 05:23 · Score: 3, Informative

How is this different for MMX?
Based on personal recollections reenforced by a quick Wiki'ing, MMX's problem wasn't the concept itself, but Intel's braindead constraints placed on x86 support for vectors. MMX recycled the same registers as used for floating point math, causing expensive context switches between each mode and only allowing integer math to be vectorized. Intel eventually developed SSE to work around some of the bottlenecks, but the eventual dominance of GPUs on the PC platform reduced the development priority for vector math in the CPU.

--
Those who complain about affect & effect on /. should be disemvoweled
Re:A little background by at_18 · 2005-02-07 06:27 · Score: 1

MMX is an integer-only implementation of SIMD. It was also problematic because it didn't have its own registers, but re-used the floating point ones of the CPU. SSE is a floating-point implementation of SIMD with its own registers.
Re:A little background by Dominic_Mazzoni · 2005-02-07 06:54 · Score: 4, Informative

Quick summary:

MMX (x86): 8-byte registers, only integer operations
SSE (x86): 16-byte registers, single-precision float ops
AltiVec (PPC): 16-byte registers, integer and single-precision float ops
SSE2 (x86): 16-byte registers, double-precision float ops

In order to implement many complex algorithms on x86, you need to use a motley combination of MMX and SSE. There are many flaws in both; lots of very useful instructions are missing, and MMX can't be used in conjunction with non-SIMD floating-point operations without a huge expensive context switch. One of the biggest flaws in MMX/SSE that I found was the lack of instructions to shuffle data around within a (8-byte or 16-byte) register. The only advantage on a modern x86 CPU is SSE2, which is the only SIMD unit with double-precision floats. But you can only work with two doubles at a time, so the speedup is not that great.

AltiVec, on the other hand, included both floats and integers right from the start, with no penalty for switching between them, and it includes a very detailed and useful set of instructions, including an awesome shuffle instruction. My personal experience, coding for both, is that AltiVec is about twice as useful as MMX/SSE/SSE2 combined.

Also, note that in Mac OS X, many of the standard libraries and system calls are already AltiVec-optimized for you, and Apple also provides a great Vector library with lots of common DSP operations.
Re:A little background by TheRaven64 · 2005-02-07 08:17 · Score: 2, Informative

As well as the vDSP libraries, Apple also provide a set of wrapper functions around the vector instructions. These expose the instructions directly, but let the compiler handle register allocation, making using AltiVec directly very easy.

--
I am TheRaven on Soylent News
Re:A little background by julesh · 2005-02-08 01:37 · Score: 1

Your list misses:

3DNow! (AMD-x86): 8 byte registers, single precision floats

Possibly a footnote of history now, but worth mentioning as a fairly significant proportion of processors support it.

One of the biggest flaws in MMX/SSE that I found was the lack of instructions to shuffle data around within a (8-byte or 16-byte) register.

You mean like the PSHUF* family of instructions? Or something else?
Re:A little background by excessive · 2005-02-08 23:51 · Score: 1

"getting all of these pixels" is going to take the same length of time, (The processor has to have the data locally to act on it - and there is still the same number of buses from RAM into the chip) it's just the arithmatic on the values that is done in parallel.
Also, IIRC, brightness is a scaling function... (But that's just me being picky)

16X increase? by Sensible+Clod · 2005-02-07 04:36 · Score: 1

Okay, I'm willing to believe it, but only if someone shows how that's possible.

--

The difference between spam and poop is that you don't have to dig through septic tanks looking for real food. -- Me

Re:16X increase? by mirko · 2005-02-07 04:39 · Score: 2, Interesting

When using Reason 3, some virtual synths have the option to produce an enhanced sound.
What is curious is that if you are using a pre-Altivec proc (G3), it'll burn more CPU time while the same enhancement will be totally and natively supported by Altivec-enabled units : a 400MHz G4 Powerbook is enhancing these sytnhs more efficiently than an 800MHz G3.
I guess this was like the simultaneous operations that the ARM assembly language supports (e.g. both storing and rotating values in an operation)...

--
Trolling using another account since 2005.
Re:16X increase? by LordRPI · 2005-02-07 04:43 · Score: 5, Informative

The principle behind SIMD, or, rather, Single Instruction Multiple Data, is that you can process wide arrays of values in a single instruction. With the PowerPC version of SIMD, also known as AltiVec, you can issue an instruction and have it work with a 128-bit wide register. These registers may contain up to 4 32-bit numbers, 8 16-bit numbers or 16 8-bit numbers. For example, I can load two AltiVec registers with 16 unsigned chars, add them together using Vec_Add() and have it return its results to an AltiVec register. So this in essense is adding 16 values at once and in theory it's good enough for markeing to claim a 16X speedup, but this is rarely the case.
Re:16X increase? by Anonymous Coward · 2005-02-07 05:51 · Score: 2, Informative

The concept, and radical performance boost, is in line (pardon the pun) with Expression Templates for C++.

A good example is what happens when you let the compiler decide how to do aritmetic with vectors and matrixes.

Matrix a,b,c,x;
x = a + b + c;

The naked compiler, in combination with your custom Matrix class, will probably unwind the operator overloads to do something like this:
// assuming a reasonable STL w/function inlining Matrix __t1; for(int i=0; i<a.width; i++){ for(int j=0; j<a.width; j++){ __ti[i][j] = a[i][j] + b[i][j]; } } Matrix __t2; for(int i=0; i<b.width; i++){ for(int j=0; j<c.width; j++){ __t2[i][j] = __t1[i][j] + c[i][j]; } } x = __t2;
All those temporary copies and inlined loops really kill performance.

Now, with an expression library, it handles each arithmetic expression discretely by type. By treating the expressions, as well as the types involved, you can do more sophisticated things. In this case, the Expression Template Library solves the problem thusly:
// using ETL for(i=0; i<a.length; i++){ x[i] = a[i] + b[i] + c[i]; }
Here the library has carnal knowledge of the data structures involved as well as order of operations to come to such a succint solution.

In the case of MACSTL, its still using these principals of "vectorizing" the expressions as well as unrolling and other traditional optimization techniques. Its also going the extra mile and using processor specific code and/or C code that targets *extremely* well to PPC. For example, the above example would opitmize well using Altivec, due to the platform's built-in vector type; you wouldn't even need a loop for adding several 'vec' instances.

I wish I knew enough about MACSTL and altivec to give a hard example of a 16X speedup. I hope this gets you closer to seeing at least *where* the reducable overhead is coming from. :)

Check out Blitz++'s papers listing for more info:
http://www.oonumerics.org/blitz/papers/
Re:16X increase? by sribe · 2005-02-07 07:02 · Score: 2, Interesting

So this in essense is adding 16 values at once and in theory it's good enough for markeing to claim a 16X speedup, but this is rarely the case.

There are 32 of these registers (independent, not shared with the FPU) which means you can chain together a pretty complex series of calculations without intermediate load/store sequences. The unit has multiple independent computation units with their own dispatch queues (details vary between specific processor models). Some AltiVec opcodes are designed to common series of multiple scalar instructions.

The result is that speed ups of more than 16x are not at all rare. 30x is not uncommon in graphics manipulations; I would venture to say that 100x is "rarely the case." ;-)
Re:16X increase? by tepples · 2005-02-07 15:31 · Score: 1

This is of course assuming that you're not already using an old optimization trick of using a word to store four bytes.

I've written a GBA audio mixer that uses a technique similar to this, but the carry from byte to byte within a machine word often gets in the way of hardcore vectorization.

Long thread about using Altivec by ThousandStars · 2005-02-07 04:37 · Score: 4, Informative

The Mac forum at Ars Technica has a long, continuing post about Altivec optimizations and how they should be used. The thread started more than two years ago and still gets relevent points and questions added to it. It's an amazing resource if you're interested in starting.

Moore's Law has eroded the need for assembly by betelgeuse68 · 2005-02-07 04:38 · Score: 1, Interesting

Moore's Law has eroded the need for such knowledge. It would be like concerning myself on how to design circuits to convert a DC current to AC current because I happen to use devices that use electricity, e.g., my toaster (as in bread).

I learned assembly long ago, still retaining a fair amount of it (80x86). There have been a few occasions where I've called upon its use, yeah twice in the last eight years... and that's about it.

Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche, much like the Win32 programming I used to do in the early 90's.

90% of IT jobs are with non-tech companies. Those situations are about the last place you will find anyone caring about something called "assembly language."

-M

Re:Moore's Law has eroded the need for assembly by geoffspear · 2005-02-07 04:48 · Score: 2, Funny

99% of all jobs in the world require no programming at all. Therefore, there is no need for anyone anywhere to learn C.
90% of the worlds' people do not own cars. Therefore, there is no need for gas stations. If you pick a living human completely at random from the earth, chances are they don't drive one of these "car" things.

--
Don't blame me; I'm never given mod points.
Re:Moore's Law has eroded the need for assembly by bonch · 2005-02-07 04:55 · Score: 1

Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche, much like the Win32 programming I used to do in the early 90's.

I don't consider Doom 3 to be a niche.
Re:Moore's Law has eroded the need for assembly by lowe0 · 2005-02-07 05:00 · Score: 2, Insightful

Which is exactly why this sort of thing is so important.

Sure, you could probably get it to work even faster with hand-tuned assembly than simply using this library. But programmer time is expensive, and customizing code adds complexity. By reusing optimized code, you can enjoy some of the benefits of SIMD without having to devote the same amount of resources.

Let's be honest, this isn't a silver bullet - this isn't going to speed up code that doesn't use lots of floating-point vectors anyway. But if it does... (nearly) free performance is always a good thing.
Re:Moore's Law has eroded the need for assembly by betelgeuse68 · 2005-02-07 05:14 · Score: 1

Sure, and you and everyone you know is working on Doom3 or a competitor?

Just because you use it, doesn't mean you engineer it.

You use a TV... when was the last time you even thought of any of the eletronics inside of it?

-M
Re:Moore's Law has eroded the need for assembly by groomed · 2005-02-07 06:01 · Score: 3, Insightful

Sorry, but yours is an utterly kneejerk boilerplate response which has nothing to do with the topic at hand and only serves to establish your credentials as a hard nosed realist who has been there and done it.

Moore's Law has eroded the need for such knowledge

Moore's "law" (which is just an off-the-cuff observation, really) has nothing to do with this. If anything, Moore's law has enabled transistor and space devouring SIMD technology.

It would be like concerning myself on how to design circuits...

No, it's nothing like that at all. Just because you own and know how to use money doesn't mean there is no point to the complex financial reckonings that are made every day at institutions all over the world. You may not need, but you is not under discussion.

Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche

By this definition, everything is niche. The whole computing industry becomes "niche". Farming is "niche". The paper industry is "niche". What you're describing is just non-descript white collar administrative work which just happens to involve a computer; bit shuffling, rather than paper shuffling.

Those situations are about the last place you will find anyone caring about something called "assembly language."

Again, completely irrelevant.

The point is that with a few dozen lines of SIMD code (whether in assembly or some high level language) any reasonably competent programmer can achieve four-fold, ten-fold, even twenty-fold speedups on critical path code, from scratch, in as little as a week.

These are amazing results, and people should be encouraged to investigate the possibilities, not be dragged down into this drab netherworld of yours.
Re:Moore's Law has eroded the need for assembly by afidel · 2005-02-07 06:07 · Score: 1

Doing things like transcoding/encoding of multimedia content is one of those "niche" areas where assembly is still needed. If it takes 1.5hours or 3 hours to transcode a movie is a BIG deal, especially if you have to do it many times to archive a library of old content. Sure most programmers won't need it, ever, but that's been pretty much true since we got high level languages and computers got more than a couple hundred K of RAM.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:Moore's Law has eroded the need for assembly by Crazy+Eight · 2005-02-08 00:52 · Score: 1

IIRC, D3 was written in C++.

License issues by IO+ERROR · 2005-02-07 04:39 · Score: 5, Informative

Be careful; the "open source" license (PDF) is not GPL-compatible. I don't even think it's BSD-compatible on first reading.

The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.

--
How am I supposed to fit a pithy, relevant quote into 120 characters?

Re:License issues by voxlator · 2005-02-07 04:51 · Score: 2, Interesting

True, but only if you don't purchase a license.

Simple to understand; if you use it for free, you're expected to release your source code (i.e. the 'reciprocal' part of RPL). If you pay to use it, you don't have to release your source code.

--#voxlator
Re:License issues by IO+ERROR · 2005-02-07 05:04 · Score: 3, Informative

Simple to understand; if you use it for free, you're expected to release your source code (i.e. the 'reciprocal' part of RPL). If you pay to use it, you don't have to release your source code.

True enough, but using the proprietary license makes it impossible to use this in existing projects without changing the license. Suddenly your open source project is either no longer open source, or doesn't look so attractive.
One of the nicest features of the GPL (and, to be fair, of the BSD license) is that you do not have to release source code if you don't distribute your software. This RPL requires you to release your source code even if you don't distribute your software. And the proprietary license simply isn't appropriate for any type of open source project.
The guy wants to get paid, and that's fine, I want to get paid, too. But he's got no business telling me I have to distribute my source code for an internal project that will never be distributed. He could easily have used a method similar to Trolltech's dual-licensing, but he chose instead to do something a whole lot more obnoxious.

--
How am I supposed to fit a pithy, relevant quote into 120 characters?
Re:License issues by IO+ERROR · 2005-02-07 05:21 · Score: 2, Informative

It sounds like the GPL virus to me.

Look, a troll! The GPL doesn't require you to release your code, unless you distribute it. This RPL thing requires you to release your code, even if you don't distribute it. I've discussed the linking issue elsewhere.

--
How am I supposed to fit a pithy, relevant quote into 120 characters?
Re:License issues by RupW · 2005-02-07 05:24 · Score: 1

The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.

IANAL, but I read the intent as "if you improve macstl you have to publish your changes to macstl" not "if you link macstl you have to publish source to the entire project".

Obviously I can't say which one matches the legalese.
Re:License issues by IO+ERROR · 2005-02-07 05:26 · Score: 1

If you have an existing work that you can optionally combine with the RPL licensed software, it is unlikely that a court would consider your existing work a derivative of the RPL software.

With C++ templates this is a very thorny issue. When your code instantiates the template, the library code is very inextricably an integral part of your code, and not easily (if at all) separable. This might be a different issue if it were a C library you could just call through an API.
Currently under the GPL/LGPL this situation requires a special exception in the template library's license.

--
How am I supposed to fit a pithy, relevant quote into 120 characters?
Re:License issues by jemfinch · 2005-02-07 08:35 · Score: 1

I don't even think it's BSD-compatible on first reading.

What, does it require you to remove other copyright notices on the file? Does it require that the name of the author be used in advertisement of software that uses it? Does it require that the author be liable for damages?

If not, it's BSD compatible. You'll be hard-pressed to find a license that's not BSD compatible.

Jeremy

--
Looking for a Python IRC bot?

oops by Anonymous Coward · 2005-02-07 04:42 · Score: 1, Informative

Typo...
Propellerheads.SE

Moore's Law has nothing to do with assembly by Anonymous Coward · 2005-02-07 04:42 · Score: 2, Insightful

Moore's Law has eroded the need for assembly

Moore's Law has nothing to do with assembly language and optimizations. From Wikipedia:

Moore's law is an empirical observation stating, in effect, that at our rate of technological development and advances in the semiconductor industry, the complexity of integrated circuits doubles every 18 months.

I wish people would stop saying "But Moore's Law..." for every hardware-related story on Slashdot. Do a bit of reading, please.

Re:Moore's Law has nothing to do with assembly by asliarun · 2005-02-07 07:43 · Score: 1

You misunderstood.

>> Moore's Law has eroded the need for assembly

> Moore's Law has nothing to do with assembly language and optimizations. From Wikipedia:...

The grandparent was saying that because processor speeds have increased to such an extent (Moore's Law), it doesn't make sense to use assembly to write modern code; even if the assembly code is faster.
Re:Moore's Law has nothing to do with assembly by tzanger · 2005-02-07 13:05 · Score: 1

Moore's Law has absolutely nothing to do with the speed of integrated circuits, it is talking about the complexity of the designs doubling roughly every 18 months. Complexity doesn't necessarily mean speed.

About the RPL by pavon · 2005-02-07 04:47 · Score: 4, Informative

The RPL ( Reciprocal Public License) is an odd choice for this project. It is an even stronger viral copy-left than the GPL, to the point where the FSF takes issue with it. If create a derivative work you are required required to 1) Notify the original author, and 2) Publish your changes even if you only use the program in house. Furthermore, their definition of derivative work is much, much broader than the "linking" definition that the GPL uses.

The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL. In fact, considering the requirements placed on you by the license, I would expect that you will have difficulty incorporating this RPL library into any existing FLOSS project without running into license conflicts. The only thing I can see this being useful for is a new project that you don't mind releasing under the RPL, or with existing BSD style licensed code which you dual license as BSD/RPL (since BSD can be included in anything).

So this library does not appear to very useable for the FLOSS world, although if you want to license it for proprietary software you may.

Re:About the RPL by geoffspear · 2005-02-07 04:53 · Score: 2, Informative

Clearly, we need to get everyone in the world to download the source, make one superficial change, and email the entire thing back to the original developer.
And what happens if the original developer dies? Is everyone prohibited from using his code until the copright runs out in 95 years, as they can't notify him of changes?

--
Don't blame me; I'm never given mod points.
Re:About the RPL by MenTaLguY · 2005-02-07 05:22 · Score: 1

And what happens if the original developer dies? Is everyone prohibited from using his code until the copright runs out in 95 years, as they can't notify him of changes?

Yes, unless he has an identifiable successor-in-interest.

--

DNA just wants to be free...
Re:About the RPL by Baldrson · 2005-02-07 05:24 · Score: 1

The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL.
It's no more incompatible than is a class that overrides a method of a superclass "incompatible" with that superclass. In this instance, the release "method" is more strict.

--
Seastead this.
Re:About the RPL by HeghmoH · 2005-02-07 05:30 · Score: 1

#1 is understandable, if odd, but #2 is just ridiculous. In-house use doesn't fall under copyright protection to begin with, so how can the RPL regulate it?

--
Mod down posts with a "Free Mac Mini/iPod" sig, they're spam!
Re:About the RPL by pavon · 2005-02-07 06:00 · Score: 1

Yes, clearly the world of dental hygene is not ready for such a radical license! I wonder if he means Free / Open Source Software, though what the additional "L" stands for is anyone's guess...

Libre . Although it is understandable why many people have decided to drop the L from that acronym :)
Re:About the RPL by CableModemSniper · 2005-02-07 06:34 · Score: 1

You might not realize it, but that example is actually agreeing with your parent. A subclass that does that is breaking the substituion principle.

--
Why not fork?
Re:About the RPL by phliar · 2005-02-07 07:09 · Score: 1

In-house use doesn't fall under copyright protection to begin with
False. You may be confusing in-house use with the doctrine of fair use.

--
Unlimited growth == Cancer.
Re:About the RPL by HeghmoH · 2005-02-07 07:29 · Score: 1

You're right, I wasn't thinking. Wide-scale internal use would in fact be governed by the RPL. Small-scale use that fell under fair use would not.

--
Mod down posts with a "Free Mac Mini/iPod" sig, they're spam!
Re:About the RPL by Abcd1234 · 2005-02-07 09:53 · Score: 1

Actually, I believe it's the other way around. i.e., the FOSS acronym existed first, and someone thought "But that's not, like, a word, and not *nearly* redundant enough. It needs something... another letter, I think. Hey, what is this 'french' dictionary, I see here? Don't they make fries or something? Ahh well, let's check that it out. Hey, here we go: 'libre'! Walla... FLOSS!" And so another stupid acronym was born...

Black Art? Uh... by arekusu · 2005-02-07 04:47 · Score: 3, Interesting

"...the black art of assembly language magicians."

The nice thing about altivec is that it has a C interface. You don't have to use assembly!

Take a look at this Apple tutorial to see how easy it is.

Re:Black Art? Uh... by Leo+McGarry · 2005-02-07 05:03 · Score: 3, Funny

Yes, I think the person who wrote the summary revealed a little more of his own ignorance than he meant to. I don't consider calling "vec_add" inside a loop to be a black art.
Re:Black Art? Uh... by dsci · 2005-02-07 05:17 · Score: 1

Also, the VectorC compiler by CodePlay is useful for using a C compiler that can generate SIMD for MMX, SSE and 3DNow!.
,br> But really, at the end of the day, what's so bad about assembly? I mean, if you inline only those (relatively small parts) you need to optimize, and let the C compiler handle all the symbol table stuff, it's not that bad. We're not talking about developing a full app, including GUI, in straight Assembly from scratch.

--
Computational Chemistry products and services.
Re:Black Art? Uh... by Paradox · 2005-02-07 05:32 · Score: 1

Yeah, the C library is out there, and it's not too hard to use. :)

But one could counter that even in the C library, unless you know what you're doing, you may not get as dramatic a speedup as you wanted. Until I looked at serveral of Apple's examples, I couldn't write altivec-aware code properly (i.e. maximum performance benefit).

Once I knew what I was doing I went back and redid the code, and it ran much faster. So it is still tricky to maximize your bang-for-buck.

--
Slashdot. It's Not For Common Sense
Re:Black Art? Uh... by julesh · 2005-02-08 01:43 · Score: 1

The nice thing about altivec is that it has a C interface. You don't have to use assembly!

MS's compilers have a similar interface to MMX/SSE. I think the advantage of this project is that it's an abstraction layer that can use either.

More source-distro goodness to follow? by Progman3K · 2005-02-07 04:48 · Score: 1

Does this mean we can expect source Linux distros to start taking advantage of this?

I know I'll sound like a wannabe leet for saying this, but I already really like my Gentoo workstation because it is a stage1 install (all from source), and I expect this will only make it even faster!

Yay!

--
I don't know the meaning of the word 'don't' - J

Re:More source-distro goodness to follow? by ykardia · 2005-02-07 04:53 · Score: 1

If you are using Gentoo, there is an "icc" useflag that allows using the Intel Compiler for code that supports this. This compiler already automatically vectorizes your code to work with the Pentium SIMD units (SSE, SSE2 etc).

The speedup is probably not as the one you would get from hand-coded libraries, but it can be quite significant (certain things can run up to twice as fast from my experience)
Re:More source-distro goodness to follow? by Lussarn · 2005-02-07 05:21 · Score: 1

Any particuar ebuilds I can test this on?
Re:More source-distro goodness to follow? by ykardia · 2005-02-07 22:04 · Score: 1

Try
cd /usr/portage find . -name '*.ebuild' -exec egrep -H "IUSE.*\Wicc\W.*" {} \;
Go make yourself a coffee - when you come back you should have an idea of which ebuilds take the flag. There might be a more sophisticated way of finding out which ebuilds take that flag, but I can't think of it.
Re:More source-distro goodness to follow? by Lussarn · 2005-02-11 00:56 · Score: 1

thx..

Too expensive? by saddino · 2005-02-07 04:49 · Score: 1

Sounds great, but $2499 for a redistributable binary? Ouch.

Re:Too expensive? by voxlator · 2005-02-07 04:57 · Score: 2, Insightful

In the corporate world, is it more expensive than paying a developer to design, code, test, and maintain a home-grown version?

Once you've payed a $30 dollar/hour developer for 10 days work, you've forked out ~ $2,500...

--#voxlator
Re:Too expensive? by saddino · 2005-02-07 05:21 · Score: 1

If the question was "Do I hire my own programmer or buy this technology?" then you would be correct.

But, given this is an optimization and replacement for STL then the question is "Do I just live with STL, or buy this technology?"

In other words, it isn't an essential development cost, it's an extra (I imagine most interested parties already have shipping apps that use STL).

And at this price point, IMHO, I think the answer may be "if it ain't broke, don't fix it."

Slides about SIMD by quigonn · 2005-02-07 04:49 · Score: 2, Informative

A bit OT, but nevertheless quite interesting to read and it contains information about SIMD instruction sets other than just MMX/SSE: http://www.fefe.de/ccccamp2003-simd.pdf

--
A monkey is doing the real work for me.

Assembly or C++? by nagora · 2005-02-07 04:49 · Score: 1

I'll take the Assembly Language, thanks. Especially on such a nice processor.

TWW

--
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"

Re:Assembly or C++? by nagora · 2005-02-07 04:56 · Score: 1

Especially on such a nice processor as the PowerPC, that is. Sheesh.
TWW

--
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"

Autovectorization being add in GCC 4.0 by shawnce · 2005-02-07 04:50 · Score: 5, Interesting

For those that don't already know is that autovectorization is being worked on for GCC by folks from IBM and others.

GCC vectorizatoin project (site seem offline atm) but the abstract from a recent GCC summit is up.

Autovectorization Talk (google html view of pdf)

Re:Autovectorization being add in GCC 4.0 by TedCheshireAcad · 2005-02-07 05:23 · Score: 1

If you're serious about performance, use XLC. GCC is great if you're cheap, but it's kind of like putting monster truck tires on a Ferarri.
Re:Autovectorization being add in GCC 4.0 by joib · 2005-02-07 05:47 · Score: 1

Yes, the new ssa architecture in GCC 4.0 allows for autovectorization, but at the moment the focus is on getting GCC 4.0 sufficiently stable for release in a few months. Because of this, IIRC, some of the fancier vectorization passes were deferred until GCC 4.1.

So yes, you might see some performance improvements due to vectorization in 4.0, but you'll have to wait until 4.1 or maybe even 4.2 before you'll see the full potential of it.

-joib, occasional GCC contributor (although I have absolutely zilch to do with the mid- and backend stuff, were most of the optimization passes are done)
Re:Autovectorization being add in GCC 4.0 by bani · 2005-02-07 08:18 · Score: 1

even without vectorization, the performance improvements in gcc4 are impressive.

unfortunately some of the regressions are impressive as well :-/
Re:Autovectorization being add in GCC 4.0 by jd · 2005-02-07 11:01 · Score: 1

That would be interesting. Unfortunately, as far as I can tell, XLC only runs on AIX and therefore only on IBM's medium to big iron. This parallelization seems to be available for small machines and games boxes.

Personally, I think GCC is running into design limitations, to judge from the problems they're having getting GCC 4 to compile as well as GCC 3. In the builds that come with Fedora of both GCC 3 and GCC 4, I'm not seeing Fortran 90, although the ADA extension seems to be coming along nicely.

What I would like to see (naive and optimistic that I am) is for commercial vendors such as the Portland Compiler Group, Intel, IBM, etc, look at GCC as a way to outsource the more mundane work, and to re-fashion their own compilers as GCC extensions.

That way, they'd have more manpower to work on the new stuff, the stuff that makes their product different and therefore markettable. GCC could then focus on the generic stuff, the stuff they are good at, instead of working on optimization code that they are going round in circles on.

The vendors would end up with a greater range of compilers (anything you can stuff a front-end on for GCC) and a greater range of target machines, therefore potentially having a larger audience.

The GCC group would end up with a lot of refinements that they'd never have thought of otherwise and probably wouldn't be up to coding if they did.

By making GCC "middleware", and having proper pluggable language front-ends, and proper pluggable targets, I think a lot of people could gain substantially. The only ones who'd lose are the ones who don't have a product in the first place, and they'll lose in time anyway.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:Autovectorization being add in GCC 4.0 by HuguesT · 2005-02-08 01:23 · Score: 1

XLC is actually available for MacOS/X.

It works relatively well but I've found the XLC code to be less reliable than the FSF GCC one (i.e more compiler-related bugs). The code it produces is not very much better than the one produced by Apple's GCC. Apple's GCC also has reliability problems.

It's in the compiler by Mad+Hughagi · 2005-02-07 04:51 · Score: 2, Informative

Vectorization (SIMD) is built into the Intel compiler. There is no need to hack in assembly as the compiler will do it for you. This is the case with most vendor supplied compilers, as they want to fully exploit their hardware functionality.

The problem is bringing this functionality to OS compilers, which as far as I know, there is not even an OpenMP (threading) implementation, let alone internal vectorization.

--
UBU

Re:It's in the compiler by nonmaskable · 2005-02-07 05:24 · Score: 1

It is built in but you don't automagically get full benefit unless you design your data structures and algorithms appropriately. In my case, I got no measurable benefit until I did a fairly extensive redesign.

Intel has a great book on performance tuning that has been extremely helpful, as has Intel's VTune.
Re:It's in the compiler by spitzak · 2005-02-07 05:45 · Score: 1

With no changes to our code, but turning on most of the switches to the Linux Intel compiler, I got a huge number of "loop was vectorized" messages, and the resulting code was sped up almost 20% (verses only 5% for the Intel compiler with no switches other than -O5). Now it is quite likely that more speedup is possible, but it appears the Intel compiler was quite able to recognize and vectorize code that was not designed for it. (ps the code is floating-point image processing, with repetitive operations done to huge arrays of 32-bit floats, so it may be well-suited to their vectorization)
Re:It's in the compiler by Sebastopol · 2005-02-07 06:09 · Score: 1

Actually, you DO get automagical compiler speedup. In some cases it can identify vector-izable (is that word?) loops and promote them to SIMD operations.

But yes, otherwise, you need to re-code if the compiler doesn't take the hint, especially in structures/classes. The only objection I have to the Intel intrinsics is they don't look pretty! ;-)

I haven't used VTune since circa 1998, and it had this awesome feature that would point out boneheaded things in your code. One interesting suggestion it made: it noticed I wasn't using return values despite returning them, when I removed them, I got a 25% performance boost in some critical code. Made me feel like an eeeeediot.

--
https://www.accountkiller.com/removal-requested
Re:It's in the compiler by nonmaskable · 2005-02-07 06:39 · Score: 1

Automagical only if it can make the identification; there are several things that can prevent it from doing so, and I managed to do several of them. VTune helps a lot with code like this - I've spent many happy hours tracking down hotspots with it.

already exists by jeif1k · 2005-02-07 04:55 · Score: 2, Informative

SIMD support already exists, in the form of C, C++, and Fortran libraries (usually, as a small part of larger numerical libraries), as well as in language constructs in languages like Fortran.

Re:already exists by jkujawa · 2005-02-07 05:17 · Score: 1

The point of MacSTL is it's portable to both PPC and Intel. You can make a portable SIMD-optimized program.
Re:already exists by jeif1k · 2005-02-07 08:13 · Score: 1

That's not much of a point. E.g., Atlas libraries support many different CPUs, and so do Fortran compilers. Many of those tools are already mature, widely used, conform to open standards, and are free for even for commercial use.

Assembly by bsd4me · 2005-02-07 04:55 · Score: 2, Insightful

Even in embedded systems, assembly isn't used as much as it used to. It still get used in bootloaders, and sometimes in device drivers. However, most devices are memory mapped, and most of the driver is written in C, and asm() calls are made when appropriate (eg, asm("eieio");), especially when you get to use gcc and asm() syntax for accessing variables.

--

(S(SKK)(SKK))(S(SKK)(SKK))

The future by johnhennessy · 2005-02-07 04:55 · Score: 3, Insightful

Surely people can now start to see where the future lies - from a performance viewpoint. We've reached the end of the clocking "free lunch" (see http://www.gotw.ca/publications/concurrency-ddj.ht m/).

The way forward is turning the CPU (of a traditional) architecture into a Nanny for a range of various dedicated processing units. IBM saw this years ago, and thus began the whole Cell architecture - but I suspect that their job was much easier. The software that would run on the platform they are designing is fairly specific - games & multimedia which usually lend themselves well to vectorization.

The real challenge for architects (in my humble opinion) is translating will be applying the same technique to other system bottlenecks.

AMD's (and now Intel's) approach of crambing more and more processing cores onto an IC might pay off in the short term, but like the "free lunch" of clock speed, will hit a roadblock when issues like memory bandwidth and caching schemes just have too much work to do with 4 or 8 processing cores hacking at it all the time.

--
[ Monday is a terrible way to spend one seventh of your life. ]

Re:The future by Rinikusu · 2005-02-07 07:39 · Score: 1

Isn't that pretty much what the Amiga was doing a couple decades ago? The CPU was merely a traffic cop, directing other specialized units to actually do the real work? If so, they're a bit late to the party, eh?

--
If you were me, you'd be good lookin'. - six string samurai
Re:The future by Duncan3 · 2005-02-07 07:54 · Score: 1

Even Amiga was a couple of decades late to the party. Vector processors have been around for a looooooooooooong time.

Still, it makes good headlines even today ;)

--
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
Re:The future by jeif1k · 2005-02-07 08:22 · Score: 1

IBM saw this years ago, and thus began the whole Cell architecture

This was completely obvious to everybody in the 1980's. What was surprising to most people was that Intel managed to succeed with the x86 architecture for so long without innovation.

AMD's (and now Intel's) approach of crambing more and more processing cores onto an IC might pay off in the short term, but like the "free lunch" of clock speed, will hit a roadblock when issues like memory bandwidth and caching schemes

And how do you think Cell addresses that? Right now, it's still just a lot of CPUs on a chip, with the same memory bottleneck as everybody else.

In fact, AMD and Intel probably are at a sweet spot with their processors where CPU power and memory bandwidth are fairly evenly matched for most applications. Cell, if anything, probably makes a bad tradeoff for real-world apps.
Re:The future by johnhennessy · 2005-02-11 04:15 · Score: 1

Ahhh Amiga,

I spent most of the 80's watching day-time TV.

I'm still trying to figure out which was the better option (TV or Computers).

--
[ Monday is a terrible way to spend one seventh of your life. ]

Isn't it what std::valarray is for? by 21mhz · 2005-02-07 04:57 · Score: 1

Reading this reminded me about that portion of the standard C++ library which is all about operations on vector data. So, my question is: could an std::valarray specialization for processor-supported types serve as a basis for portable SIMD support in C++?

--
My exception safety is -fno-exceptions.

Re:Isn't it what std::valarray is for? by kuwan · 2005-02-07 05:21 · Score: 2, Insightful

So, my question is: could an std::valarray specialization for processor-supported types serve as a basis for portable SIMD support in C++?

That's exactly what this is. If you read the part on his website about valarray then you'll see that it does extensive SIMD optimizations for valarray for both Altivec and MMX/SSE/SSE2/SSE3 platforms. He's even added "parallelized algorithms such as integer division, trigonometric functions and complex number arithmetic" which you'd have to code yourself in either assembly or using the C-based intrinsics if you wanted do the SIMD programming by hand.

So basically, this allows you to code using std::valarray using normal C++ and then plug this in under the hood to get a nice speed boost.

--
Join the Pyramid - Free Mini Mac

--
infested with jello like fishes no melotron wishes
Re:Isn't it what std::valarray is for? by emarkp · 2005-02-07 07:43 · Score: 1

I guess you didn't notice: http://www.pixelglow.com/macstl/valarray/.

Other way around by Kiryat+Malachi · 2005-02-07 04:59 · Score: 1

Freescale, nee Motorola. (Nee roughly translates to "formerly known as").

--

---
Mod me down, you fucking twits. Go ahead. I dare you.
(I read with sigs off.)

Re:Other way around by Anonymous Coward · 2005-02-07 05:05 · Score: 1, Informative

Or born, like the french word it is: née.

No need for anyone to whip out the online dictionary and tell me "formerly known as" is an acceptable alternative.
Re:Other way around by Kiryat+Malachi · 2005-02-07 18:55 · Score: 1

'Nee' may be Dutch for 'no', but that has nothing to do with the usage I was correcting. In that usage, it's based on French, where it is the feminine form of the past participle of the verb "to be born". Thus, it is literally translated as "born as". However, the meaning it has acquired in English usage is better served by the definition "formerly known as".

--

---
Mod me down, you fucking twits. Go ahead. I dare you.
(I read with sigs off.)
Re:Other way around by Kiryat+Malachi · 2005-02-07 19:00 · Score: 1

Well, considering it is applied (by English speakers) to objects being renamed, the use of the word "born" is a bit suspect.

Products aren't born, nor are companies. And yet, we see usages like Freescale, née Motorola, or "the Acura SLX (nee Isuzu Trooper)." That last was from the New York Times. As such, while the *French* word means "born", the English usage is far better represented by "formerly known as".

We've co-opted parts of your language, Frenchie. Get over it.

--

---
Mod me down, you fucking twits. Go ahead. I dare you.
(I read with sigs off.)

Read the Altivec mailing list by kuwan · 2005-02-07 05:05 · Score: 4, Informative

A better resource for Altivec and SIMD in general is the SIMDtech.org website and Altivec mailing list. There are tutorials and technical manuals available and the email list is indispensable. While the mailing list is mostly geared towards Altivec optimizations and discussions all SIMD discussion is welcome, including MMX/SSE. There are Apple engineers that read and contribute to the list as well as Motorola/Freescale engineers. It's probably the single best resource available to Altivec programmers and you get to talk directly to the Wizards that created it.

I'm a relative newcomer to the list and it's been an invaluable resource as I've optimized with Altivec.

--
Join the Pyramid - Free Mini Mac

--
infested with jello like fishes no melotron wishes

OS X Tiger will do it for you by jilbert · 2005-02-07 05:06 · Score: 2, Interesting

Tiger, the next OS release from Apple, will take care of vector optimization automatically in their version of gcc 4.0. I guess this will make it into the public gcc too.

Re:OS X Tiger will do it for you by Junks+Jerzey · 2005-02-07 05:12 · Score: 1

Tiger, the next OS release from Apple, will take care of vector optimization automatically [apple.com] in their version of gcc 4.0. I guess this will make it into the public gcc too.

For the record, this has been in Intel's C compiler for years now. It's also in the current release of the Microsoft Visual C++ compiler, including the free download version.
Re:OS X Tiger will do it for you by be-fan · 2005-02-07 05:33 · Score: 4, Informative

Actually, Apple's Tiger will get an auto-vectorizing compiler courtesy of the public GCC 4.0 release. The auto-vectorizer wasn't developed in Apple's version of GCC. IBM's GCC team at the Haifa Research Lab developed the vectorizer in the public LNO (loop nest optimization) branch of GCC 4.0. I'm not trying to minimize Apple's contribution here, one of their developers did work on the team, but let's give credit where credit is due.

--
A deep unwavering belief is a sure sign you're missing something...
Re:OS X Tiger will do it for you by johnnyb · 2005-02-07 06:41 · Score: 1

Watch out, it's the Loop Nest Monster!

--
Engineering and the Ultimate

Q for VMX/3D/OpenGL software developers: by tubbtubb · 2005-02-07 05:06 · Score: 1

This is public now, so I can talk about it--
I worked on extending the accuracy and continuity of the VMX instruction vexptefp, see the patent application here
My understanding is that this instruction is used to compute Phong/specular hilights, and that previous implementations of this instruction were unusable because the lack of accuracy and continuity made it visually undesirable. We were able to improve the algorithm enough to be visually indistinguishable from a fully accurate non-estimate.
Can any software developers that use this instruction comment on this?
Is Phong hilighting mostly done on GPUs now?

From the limewire... by WilyCoder · 2005-02-07 05:07 · Score: 3, Interesting

As two of my professors have stated in class, SIMD and moreso parallel processing will require programmers to think in a fundamentally different way in order for multi-core/multi-processor to really take off.

This project may be a step in the right direction. Benchmarks show that SIMD such as SSE/2/3 only provide a marginal speed increase. And meanwhile, the massively parallel computations done on graphics cards dwarfs anything SIMD claims to produce.

Perhaps we will see GFX manufacturers selling their technology to the CPU makers.

I forget the specifics, but a new GFX card can perform somewhere around 35 GFLOPS, while a 3.4Ghz P4(executing SIMD code) can only produce around 5-6GFLOPS at best.

With projects like Brook GPU emerging, the division of CPU and GFX processor may be narrowed significantly.

Re:From the limewire... by Hast · 2005-02-08 03:23 · Score: 1

FPGAs are loads of fun. But they are not particularly suited for general cases. I was part of a group that did an image processor for FPGA, even its theoretical capacity was thouroughly whipped by a GPU running the same algorithms implemented as shaders. (And the GPU version wasn't optimised at all, eg we wasted 3 of 4 channels available for processing.)

And for basic signal processing a DSP is really fast, typically way faster than a FPGA. It's all about how the data is structured.

AFAIK DSP is good with "narrow" data, ie audio. FPGAs are good with wider data (eg images) with high demands on flexibility on the processing elements. Vector processors (such as GPUs, Altivec, SSE etc) are good when you have semi-wide data and general processing requirements. As it's implemented as ASIC you can get extremely high frequencies which makes for high throughput and low latency. Compared to FPGA the calculations aren't as flexible though. And finally the typical CPU which can do just about anything but comparatively slowly.

All that said, I remember when I studied good ol' An introduction to Algorithms that there was an entire chapter on parallell processing, however it typically required multiple processor for each element in the data. Naturally that is not very useful for a normal CPU but it's good for a GPU and even better for FPGA.

Ignorant submitter, or smart marketing? by javaxman · 2005-02-07 05:08 · Score: 2, Interesting

Sorry, I can't read a story submitted by someone who doesn't even know about C libraries that have been around for years.

Or is this just another advertisement pretending to be a story, with the submitter trying to play ignorant about alternative Altivec and MMX libraries ?

Depends on what you are doing by dsci · 2005-02-07 05:09 · Score: 5, Insightful

We write code for hardcore chemical simulations. The limits on what can be studied, ie number of atoms/molecules or timescales of the simulations depends on one thing: speed.

Faster computers means better simulations. BUT, if the code is not as fast as it can be on a particular architecture, your simulations are not going to be as complete as they can be. At least within a given time allotment.

I've recently applied some code optimizations to a Monte Carlo simulation and saw speed ups of over 1000x. That's significant.

It's naive to think that faster computers means we should live with sloppy or unoptimized code. SIMD is a useful technique, and if it means the difference between me getting work done in a week or two or three weeks, I think I'll take the one-week sim.

--
Computational Chemistry products and services.

Re:Depends on what you are doing by imsabbel · 2005-02-07 07:32 · Score: 1

Speedups like a factor of 1000 can only come from high level optimisations (like replacing an O(n^2) with an O(n log n) algo).

Honestly: TO be able to get a 1000 times boost, your original code must have been beyond bullshit.

And of course using simd is better than not using it, but i would rather stay on a "let the compiler vectorize it" level. I mean, doing your inner loop in leet assambler only to NOT know after a long simulation if ther results are real or you just botched some line isnt worth it.

--
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
Re:Depends on what you are doing by groomed · 2005-02-07 07:44 · Score: 1

And of course using simd is better than not using it, but i would rather stay on a "let the compiler vectorize it" level. I mean, doing your inner loop in leet assambler only to NOT know after a long simulation if ther results are real or you just botched some line isnt worth it.

Baseless FUD. Why would a few dozen lines of hand coded assembly suddenly invalidate the results?
Re:Depends on what you are doing by Dasein · 2005-02-07 09:17 · Score: 2, Insightful

Speedups like a factor of 1000 can only come from high level optimisations (like replacing an O(n^2) with an O(n log n) algo).

Nope. Technically, there are two constant burried in here. The definition is g(x) = O(f(x)) => g(x) <= k*f(x) where x > a for some orbitrary a. If you don't change algorithms, all you can do is manipulate the k. For a given k and a given level of improvement, I can give you a new k that hits that level of improvement.

Honestly: TO be able to get a 1000 times boost, your original code must have been beyond bullshit.

Also, his original code may have been "bullshit" but it may not have. It depends a lot on the algorithm in question. The higher the exponent on an exponential algorithm, they more sensitive its running time is to some optimization in an inner loop.

And of course using simd is better than not using it, but i would rather stay on a "let the compiler vectorize it" level. I mean, doing your inner loop in leet assambler only to NOT know after a long simulation if ther results are real or you just botched some line isnt worth it.

This is a simple matter of economics. There's a cost/benifit to expending the effort to optimize in assembly. If the compiler generates good code, then obviously, the cost/benefit of recoding in assembly is pretty high. However, without specific knowledge of *HIS* economics, I would suggest that you not spout off.

--
You are not a beautiful or unique snowflake -- but you could be if you got off your ass.
Re:Depends on what you are doing by bigox · 2005-02-07 09:42 · Score: 1

Some algorithms with the same complexity may have vastly different constant coefficients. I've seen some algorithms in O(n^2) and O(nlog(n)) that have constants as high as 1000. And some of the constants are actually parameters of the algorithm. Why do people look at these algorithms? Some are easier to explain (and easier to illustrate certain concepts) than others. Computational Geometry has a lot of these beasties.
Re:Depends on what you are doing by aminorex · 2005-02-07 11:48 · Score: 2

The difference between running mostly in L1 cache and regularly going to RAM (particularly when load/store patterns are pessimal), multiplied by the parallelism of exploiting SIMD can quite feasibly give a 1000x performance difference.

--
-I like my women like I like my tea: green-
Re:Depends on what you are doing by imsabbel · 2005-02-08 01:58 · Score: 1

er, no.
not a factor of 1000.
even if it before was 0% cache hitrate and only scalar and after 100% l1 cache hitrate and 4 way simd, it would only have a speedup of about 100. (l1 latency is 2-4 clocks on different cpus, main memory around 80-150).

Believe me, i know how a nifty algotrithm can perform like crap because what was before nice block streaming in O(n^2) bcame pointer chasing in O(NlgN), but not a factor of 1000, not that way.

I know its a bit of nitpicking, but throwing around claims like "factor 1000 improvement" need a bit of backup...

--
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
Re:Depends on what you are doing by aminorex · 2005-02-09 04:12 · Score: 1

Consider a 2GHz CPU, with 2 cycle L1 latency loading one-byte from a 16 byte cache line, at stride 16 bytes. Ignoring TLB thrash, etc., to force a factor of 1000 in latency use a bit rate N, 1000ns = 128b * N ns/b. That's 7ns/b. An ECC EDO RAM 9b wide feeding an off chip MMU would do about that, for example.

Now let's start swapping, and see if we can push a factor of 10e6, eh?

--
-I like my women like I like my tea: green-

Re:License issues-Smells funny. by IO+ERROR · 2005-02-07 05:17 · Score: 1

Of course he hasn't taken away my choice, AC. I can't reconcile either of his licenses with my existing projects, so I choose not to use his code. I suspect many existing projects will find themselves in a similar situation when they actually read the licenses, and will also choose not to use his code.

--
How am I supposed to fit a pithy, relevant quote into 120 characters?

liboil by labratuk · 2005-02-07 05:23 · Score: 2, Interesting

Another project trying to do something similar is liboil, the Library of Optimised Inner Loops.

However in the future I can see things changing for the structure of the stardard PC.

At the moment in a high end machine you have the CPU, which is a scalar processor, a GPU, which is in essence a glorified vector processor (not just useful for graphics, as projects like GpGPU are showing us), and SIMD extensions to the CPU to allow it to do small amounts of vector processing.

Scalar processors are good for some things (branchy code) and vector processors are good for other things (very predictable parallel code). Having both is very useful.

I would say in the next 5-10 years we will see the GPU join together with the SIMD extensions to provide a seperate general purpose vector processor.

PCs will ship with two processors - one scalar, one vector. And everyone will be happy.

Now, whether this will be transparent to the programmer depends on how automatic code optimisation progresses over the next few years. Is Intel's icc auto vectorisation already good enough? Don't know.

--
Malike Bamiyi wanted my assistance.

Moore's Law is OVER by emarkp · 2005-02-07 05:29 · Score: 1

Haven't you been paying attention? Processor speed increases stopped 2 years ago. We can put more transistors on silicon, but the free performance ride is over.

See Herb Sutter's article in the Feb C/C++ Users Journal or the (expanded) one in the March Dr. Dobb's Journal.

But times are changing, this is becoming valuable by Paradox · 2005-02-07 05:29 · Score: 1

Recently Herb Sutter (famous software engineering guru and C++ wizard) posted this essay in which he reminds us, among other things, that the generalization of Moore's law to processor is allready failing! While computers are continuing to get faster, it's not just in their clockspeed anymore.

While memory speeds will continue for awhile, already processor speeds are falling off. Check out this graph from the article where he clearly shows what's happening.

This brings an interesting dilemma to modern programmers. Programs won't magically get faster anymore. We need to start coding to take advantage of concurrency.

The same is true of using SIMD units. They can speed up your code dramatically, but they must be taken into account in your code. That's why this macstl project is such a good idea. It is a standard set of common primitives that let you harness the SIMD functions of your processor. By putting a library over the specifics, your vector-aware code will grow with modern SIMD systems.

Few people will ask you to write in assembly these days, but if you could easily give your math-intensive program a 10x-30x speedup by using one library (that seems very easy to use, by my standards), why wouldn't you?

--
Slashdot. It's Not For Common Sense

Why? Altivec-optimized libraries supplied by Apple by coult · 2005-02-07 05:38 · Score: 3, Interesting

You really don't need macstl unless you have a strong desire to use valarray in C++...for example, the ATLAS project http://math-atlas.sourceforge.net/ already uses Altivec (and SSE/SSE2, etc) wherever it results in a speedup. So, if your code does linear algebra, use ATLAS and you'll see an automatic speedup in many cases. Other projects such as fftw http://fftw.org/ include Altivec/SSE/SSE2 optimizations as well. ATLAS includes lots of other optimizations such as cache-blocking, loop-unrolling, etc. I don't know of macstl includes such optimizations, but I do know that ATLAS performance approaches the theoretical peak performance on G4/G5 for things like matrix-matrix multiplication.

Not only that, but Apple's vecLib http://developer.apple.com/ReleaseNotes/MacOSX/vec Lib.html includes ATLAS so you don't even have to download or install anything - it comes with OS X.

--

All is Number -Pythagoras.

Algorithms by Detritus · 2005-02-07 05:38 · Score: 1

You often need radically different algorithms to get the full benefit of SIIMD. The processing power is there, figuring out how to exploit it can be very difficult.

You can do a limited version of SIMD with an ordinary CPU. A 32-bit CPU can execute 32 "bit logic" operations with a single instruction. With a properly structured problem, 32 instances can be computed in parallel.

--
Mea navis aericumbens anguillis abundat

Yes. by Trillan · 2005-02-07 05:44 · Score: 2, Informative

Yes it does.

Re:Yes. by homb · 2005-02-07 05:59 · Score: 2, Informative

No the current version of XCode uses GCC 3.3 and does NOT support autovectorization.
The page you link to is a page that shows how to code vector-based programs. What the parent is asking is if the standard "Hello World" program can be auto-vectorized with one command-line argument, and that won't work currently.
The next version of XCode (2.0) with GCC 3.4 will support partial auto-vectorization, as another comment said as well.
Re:Yes. by Trillan · 2005-02-07 08:16 · Score: 1

Actually, the parent says: Doesn't XCode have a feature that lets you "vectorize" certain parts of your code already?

The answer is yes.

"Automatic" was introduced by one of the child posts. You're correct that the answer to that is no. However, I didn't think that's what was being asked since it was a top-level question in response to TFA's "you must use assember."

Anyway, sounds like we agree on reality, we just disagree on what's being asked. So between the two of us we've provided a very complete answer. :)
Re:Yes. by homb · 2005-02-07 08:30 · Score: 1

Except that I was wrong on the fact that GCC 3.4 does auto-vectorization, while it's in fact GCC 4.0 (which will ship with XCode 2.0).

Other than that, I agree we agree on agreeing to agree.

Why limit yourself to Altivec when you have NVidia by kompiluj · 2005-02-07 05:45 · Score: 3, Insightful

Well the processing power of Altivec or MMX/SSE/3DNow or whatever is nowhere near the power of you newest NVidia/ATI card you have surely bought for playing Doom III. Why not use it then? Get the brook compiler! Furthemore, I see they introduce classes like vec, etc. Such classes have been already designed successfuly for C++. Why not try porting Blitz to the Altivec and/or to the GPU?

--
You can defy gravity... for a short time

Maybe it's just Ignorant criticism... by kuwan · 2005-02-07 05:48 · Score: 3, Informative

If you'd actually read what this is all about then you'd have find out that this is a cross-platform library for SIMD programming. You program in standard C++ using std::valarray and you get code optimized for Altivec and MMX/SSE/SSE2/SSE3 without having to do anything else. You don't need to worry about coding to two different libraries on two different platforms nor do you have to worry about learning the platform-specific C intrinsics, alignment issues, head/tail cases, etc.

SIMD programming becomes as easy as this:

float af1 [] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; stdext::valarray <float> v1 (af1, 10); // construct from first 10 elements of af1 stdext::valarray <float> v2 (10, 3.0f); // construct with 10 repeats of 3.0f stdext::valarray <float> v3 (10); // construct with 10 repeats of 0.0f v3 = sin (v1) * cos (v2) + sin (v2) * cos (v1);

He claims that the above code is 17.4x faster than Codewarrior MSL C++, 11.6x faster than gcc libstdc++ and 9.5x faster than Visual C++.

Macstl also provides a cross-platform syntax for using vector registers that is similar to using the native C intrinsics on each platform. So while not all of the native operations are available, his cross-platform "vec" API allows you to write cross-platform code without having to learn both the Altivec and MMX/SSE intrinsics (which is a good solution for someone who knows one platform but not the other).

--
Join the Pyramid - Free Mini Mac

--
infested with jello like fishes no melotron wishes

Re:Maybe it's just Ignorant criticism... by javaxman · 2005-02-07 06:19 · Score: 1

If you'd actually read what this is all about then you'd have find out that this is a cross-platform library for SIMD programming.
My point exactly. Does the story say cross-platform anywhere? No, it says :
programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians
er... so, instead of saying something like "here's a product which allows you to use the same API for both PPC and Intel SIMD", the submitter puts in the above sentence, which is clearly not factually correct unless you infer the part about cross-platform ?
I'm not saying that the library isn't cool and potentially useful, I'm saying I'm very turned off by the claim that programming either of these architectures requires assembly language. Because it's not true. Why do you have a problem with my pointing that out? Why do you assume I didn't know what macstl is already?
Oh, and I'm also turned off by the post essentially being an ad... like your sig.

OSI-approved RPL goodness. Admit it.... by Pyrosophy · 2005-02-07 06:03 · Score: 2, Funny

This story doesn't really mean anything and people are just making up comments.

Content Addressable Parallel Processors by Baldrson · 2005-02-07 06:10 · Score: 2, Interesting

The real "grand unified theory" of SIMD is CAPP or content addressable parallel processors. I read a book on this topic back in the 1970s and it was pretty clear to me that it:

Was a great way of dealing with relational data
Would have to await much larger scales of integration before becoming practical.

Since then the computer world has become much more relational due to relational databases, and the levels of integration of skyrocketed, but no one major manufacturer of silicon has bothered to revisit this very simple and powerful route to high power computing.

Fortunately there is at least a little ongoing research.

The beauty of these processors is they integrate memory with computation so that the massive economies of scale we witness in memory fabrication apply to computation speeds as well so long as we can move toward relational rather than function computing as a paradigm. Fortunately this appears to be supported by the study of quantum computers, however those computers may never see the light of day for more fundamental reasons.

--
Seastead this.

Re:Content Addressable Parallel Processors by TheRaven64 · 2005-02-07 08:30 · Score: 1

I think you missed the articles about the Cell processor. Almost every other post in those was someone re-inventing the Transputer...

--
I am TheRaven on Soylent News

Re:Assembly-DSPs by bsd4me · 2005-02-07 06:23 · Score: 1

It is when programming DSP's (and related devices).

From my experience, yes and no. Fixed-point DSP tends to be done in assembly, mainly because FP techniques don't translate well to C. The compilers also tend to suck. A fair to large amount of floating-point DSP is done with C when the compiler support is good. I have done a lot of floating-point DSP, and we found that the write in C, refine in ASM workflow was best.

Don't forget that microcontrollers outnumber microprocessors by a large margin.

That is true. I usually refer to this as "high" embedded versus "low" embedded systems. Along with DSP, I have spent my career mainly working on large, embedded systems running on microprocessors (Mot 68k, PowerPC, and some MIPS) under control of an RTOS. In this application, assembly doesn't get used as much as you think even when you are dealing with hard realtime requirements.

--

(S(SKK)(SKK))(S(SKK)(SKK))

Re:faster? Bogus.... by Anonymous Coward · 2005-02-07 06:27 · Score: 1, Insightful

the idea that assmebler programmers can write better code than a compiler can generate is one of those urban myths that refuses to die. compilers can and do undertake code analysis that no assembler programmer could ever do - like trace back the control flow through every single branch point to find instances where data has already been precalculated. code hoisting of temporaries outside loops in a way that maximises register use over memory hits. undertaking such analysis before coding in assembler would be extremely high risk for an assembler programmer. also would you as an assmbler programmer go about inlining all your assembler functions - the code would be unmanageable? how many assembler programmers would know how to reorder their instructions to avoid pipeline stalls. all the knowledge about optimising assembly programs has been incorporated into compiler backends over the years- why wouldnt it have been?

its been tested - get a program that converts assembler to c and then recompile with optimisation - it *will* run faster.

the only exceptions are where the compiler lacks an algebraic or RTL awareness of an instruction on a specific architecture.

jxxx

Re:OSI-approved RPL goodness. Admit it.... by yarichg** · 2005-02-07 06:27 · Score: 1

Its an interesting discussion , only its jumping all over the place and touching a whole bunch of interesting points...I've only written (.model small) programs in Assembly(don't think I'd want to write many larger programs in Assembly and my recollection is that given the nature of Assembly(written properly) it will just about always be faster than any compiled code....so why not use it in small doses where applicable........

macstl vs. Blitz++ by ljubom · 2005-02-07 06:48 · Score: 1, Interesting

It will be interesting to compare performance of the macstl library to other "high speed" template libraries like Blitz++ (see http://www.oonumerics.org/blitz/)

Obviously, you arent a PS2 graphics programmer.. by LordZardoz · 2005-02-07 07:24 · Score: 1

And one step further, I am betting you do not perform any sort of graphics programming.

On win32 / mac platforms, the need to know how to do this is pretty low. DirectX wraps most of it, as well as the processes needed for GPU programming. I am sure the Mac libs that do the same job as DirectX accomplish much the same.

But low level graphics programming is alive and well for game programming. I do what I can to stay well clear of that, since I dont like graphics programming much (just personal preference). But the need for this type of programming continues to exist. And it will continue to exist for a while yet.

END COMMUNICATION

Pedantic Pissing Contests Aside by Baldrson · 2005-02-07 07:37 · Score: 1

The point is that the GPL doesn't specify release behavior for code that isn't distributed so any "program" P developed with regard to the GPL should not reference such release behavior -- hence the substitution principle works.

--
Seastead this.

Re:Pedantic Pissing Contests Aside by CableModemSniper · 2005-02-07 10:42 · Score: 1

But pedantic pissing contests are my favorite! ;)

--
Why not fork?

Re:Why limit yourself to Altivec when you have NVi by TheRaven64 · 2005-02-07 08:27 · Score: 2, Insightful

The main reason is that the AGP bus is designed to move data very quickly to the card, but is not so hot at moving it back again. This should change with PCI Express.

--
I am TheRaven on Soylent News

Now they're thinking of it by El+Cabri · 2005-02-07 08:40 · Score: 1

Funny thing : it was PRECISELY the topic of an engineer degree internship that I've made in the summer of 1997. Making a universal C++ template lib for SIMD programming, with application to the IA-32 MMX system. At the time there was already similar work all around, with the introduction of the MMX and the popular Alpha architecture that had a similar system. All that to say that it does not sound really new to me.

Assembly lives! by Omigod · 2005-02-07 08:45 · Score: 2, Interesting

The more complex the architecture the greater need to keep around low level coding. Compilers just can't keep up. During the early days of the PS2 we commonly got 300x performance improvements when switching from high level code to carefully architected and coded assembly. Programmers have gotten lazy and have lost the skills required to maximize the performance on current architectures. If you code carefully you can make sure that you are executing the maximum number of instructions per cycle. When you use a compiler it abstracts you from seeing that if you change your instruction pairing or split off some of the instructions into another pipeline you might get better performance. In school they teach you that algorythm is the most important thing to look at and that implementation doesn't matter that much, but with todays complex bus architectures, and with everything fighting for control of the bus, if you aren't careful you can end up wasting most of your time waiting for access to data or stalling the instruction pipeline waiting for results to calculations.

CAPP != Cell by Baldrson · 2005-02-07 08:58 · Score: 1

In CAPP operations are generally carried out in a bit-serial, word-parallel manner. This is radically different from Cell processor architecture.

--
Seastead this.

Cell is not an SIMD but a MIMD by John+Sokol · 2005-02-07 09:32 · Score: 1

SIMD is Single Instruction Multiple data,
MMX, SSE,2 and 3 , Altavec are this.

Cell is MIMD, Multiple instruction and multiple data.

Cell is an array of small independent CPU's. (think Beowulf cluster on a chip)

Computation is done by a systolic arrays or similar parrallel processing techniques.

Think of cells in spread sheet, where each rectange in the performs it's computation. A Cell processor allow you to change data at the top of the spread sheet and compute results at the bottom a GHz speeds!

Granted this isn't good for running an OS, but for video processing, Finite element simulations, Ray Tracing, code breaking, and AI, it's great.

I was working with Chuck Moore on Project Enumera , we layed out a chip with 49 (7x7) asynchronous CPU's. (this is important).
When doing Cell processors with 50 cores you don't want them to run step lock. This is akin to why soldiers march out of step when crossing a bridge. It's distributes the loading on the PowerSupply lines, rather then creating one big spike when they all switch.

--
I am always doing that which I can not do, in order that I may learn how to do it. - Pablo Picasso

Yes, it does. by Millennium · 2005-02-07 09:33 · Score: 1

AltiVec was introduced with the G4 line. The Mac Mini has such a chip. If it were to use a G3 chip, then it wouldn't have AltiVec, but that is not the case.

Autovectorizing by dghomefry · 2005-02-07 09:44 · Score: 1

Autovectorizing does exist Absoft makes a product called VAST http://www.absoft.com/Products/Libraries/vast.html
It works on C, Fortran, and C++. I've seen some reasonable performance gains from just a recompile.

grand unified theory? by convolvatron · 2005-02-07 10:06 · Score: 1

a grand unified theory would use SIMD for distribution, not just exploiting a shallow local vector unit. like ZPL, or the connection machine languages. in a manner that allows you to exploit scale.

no one calls a cray X1 SIMD, but its alot closer than altivec.

Re:faster? Bogus.... by Creepy · 2005-02-07 10:20 · Score: 1

I have to agree with the "insightful" poster - he/she/it is describing optimizing using assembly, not C, and trying to optimize code in assembler is a nightmare these days. You've got deep pipelines, multiple execution units, parallel processing - it's just ugly and requires a deep understanding of how to reorder the instructions to avoid pipeline stalls as well as keeping all the parallel units working at the same time. Optimizing your C code for better assembly is a different thing entirely (unrolling loops, aligning bitmaps on power-of-2 boundaries for faster copies, putting most common loop choice first, etc.).

About the best you can hope for is to tweak C code with assembly blocks (usually after extensive profiling). Even then, you really don't see the huge performance improvements these days (nothing like 5 years ago).

I used to be excited about the future of this tech, but to me, GPU tech has become far more important (heck, the latest stuff I've been programming barely sucks up 800MHz of my CPU but throttles the GPU).

Re:Why? Altivec-optimized libraries supplied by Ap by jd · 2005-02-07 11:17 · Score: 1

The maintainer for ATLAS has been unemployed and/or abducted by aliens for some considerable time. Nor does ATLAS implement all of BLAS and LAPACK. It is probably no longer optimal for many modern systems.

That's no fault of the maintainer - rough times are, well, rough. Nonetheless, it will take a long time before ATLAS is again a serious option for the kinds of problems it is good at solving. It's a pity, but it's too big a project for one person, and other projects (albeit non-free, in any sense) are putting in far more resources than that.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Like a Blitter? by gidds · 2005-02-07 14:18 · Score: 1

It's just struck me that the Altivec unit in this 'ere PowerMac is actually a little similar to the Blitter chip that my old Atari STE used to have.

The exact function may be slightly different -- a vector processor is far more flexible -- but it's still a special-purpose unit that drastically speeds up a few simple operations on reasonably large amounts of data, often used for graphical operations.

Interesting how so many ideas in computing are just developments of previous ones...

--

Ceterum censeo subscriptionem esse delendam.

Battery consumption by tepples · 2005-02-07 15:36 · Score: 1

because processor speeds have increased to such an extent (Moore's Law), it doesn't make sense to use assembly to write modern code; even if the assembly code is faster.

More efficient code will run in less time, letting the CPU stay in an idle state more often. This can reduce power consumption, especially on battery-constrained devices. How many watts does your l33t-0-fast processor draw again, and how long would it run on a pair of AAs?

Re:Why limit yourself to Altivec when you have NVi by MasterVidBoi · 2005-02-07 16:41 · Score: 1

For the same reason that general purpose computation isn't done on your GPU. A GPU gets it's performance from being able to do the same small task to a whole lot of data. A CPU needs to do a bunch of tasks to a small bit of data.

So, you need to multiply two vectors of a thousand floats each. Can the GPU do it faster? Yes. But not really, because there is an astounding minimum latency before the results of that computation can be recovered from the GPU. It's a *deep* pipeline. Even though the CPU will spend longer calculating, the results will be available immediatly, and you have much better turnaround time. If you were doing a million multiplies, the answer would be different. But outside of image processing/DSP work, you rarely find such operations.

Altivec (and MMX/SSE2/SSE3, although less useful), sit nicely in the middle, allowing you to operate on larger pieces of data in parallel without incurring the latency of a GPU operation, allowing excellent performance gains in quite a few common situations.

API matters by dugenou · 2005-02-08 00:54 · Score: 1

What about an open library, cross-platform, multimedia oriented, along the line of SUN's mediaLib ? Would SUN allow freely the re-use of their API ?

I'm looking for such a library, with GPL/LGPL compatible license. The API has to be in C, to maximise audience. For many projects, C++ is not an option.

Primary use will be DSP work in GNU Radio project, but multimedia extensions could prove useful anywhere in GUI's to audio/video app, etc.

I would take any pointers to such an already existing API/project, or be ready to start a new one, if other people interested in.

--
Love salty crackers? catchy electronica? Try !

156 of 223 comments (clear)