High performance FFT on GPUs

Final Fantasy Tactics? by Musteval · 2006-05-29 04:32 · Score: 4, Funny

Why use a GPU for Final Fantasy Tactics? Couldn't you just use the GBA?

--
Note to mods: I'm probably being sarcastic.

Re:Final Fantasy Tactics? by numbski · 2006-05-29 05:47 · Score: 4, Funny

Seriously! I mean, with only 1-D, it's really going to suck. I know people have accused Final Fantasy of being too linear in the past, but this is getting a bit silly...now, we're stuck at a single point!

--
Karma: Chameleon (mostly due to the fact that you come and go).

It's nice... by non0score · 2006-05-29 04:34 · Score: 4, Informative

if you're only considering 32-bit floating point numbers and don't need full IEEE-754 compliance.

Re:It's nice... by john.r.strohm · 2006-05-29 04:51 · Score: 4, Interesting

Depending on what you're doing, for an FFT, you probably don't need 64-bit floating point, and you DON'T need full IEEE-754 compliance.

If you are taking data off of some kind of sensor, there are damned few sensors with 24 good bits of data out of the noise floor. Radars work just fine with 16-bit A/D converters.

IEEE-754 compliance helps you in the ill-defined corners of the number space. FFTs inherently work in the well-behaved arena of simple trig functions and three-function (add/subtract/multiply) math.

I'm currently doing FFTs with 16-bit fractional arithmetic in Blackfin DSP. For what I'm doing with the results, it is good enough.

Not to mention you could use a "GPU farm" to do a fast search, and take any "interesting" data regions and feed those to a 64-bit, fully-IEEE-754 compliant, slow-as-molasses-in-January x86 FFT.

Eventually, with some more articles like this one and yesterday's Cell piece, people will start to figure out that the x86 architecture is brain-dead and needs to be put out of its misery.
Re:It's nice... by stephentyrone · 2006-05-29 05:37 · Score: 4, Informative

FFTs inherently work in the well-behaved arena of simple trig functions and three-function (add/subtract/multiply) math.
add/subtract/multiply math is the area that 754 has had the biggest effect on - in fact, the spec has very little to say about transcendental functions, but is almost entirely concerned with the basic arithmetic ops. prior to 754, floating point was, in general, not algebraicly closed under +-*/, nor were the results correctly rounded.
most highly parallel GPU-type chips lack support for gradual underflow, for example, one of those "ill-defined corners of the number space" where 754 has been a tremendous boon. flush-to-zero is fine if you're decoding MP3s or unpacking texture maps, but it causes a lot of problems when you start trying to do more general scientific computations. sometimes those low order bits matter a whole lot; sometimes they're the difference between getting an answer accurate to 4 digits and an answer with *no* correct digits.
"simple trig functions" have their own problems on these architectures; try writing an acceptable range-reduction algorithm for sin or cos without having correctly rounded arithmetic ops. sin and cos are, in fact, two the hardest operations in the math lib on which to get acceptable accuracy.
admittedly, none of these objections are an issue with FFTs. but the reason that FFTs will perform acceptably on such an architecture is that the operations are (usually) constrained to the domain in which you don't encounter the problems i mention, not because the operations themselves are inherently safe. the lack of support for gradual underflow will cause you to lose many, many bits in frequency components that have nearly zero magnitude, but you usually don't care about those components when you're doing FFTs, anyway.
Re:It's nice... by edp · 2006-05-29 06:35 · Score: 3, Insightful

"If you are taking data off of some kind of sensor, there are damned few sensors with 24 good bits of data out of the noise floor. Radars work just fine with 16-bit A/D converters."
Take a look at their benchmarks. The chart goes up to eight million elements. The accumulated rounding error in FFT outputs may be around n * log2(n) ULP, where n is the number of elements, and ULP (units in last place) is relative to the largest input element. (Caveats: That is the maximum; the distribution of the logs of the errors resembles a normal distribution. Input was numbers selected from a uniform distribution over [0, 1). The error varies slightly depending on whether you have fused multiply-add and other factors.)
So with eight million elements, the error may be 184 million ULP, or over 27 bits. With only 24 bits in your floating-point format, that is a problem. Whether you had 24-bit or 1-bit data to start with, it is essentially gone in some output elements. Most errors are less than the maximum, but it seems there is a lot of noise and not so much signal.
It may be that the most interesting output elements are the ones with the highest magnitude. (The FFT is used to find the dominant frequencies in a signal.) If so, those output elements may be large relative to the error, so there could be useful results. However, anybody using such a large FFT with single-precision floating-point should analyze the error in their application.
Re:It's nice... by Waffle+Iron · 2006-05-29 07:38 · Score: 5, Informative

Eventually, with some more articles like this one and yesterday's Cell piece, people will start to figure out that the x86 architecture is brain-dead and needs to be put out of its misery.
Why? Because the x86 isn't a DSP?
The x86 is a general-purpose CPU. It isn't brain dead; historically it's almost always been at least half as fast as the latest expensive processor fad du jour, and sometimes it has actually been the fastest available general purpose processor. As these fads have come and gone, the x86 has quietly kept improving by incorporating many of their best ideas.
The cell processor is basically a POWER processor core packaged with a few DSPs tacked onto the die. That sounds like a kludge to me, but if it turns out to be a success, there's nothing stopping people from tacking DSPs onto an x86 die.
All a DSP is good at is fast number crunching. It usually has little in the way of an MMU, along with a memory architecture tuned mainly for vector-like operations, branch prediction tuned only for matrix math, etc. DSPs would make a bad choice for running general purpose programs, especially with cache and branch issues becoming the dominant performance bottleneck in recent times. DSPs would a horrible choice for running an OS with any kind of security enforcement. Using a GPU as a poor-man's DSP is interesting, but it suffers even more from these same limitations. If DSPs really offered a better solution for general-purpose problems, they would have replaced other CPU architectures decades ago.

Re:Uhh.. by Anonymous Coward · 2006-05-29 04:34 · Score: 5, Informative

Fast Fourier Transform

Rush hour math. by Anonymous Coward · 2006-05-29 04:35 · Score: 3, Insightful

""The UNC GAMMA group has recently released a high performance FFT library which can handle large 1-D FFTs. According to their webpage, the FFT library is able to achieve 4x higher computational performance on a $500 NVIDIA 7900 GPU than optimized Intel Math Kernel FFT routines running on high-end Intel and AMD CPUs costing $1500-$2000. "

GPUs are nice, but there's the little matter of getting data and results on and off the chip.

Re:Rush hour math. by corvair2k1 · 2006-05-29 08:35 · Score: 4, Informative

Typically, when doing these measurements, the GAMMA group counts the upload/download time as part of the computation time. So, the 4x-5x speedup you're seeing is end to end, with results starting and ending in main memory.

FFTs in GPUs, eh? by Anonymous Coward · 2006-05-29 04:35 · Score: 3, Funny

Well, seeing as how the V.P. is such a V.I.P., shouldn't we keep the P.C. on the Q.T.? 'Cause if it leaks to the V.C. he could end up M.I.A., and then we'd all be put out in K.P.

FFT=Fast Fourier Transforms by amliebsch · 2006-05-29 04:41 · Score: 4, Interesting

Isn't that what SETI@home uses for the bulk of its signal analysis? Would be kind of neat to leverage the millions of idle GPU's out there.

--
If you don't know where you are going, you will wind up somewhere else.

$1500-$2000? by Wesley+Felter · 2006-05-29 04:45 · Score: 3, Insightful

I sense a little bias here; the fastest Intel and AMD processors are actually $1,000.

Re:$1500-$2000? by veg_all · 2006-05-29 04:56 · Score: 4, Informative

RTFA? Or just look at the pretty diagrams.
They're running against dual-processor systems (Opteron and Xeon).

--
grammar-lesson free since 1999. (rescinded - 2005)

Cray-1 comparison by Mostly+a+lurker · 2006-05-29 04:56 · Score: 5, Interesting

The Cray-1A supercomputer, weighing in at 5.5 tons, had an absolute maximum peak performance of 250 megaflops. It, of course, cost millions and the power requirements (including for cooling) were in excess of 200 kW. I remember marveling at the advanced nature of this technological achievement.

Thirty years later, a $500 GPU, weighing less than 1 pound, can produce 6 gigaflops. People complain about its power and cooling needs, but they are rather below 200 kW! We sometimes forget just how amazing the developments in computing have been over the last three decades.

Re:This is news? by Fex303 · 2006-05-29 05:02 · Score: 4, Funny

I have an uncle who's a professor who's been using GPUs for scientific computation for years.

I'm sorry, but playing Quake at really high framerates does not count as research. He's not fooling anyone.

The business cards which list him as 'Profess0r of Pwnage' probably aren't helping either.

It's also bad when he refers to the undergrads as 'n00bs' during his lectures to them.

Re:Uhh.. by Fordiman · 2006-05-29 05:05 · Score: 5, Informative

an FFT is a transform that turns a signal (like an audio file) into its frequency components (like a spectrograph). It's used for MP3 compression, sound EQs, jpeg compression, mpeg4 compression, and a number of other things (I use FFTW for tuning my guitar).

FFTW is the 'Fastest Fourier Transform in the West', a cute name for the work of a number of graduate students who use several techniques to turn the FFT from 'Numerical Recipes in C' into a freaking speed daemon.

GPUFFTW is much the same thing, but ported to your video card's GPU - which is generally more optimized for doing the 'apply a floating point matrix to an array' thing - thus speedin the FFTW up even more while relieving the main processor from doing the work.

If you don't have a high-powered video card, this means nothing for you. If you do, it means the above operations (compression, spectrum analysis, etc) can be done faster and without eating up processes.

--
110100 1101000 1101000 1100110 0 1101111 1101000 1100011 1

That's where PCIe is useful by WoTG · 2006-05-29 05:07 · Score: 3, Informative

AGP was not very useful for bidirectional data flow, but PCIe is. GPU's are pretty sophisticated these days, so they've got the logic to handle moving stuff in and out of it's memory and over the bus to the CPU and the rest of the system.

Re:Any 64 bit GPU's? by Surt · 2006-05-29 05:12 · Score: 3, Interesting

Not yet. But in the next or second generation out your wish will be fulfilled (more and more game developers are pushing for 64 bit color accuracy, which will necessitate a transition to fully 64bit GPUs in the not distant future).

--
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking

Re:Uhh.. by kabz · 2006-05-29 05:16 · Score: 3, Interesting

Or in the form of a concrete example ... The little spectrum analysers in iTunes are a good example of taking some time domain data, analysing it, and displaying the low through high frequencies.

As an example of how far we've come, I implemented the Cooley-Tukey FFT in assembler on an Amiga, and it was just barely out of real-time. You had to capture some audio data, then wait while it was analysed. Nowadays, you can write the same thing in Objective-C on a G4, using the standard audio capture library, and have the FFT's computed between redraw events.

--
-- "It's not stalking if you're married!" My Wife.

Re:Uhh.. by exp(pi*sqrt(163)) · 2006-05-29 05:18 · Score: 4, Funny

FFT is a data compression and encryption standard used by a wide variety of extraterrestrial civilizations. Seti@home spends most of its time running FFT code to look for signals. If we managed to communicate with any of these aliens we could ask them what it stands for.

--
Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.

Re:Any 64 bit GPU's? by TheRaven64 · 2006-05-29 05:25 · Score: 4, Interesting

more and more game developers are pushing for 64 bit color accuracy, which will necessitate a transition to fully 64bit GPUs in the not distant future

Current generation GPUs handle 64bit and 128bit colours already. A 64-bit colour value is just four channels of 16-bit floats (halfs in Cg parlance). A 128-bit colour value is a vector of four 32-bit colour values.

If game developers wanted 256-bit colour, then GPUs would need to support 64-bit floating point arithmetic. This is unlikely to happen, however, since 64-bit colour (which is really 48-bit colour with a 16-bit alpha channel) gives more colours than the human eye can distinguish. In fact, even with 64- or 128-bit colour for the intermediate results, current cards only have a 10-bit DAC for converting the colour value to an analogue quantity that can be displayed on an analogue screen.

It is worth noting that Pixar's RenderMan software doesn't use more than 128-bit colour, and films like Toy Story were rendered using 64-bit mode.

--
I am TheRaven on Soylent News

What's an FFT by Geoffreyerffoeg · 2006-05-29 05:31 · Score: 5, Informative

Apparently nobody knows what an FFT is. Here's the best description I can give without descending into math too much.

The Fast Fourier Transform is an algorithm to turn a set of data (as amplitude vs. time) into a set of waves (as amplitude vs. frequency). Say that I have a recording of a piano playing an A at 440 Hz. If I plot the actual data that the sound card records, it'll come out something like this picture. There's a large fading-out, then the 440 Hz wave, then a couple of overtones at multiples of 440 Hz. The Fourier series will have a strong spike at 440 Hz, then smaller spikes at higher frequencies: something like this plot. (Of course, that's not at 440, but you get the idea.)

The reason we like Fourier transforms is that once you have that second plot, it's extremely easy to tell what the frequency of the wave is, for example - just look for the biggest spike. It's a much more efficient way to store musical data, and it allows for, e.g., pitch transformations (compute the FFT, add your pitch change to the result, and compute the inverse FFT which uses almost the same formula). It's good for data compression because it can tell us which frequencies are important and which are imperceptible - and it's much smaller to say "Play 440 Hz, plus half an 880 Hz, plus..." than to specify each height at each sampling interval.

The FFT is a very mathematics-heavy algorithm, which makes it well suited for a GPU (a math-oriented device, because it performs a lot of vector and floating-point calculations for graphics rendering) as opposed to a general-purpose CPU (which is more suited for data transfer and processing, memory access, logic structures, integer calculations, etc.) We're starting to see a lot of use of the GPU as the modern equivalent of the old math coprocessor.

If you're looking for more information, Wikipedia's FFT article is a good technical description of the algorithm itself. This article has some good diagrams and examples, but his explanation is a little non-traditional.

Re:Any 64 bit GPU's? by Surt · 2006-05-29 05:36 · Score: 3, Informative

It's at the pixel shader level that you run into low color rendition on current GPUs, and also where the people doing math on GPU are doing their work. That's where the move to 64 bit will likely happen soon, and will conveniently help the math people as a side effect.

--
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking

Finally.... by Comboman · 2006-05-29 06:11 · Score: 4, Funny

Finally I have a good excuse to give the IT department why I need to upgrade my video card. I need to do FFTs faster (it has nothing at all to do with Doom3).

--
Support Right To Repair Legislation.

Re:As you should very well know... by Jesus_666 · 2006-05-29 08:16 · Score: 3, Funny

Knowing how to hook up a composite video input is irrelevant. The civilised world uses SCART. You can't screw up hooking up SCART.

And just you wait, France will develop a European alternative to that Fourier nonsense as well!

--
USE HOT GRITS WITH STATUE OF NATALIE PORTMAN (NAKED AND PETRIFIED)

Re:As you should very well know... by jawtheshark · 2006-05-29 08:33 · Score: 3, Funny

Just in case some people don't get the following statement:

And just you wait, France will develop a European alternative to that Fourier nonsense as well!

Read ... And notice the country....

Mod the guy up.. He might be a troll, but he is funny :-D

--
Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)

Nope! by woolio · 2006-05-29 11:51 · Score: 3, Informative

Sorry, the FFT of a time-domain signal does **NOT** indicate how the power (or energy) of the signal is distributed.

For the latter, you need a PSD (power spectral density) plot, which is obtained by finding the square of the magnitude of the freq-domain FFT (complex) outputs.

And the term "FFT" usually describes a specific class of algorithms that finds a Discrete Fourier Transform of a signal in much less than O(N^2) time, where N is the number of elements/samples considered.

However, the FFT is also useful to perform fast polynomial multiplication (and even fast multiplication of very very very long numbers). This application has nothing to do with power or frequencies in a signal.

Re:Interesting question... by SETIGuy · 2006-05-29 13:14 · Score: 4, Informative

Yes, 32 bits is quite enough for our FFTs. Our requirements are fairly low. 16bit floats may even do the job (although I've never tried 16bit floats in SETI@home). What has concerned us in the past is that bandwidth to GPUs was fairly assymmetric (on AGP cards), the seti@home working sets (A few buffers of 1M complex samples == 16MB each) were larger than the usable memory on many video cards and the length of the maximum shader routine was fairly small. SETI@home does quite a bit more than FFTs, so moves into and out of main memory were required. At the time we couldn't put more into the shader language. That may have changed, but right now we lack anyone who both has the time to do the job and is capable of doing it.

Our tests on nVidia 5600 series AGP cards (this was several years ago) showed that the net SETI@home throughput using the GPU was at best 1/5 of what we could obtain with the CPU. This was primarily due to transfers out of graphics memory and into main memory.

PCI Express allows for symmetric bandwidth to graphics memory and graphics memories are now typically larger than the size of our working set. The difficulty will be in benchmarking to see which is faster for a specific GPU/CPU combination.

At any rate it's a fairly simple job to swap FFT routines in SETI@home. The source is available. Someone may have done it by now...

--
Support SETI@home

Your error analysis is totally wrong by stevenj · 2006-05-29 13:14 · Score: 4, Informative

The Cooley-Tukey FFT algorithm and its variants, which is what they are using, has much better error characteristics than you think.

In floating-point arithmetic, the algorithm was proved in 1966 to have an upper bound for the error that grows only as O(log N), and the mean (rms) error grows only as O(log N). (See this page for more info.) (Errors in fixed-point arithmetic are worse, going as N.)

Even in single precision, the errors for their FFT sizes are probably quite reasonable, assuming they haven't done something silly like use an unstable trigonometric recurrence.

--
If a thing is not diminished by being shared, it is not rightly owned if it is only owned & not shared. S. Augustine

erratum by stevenj · 2006-05-29 13:17 · Score: 3, Informative

That should be O(sqrt(log N)) for the rms error and O(sqrt(N)) for fixed point. (My sqrt symbols didn't post properly somehow)

--
If a thing is not diminished by being shared, it is not rightly owned if it is only owned & not shared. S. Augustine

31 of 274 comments (clear)