High performance FFT on GPUs

It's nice... by non0score · 2006-05-29 04:34 · Score: 4, Informative

if you're only considering 32-bit floating point numbers and don't need full IEEE-754 compliance.

Re:It's nice... by stephentyrone · 2006-05-29 05:37 · Score: 4, Informative

FFTs inherently work in the well-behaved arena of simple trig functions and three-function (add/subtract/multiply) math.
add/subtract/multiply math is the area that 754 has had the biggest effect on - in fact, the spec has very little to say about transcendental functions, but is almost entirely concerned with the basic arithmetic ops. prior to 754, floating point was, in general, not algebraicly closed under +-*/, nor were the results correctly rounded.
most highly parallel GPU-type chips lack support for gradual underflow, for example, one of those "ill-defined corners of the number space" where 754 has been a tremendous boon. flush-to-zero is fine if you're decoding MP3s or unpacking texture maps, but it causes a lot of problems when you start trying to do more general scientific computations. sometimes those low order bits matter a whole lot; sometimes they're the difference between getting an answer accurate to 4 digits and an answer with *no* correct digits.
"simple trig functions" have their own problems on these architectures; try writing an acceptable range-reduction algorithm for sin or cos without having correctly rounded arithmetic ops. sin and cos are, in fact, two the hardest operations in the math lib on which to get acceptable accuracy.
admittedly, none of these objections are an issue with FFTs. but the reason that FFTs will perform acceptably on such an architecture is that the operations are (usually) constrained to the domain in which you don't encounter the problems i mention, not because the operations themselves are inherently safe. the lack of support for gradual underflow will cause you to lose many, many bits in frequency components that have nearly zero magnitude, but you usually don't care about those components when you're doing FFTs, anyway.
Re:It's nice... by Waffle+Iron · 2006-05-29 07:38 · Score: 5, Informative

Eventually, with some more articles like this one and yesterday's Cell piece, people will start to figure out that the x86 architecture is brain-dead and needs to be put out of its misery.
Why? Because the x86 isn't a DSP?
The x86 is a general-purpose CPU. It isn't brain dead; historically it's almost always been at least half as fast as the latest expensive processor fad du jour, and sometimes it has actually been the fastest available general purpose processor. As these fads have come and gone, the x86 has quietly kept improving by incorporating many of their best ideas.
The cell processor is basically a POWER processor core packaged with a few DSPs tacked onto the die. That sounds like a kludge to me, but if it turns out to be a success, there's nothing stopping people from tacking DSPs onto an x86 die.
All a DSP is good at is fast number crunching. It usually has little in the way of an MMU, along with a memory architecture tuned mainly for vector-like operations, branch prediction tuned only for matrix math, etc. DSPs would make a bad choice for running general purpose programs, especially with cache and branch issues becoming the dominant performance bottleneck in recent times. DSPs would a horrible choice for running an OS with any kind of security enforcement. Using a GPU as a poor-man's DSP is interesting, but it suffers even more from these same limitations. If DSPs really offered a better solution for general-purpose problems, they would have replaced other CPU architectures decades ago.

Re:Uhh.. by Anonymous Coward · 2006-05-29 04:34 · Score: 5, Informative

FFT; by MrShaggy · 2006-05-29 04:37 · Score: 2, Informative

FFT; Fast Fourier Transform - a specific algorithm, but used to indicate any algorithm attempting to determine the power versus frequency graph for a signal Dag-nadit

--
I have mod points and I am not afraid to use them.

Re:Uhh.. by SpinyNorman · 2006-05-29 04:47 · Score: 2, Informative

There's a somewhat non-obvious mathematical result that any continuous periodic function can be decomposed as the sum of a series of sine functions of different frequencies. This series of sine waves is referred to as the fourier series of the function. The FFT (fast fourier transform) is an efficient numeric algorithm to derive the coefficients of the fourier series for any function.

One useful way to think of the FFT is as transform of signal data from the time domain (raw samples) to the frequency domain (the constituent sine waves). This is useful for all sorts of purposes such as being the first step in speech recognition, the basis of JPEG/MPEG compression, ...

Re:Math library for sale? by Kortec · 2006-05-29 04:51 · Score: 2, Informative

This is probably making a lot of developers, myself included, very very happy people. FFT's are where the proverbial magic happens for a lot of signals and systems analysis, as well as for the multiplication of very large integers. So anyone involved in gaming that includes digital signal processing (voice chat in UT, karaoke-karaoke-revolution type games, analog user input, etc.) is going to be happy, and anyone who's involved in multiplying huge integers (crypto anyone?) may very well have wet themselves.

If this is true, to be able to make the same computations in a fourth of the time is a pretty nice thing, and using a little more of a GPU is likely to be acceptable in a decent number of applications.

--
"My heart is in the work." - Andrew Carnegie

Re:$1500-$2000? by veg_all · 2006-05-29 04:56 · Score: 4, Informative

RTFA? Or just look at the pretty diagrams.
They're running against dual-processor systems (Opteron and Xeon).

--
grammar-lesson free since 1999. (rescinded - 2005)

Re:$1500-$2000? by MP3Chuck · 2006-05-29 04:58 · Score: 2, Informative

Pricegrabber lists some up in the range of several thousand dollars ... like this Itanium, for just over $5k and this Dual Core Xeon for $3,700.

--
Who doesn't like free music?

Re:Uhh.. by Fordiman · 2006-05-29 05:05 · Score: 5, Informative

an FFT is a transform that turns a signal (like an audio file) into its frequency components (like a spectrograph). It's used for MP3 compression, sound EQs, jpeg compression, mpeg4 compression, and a number of other things (I use FFTW for tuning my guitar).

FFTW is the 'Fastest Fourier Transform in the West', a cute name for the work of a number of graduate students who use several techniques to turn the FFT from 'Numerical Recipes in C' into a freaking speed daemon.

GPUFFTW is much the same thing, but ported to your video card's GPU - which is generally more optimized for doing the 'apply a floating point matrix to an array' thing - thus speedin the FFTW up even more while relieving the main processor from doing the work.

If you don't have a high-powered video card, this means nothing for you. If you do, it means the above operations (compression, spectrum analysis, etc) can be done faster and without eating up processes.

--
110100 1101000 1101000 1100110 0 1101111 1101000 1100011 1

That's where PCIe is useful by WoTG · 2006-05-29 05:07 · Score: 3, Informative

AGP was not very useful for bidirectional data flow, but PCIe is. GPU's are pretty sophisticated these days, so they've got the logic to handle moving stuff in and out of it's memory and over the bus to the CPU and the rest of the system.

Re:FFT=Fast Fourier Transforms by fredrikj · 2006-05-29 05:09 · Score: 2, Informative

It's also used by distributed mathematics projects such as GIMPS to multiply large numbers. Unfortunately, if this implementation only operates in 32-bit precision, it will probably be less useful for this purpose since you'd have to do subproducts with fewer digits at a time, to avoid rounding error. I'm not familiar with the details, though.

Re:Any 64 bit GPU's? by Fordiman · 2006-05-29 05:15 · Score: 2, Informative

No.

Implementing 'big numbers', or numbers larger than the proccessor's spec, is actually quite computationally heavy when compared to the operations you're replacing. As such, a 4x increase in the speed of computation can translate to a (to pull a number from my arse) 0.25x loss of performance when dealing with larger floats.

However...

With CPU/GPU cooperation, the floating gap can be handled by using the CPU to generate a lookup table of high-precision trig as, say, a texture, and treating the numbers as mere pointers to an array. Addition is a relatively light bignum math, and with the FFT, you could implement the addition and lookup math quite speedily on the GPU side.

Of course, from reading in, I'm pretty sure that's what's in-process for higher precision.

--
110100 1101000 1101000 1100110 0 1101111 1101000 1100011 1

Precision limit. by DrYak · 2006-05-29 05:16 · Score: 2, Informative

The limit is the floating-point precision of the GPU.

Most GPU can do max up to 32-bit floating point operations (depending on the brand and the model), where as most scientific applications use 64-bit and higher (the old FP unit could do 80bit FP math, SSE registers in recent processors can do 128-bits FP math).

So some user will be happy, like for sound processing (GPU have already been used to process reverberation to create realistic sound environnement - too lazy to do the search for the slashdot reference)
But other application (cryptography maybe) will probably need more FP precision.

Not to mention that most scientific applications run mostly under *nix like Linux or BSD, for which GFX driver support isn't always incredible, specially for recent models, (website mentions performance hit).
(And also remember that soon Vista will have an interface that'll completly clog the GPU and leave less free cycles to do general purpose calculation).

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]

Bytes/bits? by 4D6963 · 2006-05-29 05:23 · Score: 2, Informative

From the site :

The Video RAM will determine the maximum array length that can be sorted on the GPU. A rough guideline for performing FFT on 32-bit floats is: Maximum array length in millions = Video RAM in MB / 32

Max array length equals video RAM in megabytes divided by 32... bits? Correct me if i'm dumb but shouldn't it rather be "Video RAM in MB / 4"?

--
You just got troll'd!

What's an FFT by Geoffreyerffoeg · 2006-05-29 05:31 · Score: 5, Informative

Apparently nobody knows what an FFT is. Here's the best description I can give without descending into math too much.

The Fast Fourier Transform is an algorithm to turn a set of data (as amplitude vs. time) into a set of waves (as amplitude vs. frequency). Say that I have a recording of a piano playing an A at 440 Hz. If I plot the actual data that the sound card records, it'll come out something like this picture. There's a large fading-out, then the 440 Hz wave, then a couple of overtones at multiples of 440 Hz. The Fourier series will have a strong spike at 440 Hz, then smaller spikes at higher frequencies: something like this plot. (Of course, that's not at 440, but you get the idea.)

The reason we like Fourier transforms is that once you have that second plot, it's extremely easy to tell what the frequency of the wave is, for example - just look for the biggest spike. It's a much more efficient way to store musical data, and it allows for, e.g., pitch transformations (compute the FFT, add your pitch change to the result, and compute the inverse FFT which uses almost the same formula). It's good for data compression because it can tell us which frequencies are important and which are imperceptible - and it's much smaller to say "Play 440 Hz, plus half an 880 Hz, plus..." than to specify each height at each sampling interval.

The FFT is a very mathematics-heavy algorithm, which makes it well suited for a GPU (a math-oriented device, because it performs a lot of vector and floating-point calculations for graphics rendering) as opposed to a general-purpose CPU (which is more suited for data transfer and processing, memory access, logic structures, integer calculations, etc.) We're starting to see a lot of use of the GPU as the modern equivalent of the old math coprocessor.

If you're looking for more information, Wikipedia's FFT article is a good technical description of the algorithm itself. This article has some good diagrams and examples, but his explanation is a little non-traditional.

Re:What's an FFT by Yenshee · 2006-05-29 09:46 · Score: 2, Informative

Great summary, but you are describing a general Fourier Transform, not necessarily a *fast* FT. FFT refers to a specific divide and conquer algorithm that has O(n*lg(n)) time complexity, rather than the O(n^2) time complexity of the naive FT algorithm. Here n is the number of discrete samples, i.e. the size of the problem.
FFT is to naive FT as quicksort is to insertion sort. Both quicksort and the FFT are considered to be among the top ten algorithms of the 20th century.
Re:What's an FFT by fredrikj · 2006-05-29 11:13 · Score: 2, Informative

FFT is to naive FT as quicksort is to insertion sort.

Be careful with the terminology; you correctly referred to "naive FT algorithm" above, but this sentence might give the impression that the Fourier transform itself is an algorithm. FT is a function whereas the FFT is an algorithm that computes the function. It would be more appropriate to say that FFT is to the Fourier transform what quicksort is to sorting.

Re:Any 64 bit GPU's? by Surt · 2006-05-29 05:36 · Score: 3, Informative

It's at the pixel shader level that you run into low color rendition on current GPUs, and also where the people doing math on GPU are doing their work. That's where the move to 64 bit will likely happen soon, and will conveniently help the math people as a side effect.

--
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking

Re:Any 64 bit GPU's? by jthill · 2006-05-29 05:41 · Score: 2, Informative

Yes. Some guys at LBNL took good look. /. had the story yesterday. When they were trying, they repeatedly toasted Cray's best. With a "naive" FFT implementation -- not half trying -- they got 80%.

--
As always, all IMO. Insert "I think" everywhere grammatically possible.

Re:If you need gigaflops... by zeldor · 2006-05-29 06:14 · Score: 2, Informative

not so custom really if you go with the right integrator.
SRC computers puts out FPGA based systems that has a nice
32bit floating point fft library already in their development
environment. Most customers using the fft are for radar image
processing where the best PC solution is 50 times slower
then the fpga based solution. Think UAVs with smart
tracking off their radar.
http://www.srccomputers.com/

--
If I could walk that way I wouldnt need cologne.

8xx verses 2xx by Somegeek · 2006-05-29 07:03 · Score: 2, Informative

Note that you can only use the 2xx Opterons in 1 or 2 CPU (2 or 4 core) motherboards. If you want to have 4 CPUs with 8 cores you need to use the 8 series Opterons. The Opteron 880 dual core currently starts at over USD 1,600.00 each, which makes your configuration start at about 7,500 just for the MB and CPUs. Then add the registered RAM, server case, big juicy power supply, drives, video, monitor, a UPS to protect all of that... It sounds sexy but realistically its going to be close to $10K if you build it yourself.

--
And as you tread the halls of sanity, You feel so glad to be, Unable to go beyond. I have a message, From another time..

Re:Correction on usages by Anonymous Coward · 2006-05-29 07:41 · Score: 1, Informative

DCT and FFT are pretty much the same thing, except the DCT is limited to the real part of the signal. You could use an FFT in the place of a DCT if you discarded the imaginary data.

Re:Rush hour math. by corvair2k1 · 2006-05-29 08:35 · Score: 4, Informative

Typically, when doing these measurements, the GAMMA group counts the upload/download time as part of the computation time. So, the 4x-5x speedup you're seeing is end to end, with results starting and ending in main memory.

Re:Cryptography? by Anonymous Coward · 2006-05-29 08:58 · Score: 1, Informative

Public key cryptography needs much bigger integers than fit a machine register. Remember those 1024-bit or even 4096-bit keys? Those represent integers. Multiplying big integers can be done more efficiently by FFT or FFT-like techniques, and if you've got fast floating point then yes, it is more efficient to use floating point to do big integer multiplication.

The basic trick there is that integer multiplication is essentially convolution on the pieces of the numbers, and convolution is something that is trivial to compute in frequency space, which is where the FFT comes in. Check out the "Fourier Transform Methods" section of the wikipedia article on multiplication algorithms.

Re:Great for audio! by CalcWiz · 2006-05-29 11:35 · Score: 2, Informative

Generally you will only notice a difference if you are running another GPU intensive program at the same time. Most GPGPU programs (including ones made by Gamma) use extensions such as the frame buffer object so that you don't need to see the mathematical garbage in a graphical sense. Thus the only side effect is that it needs a significant amount of video memory and GPU processing time, both of which will be shared with other programs trying to use them. So if you try to compute the FFT of something that barely fits in video memory while playing UT 2004 at 1600x1200 resolution, you might have a problem. Otherwise, you probably won't notice.

--
If PRO- is the opposite of CON-, What's the opposite of Progress?

Nope! by woolio · 2006-05-29 11:51 · Score: 3, Informative

Sorry, the FFT of a time-domain signal does **NOT** indicate how the power (or energy) of the signal is distributed.

For the latter, you need a PSD (power spectral density) plot, which is obtained by finding the square of the magnitude of the freq-domain FFT (complex) outputs.

And the term "FFT" usually describes a specific class of algorithms that finds a Discrete Fourier Transform of a signal in much less than O(N^2) time, where N is the number of elements/samples considered.

However, the FFT is also useful to perform fast polynomial multiplication (and even fast multiplication of very very very long numbers). This application has nothing to do with power or frequencies in a signal.

Re:stating the obviou... by CalcWiz · 2006-05-29 11:53 · Score: 2, Informative

It just doesn't work that way. GPU's get their advantage because they are highly parallel and heavily pipelined. This makes it easier and cheaper to add more computational power to the card. The problem is that there are a lot of things you just can't do efficiently. The time it takes for one piece of data to get through the pipeline prohibits many general purpose operations. The big advantage for a lot of applications is that it takes just about the same amount of time for thousdands to millions of pieces of data to get through the pipeline, they just can't depend on each other.

Go read a tutorial on CPU architecture and where it talks about pipelining replace the pipeline that takes 5 cycles end to end with one that takes thousands of cycles and imagine how you could write basic algorithms. Worse yet, imagine how you would handle IO or interrupts.

--
If PRO- is the opposite of CON-, What's the opposite of Progress?

Re:Errr, I don't want to sound skeptical... by eh2o · 2006-05-29 12:45 · Score: 2, Informative

djbfft apparently had an edge at some point, but now has not been updated for more than 5 years. meanwhile FFTW has incremented the major version number to 3, undergone a complete rewrite, added simd, multiprocessor, 64bit and a slew of other things (its obviously not a stagnant project). not to mention its the basis of the 'fft' function in matlab and thereby probably the most used fft implementation in the world. assuming their benchmarks (which now include accuracy as well as speed) are valid, Intel Performance Suite is probably their closest competitor.

Re:Interesting question... by SETIGuy · 2006-05-29 13:14 · Score: 4, Informative

Yes, 32 bits is quite enough for our FFTs. Our requirements are fairly low. 16bit floats may even do the job (although I've never tried 16bit floats in SETI@home). What has concerned us in the past is that bandwidth to GPUs was fairly assymmetric (on AGP cards), the seti@home working sets (A few buffers of 1M complex samples == 16MB each) were larger than the usable memory on many video cards and the length of the maximum shader routine was fairly small. SETI@home does quite a bit more than FFTs, so moves into and out of main memory were required. At the time we couldn't put more into the shader language. That may have changed, but right now we lack anyone who both has the time to do the job and is capable of doing it.

Our tests on nVidia 5600 series AGP cards (this was several years ago) showed that the net SETI@home throughput using the GPU was at best 1/5 of what we could obtain with the CPU. This was primarily due to transfers out of graphics memory and into main memory.

PCI Express allows for symmetric bandwidth to graphics memory and graphics memories are now typically larger than the size of our working set. The difficulty will be in benchmarking to see which is faster for a specific GPU/CPU combination.

At any rate it's a fairly simple job to swap FFT routines in SETI@home. The source is available. Someone may have done it by now...

--
Support SETI@home

Your error analysis is totally wrong by stevenj · 2006-05-29 13:14 · Score: 4, Informative

The Cooley-Tukey FFT algorithm and its variants, which is what they are using, has much better error characteristics than you think.

In floating-point arithmetic, the algorithm was proved in 1966 to have an upper bound for the error that grows only as O(log N), and the mean (rms) error grows only as O(log N). (See this page for more info.) (Errors in fixed-point arithmetic are worse, going as N.)

Even in single precision, the errors for their FFT sizes are probably quite reasonable, assuming they haven't done something silly like use an unstable trigonometric recurrence.

--
If a thing is not diminished by being shared, it is not rightly owned if it is only owned & not shared. S. Augustine

erratum by stevenj · 2006-05-29 13:17 · Score: 3, Informative

That should be O(sqrt(log N)) for the rms error and O(sqrt(N)) for fixed point. (My sqrt symbols didn't post properly somehow)

--
If a thing is not diminished by being shared, it is not rightly owned if it is only owned & not shared. S. Augustine

Re:Based on FFTW? by Anonymous Coward · 2006-05-29 13:35 · Score: 2, Informative

You're right, they give them no credit. From TFA:

We are not porting FFTW to GPUs and our project is not related to FFTW. FFTW is a more general library designed mainly for CPUs. GPUFFTW uses some cache optimizations to obtain maximum memory performance on GPUs.

33 of 274 comments (clear)