High performance FFT on GPUs
A reader writes: "The UNC GAMMA group has recently released a high performance FFT library which can handle large 1-D FFTs. According to their webpage, the FFT library is able to achieve 4x higher computational performance on a $500 NVIDIA 7900 GPU than optimized Intel Math Kernel FFT routines running on high-end Intel and AMD CPUs costing $1500-$2000. The library is supported for both Linux and Windows platforms and is tested to work on many programmable GPUs. There is also a link to download the library freely for non-commerical use."
Isn't that what SETI@home uses for the bulk of its signal analysis? Would be kind of neat to leverage the millions of idle GPU's out there.
If you don't know where you are going, you will wind up somewhere else.
I have an uncle who's a professor who's been using GPUs for scientific computation for years. Apparently he has systems with four GPUs running simulations.
Depending on what you're doing, for an FFT, you probably don't need 64-bit floating point, and you DON'T need full IEEE-754 compliance.
If you are taking data off of some kind of sensor, there are damned few sensors with 24 good bits of data out of the noise floor. Radars work just fine with 16-bit A/D converters.
IEEE-754 compliance helps you in the ill-defined corners of the number space. FFTs inherently work in the well-behaved arena of simple trig functions and three-function (add/subtract/multiply) math.
I'm currently doing FFTs with 16-bit fractional arithmetic in Blackfin DSP. For what I'm doing with the results, it is good enough.
Not to mention you could use a "GPU farm" to do a fast search, and take any "interesting" data regions and feed those to a 64-bit, fully-IEEE-754 compliant, slow-as-molasses-in-January x86 FFT.
Eventually, with some more articles like this one and yesterday's Cell piece, people will start to figure out that the x86 architecture is brain-dead and needs to be put out of its misery.
Thirty years later, a $500 GPU, weighing less than 1 pound, can produce 6 gigaflops. People complain about its power and cooling needs, but they are rather below 200 kW! We sometimes forget just how amazing the developments in computing have been over the last three decades.
Not yet. But in the next or second generation out your wish will be fulfilled (more and more game developers are pushing for 64 bit color accuracy, which will necessitate a transition to fully 64bit GPUs in the not distant future).
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
Or in the form of a concrete example ... The little spectrum analysers in iTunes are a good example of taking some time domain data, analysing it, and displaying the low through high frequencies.
As an example of how far we've come, I implemented the Cooley-Tukey FFT in assembler on an Amiga, and it was just barely out of real-time. You had to capture some audio data, then wait while it was analysed. Nowadays, you can write the same thing in Objective-C on a G4, using the standard audio capture library, and have the FFT's computed between redraw events.
-- "It's not stalking if you're married!" My Wife.
on this page here they almost compare to a program called libgpufft (which is an open source BSD version of the same library here ) I wonder how they do compared to the BSD licensed version---
The interesting question will be :
Is the 32-bit precision enough for SETI@Home application ?
Or does the project needs the higher precisions (64bit to 128bit) that can (for now) only be provided by the CPU ?
IMHO, maybe this could be useful. They're trying to find which chunk contains candidate data. If there's some fast low-precision algorithm that can quickly mark chunks as interesting / recheck with higher precision / un-interesting, it'll be helpful to quickly tell appart interseting chunk, even if data need to be post-processed later at higher precision.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Current generation GPUs handle 64bit and 128bit colours already. A 64-bit colour value is just four channels of 16-bit floats (halfs in Cg parlance). A 128-bit colour value is a vector of four 32-bit colour values.
If game developers wanted 256-bit colour, then GPUs would need to support 64-bit floating point arithmetic. This is unlikely to happen, however, since 64-bit colour (which is really 48-bit colour with a 16-bit alpha channel) gives more colours than the human eye can distinguish. In fact, even with 64- or 128-bit colour for the intermediate results, current cards only have a 10-bit DAC for converting the colour value to an analogue quantity that can be displayed on an analogue screen.
It is worth noting that Pixar's RenderMan software doesn't use more than 128-bit colour, and films like Toy Story were rendered using 64-bit mode.
I am TheRaven on Soylent News
So you are too lazy to go look it up on something like wikipedia, which will take a few minutes, but not too lazy to check back here 100 times today for someone to make a post?
This is somewhat off topic, but keeps to the fact that GPUs
are having profound effects in the cinematography world.
Check out this paper detailing impresive gains using a GPU versus
Pixar's standard engine and render -
some gains are upwards of 20,000X more efficient
with almost identical visual results!
http://www.vidimce.org/publications/lpics/
Interesting? Cheers!
The one thing that I haven't seen mentioned is that the benchmarks only show "compute timings" and not actual setup and retreval times. If the benchmarks showed the amount of time to get the data to the GPU and especially the time to get the result back to a place where a program could actually use it then it would be blown out of the water by the CPU. Future cards/drivers could speed up the process of retrieving the data, but for now there will always be lame benchmarks like this that are unfairly biased toward the GPU and only tell half the story. I mean what's the point of doing an FFT so quickly if it takes forever to actually be able to get to the data.
Right then. So how long before they just include some weak general-purpose instructions in the GPU, add SATA and ethernet to the cards, and call it a budget PC?
Kid-proof tablet..
The sizes of transforms they are using for comparison here are of lengths of the order of 1 million points. This is huge for an FFT, and truncation error will definitely come into play here using only 32-bit precision. It all depends on what you are doing whether this will be adequate or not. Also, it's not at all clear what they did on the other platforms. There are some tricks to doing very long sequences; essentially using a 2D transform to perform a long 1D transform. It's not trivial, and requires some extra work, but generally a lot more efficient than taking a 1D transform and shoving a 4 million element transform into it. The inner loops of a 1D transform will eventually trash the cache for such a large transform, so using a blocked 2D transform avoids this, with some overhead of course. It's hard to tell what they are doing from the performance curves, since they report seconds, and it needs to be scaled by n log n to really see what's going on. It's cool they tried this, though. I was looking at using a GPU to do FFTs and linear algebra kernels a couple of years ago, but decided not to go there as I didn't think it would pay off; mostly because of the 32-bit precision.
I'm wondering whether or not the DSP latency of these libraries is sufficient to use with real-time audio processing...if folks were to write RTAS/AU/VST plugins using the library, how they would compare to other hardware-assisted DSP solutions such as the PowerCore and Pro Tools farm cards. Then again, if you have to spend $500 on a card to get this goodness, it's hardly a bargain (albeit cheaper than the above products...)
I think you are confusing the precision of the input data with the precision of the power spectrum. An FFTs do a scaled add of a large number of samples, so the precision in the output is dependent on the number of input samples.
For example SETI@home uses 1 bit complex sampling. (Yes, the SETI@home ADCs are a pair of high speed comparators.) That doesn't limit the results to 1 bit. It does reduce the sensitivity by 1.5dB (which, unfortunately isn't worth doubling our bandwidth or the number of tape we have to buy). The maximum precision you can get out of that in a 128K FFT (in the low SNR limit) is about 18 bits/channel or 54dB of dynamic range in the power spectrum. That's more than enough for SETI@home where our thresholds are at about 14dB. Changing that to a true 16bit ADC would give you 34 bits or 102dB at the expense of creating 16X as much data.
Of course depending upon the sampling rate and the duration of the signal you are looking for you may need the 16 bits. If you are looking over 8 samples you only get 4bits or 12dB for SETI@home, 20bits or 60dB for a 16bit ADC.
Support SETI@home