High performance FFT on GPUs
A reader writes: "The UNC GAMMA group has recently released a high performance FFT library which can handle large 1-D FFTs. According to their webpage, the FFT library is able to achieve 4x higher computational performance on a $500 NVIDIA 7900 GPU than optimized Intel Math Kernel FFT routines running on high-end Intel and AMD CPUs costing $1500-$2000. The library is supported for both Linux and Windows platforms and is tested to work on many programmable GPUs. There is also a link to download the library freely for non-commerical use."
""The UNC GAMMA group has recently released a high performance FFT library which can handle large 1-D FFTs. According to their webpage, the FFT library is able to achieve 4x higher computational performance on a $500 NVIDIA 7900 GPU than optimized Intel Math Kernel FFT routines running on high-end Intel and AMD CPUs costing $1500-$2000. "
GPUs are nice, but there's the little matter of getting data and results on and off the chip.
While interesting, I need IEEE 64 bit double precision for my scientific applications. Are there any 64-bit floating point GPU's out there?
FFT:
Some calculation which can be heavily optimized to simple but fast processing. Hence a [relatively] cheap part that does a few simple tasks very fast can out perform a more expensive part that can do a vastly greater range of tasks with more efficiency across that general range but less in a specific area when performing that optimized calculation.
By capitalizing on this incredibly basic rule of computer science (the an optimized simple thing going fast is faster than a more powerful general thing that is only being used for one of its many potentials), attention grabbing headlines can be garnered.
I don't know if it's exactly newsworthy, but it's kind of cute (and interesting) that the amount of specialisation that's going on in graphics cards leads to situations where one can persuade the graphics card to do one's (not graphics-related) work faster than one would be using the general purpose CPU for the same task. It's more amusement than anything else (although for those who want to do the computation, it's also useful).
For every problem, there is at least one solution that is simple, neat, and wrong.
I sense a little bias here; the fastest Intel and AMD processors are actually $1,000.
Graphics Processing Units have always been better for FFTs and signal processing than general CPUs. I've read a journal article where machine vision was implemented on a GeForce 5200 at a 3x speedup over an AMD Athlon 3200+. The reason? This is what a GPU is made for; the small dedicated instruction set makes a GPU much more adept at signal processing than the 686's have ever been.
Karma: Good, or bust!
Awesome, this is really good news for audio people.
I want to see how I can take advantage of this... I hope the license isn't too restrictive.
It might be a good example of how to use the GPU for general purpose (vector-based) computation, something I've been wanting to explore.
Just curious, how does the use of the GPU for this kind of thing affect the graphics display?
Are you unable to draw on the screen while it's running, or something?
Crypto = Number theory = integer math.
No need for floating point.
People complain because they compare the power consumption to their old "home computer". Just look at the Apple II, the C64 or similar 8 bit computers, almost all of which had such low power demands that they could run without fans. Even the "IBM compatible" PCs up to and including 386s almost always had exacly one fan, the one in the power supply. My most recently purchase (second hand) computer has 6 fans, and draws enough power to justify most of them.
You can probably make up your own flawed car analogy and compare top speed and fuel consumption of today's compact cars with the racing cars of 60 years ago.
Take a look at their benchmarks. The chart goes up to eight million elements. The accumulated rounding error in FFT outputs may be around n * log2(n) ULP, where n is the number of elements, and ULP (units in last place) is relative to the largest input element. (Caveats: That is the maximum; the distribution of the logs of the errors resembles a normal distribution. Input was numbers selected from a uniform distribution over [0, 1). The error varies slightly depending on whether you have fused multiply-add and other factors.)
So with eight million elements, the error may be 184 million ULP, or over 27 bits. With only 24 bits in your floating-point format, that is a problem. Whether you had 24-bit or 1-bit data to start with, it is essentially gone in some output elements. Most errors are less than the maximum, but it seems there is a lot of noise and not so much signal.
It may be that the most interesting output elements are the ones with the highest magnitude. (The FFT is used to find the dominant frequencies in a signal.) If so, those output elements may be large relative to the error, so there could be useful results. However, anybody using such a large FFT with single-precision floating-point should analyze the error in their application.
For both MP3, JPEG, and MPEG4 the transformation used is not a Fourier transform (not even TDFT/FFT), but DCT/IDCT ([inverse]discrete cosine transform). The reason for using DCT instead of the FFT (equivalent to the time Discrete Fourier Transform) is because the DCT is computationally cheaper than the FFT (about one half, in the fundamentals is really a mutilated DFT/FFT), and it provides enough information for the band discarding approaches used in lossy data compression.