High performance FFT on GPUs
A reader writes: "The UNC GAMMA group has recently released a high performance FFT library which can handle large 1-D FFTs. According to their webpage, the FFT library is able to achieve 4x higher computational performance on a $500 NVIDIA 7900 GPU than optimized Intel Math Kernel FFT routines running on high-end Intel and AMD CPUs costing $1500-$2000. The library is supported for both Linux and Windows platforms and is tested to work on many programmable GPUs. There is also a link to download the library freely for non-commerical use."
Why use a GPU for Final Fantasy Tactics? Couldn't you just use the GBA?
Note to mods: I'm probably being sarcastic.
I'm too lazy to go and look for a definition to FFT, and the article (or webpage describing the software) was short on details.
Enlighten me, please!
Could someone translate the summary ?
if you're only considering 32-bit floating point numbers and don't need full IEEE-754 compliance.
""The UNC GAMMA group has recently released a high performance FFT library which can handle large 1-D FFTs. According to their webpage, the FFT library is able to achieve 4x higher computational performance on a $500 NVIDIA 7900 GPU than optimized Intel Math Kernel FFT routines running on high-end Intel and AMD CPUs costing $1500-$2000. "
GPUs are nice, but there's the little matter of getting data and results on and off the chip.
Well, seeing as how the V.P. is such a V.I.P., shouldn't we keep the P.C. on the Q.T.? 'Cause if it leaks to the V.C. he could end up M.I.A., and then we'd all be put out in K.P.
FFT stands for "Final Fantasy Tactics", a brilliant tactics/role playing game for the original Sony Playstation. You should really try it.
Stay away from the GBA sequel though, that one just sucks.
Alternatively it could be some weird mathy jibber-jabber about furrier Transformers or something.
USE HOT GRITS WITH STATUE OF NATALIE PORTMAN (NAKED AND PETRIFIED)
While interesting, I need IEEE 64 bit double precision for my scientific applications. Are there any 64-bit floating point GPU's out there?
Limitations raised by parent prevents us from doing serious scientific computing on GPUs...
Don't Panic.
Explanation here.
FFT; Fast Fourier Transform - a specific algorithm, but used to indicate any algorithm attempting to determine the power versus frequency graph for a signal Dag-nadit
I have mod points and I am not afraid to use them.
FFT:
Some calculation which can be heavily optimized to simple but fast processing. Hence a [relatively] cheap part that does a few simple tasks very fast can out perform a more expensive part that can do a vastly greater range of tasks with more efficiency across that general range but less in a specific area when performing that optimized calculation.
By capitalizing on this incredibly basic rule of computer science (the an optimized simple thing going fast is faster than a more powerful general thing that is only being used for one of its many potentials), attention grabbing headlines can be garnered.
From some of the other pages of the site:
"Our current code only handles 1D power-of-two single-precision FFTs"
Ah. Single precision. Useful, but not as interesting as double would be. Still impressive, though.
So correct me if I'm wrong but, it's just a math library for sale, right?
Don't get me wrong, new tools are cool, but can someone explain to me why this is newsworthy?
crazy dynamite monkey
Isn't that what SETI@home uses for the bulk of its signal analysis? Would be kind of neat to leverage the millions of idle GPU's out there.
If you don't know where you are going, you will wind up somewhere else.
I have an uncle who's a professor who's been using GPUs for scientific computation for years. Apparently he has systems with four GPUs running simulations.
I sense a little bias here; the fastest Intel and AMD processors are actually $1,000.
Thirty years later, a $500 GPU, weighing less than 1 pound, can produce 6 gigaflops. People complain about its power and cooling needs, but they are rather below 200 kW! We sometimes forget just how amazing the developments in computing have been over the last three decades.
AGP was not very useful for bidirectional data flow, but PCIe is. GPU's are pretty sophisticated these days, so they've got the logic to handle moving stuff in and out of it's memory and over the bus to the CPU and the rest of the system.
makes you wonder how much raw computing power is locked away in the various chipsets in your PC. A 500mhz GPU designed for floating point and complex mathematical instructions could come in especially handy for encoding videos rather than just being used to apply filters during playback. That said, if it's used in things like Seti@home I'd imagine the additional power use would have a notable effect on your electricity bill
Note how many comments are about explaining what FFT is as opposed to how many comments consist in asking what FFT means. Quite a fucked up demand/offer ratio.
You just got troll'd!
The limit is the floating-point precision of the GPU.
Most GPU can do max up to 32-bit floating point operations (depending on the brand and the model), where as most scientific applications use 64-bit and higher (the old FP unit could do 80bit FP math, SSE registers in recent processors can do 128-bits FP math).
So some user will be happy, like for sound processing (GPU have already been used to process reverberation to create realistic sound environnement - too lazy to do the search for the slashdot reference)
But other application (cryptography maybe) will probably need more FP precision.
Not to mention that most scientific applications run mostly under *nix like Linux or BSD, for which GFX driver support isn't always incredible, specially for recent models, (website mentions performance hit).
(And also remember that soon Vista will have an interface that'll completly clog the GPU and leave less free cycles to do general purpose calculation).
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Are the two linked or is the W at the end of the name just a mere coincidence?
You just got troll'd!
on this page here they almost compare to a program called libgpufft (which is an open source BSD version of the same library here ) I wonder how they do compared to the BSD licensed version---
The interesting question will be :
Is the 32-bit precision enough for SETI@Home application ?
Or does the project needs the higher precisions (64bit to 128bit) that can (for now) only be provided by the CPU ?
IMHO, maybe this could be useful. They're trying to find which chunk contains candidate data. If there's some fast low-precision algorithm that can quickly mark chunks as interesting / recheck with higher precision / un-interesting, it'll be helpful to quickly tell appart interseting chunk, even if data need to be post-processed later at higher precision.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
The Video RAM will determine the maximum array length that can be sorted on the GPU. A rough guideline for performing FFT on 32-bit floats is: Maximum array length in millions = Video RAM in MB / 32
Max array length equals video RAM in megabytes divided by 32... bits? Correct me if i'm dumb but shouldn't it rather be "Video RAM in MB / 4"?
You just got troll'd!
Graphics Processing Units have always been better for FFTs and signal processing than general CPUs. I've read a journal article where machine vision was implemented on a GeForce 5200 at a 3x speedup over an AMD Athlon 3200+. The reason? This is what a GPU is made for; the small dedicated instruction set makes a GPU much more adept at signal processing than the 686's have ever been.
Karma: Good, or bust!
Apparently nobody knows what an FFT is. Here's the best description I can give without descending into math too much.
The Fast Fourier Transform is an algorithm to turn a set of data (as amplitude vs. time) into a set of waves (as amplitude vs. frequency). Say that I have a recording of a piano playing an A at 440 Hz. If I plot the actual data that the sound card records, it'll come out something like this picture. There's a large fading-out, then the 440 Hz wave, then a couple of overtones at multiples of 440 Hz. The Fourier series will have a strong spike at 440 Hz, then smaller spikes at higher frequencies: something like this plot. (Of course, that's not at 440, but you get the idea.)
The reason we like Fourier transforms is that once you have that second plot, it's extremely easy to tell what the frequency of the wave is, for example - just look for the biggest spike. It's a much more efficient way to store musical data, and it allows for, e.g., pitch transformations (compute the FFT, add your pitch change to the result, and compute the inverse FFT which uses almost the same formula). It's good for data compression because it can tell us which frequencies are important and which are imperceptible - and it's much smaller to say "Play 440 Hz, plus half an 880 Hz, plus..." than to specify each height at each sampling interval.
The FFT is a very mathematics-heavy algorithm, which makes it well suited for a GPU (a math-oriented device, because it performs a lot of vector and floating-point calculations for graphics rendering) as opposed to a general-purpose CPU (which is more suited for data transfer and processing, memory access, logic structures, integer calculations, etc.) We're starting to see a lot of use of the GPU as the modern equivalent of the old math coprocessor.
If you're looking for more information, Wikipedia's FFT article is a good technical description of the algorithm itself. This article has some good diagrams and examples, but his explanation is a little non-traditional.
Limited to a single process - not multi-process or multi-thread safe.
Single-precision floating point is more than enough if the final product is an image with 8 bits/pixel.
Awesome, this is really good news for audio people.
I want to see how I can take advantage of this... I hope the license isn't too restrictive.
It might be a good example of how to use the GPU for general purpose (vector-based) computation, something I've been wanting to explore.
Just curious, how does the use of the GPU for this kind of thing affect the graphics display?
Are you unable to draw on the screen while it's running, or something?
...for a custom application, I'd rather look at FPGA.
Crypto = Number theory = integer math.
No need for floating point.
This is somewhat off topic, but keeps to the fact that GPUs
are having profound effects in the cinematography world.
Check out this paper detailing impresive gains using a GPU versus
Pixar's standard engine and render -
some gains are upwards of 20,000X more efficient
with almost identical visual results!
http://www.vidimce.org/publications/lpics/
Interesting? Cheers!
Finally I have a good excuse to give the IT department why I need to upgrade my video card. I need to do FFTs faster (it has nothing at all to do with Doom3).
Support Right To Repair Legislation.
Not to mention that their dollar figures are somewhat misleading, as they didn't include the cost of the host PC...and AMD Opeteron 280 processors don't cost ANYWHERE NEAR $2,000. They also didn't show us how cheaper processors do.
You can buy a Tyan quad-CPU motherboard for about $1k, and the dual-core version of the 280 for $1k tops. So...that's $5-5.5K for a box with EIGHT processors that will do 6 million complex values in almost half the time as the $500 video card sitting in a workstation that most likely costs well over $1,500. So it's not really quite as simple as "OMG 4X PERFORMANCE, ONE QUARTER THE COST!!!"
Also, did you know that you can buy a Sempron motherboard, 1GB of ram, and sempron processor for about $150 these days? For the same cost as that one GPU, you can buy 3 complete Sempron systems and have change left over for your networking gear.
Please help metamoderate.
The one thing that I haven't seen mentioned is that the benchmarks only show "compute timings" and not actual setup and retreval times. If the benchmarks showed the amount of time to get the data to the GPU and especially the time to get the result back to a place where a program could actually use it then it would be blown out of the water by the CPU. Future cards/drivers could speed up the process of retrieving the data, but for now there will always be lame benchmarks like this that are unfairly biased toward the GPU and only tell half the story. I mean what's the point of doing an FFT so quickly if it takes forever to actually be able to get to the data.
With respect to CPUs, I understand that Intel and AMD processors tend to smoke PowerPCs for integer computations, and it's the other way around for floating point calculations, particularly for things that make effective use of AltiVec.
So I'm wondering how this GPU-based (presumably intensively floating-point) approach compares to a PowerPC/AltiVec processor. Anyone got any numbers/analysis?
(This is more than just an academic proposition. Some have challenged a program's transition from PowerPC to Intel for precisely this reason, claiming that their PPC based algorthims wouldn't achieve necessary performance on Intel processors.)
dave
And those acronyms are "Joint Photographic Experts Group", "Motion Picture Experts Group", "Discrete Cosine Transform", and "Fast Fourier Transform", for those who do not know and are too busy/lazy to look it up.
Note that you can only use the 2xx Opterons in 1 or 2 CPU (2 or 4 core) motherboards. If you want to have 4 CPUs with 8 cores you need to use the 8 series Opterons. The Opteron 880 dual core currently starts at over USD 1,600.00 each, which makes your configuration start at about 7,500 just for the MB and CPUs. Then add the registered RAM, server case, big juicy power supply, drives, video, monitor, a UPS to protect all of that... It sounds sexy but realistically its going to be close to $10K if you build it yourself.
And as you tread the halls of sanity, You feel so glad to be, Unable to go beyond. I have a message, From another time..
There are plenty of high-performance co-processor boards you could use. Often, they don't help at all because getting the data in and out is slower than just doing the operations on the CPU. Furthermore, the $500 is in addition to the CPU you already have. Third, the effort you invest in putting your code on the coprocessor is likely going to be a short-term investment, as these boards (whether GPU or otherwise) rapidly change, and as few other people are going to have exactly the same setup as you.
Overall, a 4x speedup for $500 for FFTs, impressive as it may sound at first sight, is probably not enough to help this sort of thing succeed widely. But it may find some uses in a few special niches.
Right then. So how long before they just include some weak general-purpose instructions in the GPU, add SATA and ethernet to the cards, and call it a budget PC?
Kid-proof tablet..
And virtually all of this computation is single precision.
There are enough inaccuracies in transforming to the frequency domain due to the "windowing problem" (which says that the finite length of the input series results in smearing in the frequency domain) that double precision is silly except for very long series.
I'm interested in knowing where and why double precision FFT's are considered necessary.
For both MP3, JPEG, and MPEG4 the transformation used is not a Fourier transform (not even TDFT/FFT), but DCT/IDCT ([inverse]discrete cosine transform). The reason for using DCT instead of the FFT (equivalent to the time Discrete Fourier Transform) is because the DCT is computationally cheaper than the FFT (about one half, in the fundamentals is really a mutilated DFT/FFT), and it provides enough information for the band discarding approaches used in lossy data compression.
FFTs are also useful for squaring large numbers with a modulo. That's what http://www.mersenne.org/ uses them for.
Melissa
"Screw Sun, cross-platform will never work. Let's move on and steal the Java language." - Visual J++ Product Manager
The test is a 7900 GTX versus AMD & Intel systems both with Two dual-core CPUs; that's why the CPUs pricing is so high, as two Opteron or Xeon dual-core chips don't come cheap, although their pricing is out of date. I'm not clear on whether the test is comparing the FP units or the SSE units against the 7900! It's gotta be the SSE performance right? Someone was wondering why all this raw GPU power couldn't be used for video encoding etc; well ATI has released software that allows their recent GPUs to be used to accelerate video transcoding. About time there was a mainstream use for all this silicon.
The sizes of transforms they are using for comparison here are of lengths of the order of 1 million points. This is huge for an FFT, and truncation error will definitely come into play here using only 32-bit precision. It all depends on what you are doing whether this will be adequate or not. Also, it's not at all clear what they did on the other platforms. There are some tricks to doing very long sequences; essentially using a 2D transform to perform a long 1D transform. It's not trivial, and requires some extra work, but generally a lot more efficient than taking a 1D transform and shoving a 4 million element transform into it. The inner loops of a 1D transform will eventually trash the cache for such a large transform, so using a blocked 2D transform avoids this, with some overhead of course. It's hard to tell what they are doing from the performance curves, since they report seconds, and it needs to be scaled by n log n to really see what's going on. It's cool they tried this, though. I was looking at using a GPU to do FFTs and linear algebra kernels a couple of years ago, but decided not to go there as I didn't think it would pay off; mostly because of the 32-bit precision.
Someone in a parallel computing class I was taking did a parallel application using FFT to do processing. I wonder if they could incorporate this into MPI for parallel computations pretty easily?
In undeveloped countries, the consumer controls the market. In capitalist America, the market controls you.
the code doesnt compile on my system, ubuntu breezy amd64. anyone got a patch for it?
This is my sig.
I'm wondering whether or not the DSP latency of these libraries is sufficient to use with real-time audio processing...if folks were to write RTAS/AU/VST plugins using the library, how they would compare to other hardware-assisted DSP solutions such as the PowerCore and Pro Tools farm cards. Then again, if you have to spend $500 on a card to get this goodness, it's hardly a bargain (albeit cheaper than the above products...)
.. for the last 2 hours I was in class listening about FFT. I swear to god they are out to get me.
-- TRUST ME! I KNOW WHAT I'M DOING!
Ah, but the modern GPU doesn't have its own built in couch! How many people can sit on a modern GPU?
(Professor Bernstein, in an e-mail reply to a question, did mention he was going to update DJBFFT, but this does not appear to have happened. In all probability he lacks the time, but that doesn't solve the problem of needing high-power FFT software. FFTW, although slightly better on updates, isn't massively better. What is needed is for someone to start a project with the goal of writing code as fast as DJBFFT, as generic as FFTW, and as active as LinuxBIOS or the Linux Kernel in terms of getting new code out there.)
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Thats nice 'init, coz de FFT's hav 'ad an 'ell ov a time ininday?
Does that mean, if you didn't mind that the algorithm being more computationally expensive (say, if you were running it on your GPU, for example), that you could get even better compression than MPEG4 by using a FFT instead of a DCT?
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Sorry, the FFT of a time-domain signal does **NOT** indicate how the power (or energy) of the signal is distributed.
For the latter, you need a PSD (power spectral density) plot, which is obtained by finding the square of the magnitude of the freq-domain FFT (complex) outputs.
And the term "FFT" usually describes a specific class of algorithms that finds a Discrete Fourier Transform of a signal in much less than O(N^2) time, where N is the number of elements/samples considered.
However, the FFT is also useful to perform fast polynomial multiplication (and even fast multiplication of very very very long numbers). This application has nothing to do with power or frequencies in a signal.
So... if I can get this running on my Gentoo system, I can compile with my CPU and GPU using a sort of DistCC?
In floating-point arithmetic, the algorithm was proved in 1966 to have an upper bound for the error that grows only as O(log N), and the mean (rms) error grows only as O(log N). (See this page for more info.) (Errors in fixed-point arithmetic are worse, going as N.)
Even in single precision, the errors for their FFT sizes are probably quite reasonable, assuming they haven't done something silly like use an unstable trigonometric recurrence.
If a thing is not diminished by being shared, it is not rightly owned if it is only owned & not shared. S. Augustine
Who's got the GPGPU SW that does MP3 or VoIP codec compression on the GPU HW?
--
make install -not war
That should be O(sqrt(log N)) for the rms error and O(sqrt(N)) for fixed point. (My sqrt symbols didn't post properly somehow)
If a thing is not diminished by being shared, it is not rightly owned if it is only owned & not shared. S. Augustine
We're putting High Performance FFT's on all of our TPS reports now before they go out, Did you get that memo?
So if you could go ahead and do that for me from now on, that'd be great
Can't we all just get along
Accelerating Double Precision FEM Simulations with GPUs by Dominik Goddeke Proceedings of the 18th Symposium of Simulation Technique, Erlangen, Germany, September 2005:
"Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry
First, we should probably clarify something. The output of the FFT is the output of a DFT (Discrete Fourier Transform) for all intents and purposes. The FFT is a fast way of computing those values by using the fact that when you actually compute a DFT, you're recalculating a lot of the same stuff over and over due to the fact that the Fourier matrix is a circulant matrix. Now that that's out of the way, it should be said that the DCT is a specific case of the DFT when the signal is real and even. See my explanation for that here. No, you probably wouldn't have any advantage using a generic DFT instead of a DCT. First of all, you don't care about complex signals for audio and images and second of all, I forget my second point ;-)
"Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman
What is newsworthy is that this is a shameless attempt to secularize mathematics. It's right in the name -- Fast Fourier Transformation. That's idolatry. What can a man know about signals that God hasn't already made clear in the Word? Come to our website, and you can learn all about Intelligent Factoring, which is on much sounder mathematical grounds because it develops entirely from biblical principles.
I bet you'd see the same thing if there were an article about mergesort (mergesort for disk-drives and tape-drives is a "fascinating" subject), or about ... whatever it is that chemists study.
You are correct that FFTW has undergone some massive reworking and is currently at 3.1.1 (although Fedora Core 5 is only on 3.1). What would be wonderful would be for FFTW to add in hooks for PAPI (a powerful profilling library) to establish how fast the code actually is and where improvements could be made. It would both help the developers figure out what needs to be accelerated, and help convince those of us who have become skeptical of performance claims that there is something to cheer about.
(I hate being cynical; it's not in my nature. It is, however, the natural byproduct of seeing too many bogus performance claims and outright fraudulant benchmarks. I don't much care WHO produces genuinely useful data or WHO has the ultimate in FFT technology; what I do care about is having a way to know WHICH data and WHICH technology is top of the heap. Hell, if independent reviewers actually got paid, I'd add the damn hooks myself, run the tests and publish the results on an advert-free site.)
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
But I think that GFX under linux suck because :
- Companies usually release proprietary binary-only drivers.
- They only release drivers for a couple of architecture (mostly x86 and AMD64).
- some company don't put as much effort in their Linux prorietary drivers as in the Windows versions (ATI is most notable)
- they mostly don't colaborate with DRI.
- e.g.: ATI Radeon r300 driver (9500 to X850) had to be reverse engeneered.
- Intel i8xx and i9xx seem to be the only recent chipset with continued DRI support and that's a low-end embeded chipset (ATI is only available up to r200 / 9250 later, Matrox only up to G550, nVidia is proprietary-only, 3DFX company is dead)
For most GFX card company, supporting Linux is just ticking a box on a list and putting minimal effort.
So, what should you do if you want to use the Radeon 9600 on your PowerPC (mac) laptop ?
Or plug a GFX card on a Sparc based station ?
Sit down and cry ?
I have no intention to put money into pockets of companies that refuse to support opensource.
But I do run a (refurbished) Radeon9600XT using the open-source DRI reverse-engeneered driver at work.
At home I keep an old 3DFX. It's still enough for me and has opensource drivers.
Having to wait for months before getting a AMD64 version of fxgl is something I experienced personnally.
The local squad of tech gods use command line interface.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
where can i find the CVS for this project?
This is my sig.
Figures that it takes a graphics company to do some audio company's hardware job better. Just can't expect a dedicated and useful FFT chip on available APUs nowadays especially with the lack of real competition and creativity.
An interview with Jim Gray, in which he talks about this subject (he apparently helped out the North Carolina group with some of this work): "Deconstructing databases with Jim Gray".