Slashdot Mirror


High performance FFT on GPUs

A reader writes: "The UNC GAMMA group has recently released a high performance FFT library which can handle large 1-D FFTs. According to their webpage, the FFT library is able to achieve 4x higher computational performance on a $500 NVIDIA 7900 GPU than optimized Intel Math Kernel FFT routines running on high-end Intel and AMD CPUs costing $1500-$2000. The library is supported for both Linux and Windows platforms and is tested to work on many programmable GPUs. There is also a link to download the library freely for non-commerical use."

274 comments

  1. Final Fantasy Tactics? by Musteval · · Score: 4, Funny

    Why use a GPU for Final Fantasy Tactics? Couldn't you just use the GBA?

    --
    Note to mods: I'm probably being sarcastic.
    1. Re:Final Fantasy Tactics? by numbski · · Score: 4, Funny

      Seriously! I mean, with only 1-D, it's really going to suck. I know people have accused Final Fantasy of being too linear in the past, but this is getting a bit silly...now, we're stuck at a single point!

      --

      Karma: Chameleon (mostly due to the fact that you come and go).

    2. Re:Final Fantasy Tactics? by Anonymous Coward · · Score: 0

      Err, isn't 1D a line?

    3. Re:Final Fantasy Tactics? by john83 · · Score: 1
      Err, isn't 1D a line?
      Yes. The GP was confusing 0D (a scalar) with 1D (a vector). Also, since the multi-dimensional Fourier Transform kernal is seperable in each dimension, you can calculate higher dimensional transforms with the 1D version. Of course, the GP was kidding anyway, so I wouldn't get hung up on it. ;)
      --
      Strange women lying in ponds distributing swords is no basis for a system of government.
    4. Re:Final Fantasy Tactics? by colinrichardday · · Score: 0

      Yes.

    5. Re:Final Fantasy Tactics? by swelke · · Score: 0

      Just to completely ruin the joke: 1 dimension is a line; zero dimensions is a point.

      --
      Have you ever wondered How to Take Over
    6. Re:Final Fantasy Tactics? by fireman+sam · · Score: 2, Funny

      Ah, but great framerate.

      --
      it is only after a long journey that you know the strength of the horse.
    7. Re:Final Fantasy Tactics? by Ohreally_factor · · Score: 1

      So you rendered the joke pointless. How many dimensions is that, Mr. Smart Guy? =)

      Oh, and pun retrospectively intended.

      --
      It's not offtopic, dumbass. It's orthogonal.
    8. Re:Final Fantasy Tactics? by kalirion · · Score: 1

      Yes. The GP was confusing 0D (a scalar) with 1D (a vector). Also, since the multi-dimensional Fourier Transform kernal is seperable in each dimension, you can calculate higher dimensional transforms with the 1D version. Of course, the GP was kidding anyway, so I wouldn't get hung up on it. ;) I was with you until the second sentence.

  2. Uhh.. by PDXNerd · · Score: 0, Offtopic

    I'm too lazy to go and look for a definition to FFT, and the article (or webpage describing the software) was short on details.

    Enlighten me, please!

    1. Re:Uhh.. by Anonymous Coward · · Score: 5, Informative
    2. Re:Uhh.. by Anonymous Coward · · Score: 1, Funny

      No idea, but a high performance FFT has got to be better than the lower performance FFTs we've been using so far. I'm setting a goal of fast FFT compliance by the third week in June.

    3. Re:Uhh.. by SpinyNorman · · Score: 2, Informative

      There's a somewhat non-obvious mathematical result that any continuous periodic function can be decomposed as the sum of a series of sine functions of different frequencies. This series of sine waves is referred to as the fourier series of the function. The FFT (fast fourier transform) is an efficient numeric algorithm to derive the coefficients of the fourier series for any function.

      One useful way to think of the FFT is as transform of signal data from the time domain (raw samples) to the frequency domain (the constituent sine waves). This is useful for all sorts of purposes such as being the first step in speech recognition, the basis of JPEG/MPEG compression, ...

    4. Re:Uhh.. by Fordiman · · Score: 5, Informative

      an FFT is a transform that turns a signal (like an audio file) into its frequency components (like a spectrograph). It's used for MP3 compression, sound EQs, jpeg compression, mpeg4 compression, and a number of other things (I use FFTW for tuning my guitar).

      FFTW is the 'Fastest Fourier Transform in the West', a cute name for the work of a number of graduate students who use several techniques to turn the FFT from 'Numerical Recipes in C' into a freaking speed daemon.

      GPUFFTW is much the same thing, but ported to your video card's GPU - which is generally more optimized for doing the 'apply a floating point matrix to an array' thing - thus speedin the FFTW up even more while relieving the main processor from doing the work.

      If you don't have a high-powered video card, this means nothing for you. If you do, it means the above operations (compression, spectrum analysis, etc) can be done faster and without eating up processes.

      --
      110100 1101000 1101000 1100110 0 1101111 1101000 1100011 1
    5. Re:Uhh.. by kabz · · Score: 3, Interesting

      Or in the form of a concrete example ... The little spectrum analysers in iTunes are a good example of taking some time domain data, analysing it, and displaying the low through high frequencies.

      As an example of how far we've come, I implemented the Cooley-Tukey FFT in assembler on an Amiga, and it was just barely out of real-time. You had to capture some audio data, then wait while it was analysed. Nowadays, you can write the same thing in Objective-C on a G4, using the standard audio capture library, and have the FFT's computed between redraw events.

      --
      -- "It's not stalking if you're married!" My Wife.
    6. Re:Uhh.. by exp(pi*sqrt(163)) · · Score: 4, Funny

      FFT is a data compression and encryption standard used by a wide variety of extraterrestrial civilizations. Seti@home spends most of its time running FFT code to look for signals. If we managed to communicate with any of these aliens we could ask them what it stands for.

      --
      Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.
    7. Re:Uhh.. by Anonymous Coward · · Score: 1, Interesting

      So you are too lazy to go look it up on something like wikipedia, which will take a few minutes, but not too lazy to check back here 100 times today for someone to make a post?

    8. Re:Uhh.. by JourneyExpertApe · · Score: 1, Redundant

      No, no, they're talking about playing Final Fantasy Tactics. That's why they need a GPU.

      --
      If you can read this sig, you're too close.
    9. Re:Uhh.. by DarthChris · · Score: 1

      I feel I should point out here that the basic concept of Fourier series has both sine and cosine in, but that it is possible to rewrite them as what are referred to as sine/cosine half-series.

      Fourier series, as a concept, is analogous to the polynomial-based power series of a function - both allow you to construct an infinite set of coefficients. Just as Taylor's theorem can be used to construct a power series (and determine the coefficients) for any function, the FFT can be used to obtain the Fourier coefficients.

      --
      Don't you just hate it when people reply to your signature?
    10. Re:Uhh.. by yoz · · Score: 1

      FFTW is the 'Fastest Fourier Transform in the West', a cute name for the work of a number of graduate students who use several techniques to turn the FFT from 'Numerical Recipes in C' into a freaking speed daemon. ... but not as fast as djbfft, right? (I'm asking rather than telling, as I have very little FFT experience and am intrigued to know)

    11. Re:Uhh.. by strstrep · · Score: 1

      There are some important difference between Fourier series and Fourier transforms. A Fourier series can only be used for a periodic signal (one that repeats itself over and over). A Fourier transform can be used on any finite-energy signal (if you integrate over the whole signal from -infinity to infinity in time, the value is not infinite). Also, a Fourier series only analyzes the frequencies that are integer multiples of a natural frequency. Fourier transforms analyze all frequencies.

    12. Re:Uhh.. by odourpreventer · · Score: 1

      For my Master Thesis, I did a radio wave analysis program. On a 2.8GHz hyperthreading Intel CPU, I could do six transformations on three 70KHz signals (yes, 18 transformations in total!), then convert them with matrix algebra and finally present the result in six separate dialog boxes through DirectX (yes, on a WinXP box with Visual C++, no less), all in real time with clock cycles to spare.

      It was sweet, I tell you.

    13. Re:Uhh.. by poopdeville · · Score: 1

      I prefer to think of both Taylor's (and Weierstrass's) and Fourier's theorem as special cases of the Stone-Weierstrass theorem. Easy, formal, unifying. http://en.wikipedia.org/wiki/Stone-Weierstrass_the orem

      --
      After all, I am strangely colored.
    14. Re:Uhh.. by Anonymous Coward · · Score: 0

      Wow, that does sound pretty sweet.

      Incidentally I took a look at the GPUFFTW code and most of the loops and stuff still look present, with some heavy multiplication handled by the gpu. It doesn't really look like it should be that much more efficient. Some hand-coded assembly with loop unrolling and tweaks to keep the FPU under full load might give it a run.

      Sad to say, I've never been able to get into Intel style assembly. (Z80, 8080, 8088, etc).

      What's the theoretical limit on what an AMD64 could do, compared to this GPU?

    15. Re:Uhh.. by bedessen · · Score: 1

      The music and video compression codecs (like mp3, Xvid, Divx, h.264, mpeg-4) use a specialization of the FFT, the discrete cosine transformation (DCT) because in these applications you don't care about the imaginary part of the input data points (i.e. you're dealing with all real numbers.) You may have seen settings in your video playback options for iDCT - the inverse DCT used by the decoder.

    16. Re:Uhh.. by pyite · · Score: 1

      A Fourier transform can be used on any finite-energy signal (if you integrate over the whole signal from -infinity to infinity in time, the value is not infinite).

      which basically means all real-life signals because you can make any real-life signal have compact support, I suppose.

      --

      "Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman

    17. Re:Uhh.. by LilGuy · · Score: 1

      Actually... I read the wikipedia article someone linked to previously and I'd have to say you explained it much more clearly in quite a few less words. Bravo!

      --

      You're nothing; like me.
    18. Re:Uhh.. by odourpreventer · · Score: 1

      I forgot to mention that the FFT routine I used was written in C. No assembly code.

    19. Re:Uhh.. by mikael · · Score: 1

      The FFT is used to analyze a continuous signal and break it up into an equivalent set of sine waves, each with its own amplitude and phase. The signal can be anything; undersea microphones, height of sea level, wind direction, or even just your favourite MP3 track.

      Once you have a set of frequencies and amplitudes, you can do all sorts of useful things like determine the most dominant frequency (highest amplitude), filter out pops and hisses (set selected amplitudes to zero), or just run a funky jumping bar animation when you play your favourite tune.

      --
      Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
    20. Re:Uhh.. by mikael · · Score: 1

      The FFT is used to analyze a continuous signal and break it up into an equivalent set of sine waves, each with its own amplitude and phase. The signal can be anything; audio, height of sea level, wind direction, or even just your favourite MP3 track.

      Once you have a set of frequencies and amplitudes, you can do all sorts of useful things like determine the most dominant frequency (highest amplitude), filter out pops and hisses (set selected amplitudes to zero), identification (find nearest match between two sets of data) or just run a funky jumping bar animation when you play your favourite tune.

      The FFT can be extended to 2D (and even 3D), then you can do things like image processing, where problems like scratches, speckles and banding can be fixed.

      The only problem is that doing all of this requires a lot of floating-point performance coupled with high bandwidth. In the past, this used to be done with Cray supercomputers (vector processors), then dedicated accelerator boards (TMS320's) before being done by desktop CPU's, and now GPU's.

      --
      Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
    21. Re:Uhh.. by Fordiman · · Score: 1

      1) You're confusing the rubes. The generic was applicable for this explanation.
      2) Does not the port of FFT to a GPU mean that it's possible to do a similar thing to the DCT, and thus obtain the gains I mentioned?

      --
      110100 1101000 1101000 1100110 0 1101111 1101000 1100011 1
  3. Please by Anonymous Coward · · Score: 0

    Could someone translate the summary ?

    1. Re:Please by __aaclcg7560 · · Score: 2, Funny

      One more reason to buy an overpriced video card instead of an overpriced CPU?

    2. Re:Please by kimvette · · Score: 1

      (based on the sig)

      But, can an Nvidia run FFT calculations faster than Chuck Norris? I doubt it!

      --
      The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
    3. Re:Please by @madeus · · Score: 1

      One more reason to buy an overpriced video card instead of an overpriced CPU?

      What do you mean an overpriced video card, and why would you want one instead of an overpriced CPU?!

    4. Re:Please by __aaclcg7560 · · Score: 1

      I wasn't aware that the FFT calculations could be performed across two GPUs. Of course, I still haven't upgraded my old 32-bit CPU to a new dual core 64-bit CPU yet. I'm still paying off the overpriced budget video card that I got recently for $50. :P

    5. Re:Please by LiquidCoooled · · Score: 1

      Someone did once claim to have made an FFT formula which operated faster than chuck norris doing trigonometry, but that guy hasn't been seen since.

      --
      liqbase :: faster than paper
  4. It's nice... by non0score · · Score: 4, Informative

    if you're only considering 32-bit floating point numbers and don't need full IEEE-754 compliance.

    1. Re:It's nice... by john.r.strohm · · Score: 4, Interesting

      Depending on what you're doing, for an FFT, you probably don't need 64-bit floating point, and you DON'T need full IEEE-754 compliance.

      If you are taking data off of some kind of sensor, there are damned few sensors with 24 good bits of data out of the noise floor. Radars work just fine with 16-bit A/D converters.

      IEEE-754 compliance helps you in the ill-defined corners of the number space. FFTs inherently work in the well-behaved arena of simple trig functions and three-function (add/subtract/multiply) math.

      I'm currently doing FFTs with 16-bit fractional arithmetic in Blackfin DSP. For what I'm doing with the results, it is good enough.

      Not to mention you could use a "GPU farm" to do a fast search, and take any "interesting" data regions and feed those to a 64-bit, fully-IEEE-754 compliant, slow-as-molasses-in-January x86 FFT.

      Eventually, with some more articles like this one and yesterday's Cell piece, people will start to figure out that the x86 architecture is brain-dead and needs to be put out of its misery.

    2. Re:It's nice... by Andrzej+Sawicki · · Score: 0

      Wow, a high-end GPU is good at parallel computing. No kidding!
      [/sarcasm]

    3. Re:It's nice... by stephentyrone · · Score: 4, Informative
      FFTs inherently work in the well-behaved arena of simple trig functions and three-function (add/subtract/multiply) math.
      add/subtract/multiply math is the area that 754 has had the biggest effect on - in fact, the spec has very little to say about transcendental functions, but is almost entirely concerned with the basic arithmetic ops. prior to 754, floating point was, in general, not algebraicly closed under +-*/, nor were the results correctly rounded.

      most highly parallel GPU-type chips lack support for gradual underflow, for example, one of those "ill-defined corners of the number space" where 754 has been a tremendous boon. flush-to-zero is fine if you're decoding MP3s or unpacking texture maps, but it causes a lot of problems when you start trying to do more general scientific computations. sometimes those low order bits matter a whole lot; sometimes they're the difference between getting an answer accurate to 4 digits and an answer with *no* correct digits.

      "simple trig functions" have their own problems on these architectures; try writing an acceptable range-reduction algorithm for sin or cos without having correctly rounded arithmetic ops. sin and cos are, in fact, two the hardest operations in the math lib on which to get acceptable accuracy.

      admittedly, none of these objections are an issue with FFTs. but the reason that FFTs will perform acceptably on such an architecture is that the operations are (usually) constrained to the domain in which you don't encounter the problems i mention, not because the operations themselves are inherently safe. the lack of support for gradual underflow will cause you to lose many, many bits in frequency components that have nearly zero magnitude, but you usually don't care about those components when you're doing FFTs, anyway.

    4. Re:It's nice... by SuperBanana · · Score: 0
      if you're only considering 32-bit floating point numbers and don't need full IEEE-754 compliance.

      Nevermind that you need a decent host to feed the card data, twice the ram compared to what is on the card, and you can't put more than one card in a PC. So one "system" costs a lot more than just the $500 for the card.

      I'm not aware of many rackmount servers that have AGP or PCI-E slots except the Apple XServe, so you're possibly looking at using minitower type machines or building rackmount boxes yourself, and we all know how well both of those work for clusters.

      Most video cards also use a considerable amount of power...some several times what the various low-power processors use. There aren't a whole lot of SMP motherboards (especially for power-efficient processors), but is this really the most efficient (cost, power, space) solution? Especially considering that for every video card, you'll have a processor anyway?

    5. Re:It's nice... by edp · · Score: 3, Insightful
      "If you are taking data off of some kind of sensor, there are damned few sensors with 24 good bits of data out of the noise floor. Radars work just fine with 16-bit A/D converters."

      Take a look at their benchmarks. The chart goes up to eight million elements. The accumulated rounding error in FFT outputs may be around n * log2(n) ULP, where n is the number of elements, and ULP (units in last place) is relative to the largest input element. (Caveats: That is the maximum; the distribution of the logs of the errors resembles a normal distribution. Input was numbers selected from a uniform distribution over [0, 1). The error varies slightly depending on whether you have fused multiply-add and other factors.)

      So with eight million elements, the error may be 184 million ULP, or over 27 bits. With only 24 bits in your floating-point format, that is a problem. Whether you had 24-bit or 1-bit data to start with, it is essentially gone in some output elements. Most errors are less than the maximum, but it seems there is a lot of noise and not so much signal.

      It may be that the most interesting output elements are the ones with the highest magnitude. (The FFT is used to find the dominant frequencies in a signal.) If so, those output elements may be large relative to the error, so there could be useful results. However, anybody using such a large FFT with single-precision floating-point should analyze the error in their application.

    6. Re:It's nice... by Waffle+Iron · · Score: 5, Informative
      Eventually, with some more articles like this one and yesterday's Cell piece, people will start to figure out that the x86 architecture is brain-dead and needs to be put out of its misery.

      Why? Because the x86 isn't a DSP?

      The x86 is a general-purpose CPU. It isn't brain dead; historically it's almost always been at least half as fast as the latest expensive processor fad du jour, and sometimes it has actually been the fastest available general purpose processor. As these fads have come and gone, the x86 has quietly kept improving by incorporating many of their best ideas.

      The cell processor is basically a POWER processor core packaged with a few DSPs tacked onto the die. That sounds like a kludge to me, but if it turns out to be a success, there's nothing stopping people from tacking DSPs onto an x86 die.

      All a DSP is good at is fast number crunching. It usually has little in the way of an MMU, along with a memory architecture tuned mainly for vector-like operations, branch prediction tuned only for matrix math, etc. DSPs would make a bad choice for running general purpose programs, especially with cache and branch issues becoming the dominant performance bottleneck in recent times. DSPs would a horrible choice for running an OS with any kind of security enforcement. Using a GPU as a poor-man's DSP is interesting, but it suffers even more from these same limitations. If DSPs really offered a better solution for general-purpose problems, they would have replaced other CPU architectures decades ago.

    7. Re:It's nice... by Anonymous Coward · · Score: 0

      Commander Data, your presence is urgently requested on the bridge.

    8. Re:It's nice... by chriso11 · · Score: 1

      You hinted at this, but didn't explicitly state: as you increase the number of samples, you lower the SNR of the output, due to increasing numbers of rounding errors in the calculation. So, while you may be ok with 16bit fracints for a small sample set, as you increase the number of samples, you will eventually find that your noise floor obscures your results.

      --
      No, I don't trust in god. He'll have to pay up front, like everybody else.
    9. Re:It's nice... by fatphil · · Score: 1

      Statistically errors are a random walk. Distance=sqrt(#steps).
      So you have 13.5 bits of error on average rather than 27.

      I've just worked out that I do 1 billion 8192-limb double-precision complex FFTs per day!

      --
      Also FatPhil on SoylentNews, id 863
    10. Re:It's nice... by SETIGuy · · Score: 2, Interesting
      If you are taking data off of some kind of sensor, there are damned few sensors with 24 good bits of data out of the noise floor. Radars work just fine with 16-bit A/D converters.

      I think you are confusing the precision of the input data with the precision of the power spectrum. An FFTs do a scaled add of a large number of samples, so the precision in the output is dependent on the number of input samples.

      For example SETI@home uses 1 bit complex sampling. (Yes, the SETI@home ADCs are a pair of high speed comparators.) That doesn't limit the results to 1 bit. It does reduce the sensitivity by 1.5dB (which, unfortunately isn't worth doubling our bandwidth or the number of tape we have to buy). The maximum precision you can get out of that in a 128K FFT (in the low SNR limit) is about 18 bits/channel or 54dB of dynamic range in the power spectrum. That's more than enough for SETI@home where our thresholds are at about 14dB. Changing that to a true 16bit ADC would give you 34 bits or 102dB at the expense of creating 16X as much data.

      Of course depending upon the sampling rate and the duration of the signal you are looking for you may need the 16 bits. If you are looking over 8 samples you only get 4bits or 12dB for SETI@home, 20bits or 60dB for a 16bit ADC.

    11. Re:It's nice... by Anonymous Coward · · Score: 0

      All of the HP/Compaq DL380 line since the G3 have had PCI-E slots.

      Just an FYI.

    12. Re:It's nice... by paeanblack · · Score: 1

      Ok, stupid question time:

      What's the point of sending FFT loads to the GPU in the first place?

      This strikes me as the latest incarnation of co-processor fetish that has reared it's ugly head every few years for the past three decades. Except for device I/O, all the offloaded capabilites get folded back into the CPU for the next iteration. Crap like this has always been more trouble than it's worth.

      Is there a good reason the GPU should be handling major FFT work?

    13. Re:It's nice... by Henry+V+.009 · · Score: 2, Funny

      Yeah! If you offload your FFTs to your GPU, the system resources on your monster work box are now freed enough up to let you play Quake while you're supposed to be calculat...oh, damn.

    14. Re:It's nice... by Anonymous Coward · · Score: 0

      Cell is just a in-order dual issue PPE core (like PPC 601) with 8 integrated SPUs. With ~250 million transistor budget, one could build an ~8 core K7s (with 128KB L1 cache and 64KB L2 cache per core).

    15. Re:It's nice... by jmv · · Score: 1

      32-bit floating could be a valid issue (if you need more accuracy), but I don't see how IEEE-754 comes into that considering that even on current CPUs, it's not possible to have two FFTs (or even the same FFT compiled with different options) give exactly the same result. In any case, you've got rounding that's decided based on what goes in registers vs. memory. So, as long as the precision in the GPU is about the same as the 32-bit IEEE spec, I don't think anyone can really see the difference, especially considering that an FFT cannot "create" NaNs/Infs if there weren't any in the input.

    16. Re:It's nice... by Anonymous Coward · · Score: 0

      Quake ?

      Man, you're so 20th centurish ....

    17. Re:It's nice... by zootm · · Score: 1

      Eventually, with some more articles like this one and yesterday's Cell piece, people will start to figure out that the x86 architecture is brain-dead and needs to be put out of its misery.

      We know it's braindead. This has been known for years. But it's not technical features that drive x86, it's ubiquity. The sheer fact that so many system use x86, and so many demand compatibility, means that enough development has been put into x86 to bring it to its extreme limits.

      The people who care about processor architectures already know the problems with x86. The people who control the bulk of demand for processors will probably never understand. A smooth transition is required, if it's ever to be replaced with something better, rather than a simple rise in technical prowess.

    18. Re:It's nice... by Anonymous Coward · · Score: 0
      try writing an acceptable range-reduction algorithm for sin or cos

      #define MAX_ERR 1e-9
      #define m2PI 6.28318530717958
      //...is it java?...c?...whatever...tough.stuff...//
      double sin(double theta) {
        double x,y,n,err;
        x=reduceTheta(theta);
        err=0;
        y=x;
        n=1;
        while(err>MAX_ERR) {
            n+=2;
            y+=(err*=(x/n));
        }
        return(y);
      }
      double cos(double x) {
        double x,y,n,err;
        x=reduceTheta(theta);
        err=0;
        y=1;
        n=0;
        while(err>MAX_ERR) {
            n+=2;
            y+=(err*=(x/n));
        }
        return(y);
      }
      double reduceTheta(double x) {
        double tmp=x;
        while((tmp>m2PI)||(tmp<-m2PI))
            tmp/=m2PI;//I assume that *this* is what ye're b*tchin'bout
        return(tmp);
      }
    19. Re:It's nice... by stephentyrone · · Score: 1

      you're right that that's the step i'm conceptually thinking about, but if you expect that that algorithm will compute any semblance of sin or cos, i'm afraid you're rather wrong - it returns 1 for any input. even if it weren't for that bug, it wouldn't give you a numerically good sin or cos. the error in the taylor series polynomials is concentrated at the ends of the interval - you can use a much lower dimensional polynomial if you choose one that approximates with roughly equal error throughout the interval. secondly, your range reduction step is totally useless - first, it computes the result of dividing by 2pi repeatedly, not of subtracting off some multiples of 2pi. second, roundoff error is going to obliterate any accuracy you may have once had as soon as x gets "big". check out k_rem_pio2.c in fdlibm for a good free (beer & speech) implementation of range reduction for these functions, and see how hard it really is. this is not, by any stretch of the imagination, easy coding.

  5. Rush hour math. by Anonymous Coward · · Score: 3, Insightful

    ""The UNC GAMMA group has recently released a high performance FFT library which can handle large 1-D FFTs. According to their webpage, the FFT library is able to achieve 4x higher computational performance on a $500 NVIDIA 7900 GPU than optimized Intel Math Kernel FFT routines running on high-end Intel and AMD CPUs costing $1500-$2000. "

    GPUs are nice, but there's the little matter of getting data and results on and off the chip.

    1. Re:Rush hour math. by Anonymous Coward · · Score: 0

      you're trolling right ?

      i don't know the exact details of this, but you render the data to "a texture" on the gpu ram and you request it back (think pixel/vertex shaders) to "cpu ram"

    2. Re:Rush hour math. by Anonymous Coward · · Score: 0

      Right, and the question is how long does that take compared with how long it takes for the CPU to operate on data that's in RAM. Chances are, wherever the data is coming from, it's going to end up getting loaded into RAM first. From there you can either go straight to the CPU via the FSB or you can take a round trip over the PCI-X bus to use the GPU. The cost of the round trip to the GPU is better than it was with AGP, but it's still not free.

    3. Re:Rush hour math. by corvair2k1 · · Score: 4, Informative

      Typically, when doing these measurements, the GAMMA group counts the upload/download time as part of the computation time. So, the 4x-5x speedup you're seeing is end to end, with results starting and ending in main memory.

    4. Re:Rush hour math. by pep11 · · Score: 1
      GPUs are nice, but there's the little matter of getting data and results on and off the chip

      This is not a drawback when you have a lot of things to compute on a limited set of data.
      As a General Purpose GPU programmer all I see is that I won't have to program GPU FFT myself nor send the data to the CPU then back again to the GPU to do further computation if I have a FFT to perform. Each time a powerfull algorithm like this is vectorised and ported to GPU,these chips becomes more general purpose and less they are like a 3D accelerator.
  6. FFTs in GPUs, eh? by Anonymous Coward · · Score: 3, Funny

    Well, seeing as how the V.P. is such a V.I.P., shouldn't we keep the P.C. on the Q.T.? 'Cause if it leaks to the V.C. he could end up M.I.A., and then we'd all be put out in K.P.

    1. Re:FFTs in GPUs, eh? by Anonymous Coward · · Score: 0

      ...leaving us all S.O.L.

    2. Re:FFTs in GPUs, eh? by Anonymous Coward · · Score: 0

      F.Y.I. thats B.S. you S.O.B.

    3. Re:FFTs in GPUs, eh? by Anonymous Coward · · Score: 0

      I fucking hate Robin Williams.

    4. Re:FFTs in GPUs, eh? by mgblst · · Score: 1

      Hey, just be assured if you don't know the acronyms, you will probably not be interested in the story. You don't have to click on every story, show some constraint.

    5. Re:FFTs in GPUs, eh? by Anonymous Coward · · Score: 0

      Taste me, fist master

  7. As you should very well know... by Jesus_666 · · Score: 1, Redundant

    FFT stands for "Final Fantasy Tactics", a brilliant tactics/role playing game for the original Sony Playstation. You should really try it.
    Stay away from the GBA sequel though, that one just sucks.


    Alternatively it could be some weird mathy jibber-jabber about furrier Transformers or something.

    --
    USE HOT GRITS WITH STATUE OF NATALIE PORTMAN (NAKED AND PETRIFIED)
    1. Re:As you should very well know... by chriso11 · · Score: 1

      I'm afraid we have to revoke your Geek membership... knowing how to hook up a composite video input for a TV isn't quite enough.

      --
      No, I don't trust in god. He'll have to pay up front, like everybody else.
    2. Re:As you should very well know... by Jesus_666 · · Score: 3, Funny

      Knowing how to hook up a composite video input is irrelevant. The civilised world uses SCART. You can't screw up hooking up SCART.


      And just you wait, France will develop a European alternative to that Fourier nonsense as well!

      --
      USE HOT GRITS WITH STATUE OF NATALIE PORTMAN (NAKED AND PETRIFIED)
    3. Re:As you should very well know... by jawtheshark · · Score: 3, Funny
      Just in case some people don't get the following statement:

      And just you wait, France will develop a European alternative to that Fourier nonsense as well!

      Read ... And notice the country....

      Mod the guy up.. He might be a troll, but he is funny :-D

      --
      Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)
    4. Re:As you should very well know... by Anonymous Coward · · Score: 0

      Furrier transformers? You mean these guys? Or this MUSH?

      Eww. I have plumbed the depths of the Internet.

    5. Re:As you should very well know... by Jesus_666 · · Score: 1

      He might be a troll, but he is funny :-D

      That was the idea. ;)

      --
      USE HOT GRITS WITH STATUE OF NATALIE PORTMAN (NAKED AND PETRIFIED)
    6. Re:As you should very well know... by Ohreally_factor · · Score: 1

      I think not knowing the difference between composite video and component ought to make one hand in ones card.

      --
      It's not offtopic, dumbass. It's orthogonal.
    7. Re:As you should very well know... by Geoffreyerffoeg · · Score: 1

      And just you wait, France will develop a European alternative to that Fourier nonsense as well!

      You do know that it's pronounced fooriyay, not foorier, right? :-)

  8. Any 64 bit GPU's? by ufnoise · · Score: 2, Insightful

    While interesting, I need IEEE 64 bit double precision for my scientific applications. Are there any 64-bit floating point GPU's out there?

    1. Re:Any 64 bit GPU's? by TubeSteak · · Score: 1
      Well, if their library gives you a 4x increase for 32 bit stuff, would that mean a 2x increase for 64 bit math?

      /no really, would it?

      --
      [Fuck Beta]
      o0t!
    2. Re:Any 64 bit GPU's? by Anonymous Coward · · Score: 0
    3. Re:Any 64 bit GPU's? by Surt · · Score: 3, Interesting

      Not yet. But in the next or second generation out your wish will be fulfilled (more and more game developers are pushing for 64 bit color accuracy, which will necessitate a transition to fully 64bit GPUs in the not distant future).

      --
      "Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
    4. Re:Any 64 bit GPU's? by Fordiman · · Score: 2, Informative

      No.

      Implementing 'big numbers', or numbers larger than the proccessor's spec, is actually quite computationally heavy when compared to the operations you're replacing. As such, a 4x increase in the speed of computation can translate to a (to pull a number from my arse) 0.25x loss of performance when dealing with larger floats.

      However...

      With CPU/GPU cooperation, the floating gap can be handled by using the CPU to generate a lookup table of high-precision trig as, say, a texture, and treating the numbers as mere pointers to an array. Addition is a relatively light bignum math, and with the FFT, you could implement the addition and lookup math quite speedily on the GPU side.

      Of course, from reading in, I'm pretty sure that's what's in-process for higher precision.

      --
      110100 1101000 1101000 1100110 0 1101111 1101000 1100011 1
    5. Re:Any 64 bit GPU's? by TheRaven64 · · Score: 4, Interesting
      more and more game developers are pushing for 64 bit color accuracy, which will necessitate a transition to fully 64bit GPUs in the not distant future

      Current generation GPUs handle 64bit and 128bit colours already. A 64-bit colour value is just four channels of 16-bit floats (halfs in Cg parlance). A 128-bit colour value is a vector of four 32-bit colour values.

      If game developers wanted 256-bit colour, then GPUs would need to support 64-bit floating point arithmetic. This is unlikely to happen, however, since 64-bit colour (which is really 48-bit colour with a 16-bit alpha channel) gives more colours than the human eye can distinguish. In fact, even with 64- or 128-bit colour for the intermediate results, current cards only have a 10-bit DAC for converting the colour value to an analogue quantity that can be displayed on an analogue screen.

      It is worth noting that Pixar's RenderMan software doesn't use more than 128-bit colour, and films like Toy Story were rendered using 64-bit mode.

      --
      I am TheRaven on Soylent News
    6. Re:Any 64 bit GPU's? by Surt · · Score: 3, Informative

      It's at the pixel shader level that you run into low color rendition on current GPUs, and also where the people doing math on GPU are doing their work. That's where the move to 64 bit will likely happen soon, and will conveniently help the math people as a side effect.

      --
      "Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
    7. Re:Any 64 bit GPU's? by jthill · · Score: 2, Informative

      Yes. Some guys at LBNL took good look. /. had the story yesterday. When they were trying, they repeatedly toasted Cray's best. With a "naive" FFT implementation -- not half trying -- they got 80%.

      --
      As always, all IMO. Insert "I think" everywhere grammatically possible.
    8. Re:Any 64 bit GPU's? by stephentyrone · · Score: 1

      Addition is relatively light bignum math if the precision you're extending from is correctly rounded. GPU floating point usually isn't. Even if it is, you're usually talking about increasing your op count by a factor of 5 or more, which kind of blows the "4x performace speedup" out of consideration.

    9. Re:Any 64 bit GPU's? by Fordiman · · Score: 1

      oops. I was thinking my times from the bigint lib.

      Though, the floating point trig table can be converted to fractional and - oh, wait.. fractional is pretty process heavy too.

      Ok, GP. The answer is quite a definitive 'No'.

      --
      110100 1101000 1101000 1100110 0 1101111 1101000 1100011 1
    10. Re:Any 64 bit GPU's? by UncleFluffy · · Score: 2, Insightful

      While interesting, I need IEEE 64 bit double precision for my scientific applications.

      Depends on what you need 64 bit for - is it for the precision (i.e. mantissa size) or the range (i.e. exponent size)?

      If you can live with a double-precision mantissa but a single-precision exponent, it's possible to get that using single-precision building blocks with less than a 2x slowdown. Sorry, don't have the references to hand right now, but a dig around on Citeseer/Google should get you there.

      --

      What would Lemmy do?

    11. Re:Any 64 bit GPU's? by ufnoise · · Score: 1

      It's the precision. I am solving a set of fully coupled partial differential equations. With only single precision, the matrices being solved may not be accurate enough and it may be impossible to get a solution.

    12. Re:Any 64 bit GPU's? by Anonymous Coward · · Score: 0

      64-bit mode for Toy Story? All renderers use in film work use 32-bit floats per channel, so in your terms "128-bit color". (In those communities no one uses the comined total of the 4 channels as the name. When they say 32-bit color they mean 32-bits per channel.)

      Also, to the poster of the lpics paper from Pixar, that's not used for the final renders you seen the theatre, just for preview renders. As to how much the same it looks, some artist say good enough for previews through 50 or 75% of the shot. Some say it's good for very little.

      GPUs are used in film rendering, just not for the final picture making part. They're used as math co-processors for some operations. See Nvidias Gelato renderer for more. While few are using that renderer, it's similar to what others are doing.

    13. Re:Any 64 bit GPU's? by renoX · · Score: 1

      No, there's no 64-bit GPU and as games don't really need them I doubt that there will be any for a long time..
      But in some case depending of your application the GPU could still be useful, see http://www.gpgpu.org/cgi-bin/blosxom.cgi/2005/08/2 2 ,unfortunatly the linked article is not available anynmore but it used some iterative algorithm to gain a 2 times speedup with a GPU at double precision.

    14. Re:Any 64 bit GPU's? by UncleFluffy · · Score: 1
      --

      What would Lemmy do?

    15. Re:Any 64 bit GPU's? by Anonymous Coward · · Score: 0

      What are you on about? There is absolutely no need or push for 64bit floating point in the pixel shader at all. Single precision floats have been used for film rendering for ages and will be good enough for real time graphics for a very long time. People are certainly using 64bit *colour* which refers to 16-bit *per channel* floating point but that has been available for a long time.

      Aliasing and precision issues in graphics are common, but they are very rarely related to precision if you're using 32bit floating point numbers.

      GPUs will certainly gain the ability for double precision arithmetic sometime in the future, but as long as the main reason customers buy GPUs is to produce graphics, double precision will be an afterthought since it isn't needed for that. If the GPGPU community becomes more important, this could change.

    16. Re:Any 64 bit GPU's? by Anonymous Coward · · Score: 0

      This is unlikely to happen, however, since 64-bit colour (which is really 48-bit colour with a 16-bit alpha channel) gives more colours than the human eye can distinguish.

      This is missing the point. 64-bit colour is important for accurate blending, not so much for display colorspace. A good example might be blending a hundred low-opacity frames together to make a glow: with low-accuracy blending, you'll end up with lots of color banding.

  9. MOD PARENT UP!!! by Anonymous Coward · · Score: 0

    Limitations raised by parent prevents us from doing serious scientific computing on GPUs...

  10. Wikipedia is your friend. by SynapseLapse · · Score: 1, Redundant

    Don't Panic.

    Explanation here.

    1. Re:Wikipedia is your friend. by he-sk · · Score: 0

      Redundant?

      The witty quote by itself warrants a +1, Funny!

      --
      Free Manning, jail Obama.
  11. FFT; by MrShaggy · · Score: 2, Informative

    FFT; Fast Fourier Transform - a specific algorithm, but used to indicate any algorithm attempting to determine the power versus frequency graph for a signal Dag-nadit

    --
    I have mod points and I am not afraid to use them.
    1. Re:FFT; by Anonymous Coward · · Score: 0

      Yeah, like we didn't already know that :)

    2. Re:FFT; by fatphil · · Score: 1

      FFT a recursive implementation of the DFT.
      Used for performing convolutions of two length n signals in Theta(n log(n) loglog(n)) time.

      Bignum multiplication is convolution with carry propagation - a lot of numerical distributed computing tasks (such as GIMPS) are based around fast large FFTs.

      FatPhil

      --
      Also FatPhil on SoylentNews, id 863
  12. You don't need a translation... by nick_davison · · Score: 2, Insightful

    FFT:

    Some calculation which can be heavily optimized to simple but fast processing. Hence a [relatively] cheap part that does a few simple tasks very fast can out perform a more expensive part that can do a vastly greater range of tasks with more efficiency across that general range but less in a specific area when performing that optimized calculation.

    By capitalizing on this incredibly basic rule of computer science (the an optimized simple thing going fast is faster than a more powerful general thing that is only being used for one of its many potentials), attention grabbing headlines can be garnered.

  13. One word: Precision? by Anonymous Coward · · Score: 0

    From some of the other pages of the site:

    "Our current code only handles 1D power-of-two single-precision FFTs"

    Ah. Single precision. Useful, but not as interesting as double would be. Still impressive, though.

  14. Math library for sale? by decipher_saint · · Score: 1

    So correct me if I'm wrong but, it's just a math library for sale, right?

    Don't get me wrong, new tools are cool, but can someone explain to me why this is newsworthy?

    --
    crazy dynamite monkey
    1. Re:Math library for sale? by Improv · · Score: 2, Insightful

      I don't know if it's exactly newsworthy, but it's kind of cute (and interesting) that the amount of specialisation that's going on in graphics cards leads to situations where one can persuade the graphics card to do one's (not graphics-related) work faster than one would be using the general purpose CPU for the same task. It's more amusement than anything else (although for those who want to do the computation, it's also useful).

      --
      For every problem, there is at least one solution that is simple, neat, and wrong.
    2. Re:Math library for sale? by Anonymous Coward · · Score: 0

      how many math libraries out there are capable of using the processing powrer of GPU rather than CPU ?

    3. Re:Math library for sale? by __aaclcg7560 · · Score: 0

      If you can get a math library to run on a GPU, it won't be long before Windows Vista is running on the video card. That mean we can be free of the old AMD/Intel overlords while welcoming the new ATI/Nvidia overlords!

    4. Re:Math library for sale? by Kortec · · Score: 2, Informative

      This is probably making a lot of developers, myself included, very very happy people. FFT's are where the proverbial magic happens for a lot of signals and systems analysis, as well as for the multiplication of very large integers. So anyone involved in gaming that includes digital signal processing (voice chat in UT, karaoke-karaoke-revolution type games, analog user input, etc.) is going to be happy, and anyone who's involved in multiplying huge integers (crypto anyone?) may very well have wet themselves.

      If this is true, to be able to make the same computations in a fourth of the time is a pretty nice thing, and using a little more of a GPU is likely to be acceptable in a decent number of applications.

      --
      "My heart is in the work." - Andrew Carnegie
    5. Re:Math library for sale? by dattaway · · Score: 1

      I'm sure hackers are already taking the Vista open source code and making competing distributions...

    6. Re:Math library for sale? by alphamugwump · · Score: 1

      Correct me if I'm wrong, but I always thought that graphics cards were exclusively floating point, and had nothing to do with huge integers.

    7. Re:Math library for sale? by ptr2void · · Score: 1

      Actually, floating-point pixel formats are relatively new on the nVidia/ATI beach.

    8. Re:Math library for sale? by Anonymous Coward · · Score: 0

      First it's a math library for free. Second it's using GPUs instead of more expensive CPUs. If you still cannot figure out why it's newsworthy, you need to get a f**king clue.

  15. FFT=Fast Fourier Transforms by amliebsch · · Score: 4, Interesting

    Isn't that what SETI@home uses for the bulk of its signal analysis? Would be kind of neat to leverage the millions of idle GPU's out there.

    --
    If you don't know where you are going, you will wind up somewhere else.
    1. Re:FFT=Fast Fourier Transforms by fredrikj · · Score: 2, Informative

      It's also used by distributed mathematics projects such as GIMPS to multiply large numbers. Unfortunately, if this implementation only operates in 32-bit precision, it will probably be less useful for this purpose since you'd have to do subproducts with fewer digits at a time, to avoid rounding error. I'm not familiar with the details, though.

  16. This is news? by Anonymous Coward · · Score: 2, Interesting

    I have an uncle who's a professor who's been using GPUs for scientific computation for years. Apparently he has systems with four GPUs running simulations.

    1. Re:This is news? by Duncan3 · · Score: 1

      It's old news in academic, computational, and HPC circles, yes.

      New or the 2nd or 3rd dup here on /. tho

      --
      - Adam L. Beberg - The Cosm Project - http://www.mithral.com/
    2. Re:This is news? by Fex303 · · Score: 4, Funny
      I have an uncle who's a professor who's been using GPUs for scientific computation for years.

      I'm sorry, but playing Quake at really high framerates does not count as research. He's not fooling anyone.

      The business cards which list him as 'Profess0r of Pwnage' probably aren't helping either.

      It's also bad when he refers to the undergrads as 'n00bs' during his lectures to them.

    3. Re:This is news? by Anonymous Coward · · Score: 0

      This professor Uncle of yours, he wouldn't happen to live in Chapel Hill by any chance, would he? ;^)

      Seriously though, was your Uncle doing FFTs on his GPU? And if so was he doing them at 4x CPU speed?

      If not, then maybe it is news. "News for nerds". Perhaps not "stuff that matters", though, seeing as how it's only for power-of-2 1D transforms of single precision float data, and there's probably a conveniently missing pci-x bus transfer overhead that's not in the numbers.

      Still it's a cool trick. However, making GPU's do backflips like this is kind of a silly research fad. What all these people making GPUs do FFTs should be doing is figuring out either
      A) what's the right way to build chips that are actually GOOD for this stuff.
      Trying to get FFT to run on a GPU is like trying to get a Linux/Apache webserver running on an iPod. Maybe it can be done, and maybe it'll even work pretty well -- but why would you want to? The iPod was designed from the ground up to play music, not handle http requests, just like the GPU is designed from the ground up to be good at graphics, not computing FFTs. If you want a great FFT engine, then go design yourself a chip that's great at computing FFTs. Then try to convince NVIDIA to put that tech into their next generation.

      or maybe B) folks should be trying to make a compiler that takes regular sequential C code as input and generates crazy GPU code as output.

      I guess there are folks out there working on both of those. Just sad they don't get as much press as the XYZ algorithm implemented on GPU! announcements.

    4. Re:This is news? by Anonymous Coward · · Score: 0

      The reason he uses GPUs is because they offer the best price/performance ratio. I'm sure he'd love to have some custom hardware, but GPUs benefit from economies of scale, so even though they suffer from a performance penalty over the PCI bus the price makes up for it. Besides, any altenative will likely have to run over PCI as well so that's not really a loss.

      I noticed that slashdot recently did an article about a company that made FPGAs that could plug into a processor slot in a multi-CPU machine. If that company takes off I'll bet that they'll take the market back from the GPUs.

    5. Re:This is news? by Anonymous Coward · · Score: 0

      Some time ago, we used to do prime number crunching on the Commodore 64's floppy drive (that contained a complete computer). It was called 'transputing'...

    6. Re:This is news? by Vo0k · · Score: 1

      Prime number crunching, blah. You use C64 disk drive to decompress data from the floppies to RAM!

      --
      Anagram("United States of America") == "Dine out, taste a Mac, fries"
  17. $1500-$2000? by Wesley+Felter · · Score: 3, Insightful

    I sense a little bias here; the fastest Intel and AMD processors are actually $1,000.

    1. Re:$1500-$2000? by veg_all · · Score: 4, Informative

      RTFA? Or just look at the pretty diagrams.
      They're running against dual-processor systems (Opteron and Xeon).

      --
      grammar-lesson free since 1999. (rescinded - 2005)
    2. Re:$1500-$2000? by MP3Chuck · · Score: 2, Informative

      Pricegrabber lists some up in the range of several thousand dollars ... like this Itanium, for just over $5k and this Dual Core Xeon for $3,700.

    3. Re:$1500-$2000? by iamplupp · · Score: 1

      Actually, the top of the line Opteron costs more than 2000 according to amd.com

      Dual-Core Model 875 HE $2,149

      and i'm sure the fastest Xeon is similarily priced.

    4. Re:$1500-$2000? by Anonymous Coward · · Score: 0

      True except not, AMD's Opteron 885 (8 ways 2.6GHz dual-core, the top of the AMD Opteron line) is priced around $2300 each retail.

    5. Re:$1500-$2000? by Wesley+Felter · · Score: 1

      Of course, the latest Athlon 64 FX is both faster and cheaper than an Opteron 875. Same for Intel.

    6. Re:$1500-$2000? by Anonymous Coward · · Score: 0

      AMD's Opteron 8xy are 8-ways processors, they're not intended to be used in a single-chip setup (you want 1xy for that), the point of 8-ways processors is to be able to put up to 8 processors on a single motherboard...

    7. Re:$1500-$2000? by Anonymous Coward · · Score: 0

      The price may be different if you requisition them with a government Purchase Order. On complete systems, sometimes delivery alone may take 18 months depending on how much profit your supplier can get away with.

    8. Re:$1500-$2000? by HuguesT · · Score: 1

      Single CPU, yes, but the latest opteron 8xx series supports at least dual-core, 8-way SMP. Not quite the same thing.

    9. Re:$1500-$2000? by Jozer99 · · Score: 1

      Fastest consumer processors are $1000. That doesn't count the Opterons or Xeons, which are priced much higher.

  18. Cray-1 comparison by Mostly+a+lurker · · Score: 5, Interesting
    The Cray-1A supercomputer, weighing in at 5.5 tons, had an absolute maximum peak performance of 250 megaflops. It, of course, cost millions and the power requirements (including for cooling) were in excess of 200 kW. I remember marveling at the advanced nature of this technological achievement.

    Thirty years later, a $500 GPU, weighing less than 1 pound, can produce 6 gigaflops. People complain about its power and cooling needs, but they are rather below 200 kW! We sometimes forget just how amazing the developments in computing have been over the last three decades.

    1. Re:Cray-1 comparison by Anonymous Coward · · Score: 2, Insightful

      People complain because they compare the power consumption to their old "home computer". Just look at the Apple II, the C64 or similar 8 bit computers, almost all of which had such low power demands that they could run without fans. Even the "IBM compatible" PCs up to and including 386s almost always had exacly one fan, the one in the power supply. My most recently purchase (second hand) computer has 6 fans, and draws enough power to justify most of them.

      You can probably make up your own flawed car analogy and compare top speed and fuel consumption of today's compact cars with the racing cars of 60 years ago.

    2. Re:Cray-1 comparison by suv4x4 · · Score: 2, Funny

      People complain about its power and cooling needs, but they are rather below 200 kW! We sometimes forget just how amazing the developments in computing have been over the last three decades.

      Why compare it with Cray-1, compare it with the steam-powered calculators of the past that take minutes to multiply two simple numbers and the results are sometimes kinda off.

      People always demand more, this is why they develop more, so to get more. If people become suddenly satisfied with whatever state they're in, they'll never move on.

    3. Re:Cray-1 comparison by grammar+fascist · · Score: 1

      You can probably make up your own flawed car analogy and compare top speed and fuel consumption of today's compact cars with the racing cars of 60 years ago.

      And then marvel at how automobile technology has advanced, which I think was the point.

      --
      I got my Linux laptop at System76.
    4. Re:Cray-1 comparison by cdn-programmer · · Score: 0, Offtopic

      Back then computer programmers were paid better as well. Seems people want the cost of the programmer to favorably compare to the cost of the hardware!

    5. Re:Cray-1 comparison by MythMoth · · Score: 1

      We sometimes forget just how amazing the developments in computing have been over the last three decades.

      Slightly off topic, but this was brought home to me a while ago when I was working with a commercial linear programming tool to solve some tricky resource allocation problems.

      The tool in question has been around for quite a while, allowing for significant hardware developments - and the algorithms used to solve LP problems have advanced substantially over the same period.

      The end result being that this tool is now able to solve problems six million times faster than was the case twenty years ago. Our problem took about five minutes to spit out a satisfactory answer. If I'd started its predecessor on the same problem twenty years ago it would have... still been calculating it!

      --
      --- These are not words: wierd, genious, rediculous
    6. Re:Cray-1 comparison by incabulos · · Score: 1

      We sometimes forget just how amazing the developments in computing have been over the last three decades.

      Not really, its just that said computing developments are overshadowed by the capacity of commercial software to consume all available system resources and deliver performance that is orders of magnitude slower with each successive generation.

      Put a Windows Vista install on the latest cray, and you too can marvel at how quickly a multi-million dollar piece of computing quickly becomes a slow, unstable pile of junk.

    7. Re:Cray-1 comparison by Anonymous Coward · · Score: 0
      If I'd started its predecessor on the same problem twenty years ago it would have... still been calculating it!

      You could always fire up your LP tool to figure out the optimal time to start the calculation :-)

    8. Re:Cray-1 comparison by laffer1 · · Score: 1

      That is true, but the definition of an operating system is not as clear cut as it used to be. People demand more functionality from modern operating systems. Its not just a kernel and a few tools anymore. End users want their computers to play music and games. They want eye candy, instant messaging, word processing, internet capability, etc.

      Then there's algorithm changes. Sometimes faster algorithms require a larger working set and therefore use more memory. Look at the changelog for the linux kernel or check out some of the discussions on the BSD mailing lists. I saw a thread debating if FreeBSD should increase the stack size to accomodate Gnome a year or so back. Someone pointed out that linux and solaris both have larger stacks. Caches used are much larger now as we try to feed data to our 3ghz+ cpus.

      The real problem is that programmers entering college in the past 10 years have been told that memory is cheap and new cpus will be around to fix any of their problems in 18 months. No one seems to care about a small memory footprint or speed anymore. Programmers are also taught that optimizing code isn't worth it because it introduces bugs and increases debugging time. Most of us hate debugging and rely on IDEs to do it for us. Then you add virtual machines like the java vm and .NET runtime and you see the movement away from performance over simplicity. Microsoft and VMWare are pusing virtualization like crazy. That movement may bring back performance concerns since people will want WIndows and Linux to run fast in virtual machines. It will only be a short term effect though.

      Reading this thread I saw several posts about using assembly to speed up computations. Did you realize that we are told never to use assembly because gcc can do it better than we can? That was reinforced in every class at my university, even the sparc assembly course. I compared the output from a c program and something i wrote and my code was concise and faster in assembly. Most students didn't even try it though as the professor told them "the answer". Sometimes complexity is worth it.

  19. That's where PCIe is useful by WoTG · · Score: 3, Informative

    AGP was not very useful for bidirectional data flow, but PCIe is. GPU's are pretty sophisticated these days, so they've got the logic to handle moving stuff in and out of it's memory and over the bus to the CPU and the rest of the system.

  20. wasted processing power... by abigsmurf · · Score: 1

    makes you wonder how much raw computing power is locked away in the various chipsets in your PC. A 500mhz GPU designed for floating point and complex mathematical instructions could come in especially handy for encoding videos rather than just being used to apply filters during playback. That said, if it's used in things like Seti@home I'd imagine the additional power use would have a notable effect on your electricity bill

    1. Re:wasted processing power... by Jozer99 · · Score: 1

      What I'm worried about is heat. My actively cooled ATI Radeon X800 XL runs at about 49C on the Windows Desktop, but when I start a game or graphics benchmark, it shoots up to 75C. If I was always using the GPU, I would think that being that hot would shorten its life considerably.

  21. Good occasion by 4D6963 · · Score: 1
    At last a good occasion to explain what the FFT is! But huh, I'll rather link to Wikipedia, not that I couldn't explain by myself eh, pshhh!

    Note how many comments are about explaining what FFT is as opposed to how many comments consist in asking what FFT means. Quite a fucked up demand/offer ratio.

    --
    You just got troll'd!
    1. Re:Good occasion by Anonymous Coward · · Score: 0

      Not really, low demand, high offer, this is why you get free access to the comments explaining what FFT is.

  22. Precision limit. by DrYak · · Score: 2, Informative

    The limit is the floating-point precision of the GPU.

    Most GPU can do max up to 32-bit floating point operations (depending on the brand and the model), where as most scientific applications use 64-bit and higher (the old FP unit could do 80bit FP math, SSE registers in recent processors can do 128-bits FP math).

    So some user will be happy, like for sound processing (GPU have already been used to process reverberation to create realistic sound environnement - too lazy to do the search for the slashdot reference)
    But other application (cryptography maybe) will probably need more FP precision.

    Not to mention that most scientific applications run mostly under *nix like Linux or BSD, for which GFX driver support isn't always incredible, specially for recent models, (website mentions performance hit).
    (And also remember that soon Vista will have an interface that'll completly clog the GPU and leave less free cycles to do general purpose calculation).

    --
    "Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
    1. Re:Precision limit. by NVP_Radical_Dreamer · · Score: 1

      Not to mention that most scientific applications run mostly under *nix like Linux or BSD, for which GFX driver support isn't always incredible, specially for recent models,

      This is nothing but FUD, nVidia has always had great support for *nix OS's http://www.nvidia.com/object/unix.html even ATI has some support for *nix OS's https://support.ati.com/ics/support/default.asp?de ptID=894&task=knowledge&folderID=27 try using them sometime rather than listening to your tech gods at the local geek squad....

      --
      The best argument against democracy is a five-minute conversation with the average voter.

      - Winston Churchill
    2. Re:Precision limit. by Rakishi · · Score: 1

      The ATI drivers are by consensus seen as shit, and a great source of laughter whenever someone buys an ATI card and tries to get usable performance in linux.

      There is a difference between token support and good performance, and apparently you need to go back to school for reading comprehension as you missed the difference in the GP's post.

    3. Re:Precision limit. by mczak · · Score: 1

      No, you can't do "128-bits FP math" with SSE. Not if you're talking about precision. The registers are indeed 128-bits wide (*), but all math operations operate on either 2x64 bits or 4x32 bits (that's why it's called SIMD after all, "single instruction multiple data". (SSE could only work with simple precision floats (4x32) whereas SSE2 introduced double precision.)

      (*) This is actually an oversimplification. At the hardware level, the registers may actually not be 128-bit wide. In fact, both intel p4 and amd a64 only have internal 64-bit datapaths and will decompose a 128bit instruction into 2 64bit instructions. But this is of course hidden from the programmer.

  23. Based on FFTW? by 4D6963 · · Score: 1
    That thing is called GPUFFTW, one could assume that it is based on the FFTW library (which is after all the best performing FFT library around) but after looking at every page on this site, I couldn't find a single credit to FFTW.

    Are the two linked or is the W at the end of the name just a mere coincidence?

    --
    You just got troll'd!
    1. Re:Based on FFTW? by Anonymous Coward · · Score: 0

      GPU Fourier For The Win?

    2. Re:Based on FFTW? by Anonymous Coward · · Score: 2, Informative

      You're right, they give them no credit. From TFA:

      We are not porting FFTW to GPUs and our project is not related to FFTW. FFTW is a more general library designed mainly for CPUs. GPUFFTW uses some cache optimizations to obtain maximum memory performance on GPUs.

    3. Re:Based on FFTW? by Anonymous Coward · · Score: 0

      The "W" stands for "West". As in "Fastest Fourier Transform in the West". Seriously.

    4. Re:Based on FFTW? by 4D6963 · · Score: 1

      My bad, couldn't find that anywhere on their site. Thanks for answering the question anyways

      --
      You just got troll'd!
  24. (Almost) Comparison to BSD Licensed version by achooo · · Score: 2, Interesting

    on this page here they almost compare to a program called libgpufft (which is an open source BSD version of the same library here ) I wonder how they do compared to the BSD licensed version---

  25. Interesting question... by DrYak · · Score: 2, Interesting

    The interesting question will be :
    Is the 32-bit precision enough for SETI@Home application ?
    Or does the project needs the higher precisions (64bit to 128bit) that can (for now) only be provided by the CPU ?

    IMHO, maybe this could be useful. They're trying to find which chunk contains candidate data. If there's some fast low-precision algorithm that can quickly mark chunks as interesting / recheck with higher precision / un-interesting, it'll be helpful to quickly tell appart interseting chunk, even if data need to be post-processed later at higher precision.

    --
    "Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
    1. Re:Interesting question... by SETIGuy · · Score: 4, Informative
      Yes, 32 bits is quite enough for our FFTs. Our requirements are fairly low. 16bit floats may even do the job (although I've never tried 16bit floats in SETI@home). What has concerned us in the past is that bandwidth to GPUs was fairly assymmetric (on AGP cards), the seti@home working sets (A few buffers of 1M complex samples == 16MB each) were larger than the usable memory on many video cards and the length of the maximum shader routine was fairly small. SETI@home does quite a bit more than FFTs, so moves into and out of main memory were required. At the time we couldn't put more into the shader language. That may have changed, but right now we lack anyone who both has the time to do the job and is capable of doing it.

      Our tests on nVidia 5600 series AGP cards (this was several years ago) showed that the net SETI@home throughput using the GPU was at best 1/5 of what we could obtain with the CPU. This was primarily due to transfers out of graphics memory and into main memory.

      PCI Express allows for symmetric bandwidth to graphics memory and graphics memories are now typically larger than the size of our working set. The difficulty will be in benchmarking to see which is faster for a specific GPU/CPU combination.

      At any rate it's a fairly simple job to swap FFT routines in SETI@home. The source is available. Someone may have done it by now...

    2. Re:Interesting question... by jthill · · Score: 1

      You might be able to get Sony to subsidize a PS3 port — protogeek coolness, your game console's hunting aliens while you sleep.

      --
      As always, all IMO. Insert "I think" everywhere grammatically possible.
    3. Re:Interesting question... by hawkbug · · Score: 1

      Well, even if running it on a GPU is slower than CPU - couldn't you still get a benefit of running both at once? Say you used both the GPU and CPU at the same time, I would think you could search more data that way. I'm sure there's a penalty of some kind going to and from memory for both devices, but have you atleast considered the possibility?

    4. Re:Interesting question... by SETIGuy · · Score: 1
      Say you used both the GPU and CPU at the same time, I would think you could search more data that way. I'm sure there's a penalty of some kind going to and from memory for both devices, but have you atleast considered the possibility?

      We have considered the possibility. It seemed to us that the synchronization between the processing on the GPU and the CPU was the difficult part. The easiest method (and therefore the one that we would most likely have the time to implement) would be to have two copies of SETI@home running, one which uses the CPU only and one that uses the CPU plus the GPU for FFTs. You could come fairly close to keeping both the CPU and the GPU fully occupied.

      We've got someone looking into using this FFT library. Technically it should be easy. Legally is another problem. This library is "free for non-commercial use" which is incompatible with the SETI@home license (GPL). We do include an exception to allow linking proprietary FFT libraries, but I'm not sure that the exception is fully compatible with this license since we can't control whether someone uses SETI@home in a commercial manner (although I don't see what commercial use they could find for it). I may have to contact the authors to discuss this.

  26. Bytes/bits? by 4D6963 · · Score: 2, Informative
    From the site :

    The Video RAM will determine the maximum array length that can be sorted on the GPU. A rough guideline for performing FFT on 32-bit floats is: Maximum array length in millions = Video RAM in MB / 32

    Max array length equals video RAM in megabytes divided by 32... bits? Correct me if i'm dumb but shouldn't it rather be "Video RAM in MB / 4"?

    --
    You just got troll'd!
    1. Re:Bytes/bits? by cnettel · · Score: 1

      You would naturally need some kind of workspace to store the results. I think that a normal shader is not supposed to write data back to from where it was read, so there can be some quite significant "read, multiply, write to new location" going on.

    2. Re:Bytes/bits? by 4D6963 · · Score: 1
      You would naturally need some kind of workspace to store the results. I think that a normal shader is not supposed to write data back to from where it was read, so there can be some quite significant "read, multiply, write to new location" going on.

      I'm afraid you either replied to the wrong comment, or didn't understand at all the comment you're replying to.

      --
      You just got troll'd!
    3. Re:Bytes/bits? by Anonymous Coward · · Score: 0

      No, he's saying it probably cannot be executed in place: requiring 8 copies of the data (or equivalent) may be realistic.

      I have no clue if this is correct or not, but I think he did reply to the correct comment!

    4. Re:Bytes/bits? by 4D6963 · · Score: 1
      Oh, well, we have no reason to assume that it takes 8 copies of the data, it's mentionned nowhere, they just, from what I quoted, directly take the "32" figure from "32-bits" and divide megabytes with it. Sounds like nothing else but a mistake.

      Thanks for helping clearing the grand-parent post anyways :-).

      --
      You just got troll'd!
  27. No surprise here... by nitrocloud · · Score: 2, Insightful

    Graphics Processing Units have always been better for FFTs and signal processing than general CPUs. I've read a journal article where machine vision was implemented on a GeForce 5200 at a 3x speedup over an AMD Athlon 3200+. The reason? This is what a GPU is made for; the small dedicated instruction set makes a GPU much more adept at signal processing than the 686's have ever been.

    --
    Karma: Good, or bust!
  28. What's an FFT by Geoffreyerffoeg · · Score: 5, Informative

    Apparently nobody knows what an FFT is. Here's the best description I can give without descending into math too much.

    The Fast Fourier Transform is an algorithm to turn a set of data (as amplitude vs. time) into a set of waves (as amplitude vs. frequency). Say that I have a recording of a piano playing an A at 440 Hz. If I plot the actual data that the sound card records, it'll come out something like this picture. There's a large fading-out, then the 440 Hz wave, then a couple of overtones at multiples of 440 Hz. The Fourier series will have a strong spike at 440 Hz, then smaller spikes at higher frequencies: something like this plot. (Of course, that's not at 440, but you get the idea.)

    The reason we like Fourier transforms is that once you have that second plot, it's extremely easy to tell what the frequency of the wave is, for example - just look for the biggest spike. It's a much more efficient way to store musical data, and it allows for, e.g., pitch transformations (compute the FFT, add your pitch change to the result, and compute the inverse FFT which uses almost the same formula). It's good for data compression because it can tell us which frequencies are important and which are imperceptible - and it's much smaller to say "Play 440 Hz, plus half an 880 Hz, plus..." than to specify each height at each sampling interval.

    The FFT is a very mathematics-heavy algorithm, which makes it well suited for a GPU (a math-oriented device, because it performs a lot of vector and floating-point calculations for graphics rendering) as opposed to a general-purpose CPU (which is more suited for data transfer and processing, memory access, logic structures, integer calculations, etc.) We're starting to see a lot of use of the GPU as the modern equivalent of the old math coprocessor.

    If you're looking for more information, Wikipedia's FFT article is a good technical description of the algorithm itself. This article has some good diagrams and examples, but his explanation is a little non-traditional.

    1. Re:What's an FFT by grammar+fascist · · Score: 1

      Bravo. Just... bravo.

      I was going to karma-whore this myself, but you beat me to it, and probably did a better job. :)

      --
      I got my Linux laptop at System76.
    2. Re:What's an FFT by Anonymous Coward · · Score: 0

      Could this accelerate mp3 encoding?

    3. Re:What's an FFT by odourpreventer · · Score: 1

      mp3 uses wavelets which is a different beast, although some similarities exist (I think).

    4. Re:What's an FFT by Yenshee · · Score: 2, Informative
      Great summary, but you are describing a general Fourier Transform, not necessarily a *fast* FT. FFT refers to a specific divide and conquer algorithm that has O(n*lg(n)) time complexity, rather than the O(n^2) time complexity of the naive FT algorithm. Here n is the number of discrete samples, i.e. the size of the problem.

      FFT is to naive FT as quicksort is to insertion sort. Both quicksort and the FFT are considered to be among the top ten algorithms of the 20th century.

    5. Re:What's an FFT by Anonymous Coward · · Score: 0

      Nope, MP3 uses a MDCT.

    6. Re:What's an FFT by njh · · Score: 1

      Though it may not be directly useful (I haven't seen which FFT variants the library provides), the MDCT used by mp3 is a fairly similar algorithm and could probably be sped up the same way.

    7. Re:What's an FFT by SpinyNorman · · Score: 1

      I think you mean bubblesort rather than insertion sort. An insertion sort is O(n*log(n)) if you use binary search to find the insert point.

    8. Re:What's an FFT by M3shuggah · · Score: 1

      eh, not really.

      Wavelets provide time AND frequency resolution; FFT is ambiguous to time. The closest Fourier variant that I'd say remotely compares to a wavelet transform, would probably be the STFT. ...though the STFT's harsh windowing has the ability to introduce quite a bit of high-frequency noise. While wavelets are often used to de-noise a given signal.

      But on the topic of compression, the JPEG2000 standard uses wavelets.

    9. Re:What's an FFT by fredrikj · · Score: 2, Informative

      FFT is to naive FT as quicksort is to insertion sort.

      Be careful with the terminology; you correctly referred to "naive FT algorithm" above, but this sentence might give the impression that the Fourier transform itself is an algorithm. FT is a function whereas the FFT is an algorithm that computes the function. It would be more appropriate to say that FFT is to the Fourier transform what quicksort is to sorting.

    10. Re:What's an FFT by Yenshee · · Score: 1

      Good point. By 'naive FT', I mean the discrete FT algorithm obtained by a dense matrix-vector multiplication, where the matrix entries are the appropriate complex exponential coefficients, and the vector is the list of time domain samples.

    11. Re:What's an FFT by Lord+Crc · · Score: 1

      mp3 uses wavelets

      Mp3 is, like JPEG, based on the discreet cosine transform (DCT). The reason for this is "because it has a strong "energy compaction" property: most of the signal information tends to be concentrated in a few low-frequency components of the DCT".

    12. Re:What's an FFT by Geoffreyerffoeg · · Score: 1

      What are you inserting into? No matter how you find the point, if you're in the original array, it takes average O(n) to shift each remaining element over by one to make room for your new element, so it's still O(n^2) so long as finding an element is at most O(n). If you're inserting into a doubly-linked list, then you've got O(n) space overhead. If you're inserting into a tree, then it's not an insertion sort.

    13. Re:What's an FFT by Geoffreyerffoeg · · Score: 1

      Ah thanks :-D. So I didn't make a big mathematical error. You can infer by my graduation year in my signature that I haven't actually studied Fourier series in school yet... I wasn't sure if that pitch transform would work, e.g.

    14. Re:What's an FFT by gardyloo · · Score: 1

      It's a much more efficient way to store musical data [...]

            Just a bit of nitpickiness here (just so people aren't thrown off by that quoted statement above). The Fourier Transform (whether general or Fast) is NOT more efficient at storing musical data (or any kind of signal data) than the original signal. The original signal and the transformed signal are totally equivalent in the amount of data they contain, and, generally, in the space they take up. You have to do some creative and careful filtering to cut out frequency information which you don't want to store (and this isn't part of the Fourier Transforming algorithm) to be "more efficient", and you're throwing away parts of that original signal if you do that.
              To draw an analogy-which-isn't-really, compare MIDI music to a corresponding CD or .mp3 or .ogg or whatever recording. The MIDI will be much, much smaller, in general, because only very certain frequencies have been kept. On the other hand, MIDI generally sounds like crap, and the original recording can not be reconstructed from the MIDI file.

    15. Re:What's an FFT by pyite · · Score: 1

      Yes, a DCT is a DFT. It's the equivalent in a Fourier series to when you take a function that's periodic on [0, Pi], and then mirror the function across the y-axis to make it even and periodic on [-Pi, Pi] and then take the Fourier series of it. Since the function you just constructed is even, all the sin(nx) coefficients in the Fourier series drop out and you're left with cosines.

      --

      "Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman

    16. Re:What's an FFT by njh · · Score: 1

      Yep, but mp3 uses MDCT (http://en.wikipedia.org/wiki/MDCT) which has some additional properties which may make the original algorithm not very helpful (for example, it may not be able to use all of the symmetries, leading to it having to do a full FFT over the larger set, requiring 4 times as much space (and thus losing its original advantage).

      This is unlikely though.

    17. Re:What's an FFT by fatphil · · Score: 1

      Yeah but no but yeah but no.

      The (D)FT is a function. A vector of sums of products of elements of another vector. That, in itself, defines an algorithm. In some languages, the expression of the function itself is exactly the same as the expression of the algorithm. (Mathematica, GP/Pari, and presumably Maple, et al.)

      FatPhil

      --
      Also FatPhil on SoylentNews, id 863
    18. Re:What's an FFT by Geoffreyerffoeg · · Score: 1

      The original signal and the transformed signal are totally equivalent in the amount of data they contain, and, generally, in the space they take up.

      Actually, they're only mathematically equivalent. Take a square wave, for example. I can represent that as simply 0, 0, 1, 1, 0, 0 in the time domain, but you need an infinite series of sinusoids to represent it as a Fourier sum. Meanwhile, for sin(440 * 2 pi t), that's a single term in Fourier format, but requires infinitely detailed x-y pairs to represent it against time.

      So if you have a pure sine wave, or a sine wave with a couple of overtones, it is easier to say as in my example "full-amplitude 440 Hz plus half-amplitude 880 Hz" than to sample these waves. But you are right; with real-world data, you're going to have a lot of differing frequencies. Good storage algorithms will eliminate the unimportant frequencies from the Fourier sum - a trick that is hard to do with the original sampled data.

      As far as MIDI sounding like crap...have you ever heard those grand pianos with a MIDI floppy drive?

    19. Re:What's an FFT by Geoffreyerffoeg · · Score: 1

      Great summary, but you are describing a general Fourier Transform, not necessarily a *fast* FT.

      Sorry this reply's a little late. Yes, I am describing the process of Fourier transforms. From the comments here, most people had no idea what you'd use the algorithm for. It's like if they announced a way to run quicksort on a hard disk controller or something, and people asked what quicksort can do. You'd need to explain sorting itself, not pivots and partitions.

      Fast Fourier Transform, for the record, is a particular algorithm that can calculate the (discrete) Fourier series for given (discrete, sampled) data, reasonably efficiently. Monsieur Fourier's original formuation was actually an integral (to allow for a continuous summation across all frequencies and all times). Actually calculating this integral out, every time you need it, is hella slow, as you might assume. The FFT can give the same result as the integral much more quickly.

    20. Re:What's an FFT by SpinyNorman · · Score: 1

      Inserting into a doubly-linked list. You could of course copy back into an array if that was your source. This type of sort has it's uses - the worst case performace is O(n*lg(n)) vs the O(n^2) presorted case for quicksort, and best case (presorted data) is O(n). It's also obviously a natural fit if your data is in a list vs array to begin with.

    21. Re:What's an FFT by Anonymous Coward · · Score: 0

      How do you binary search a linked list?

    22. Re:What's an FFT by SpinyNorman · · Score: 1

      By maintaining a temporary hierachical index pointing into it... Or just use some sort of balanced tree instead - direct O(log(n)) insert.

    23. Re:What's an FFT by Geoffreyerffoeg · · Score: 1

      How do you update said "temporary hierarchical index" in O(log n) time? It looks very much like an array. And if you're using a balanced tree, then it's one of the tree sorts, not an insertion sort.

    24. Re:What's an FFT by SpinyNorman · · Score: 1

      Well this is quibbling over semantics... if you use a B-tree like structure as "random access" index into a linked list, then is it a list or a tree?

    25. Re:What's an FFT by Geoffreyerffoeg · · Score: 1

      Well then you have the overhead of updating the tree as well as the overhead of updating the list. So you'll have to go through the inefficient parts of both structures. If one is always more efficient, use that alone. If both are at times inefficient, then you've got an inefficient algorithm and the point is moot.

  29. GPU FFT major shortcoming by Anonymous Coward · · Score: 0

    Limited to a single process - not multi-process or multi-thread safe.

  30. Enough if you're doing images at 8 bits/pixel by Anonymous Coward · · Score: 0

    Single-precision floating point is more than enough if the final product is an image with 8 bits/pixel.

    1. Re:Enough if you're doing images at 8 bits/pixel by thejam · · Score: 1

      Beware: Numerics can be tricky! It can be the case that intermediate results need far more precision than your final result. It's not difficult to construct mathematical problems whose results are 8-bit, but where double precision is required for the computations. Especially in science, where you're asking new questions, you don't know the answer and prudence suggests that using more intermediate precision will save you headaches. Similar reasong justifies not flushing denormal numbers (floating point numbers smaller than the "machine epsilon", or smallest "normally" representable number greater than zero) to zero, but keeping as many digits as possible. This is provided by full IEEE 754 compliance. So go for full IEEE 754 double precision; it may prevent another Ariane 5 catastophe!

  31. Great for audio! by radarsat1 · · Score: 2, Insightful

    Awesome, this is really good news for audio people.
    I want to see how I can take advantage of this... I hope the license isn't too restrictive.
    It might be a good example of how to use the GPU for general purpose (vector-based) computation, something I've been wanting to explore.

    Just curious, how does the use of the GPU for this kind of thing affect the graphics display?
    Are you unable to draw on the screen while it's running, or something?

    1. Re:Great for audio! by CalcWiz · · Score: 2, Informative

      Generally you will only notice a difference if you are running another GPU intensive program at the same time. Most GPGPU programs (including ones made by Gamma) use extensions such as the frame buffer object so that you don't need to see the mathematical garbage in a graphical sense. Thus the only side effect is that it needs a significant amount of video memory and GPU processing time, both of which will be shared with other programs trying to use them. So if you try to compute the FFT of something that barely fits in video memory while playing UT 2004 at 1600x1200 resolution, you might have a problem. Otherwise, you probably won't notice.

      --
      If PRO- is the opposite of CON-, What's the opposite of Progress?
  32. If you need gigaflops... by mikelang · · Score: 1

    ...for a custom application, I'd rather look at FPGA.

    1. Re:If you need gigaflops... by zeldor · · Score: 2, Informative

      not so custom really if you go with the right integrator.
      SRC computers puts out FPGA based systems that has a nice
      32bit floating point fft library already in their development
      environment. Most customers using the fft are for radar image
      processing where the best PC solution is 50 times slower
      then the fpga based solution. Think UAVs with smart
      tracking off their radar.
      http://www.srccomputers.com/

      --
      If I could walk that way I wouldnt need cologne.
    2. Re:If you need gigaflops... by Hast · · Score: 1

      In my experience (I've dabbled with both GP-GPU and custom FPGA chips) FPGAs don't always come out on top. There certainly are things for which FPGAs are better, but there are also things for which GPUs are better.

      If for no other reason than that you can get a working GP-GPU program running in a day or two. Designing hardware (even on FPGA) typically takes months.

  33. Cryptography? by Nicolas+MONNET · · Score: 2, Insightful

    Crypto = Number theory = integer math.

    No need for floating point.

    1. Re:Cryptography? by Anonymous Coward · · Score: 1, Informative

      Public key cryptography needs much bigger integers than fit a machine register. Remember those 1024-bit or even 4096-bit keys? Those represent integers. Multiplying big integers can be done more efficiently by FFT or FFT-like techniques, and if you've got fast floating point then yes, it is more efficient to use floating point to do big integer multiplication.

      The basic trick there is that integer multiplication is essentially convolution on the pieces of the numbers, and convolution is something that is trivial to compute in frequency space, which is where the FFT comes in. Check out the "Fourier Transform Methods" section of the wikipedia article on multiplication algorithms.

    2. Re:Cryptography? by locofungus · · Score: 1

      And if you'd read Knuth you would know that the FFT can be done entirely in integers. It's in the section "how fast can we multiply". IIRC the "all integer convolution algorithm" is set as one of the problems.

      Knuth derives an upper bound on the required number of bits for each integer which (IMO) comes out surprisingly low.

      --
      God said, "div D = rho, div B = 0, curl E = -@B/@t, curl H = J + @D/@t," and there was light.
  34. Hybrid Hardware-Accelerated Relighting Engine by Anonymous Coward · · Score: 1, Interesting

    This is somewhat off topic, but keeps to the fact that GPUs
    are having profound effects in the cinematography world.

    Check out this paper detailing impresive gains using a GPU versus
    Pixar's standard engine and render -
    some gains are upwards of 20,000X more efficient
    with almost identical visual results!



    http://www.vidimce.org/publications/lpics/



    Interesting? Cheers!

    1. Re:Hybrid Hardware-Accelerated Relighting Engine by Fujisawa+Sensei · · Score: 1

      Nice on the small screen, but even there I can see a big honking difference. Look at the smoothing on the front fender, its comprable the the (IIRC) fong shading from a 10 year old Amiga. It would look really nasty on the big screen. Ready for the PC games, sure. Ready for the big screen only if you're a low/no budget studio.

      --
      If someone is passing you on the right, you are an asshole for driving in the wrong lane.
  35. Finally.... by Comboman · · Score: 4, Funny

    Finally I have a good excuse to give the IT department why I need to upgrade my video card. I need to do FFTs faster (it has nothing at all to do with Doom3).

    --
    Support Right To Repair Legislation.
    1. Re:Finally.... by Anonymous Coward · · Score: 0

      I agree
      These guys are doing a service to the engineering community.
      Imagine ordering 2 of the fastest GPUs in SLI and having it perfectly justifiable!
      My hats off to you gentlemen, you are kings among men

  36. Rush hour dollars, too by SuperBanana · · Score: 1
    GPUs are nice, but there's the little matter of getting data and results on and off the chip.

    Not to mention that their dollar figures are somewhat misleading, as they didn't include the cost of the host PC...and AMD Opeteron 280 processors don't cost ANYWHERE NEAR $2,000. They also didn't show us how cheaper processors do.

    You can buy a Tyan quad-CPU motherboard for about $1k, and the dual-core version of the 280 for $1k tops. So...that's $5-5.5K for a box with EIGHT processors that will do 6 million complex values in almost half the time as the $500 video card sitting in a workstation that most likely costs well over $1,500. So it's not really quite as simple as "OMG 4X PERFORMANCE, ONE QUARTER THE COST!!!"

    Also, did you know that you can buy a Sempron motherboard, 1GB of ram, and sempron processor for about $150 these days? For the same cost as that one GPU, you can buy 3 complete Sempron systems and have change left over for your networking gear.

    1. Re:Rush hour dollars, too by misleb · · Score: 1

      Not to mention that their dollar figures are somewhat misleading, as they didn't include the cost of the host PC...and AMD Opeteron 280 processors don't cost ANYWHERE NEAR $2,000. They also didn't show us how cheaper processors do.

      They also didn't say how cheaper GPU's do.

      You can buy a Tyan quad-CPU motherboard for about $1k, and the dual-core version of the 280 for $1k tops. So...that's $5-5.5K for a box with EIGHT processors that will do 6 million complex values in almost half the time as the $500 video card sitting in a workstation that most likely costs well over $1,500. So it's not really quite as simple as "OMG 4X PERFORMANCE, ONE QUARTER THE COST!!!"

      Why does the workstation have to cost well over $1,500? Any PCIe system should do. YOu coudl build a system capable of holding such a GPU for well under $1,000 AND you could put muliple GPUs in it. Sure sounds pretty close to "OMG 4X PERFORMANCE, ONE QUARTER THE COST!!!" to me.

      Also, did you know that you can buy a Sempron motherboard, 1GB of ram, and sempron processor for about $150 these days? For the same cost as that one GPU, you can buy 3 complete Sempron systems and have change left over for your networking gear.

      But then you're hardly talking about "high performance" FFTs. You're just being a cheapass.

      -matthew

      --
      "THERE IS NO JUSTICE, THERE IS ONLY ME." -Death
    2. Re:Rush hour dollars, too by Aardpig · · Score: 1

      Good luck in getting 2xx series CPUs to work in a 4-socket motherboard. Unless you know something I don't?

      --
      Tubal-Cain smokes the white owl.
    3. Re:Rush hour dollars, too by Anonymous Coward · · Score: 0

      the $500 video card sitting in a workstation that most likely costs well over $1,500

      Also, did you know that you can buy a Sempron motherboard, 1GB of ram, and sempron processor for about $150 these days?

      $500 + $150 = $1500?

  37. Unfair comparison by daveisfera · · Score: 2, Interesting

    The one thing that I haven't seen mentioned is that the benchmarks only show "compute timings" and not actual setup and retreval times. If the benchmarks showed the amount of time to get the data to the GPU and especially the time to get the result back to a place where a program could actually use it then it would be blown out of the water by the CPU. Future cards/drivers could speed up the process of retrieving the data, but for now there will always be lame benchmarks like this that are unfairly biased toward the GPU and only tell half the story. I mean what's the point of doing an FFT so quickly if it takes forever to actually be able to get to the data.

    1. Re:Unfair comparison by Vo0k · · Score: 1

      If the benchmarks showed the amount of time to get the data to the GPU and especially the time to get the result back to a place where a program could actually use it

      Why the hell do you assume AGP? We live in PCI-X era and it doesn't suffer from thin back-pipe anymore. Add to that DMA, meaning almost total bypassing the CPU at retrieving and writing the data and you get really lightning-fast communication.

      And of course nothing stops you from using the CPU to do the FFT just together with the GPU, this kind of task is usually nicely paralellizable. Install 8 PCI-X gfx cards in 8 slots, add a task to perform it in the CPU in spare time between managing data and you have 33x the power of just the CPU alone, in one PC :)

      --
      Anagram("United States of America") == "Dine out, taste a Mac, fries"
    2. Re:Unfair comparison by CalcWiz · · Score: 1

      Actually, transfer time is usually somewhat insignificant. I worked on a GPU-based SVD algorithm in a class taught by a Gamma member at UNC, and transfer between the GPU and CPU took less than 1% of the running time of my algorithm, even on AGP.

      In addition, who said that we necessarily need the results on the CPU? If we really just want to compute the FFT and then check a couple of things or perform more computation with it, it may be more efficient to leave the data on the GPU and run it through another GPU program. In that case, it is unfair to include transfer time in the computation time of the FFT, because we don't need to transfer it.

      --
      If PRO- is the opposite of CON-, What's the opposite of Progress?
    3. Re:Unfair comparison by dogod · · Score: 1

      so this would work really well with all the @home distributed computing projects and their ilk.

      sounds like a lot of number cruching; nice.

  38. GPUs vs PowerPC AltiVecs? by david.emery · · Score: 1

    With respect to CPUs, I understand that Intel and AMD processors tend to smoke PowerPCs for integer computations, and it's the other way around for floating point calculations, particularly for things that make effective use of AltiVec.

    So I'm wondering how this GPU-based (presumably intensively floating-point) approach compares to a PowerPC/AltiVec processor. Anyone got any numbers/analysis?

    (This is more than just an academic proposition. Some have challenged a program's transition from PowerPC to Intel for precisely this reason, claiming that their PPC based algorthims wouldn't achieve necessary performance on Intel processors.)

              dave

    1. Re:GPUs vs PowerPC AltiVecs? by Anonymous Coward · · Score: 0

      The GPU would be faster, but less accurate.

    2. Re:GPUs vs PowerPC AltiVecs? by Anonymous Coward · · Score: 0

      AltiVecs/VMX is only 32bit FP.

    3. Re:GPUs vs PowerPC AltiVecs? by david.emery · · Score: 1
      AltiVEC is 32bit

      Thanks, good point. But other posters here have observed that for most of what you're trying to do, 32 bits is enough.

      So if we postulate only 32bit FFTs, I still reassert my question.

      dave

  39. Correction on usages by okmnji · · Score: 1
    While I am not as familiar with sound compression techniques, JPEG and MPEG-4 video compression (as in the H-264 component only, no sound) uses a DCT, not a FFT. But FFT is used for signal processing, compressing wireless signals, partial diff-eq solving, and similar, I remember that much from classes.

    And those acronyms are "Joint Photographic Experts Group", "Motion Picture Experts Group", "Discrete Cosine Transform", and "Fast Fourier Transform", for those who do not know and are too busy/lazy to look it up.

    1. Re:Correction on usages by Anonymous Coward · · Score: 1, Informative

      DCT and FFT are pretty much the same thing, except the DCT is limited to the real part of the signal. You could use an FFT in the place of a DCT if you discarded the imaginary data.

  40. 8xx verses 2xx by Somegeek · · Score: 2, Informative

    Note that you can only use the 2xx Opterons in 1 or 2 CPU (2 or 4 core) motherboards. If you want to have 4 CPUs with 8 cores you need to use the 8 series Opterons. The Opteron 880 dual core currently starts at over USD 1,600.00 each, which makes your configuration start at about 7,500 just for the MB and CPUs. Then add the registered RAM, server case, big juicy power supply, drives, video, monitor, a UPS to protect all of that... It sounds sexy but realistically its going to be close to $10K if you build it yourself.

    --
    And as you tread the halls of sanity, You feel so glad to be, Unable to go beyond. I have a message, From another time..
  41. not worth it by penguin-collective · · Score: 1

    There are plenty of high-performance co-processor boards you could use. Often, they don't help at all because getting the data in and out is slower than just doing the operations on the CPU. Furthermore, the $500 is in addition to the CPU you already have. Third, the effort you invest in putting your code on the coprocessor is likely going to be a short-term investment, as these boards (whether GPU or otherwise) rapidly change, and as few other people are going to have exactly the same setup as you.

    Overall, a 4x speedup for $500 for FFTs, impressive as it may sound at first sight, is probably not enough to help this sort of thing succeed widely. But it may find some uses in a few special niches.

    1. Re:not worth it by Slashcrap · · Score: 1

      Often, they don't help at all because getting the data in and out is slower than just doing the operations on the CPU.

      That was mainly an AGP limitation. Here in the year two thousand and six, we have something called PCI Express which is fast and bi-directional.

      We're still working on the hover cars though.

  42. stating the obviou... by adolf · · Score: 2, Interesting

    Right then. So how long before they just include some weak general-purpose instructions in the GPU, add SATA and ethernet to the cards, and call it a budget PC?

    1. Re:stating the obviou... by CalcWiz · · Score: 2, Informative

      It just doesn't work that way. GPU's get their advantage because they are highly parallel and heavily pipelined. This makes it easier and cheaper to add more computational power to the card. The problem is that there are a lot of things you just can't do efficiently. The time it takes for one piece of data to get through the pipeline prohibits many general purpose operations. The big advantage for a lot of applications is that it takes just about the same amount of time for thousdands to millions of pieces of data to get through the pipeline, they just can't depend on each other.

      Go read a tutorial on CPU architecture and where it talks about pipelining replace the pipeline that takes 5 cycles end to end with one that takes thousands of cycles and imagine how you could write basic algorithms. Worse yet, imagine how you would handle IO or interrupts.

      --
      If PRO- is the opposite of CON-, What's the opposite of Progress?
  43. The Windowing Problem by rumblin'rabbit · · Score: 1
    The company I work for, a seismic processing company, has a few terraflops of processing power provided by clusters of Intel/AMD chips. We do a lot of FFT's, although it's not our major computational roadblock. We're actually small potatoes. Some large seismic processing centres are on the order of 100 terraflops.

    And virtually all of this computation is single precision.

    There are enough inaccuracies in transforming to the frequency domain due to the "windowing problem" (which says that the finite length of the input series results in smearing in the frequency domain) that double precision is silly except for very long series.

    I'm interested in knowing where and why double precision FFT's are considered necessary.

    1. Re:The Windowing Problem by stephentyrone · · Score: 1

      medical imaging. the inverse integral transforms that are used in processing are often computed using ffts. do *you* want a PET/MRI/whatever scan to miss the piece of shrapnel in your chest because it was small enough to be obscured by the single-precision noise floor?

      there's a bazillion other applications that need double (and higher) precision, but i'm pretty sure that's the most compelling one for the average person.

    2. Re:The Windowing Problem by obizgnodnahs · · Score: 1

      From the performance page, it includes millions of complex points. So the window can be large enough. The project sponsors are:

              * Army Modeling and Simulation Office
              * Army Research Office
              * Defense Advanced Research Projects Agency
              * RDECOM
              * NVIDIA Corporation

      They must demand more than playing doom.

    3. Re:The Windowing Problem by Born2bwire · · Score: 1

      Since Fourier Transform is a convolution problem, we can also use FFT to solve problems other than timefrequency domain problems. In computational electromagnetics, method of moments involves integrating currents over Green's Function. This is actually a convolution problem and so we can use FFT to calculate the integration and accelerate our iterative solution. FFT allows us to go from order N^2 to order N log N. Depending upon our problem, we need one, two, or three dimensional convolution. So I could probably use their code to solve the case of the scattering of an EM wave from an infinite strip. And for our purposes it is preferable to have double precision as opposed to single

    4. Re:The Windowing Problem by rumblin'rabbit · · Score: 1

      Might I suggest that the measurement errors of an MRI greatly outweigh the errors caused by numerical inaccuracies during the calculation of an FFT, even in single precision.

    5. Re:The Windowing Problem by Anonymous Coward · · Score: 0
      I'm interested in knowing where and why double precision FFT's are considered necessary.

      One context where they can be necessary is if the FFT is going to be followed by other computations that are sensitive to error. For example, one way the FFT has come up in my work is that I needed to take the FFT of every column of a large matrix (to turn convolution into pointwise multiplication), take the log of the absolute values (to ignore phase information and convert pointwise multiplication into pointwise addition), and then take the pseudoinverse of the resulting matrix. Often a few of the singular values would be very small, making the pseudoinverse very sensitive to error. Would a single-precision FFT have been adequate? *Maybe*; the errors would have been noticeable, but there's still the question of whether they would be problematic. If you've got the time to use doubles, though, why worry?

    6. Re:The Windowing Problem by CrimsonScythe · · Score: 0
      Well, in some cases you really need double precision, and even higher. I just finished my Master's thesis, where I ran into problems due to double precision not being good enough.

      When you have high-precision spatial data, such as 16-bit MRI images, the Fourier domain data really will need a lot of precision. Remember that each element in Fourier domain is a weighted sum (using phase-coefficients) of all the pixels in spatial domain. Most of the time, 64 bits is plenty enough to adequately describe the image data, but in cases where you have very small angles in the phase-component you may end up losing information. Since the FFT and IFFT (inverse FFT) use cosines and sines to perform the transformations, a very small angle will not be calculated correctly. Remember from calculus that for very small angles theta,
      sin(theta) ~ theta
      and
      cos(theta) ~ 1
      which can lead to round-off errors. The round-off errors can then accumulate and cause problems.

      I'm not sure how interesting my answer was, but that's at least an example of what can happen.
      --
      The view was horrible and the smell was even worse; Julie severely regretted becoming a proctologist.
    7. Re:The Windowing Problem by stephentyrone · · Score: 1

      you might, but you'd be wrong. the inverse problem in question is ill-posed, and subject to a fairly high amount of numerical instability.

    8. Re:The Windowing Problem by Anonymous Coward · · Score: 0
      Here's some abstracts about other GPGPU techniques that could be relevant:
      GPGPU Image And Volume Processing

      ...Fourier volume rendering directly on the GPU. The paper presents a novel implementation of the Fast Fourier Transform: This Split-Stream-FFT maps the recursive structure of the FFT to the GPU in an efficient way. Additionally, high-quality resampling within the frequency domain is discussed.
      *** ...an energy functional is successively minimized in a variational setting. The gradient flow formulation makes use of a robust multi-scale regularization, an efficient multi-grid solver and an adaptive time-step control.
      *** ...fast GPU algorithm to perform the discrete wavelet transform featuring flexible boundary extension schemes, flexible wavelet kernels, Cg shader implementation, and high precision....The beauty of the method is that both forward and inverse wavelet transforms are unified using position-dependent filtering and convolution and an indirect addressing technique.
      *** ...details of implementing wavelet decomposition and reconstruction using graphics hardware, and develop a scaled version of wavelet analysis that constrains data to the [0,1] range of fixed-point frame buffers.
      ***
      Accelerating 3D Convolution using Graphics Hardware


      GPGPU scientific computing
      ...details and microbenchmarks the use of pairs of native precision values to obtain higher accuracy results using DSP, SWAR, and GPU hardware. It also dicusses a way to speculatively use lower precision, recomputing with higher precisions only when accuracy constraints are not met.
      *** ...describes a preliminary algorithm to achieve double precision results by adding a CPU-based defect correction to iterative linear system solvers on the GPU. We demonstrate that identical accuracy as compared to a full CPU double precision solver is possible while still gaining a factor of 2 in speedup compared to a highly tuned cache-aware CPU reference implementation in double precision.
      *** ...developed a library generator for graphics hardware, that can automatically generate high performance matrix multiplication with comparable performance to expert manually tuned version on various graphics hardware platforms.

        much more on numerical methods is also here.
  44. Re:Uhh.. [little correction] by faragon · · Score: 2, Insightful

    For both MP3, JPEG, and MPEG4 the transformation used is not a Fourier transform (not even TDFT/FFT), but DCT/IDCT ([inverse]discrete cosine transform). The reason for using DCT instead of the FFT (equivalent to the time Discrete Fourier Transform) is because the DCT is computationally cheaper than the FFT (about one half, in the fundamentals is really a mutilated DFT/FFT), and it provides enough information for the band discarding approaches used in lossy data compression.

  45. Modular squaring by Myria · · Score: 1

    FFTs are also useful for squaring large numbers with a modulo. That's what http://www.mersenne.org/ uses them for.

    Melissa

    --
    "Screw Sun, cross-platform will never work. Let's move on and steal the Java language." - Visual J++ Product Manager
  46. FP or SSE! by smilingcrow · · Score: 1

    The test is a 7900 GTX versus AMD & Intel systems both with Two dual-core CPUs; that's why the CPUs pricing is so high, as two Opteron or Xeon dual-core chips don't come cheap, although their pricing is out of date. I'm not clear on whether the test is comparing the FP units or the SSE units against the 7900! It's gotta be the SSE performance right? Someone was wondering why all this raw GPU power couldn't be used for video encoding etc; well ATI has released software that allows their recent GPUs to be used to accelerate video transcoding. About time there was a mainstream use for all this silicon.

  47. 32 bit not good enough by MonaLisa · · Score: 2, Interesting

    The sizes of transforms they are using for comparison here are of lengths of the order of 1 million points. This is huge for an FFT, and truncation error will definitely come into play here using only 32-bit precision. It all depends on what you are doing whether this will be adequate or not. Also, it's not at all clear what they did on the other platforms. There are some tricks to doing very long sequences; essentially using a 2D transform to perform a long 1D transform. It's not trivial, and requires some extra work, but generally a lot more efficient than taking a 1D transform and shoving a 4 million element transform into it. The inner loops of a 1D transform will eventually trash the cache for such a large transform, so using a blocked 2D transform avoids this, with some overhead of course. It's hard to tell what they are doing from the performance curves, since they report seconds, and it needs to be scaled by n log n to really see what's going on. It's cool they tried this, though. I was looking at using a GPU to do FFTs and linear algebra kernels a couple of years ago, but decided not to go there as I didn't think it would pay off; mostly because of the 32-bit precision.

  48. I wonder if this could work using MPI? by ShyGuy91284 · · Score: 1

    Someone in a parallel computing class I was taking did a parallel application using FFT to do processing. I wonder if they could incorporate this into MPI for parallel computations pretty easily?

    --
    In undeveloped countries, the consumer controls the market. In capitalist America, the market controls you.
  49. Compiling by drac0n1z · · Score: 1

    the code doesnt compile on my system, ubuntu breezy amd64. anyone got a patch for it?

    --
    This is my sig.
    1. Re:Compiling by Anonymous Coward · · Score: 0

      It's fixed in CVS.

  50. Pro Audio applications? by rekoil · · Score: 2, Interesting

    I'm wondering whether or not the DSP latency of these libraries is sufficient to use with real-time audio processing...if folks were to write RTAS/AU/VST plugins using the library, how they would compare to other hardware-assisted DSP solutions such as the PowerCore and Pro Tools farm cards. Then again, if you have to spend $500 on a card to get this goodness, it's hardly a bargain (albeit cheaper than the above products...)

    1. Re:Pro Audio applications? by bombshelter13 · · Score: 1

      It's being worked on, though I don't think it's using the specific library mentioned by the article. I'm not associated with those guys, but am certainly looking forwards to seeing the results of their work.

  51. Believe it or not.. by FunctionalMethod · · Score: 1

    .. for the last 2 hours I was in class listening about FFT. I swear to god they are out to get me.

    --
    -- TRUST ME! I KNOW WHAT I'M DOING!
    1. Re:Believe it or not.. by Vo0k · · Score: 1

      Oh, but FFT is cool!

      I really regret there's no easy tool to transform pics to FT version and back in gfx images. Transfer, cut out vertical compound of some low frequency and get rid of these stripes of cardboard the original was drawn on. I had a proggy to play with that would apply given mask to FFT version of image before transferring it back and it could do some nice miracles.

      And last but not least, (yes, that's not FFT but cosine transform, but not far...) I really want a hand-tuning JPEG tool, where you can pick each 8x8px block and set/clear its bits, to fill empty field with uniform color, clearing out noises, improve important edges, while using very high compression just for some interesting background effect, and get best quality with really extreme compression out of the ol'good Jpeg supported by everything.

      (JPEG creates the compression loss by zero'ing certain "insignificant" values-frequencies in the 8x8px fields, thus making further lossless compression more efficient. It should be perfectly possible to hand-pick which values are cleared instead of leaving the decision to the program by just setting amount of fields to clear as the "quality" slider.)

      --
      Anagram("United States of America") == "Dine out, taste a Mac, fries"
  52. What about the couch? by mangu · · Score: 1
    a $500 GPU, weighing less than 1 pound, can produce 6 gigaflops.


    Ah, but the modern GPU doesn't have its own built in couch! How many people can sit on a modern GPU?

    1. Re:What about the couch? by WhiteWolf666 · · Score: 1

      How many people can sit on a modern GPU?

      Possibly one, if you aren't concerned about a) cutting your butt, or b) sterility.

      Them things get hot.

      --
      WhiteWolf666 an exBush supporter. All you new-school,compassionate,save the children Republicans can rot in hell
  53. Errr, I don't want to sound skeptical... by jd · · Score: 1
    ...but FFTW may well be the fastest "all-round generic" FFT program (I don't know if that's true or not, though), but there are FFT packages which are faster - DJBFFT is one such program. Of course, there will undoubtably be even faster methods, as I'm certain there will be generalizations in places that could be divided into faster, more specific algorithms, and I don't believe DJBFFT has any hand-optimizations for processors past the Pentium Pro, suggesting it won't take advantage of any capabilities of multi-core CPUs, 64-bit CPUs or additional instructions.


    (Professor Bernstein, in an e-mail reply to a question, did mention he was going to update DJBFFT, but this does not appear to have happened. In all probability he lacks the time, but that doesn't solve the problem of needing high-power FFT software. FFTW, although slightly better on updates, isn't massively better. What is needed is for someone to start a project with the goal of writing code as fast as DJBFFT, as generic as FFTW, and as active as LinuxBIOS or the Linux Kernel in terms of getting new code out there.)

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    1. Re:Errr, I don't want to sound skeptical... by eh2o · · Score: 2, Informative

      djbfft apparently had an edge at some point, but now has not been updated for more than 5 years. meanwhile FFTW has incremented the major version number to 3, undergone a complete rewrite, added simd, multiprocessor, 64bit and a slew of other things (its obviously not a stagnant project). not to mention its the basis of the 'fft' function in matlab and thereby probably the most used fft implementation in the world. assuming their benchmarks (which now include accuracy as well as speed) are valid, Intel Performance Suite is probably their closest competitor.

  54. Assume monty pythin voice... by amavida · · Score: 1

    Thats nice 'init, coz de FFT's hav 'ad an 'ell ov a time ininday?

  55. Re:Uhh.. [little correction] by mrchaotica · · Score: 1

    Does that mean, if you didn't mind that the algorithm being more computationally expensive (say, if you were running it on your GPU, for example), that you could get even better compression than MPEG4 by using a FFT instead of a DCT?

    --

    "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

  56. Nope! by woolio · · Score: 3, Informative

    Sorry, the FFT of a time-domain signal does **NOT** indicate how the power (or energy) of the signal is distributed.

    For the latter, you need a PSD (power spectral density) plot, which is obtained by finding the square of the magnitude of the freq-domain FFT (complex) outputs.

    And the term "FFT" usually describes a specific class of algorithms that finds a Discrete Fourier Transform of a signal in much less than O(N^2) time, where N is the number of elements/samples considered.

    However, the FFT is also useful to perform fast polynomial multiplication (and even fast multiplication of very very very long numbers). This application has nothing to do with power or frequencies in a signal.

    1. Re:Nope! by MrShaggy · · Score: 1

      See, I learn something new all the time. See I had simply cut and paste the quote from google. Therefore not really knowing what fft stood for..I simply copied what was there.

      --
      I have mod points and I am not afraid to use them.
  57. So...? by Clete2 · · Score: 0

    So... if I can get this running on my Gentoo system, I can compile with my CPU and GPU using a sort of DistCC?

  58. Your error analysis is totally wrong by stevenj · · Score: 4, Informative
    The Cooley-Tukey FFT algorithm and its variants, which is what they are using, has much better error characteristics than you think.

    In floating-point arithmetic, the algorithm was proved in 1966 to have an upper bound for the error that grows only as O(log N), and the mean (rms) error grows only as O(log N). (See this page for more info.) (Errors in fixed-point arithmetic are worse, going as N.)

    Even in single precision, the errors for their FFT sizes are probably quite reasonable, assuming they haven't done something silly like use an unstable trigonometric recurrence.

    --
    If a thing is not diminished by being shared, it is not rightly owned if it is only owned & not shared. S. Augustine
    1. Re:Your error analysis is totally wrong by Anonymous Coward · · Score: 0

      An excellent point. What makes the error estimates even more interesting is that, depending on your application, there are steps that can be taken to reduce the overall error introduced during the FFT. The steps taken vary depending on the purpose of the transformation, and they aren't available for every application, but they're still out there.

      For instance, FFTs are used in large-integer multiplication (especially when the numbers grow to the million-digit range). The idea is to break the numbers down into series of digits (where a "digit" can be any value between, say, 0 and 65536), compute FFT of each series of digits, perform an element-wise multiplication, compute the inverse Fourier transform of the result, and do some post-processing to reassemble the number. It sounds complex (and the devil is always in the details), but it turns out to be just about the fastest known method for multiplying (very) large numbers.

      Obviously, floating-point errors in an application like this are just not acceptable. One incorrect bit, and the entire result is completely invalidated. So a lot of different steps can be taken to reduce error.

      First of all, the digit size is, thankfully, somewhat adjustable. There's a speed/accuracy trade-off that can be made with the digit size. If you're looking to make sure that the multiplication won't have any induced errors, reduce the digit size-- you'll have a larger transform (and mulitplication will accordingly take longer), but it is sometimes possible to be assured that the error is never enough to influence the results.

      Another step can be taken is the use of a "balanced representation". Instead of representing the numbers to be multiplied as a series of digits in the range [0, 65536), try representing them in the range [-32768, 32768). It sounds weird, and it's kind of counterintuitive, but by centering the values around 0, the error level is reduced. And when your data sets grow by powers of 2 (as many, but certainly not all, FFT implementations require), that can make a lot of difference.

      One final step that can be taken is the selection of the transform method used. There are variants of the FFT that work exclusively on real numbers-- some of which are suitable for use in multiplication algorithms-- that might have favorable error limits. Or, it is possible to "fold" the series of real-value digits into a series of complex values of half the length. A shorter signal means that fewer steps are needed to compute the FFT, which means less error.

      So, yeah. Depending on your application, it's quite possible to avoid those worst-case scenario error bounds and actually wind up with a much more manageable level of error.

    2. Re:Your error analysis is totally wrong by fatphil · · Score: 1

      Of course, you could ditch FP, and use a NTT (Number-Theoretic Transform), where the limbs are integers modulo some suitably-chosen prime and not atomic (don't quote me on this, but a number like 2^96-2^32+1 is sometimes used). The FPU unit can still be used to compute integer maths, of course, by restricting the range of the numbers used. Modular reduction with such a prime is simply a task of moving numbers between columns, effectively free.

      I have invented a hybrid technique which permits unusually large bases for the limbs (I think I reached 7*10^12), at very little extra cost. I've yet to publish a paper describing it, but ought to. My prime-hunting projects, PIES and Les GeneFermiers, use this algorithm extensively, and are quite successful.

      However, at the end of the day good old-fashioned complex FP FFTs seem to be fastest implementations, but I think that's mostly due to CPU manufacturers dedicating more transistors to such operations. 20 years ago, with only an external 8087 FPU, I'm sure it wouldn't have been the case.

      FatPhil

      --
      Also FatPhil on SoylentNews, id 863
    3. Re:Your error analysis is totally wrong by edp · · Score: 1
      "In floating-point arithmetic, the algorithm was proved in 1966..."

      "Proof" is nice, but my comments are based on actual measurement. I have code that performs single- and double-precision FFTs on eight million numbers randomly selected from a uniform distribution over [0, 1). (The test program uses simple radix-2 code just for study, not anything you'd use in high performance.)

      In a typical run, I observe an average error of 2^-12.5 (using double as a reference for error in single). In three runs, I observe a maximum error of 2^-2.28, 2^-2.90, and 2^-2.74. Compare that to, say, 1024 elements, where I observed an average of 2^-19.5 and maximums around 2^-15.5. If the maximum error were proportional to log N and the 1024-element case were representative, we'd extrapolate around 23/10*2^-15.5 for the eight-million-element case, which is 2^-14.3.

      If you cite an easily accessible paper, I'll take a look at it.

      "(See this page for more info.)"

      That page uses an L[2] norm to determine one error for the entire vector, and it is a relative error. I discussed absolute errors in individual elements (real and imaginary separately). It also indicates the bounds on growth apply to the L[2] norm. I do not see a comment on growth with the L[1] or L[infinity] norms, which would correspond to the average error in the elements and the maximum error in the elements.

      Note also that their error is a relative error. They take the norm of the error vector and divide it by the norm of the correct vector. The norm of the correct vector is proportional to N. If the relative error is proportional to log(N) and you multiply it by N, you get log(N)*N, just what I stated for the absolute errors.

      That page also used a distribution over [-.5, .5). I changed my code to use that and saw a change in absolute scale of the error but not the order of growth.

    4. Re:Your error analysis is totally wrong by stevenj · · Score: 1
      The error analysis was published by Gentleman and Sande; the reference can be found on the link I gave previously (click on the "commentary" link).

      Although it is not plotted, the same page has links to the complete datasets which include L-infinity errors. For example, on a POWER5 with gcc and FFTW 3.1 in single precision, the L-inf relative forward error for size 16 is 6.68e-8, whereas for size 262144=2^18 it is 3.6e-7. This is an increase by a factor of 5.4, which is close enough to the prediction of 4.5 from the O(log N) error bound.

      Perhaps you have something wrong with your code. For example, if you use an trigonometric recurrence, your error will likely have worse scaling (depending on which recurrence you use), as explained in the commentary page mentioned above.

      --
      If a thing is not diminished by being shared, it is not rightly owned if it is only owned & not shared. S. Augustine
    5. Re:Your error analysis is totally wrong by stevenj · · Score: 1

      Sorry, I missed your comment about the absolute error. The L2 norm of the data for an unnormalized DFT scales like sqrt(N), not like N, so your N log N is still not correct. In any case, the absolute error is totally meaningless: if you multiply all of the data points by 16, the error doesn't get worse by 16 (in floating-point, anyway)

      --
      If a thing is not diminished by being shared, it is not rightly owned if it is only owned & not shared. S. Augustine
  59. GPGPU Compression? by Doc+Ruby · · Score: 1

    Who's got the GPGPU SW that does MP3 or VoIP codec compression on the GPU HW?

    --

    --
    make install -not war

  60. erratum by stevenj · · Score: 3, Informative

    That should be O(sqrt(log N)) for the rms error and O(sqrt(N)) for fixed point. (My sqrt symbols didn't post properly somehow)

    --
    If a thing is not diminished by being shared, it is not rightly owned if it is only owned & not shared. S. Augustine
    1. Re:erratum by Anonymous Coward · · Score: 0

      please, don't post accurate analysis on slashdot.. you're not wanted here.

  61. Uhh.. Yeah... by Sentri · · Score: 1

    We're putting High Performance FFT's on all of our TPS reports now before they go out, Did you get that memo?

    So if you could go ahead and do that for me from now on, that'd be great

    --
    Can't we all just get along
  62. There are ways to get precision and speed by Savantissimo · · Score: 1
    While the GP poster and many other comments have said that the 32 bits of precision on GPUs are not enough, there are ways around this limit for those applications that need higher precision while still maintaining increased speed.
    Accelerating Double Precision FEM Simulations with GPUs by Dominik Goddeke Proceedings of the 18th Symposium of Simulation Technique, Erlangen, Germany, September 2005:

    Double arithmetics can be emulated with single floats, also on GPUs. But such emulations increase more than tenfold the operation count. This is only acceptable in otherwise bandwidth bound operations.

    In this paper we, therefore, propose the revitalization of mixed precision defect correction approaches that have been known for almost 100 years: By iteratively computing residuals of a single precision approximate solution to a linear system in double precision, only few correction steps suffice to reduce approximation errors close to machine accuracy. In the context of scientific computing using GPUs, this approach translates to correcting a GPU result with just a few CPU-based iterations. In this way we obtain the full accuracy of CPUs with the high speed of GPUs.


    --
    "Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry
  63. Re:Uhh.. [little correction] by pyite · · Score: 1

    First, we should probably clarify something. The output of the FFT is the output of a DFT (Discrete Fourier Transform) for all intents and purposes. The FFT is a fast way of computing those values by using the fact that when you actually compute a DFT, you're recalculating a lot of the same stuff over and over due to the fact that the Fourier matrix is a circulant matrix. Now that that's out of the way, it should be said that the DCT is a specific case of the DFT when the signal is real and even. See my explanation for that here. No, you probably wouldn't have any advantage using a generic DFT instead of a DCT. First of all, you don't care about complex signals for audio and images and second of all, I forget my second point ;-)

    --

    "Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman

  64. IF by Mark_MF-WN · · Score: 2, Funny

    What is newsworthy is that this is a shameless attempt to secularize mathematics. It's right in the name -- Fast Fourier Transformation. That's idolatry. What can a man know about signals that God hasn't already made clear in the Word? Come to our website, and you can learn all about Intelligent Factoring, which is on much sounder mathematical grounds because it develops entirely from biblical principles.

  65. Ratio by Mark_MF-WN · · Score: 1
    Well, if you had done a BSc in math, physics, Compsci (at some schools), or Engineering, and had worked your way through the nitty gritty of the theory behind the FFT, wouldn't you be hungry for the chance to finally put it to some use? Not a particularly important use, but still. For every person who actually works with the FFT, there are a hundred more who studied it and shortened their life by missing sleep so that they could study for the quiz on it.

    I bet you'd see the same thing if there were an article about mergesort (mergesort for disk-drives and tape-drives is a "fascinating" subject), or about ... whatever it is that chemists study.

  66. I'll agree by jd · · Score: 1
    ...that FFTW is the most widely used FFT package out there. I do seem to recall an @Home project (Messiene Primes? Can't remember) that uses their own FFT code, because of problems with FFTW on specific applications.


    You are correct that FFTW has undergone some massive reworking and is currently at 3.1.1 (although Fedora Core 5 is only on 3.1). What would be wonderful would be for FFTW to add in hooks for PAPI (a powerful profilling library) to establish how fast the code actually is and where improvements could be made. It would both help the developers figure out what needs to be accelerated, and help convince those of us who have become skeptical of performance claims that there is something to cheer about.


    (I hate being cynical; it's not in my nature. It is, however, the natural byproduct of seeing too many bogus performance claims and outright fraudulant benchmarks. I don't much care WHO produces genuinely useful data or WHO has the ultimate in FFT technology; what I do care about is having a way to know WHICH data and WHICH technology is top of the heap. Hell, if independent reviewers actually got paid, I'd add the damn hooks myself, run the tests and publish the results on an advert-free site.)

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  67. Bad drivers by DrYak · · Score: 1
    The quote about bad drivers wasn't mine, but was from TFA : they *do indeed* notice a performance hit between GPFFTW under Windows and under Linux with nVidia's BLOB.

    But I think that GFX under linux suck because :
    - Companies usually release proprietary binary-only drivers.
    - They only release drivers for a couple of architecture (mostly x86 and AMD64).
    - some company don't put as much effort in their Linux prorietary drivers as in the Windows versions (ATI is most notable)
    - they mostly don't colaborate with DRI.
    - e.g.: ATI Radeon r300 driver (9500 to X850) had to be reverse engeneered.
    - Intel i8xx and i9xx seem to be the only recent chipset with continued DRI support and that's a low-end embeded chipset (ATI is only available up to r200 / 9250 later, Matrox only up to G550, nVidia is proprietary-only, 3DFX company is dead)

    For most GFX card company, supporting Linux is just ticking a box on a list and putting minimal effort.

    So, what should you do if you want to use the Radeon 9600 on your PowerPC (mac) laptop ?
    Or plug a GFX card on a Sparc based station ?
    Sit down and cry ?

    try using them sometime rather than listening to your tech gods at the local geek squad....

    I have no intention to put money into pockets of companies that refuse to support opensource.
    But I do run a (refurbished) Radeon9600XT using the open-source DRI reverse-engeneered driver at work.
    At home I keep an old 3DFX. It's still enough for me and has opensource drivers.

    Having to wait for months before getting a AMD64 version of fxgl is something I experienced personnally.
    The local squad of tech gods use command line interface.
    --
    "Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
  68. CVS by drac0n1z · · Score: 1

    where can i find the CVS for this project?

    --
    This is my sig.
  69. GPU instead of APU? by NuShrike · · Score: 1

    Figures that it takes a graphics company to do some audio company's hardware job better. Just can't expect a dedicated and useful FFT chip on available APUs nowadays especially with the lack of real competition and creativity.

  70. Interview with Jim Gray by doom · · Score: 1

    An interview with Jim Gray, in which he talks about this subject (he apparently helped out the North Carolina group with some of this work): "Deconstructing databases with Jim Gray".