Slashdot Mirror


Impressive Benchmarks: Sorting with a GPU

An anonymous reader writes "The Graphics research group at the University of North Carolina at Chapel Hill has posted some interesting benchmarks for a sorting implementation which is done entirely on a GPU. There have been efforts on doing general purpose computation on GPUs before (previous Slashdot article). However, most of them had generally utilized the fragment processing pipeline of the GPUs which is slower then the default high speed rendering pipeline. Apparently, the above implementation is done using "simple texture mapping operations" and "cache efficient memory accesses" only. There also seems to an option to download the distribution for non-commercial use, though the requirements seem pretty hefty (a very decent nVidia graphics card and the latest nVidia drivers)."

7 of 222 comments (clear)

  1. Not accurate by Have+Blue · · Score: 4, Informative

    the fragment processing pipeline of the GPUs which is slower then the default high speed rendering pipeline

    For the past two generations or so (starting with the Radeon 9800), there has been no such thing as "the default high speed rendering pipeline". The only circuitry present in the chip has been for evaluating shaders, and the fixed-function pipeline has been implemented as "shaders" that the driver runs on the chip automatically.

    At least, I know for a fact this is true of ATI chips, and would not be at all surprised if nVidia is doing something similar.

    1. Re:Not accurate by daVinci1980 · · Score: 3, Informative

      Your comment doesn't really make sense, and seems to convey a lack of understanding of how graphics accelerators work. (This isn't a dig, they're incredibly complicated.)

      First, as the GP said, there is no fixed function pipeline on modern GPUs. When you submit a primitive (or a batch of primitives) with fixed-function functionality, the driver converts the current fixed function operations into appropriate shader programs and sends it on down the pipe.

      Secondly, the "fixed function pipeline" for vertex processing is ludicrously simple. There's actually really nothing that's done, save multiplying the vertices by the appropriate matrix (world * view * proj, in the usual case). The interesting work, and the work that's really done by the GPU is in the fragment processor. That's where the overwhelming majority of fixed function operations are actually performed.

      However, it concerns me that you talk about writing programs that try to be as general as the fixed function pipeline. Due to the nature of fragment processors, it's phenomenally expensive to branch. And even if you were to write a general-case implementation of the fixed function pipeline that didn't branch, it would contain so many instructions as to totally hammer your performance.

      A rule of thumb in the fragment processor is that you should have no more than ~40 instructions per fragment for a full screen fill. (For pre-7800 hardware).

      --
      I currently have no clever signature witicism to add here.
  2. True GPU Genius: J. Ruby by Anonymous Coward · · Score: 3, Informative

    First, congratulations to J. Ruby who pioneered this work and is mentioned throughout the article. I work with him, and on his desk are _already_ four machines with dual GeForce 7800 in SLI mode. Talk about lust factor.

    Ruby's first published work was the SETI@HOME modified client which uses Nvidia (or) ATI GPU for the waveform FFT calculations. I have watched him steadily upgrade his Nvidia GPU up to this wicked 7800 arrangement he is using today.

    Go Jim! You owe me a beer.

  3. Re:What about other sorts? by top_down · · Score: 3, Informative

    Indeed, qsort is known to be slow. See:

    http://theory.stanford.edu/~amitp/rants/c++-vs-c/

    A comparison with the much faster STL sort should be interesting.

    --
    Anyone who generalizes about slashdotters is a typical slashdotter.
  4. Re:Is GPU memory swapped? by SlightOverdose · · Score: 3, Informative

    As far as I know, the Linux kernel allows applications to have their pages marked as non-swappable. I presume most mainstream OS's also do this.

  5. Re:What about other sorts? by Shisha · · Score: 4, Informative

    Presumably though the algorithm they used in GPUsort can be made to work on a Pentium IV.
    Yes, but the algorithm won't be anything special. It won't be a better algorithm than qsort and definitely not more efficient than O(n log n) comparisons. What is special is that it runs on a GPU.

    they should have compared GPUsort on the CPU as well

    And how exactly were they supposed to do that?!? GPUsort has been programmed to run on a GPU, and even if you don't know the first thing about computers, the G should suggest that GPU is very different from a CPU.

    One can prove that no sorting algorithm using binary comparisons can do better than use O(n log n) comparisons. Hence GPUsort couldn't have been asymptotically more efficient that qsort.

    If you think about it, the standart qsort implementation is definitely more optimised than most algorithms out there; it has been around for very long. But none of this matters; GPUsort can't run on a CPU.

  6. Re:GPUs, and Floating Point Numbers General Questi by graphicsguy · · Score: 3, Informative

    It is true. The GPU floating point implementations are not IEEE compliant. There is an interesting study on the floating point behavoir of the GPU here (and the associated pdf)