Impressive Benchmarks: Sorting with a GPU

← Back to Stories (view on slashdot.org)

Impressive Benchmarks: Sorting with a GPU

Posted by timothy on Wednesday June 29, 2005 @12:20AM from the but-that's-for-graphics dept.

An anonymous reader writes "The Graphics research group at the University of North Carolina at Chapel Hill has posted some interesting benchmarks for a sorting implementation which is done entirely on a GPU. There have been efforts on doing general purpose computation on GPUs before (previous Slashdot article). However, most of them had generally utilized the fragment processing pipeline of the GPUs which is slower then the default high speed rendering pipeline. Apparently, the above implementation is done using "simple texture mapping operations" and "cache efficient memory accesses" only. There also seems to an option to download the distribution for non-commercial use, though the requirements seem pretty hefty (a very decent nVidia graphics card and the latest nVidia drivers)."

18 of 222 comments (clear)

Min score:

Reason:

Sort:

Re:What about other sorts? by Defiler · 2005-06-29 00:31 · Score: 2, Informative

qsort() is a very well-understood algorithm that has been highly optimized.
Not including it in the benchmarks would have been a sign that some smoke and mirrors were involved. If the MSVC++ library had anything faster, people would be using it.
Not accurate by Have+Blue · 2005-06-29 00:31 · Score: 4, Informative

the fragment processing pipeline of the GPUs which is slower then the default high speed rendering pipeline

For the past two generations or so (starting with the Radeon 9800), there has been no such thing as "the default high speed rendering pipeline". The only circuitry present in the chip has been for evaluating shaders, and the fixed-function pipeline has been implemented as "shaders" that the driver runs on the chip automatically.

At least, I know for a fact this is true of ATI chips, and would not be at all surprised if nVidia is doing something similar.
1. Re:Not accurate by daVinci1980 · 2005-06-29 04:01 · Score: 3, Informative
  
  Your comment doesn't really make sense, and seems to convey a lack of understanding of how graphics accelerators work. (This isn't a dig, they're incredibly complicated.)
  
  First, as the GP said, there is no fixed function pipeline on modern GPUs. When you submit a primitive (or a batch of primitives) with fixed-function functionality, the driver converts the current fixed function operations into appropriate shader programs and sends it on down the pipe.
  
  Secondly, the "fixed function pipeline" for vertex processing is ludicrously simple. There's actually really nothing that's done, save multiplying the vertices by the appropriate matrix (world * view * proj, in the usual case). The interesting work, and the work that's really done by the GPU is in the fragment processor. That's where the overwhelming majority of fixed function operations are actually performed.
  
  However, it concerns me that you talk about writing programs that try to be as general as the fixed function pipeline. Due to the nature of fragment processors, it's phenomenally expensive to branch. And even if you were to write a general-case implementation of the fixed function pipeline that didn't branch, it would contain so many instructions as to totally hammer your performance.
  
  A rule of thumb in the fragment processor is that you should have no more than ~40 instructions per fragment for a full screen fill. (For pre-7800 hardware).
  
  --
  I currently have no clever signature witicism to add here.
True GPU Genius: J. Ruby by Anonymous Coward · 2005-06-29 00:34 · Score: 3, Informative

First, congratulations to J. Ruby who pioneered this work and is mentioned throughout the article. I work with him, and on his desk are _already_ four machines with dual GeForce 7800 in SLI mode. Talk about lust factor.

Ruby's first published work was the SETI@HOME modified client which uses Nvidia (or) ATI GPU for the waveform FFT calculations. I have watched him steadily upgrade his Nvidia GPU up to this wicked 7800 arrangement he is using today.

Go Jim! You owe me a beer.
Re:What about other sorts? by top_down · 2005-06-29 00:51 · Score: 3, Informative

Indeed, qsort is known to be slow. See:

http://theory.stanford.edu/~amitp/rants/c++-vs-c/

A comparison with the much faster STL sort should be interesting.

--
Anyone who generalizes about slashdotters is a typical slashdotter.
Re:Is GPU memory swapped? by SlightOverdose · 2005-06-29 01:05 · Score: 3, Informative

As far as I know, the Linux kernel allows applications to have their pages marked as non-swappable. I presume most mainstream OS's also do this.
Re:What about other sorts? by Shisha · 2005-06-29 01:05 · Score: 4, Informative

Presumably though the algorithm they used in GPUsort can be made to work on a Pentium IV.
Yes, but the algorithm won't be anything special. It won't be a better algorithm than qsort and definitely not more efficient than O(n log n) comparisons. What is special is that it runs on a GPU.

they should have compared GPUsort on the CPU as well

And how exactly were they supposed to do that?!? GPUsort has been programmed to run on a GPU, and even if you don't know the first thing about computers, the G should suggest that GPU is very different from a CPU.

One can prove that no sorting algorithm using binary comparisons can do better than use O(n log n) comparisons. Hence GPUsort couldn't have been asymptotically more efficient that qsort.

If you think about it, the standart qsort implementation is definitely more optimised than most algorithms out there; it has been around for very long. But none of this matters; GPUsort can't run on a CPU.
Re:GPUs, and Floating Point Numbers General Questi by graphicsguy · 2005-06-29 01:43 · Score: 3, Informative

It is true. The GPU floating point implementations are not IEEE compliant. There is an interesting study on the floating point behavoir of the GPU here (and the associated pdf)
The algorithm is called "Bitonic sort" by lonedroid · 2005-06-29 01:46 · Score: 2, Informative

This is an implementation of the Bitonic sort.

From the article, when comparing their sort to previous GPU sort: "These algorithms also implement Bitonic Sort on the GPU for 16/32-bit floats using the programmable pipeline of GPUs."

So as I understand it they made a very fast implementation (using the GPU) of an old algorithm suited to parallel processing: bitonic sort was published in 1968 (hey, where were the fast parallel processors in the late sixties ;)
Why qsort? by ringm000 · 2005-06-29 01:59 · Score: 2, Informative

Comparing with C qsort is strange, qsort will never work fast due to being unable to inline the comparison function. Hand-written qsort of floats should work much faster.
Quicksort versus HeapSort by tjwhaynes · 2005-06-29 03:35 · Score: 2, Informative

in the worst case, quicksort has Q*O(n^2) behaviour
Quicksort exhibits n-squared behaviour when the data set is sorted but reversed. Most decent quicksort routines do a random shuffle of the data at the start to avoid this issue.

Choosing a sort for a particular situation is very much a matter of choosing. A shell sort is very fast for small data sets - it's quick to implement and follows C*O(n^1.27), where C is small compared to P or Q in your example. With mostly sorted data sets, the choice to sort algorithm is even more tuning sensitive - consider depth-sorting vertices in a rotating model - most of the data isn't moving around very fast and is exchanging a few places up and down the list. Sometimes the more trivial and naive sort works better than quicksort even for large mostly-sorted sets.

Cheers,
Toby Haynes

--
Anything I post is strictly my own thoughts and doesn't necessarily have anything to do with the opinions of IBM.
Re:Why nVidia? by Anonymous Coward · 2005-06-29 03:51 · Score: 1, Informative

Check out the data precision. ATI systems are limited to 24 bits but nVidia implements 32 bits (per the IEEE spec).

It's no surprise that the ATI GPU is faster when it processes less data.
Re:Radix is O(n)? Nice try by cain · 2005-06-29 03:51 · Score: 2, Informative

The number of passes in radix sort is not dependent on the number of items in the list, n. Therefore the number of passes is not log n. The number of passes in radix sort in based on the length of the key being sorted. And for most data is a finite number, k. Therefore the complexity of radix sort is O(kn), where k is finite and not related to n at all, and thus the complexity is O(n). I think your rationale assumes that the length of the key you are sorting can be infinite.
Re:very nice by graphicsguy · 2005-06-29 04:35 · Score: 2, Informative

Sorting is not the problem to be solved. It is a piece of the solution to MANY problems. So for lots of problems you would like to solve on the GPU, you need to do sorting, and doing it on the GPU avoids constantly moving the data back and forth between the CPU and GPU.
Saskatchewan by Anonymous Coward · 2005-06-29 05:15 · Score: 1, Informative

GPUsorting by Callele Neufeld & DeLathouwer at U of Saskatchewan - paper, videos & 2003 PPT.
Re:What about other sorts? by pooly7 · 2005-06-29 05:18 · Score: 2, Informative

I don't know why the parent comment is mod to "Funny" because this is actually _totally_ true. http://www.rt.com/man/qsort.3.html
Re:Let's get an Nvidia fanboy response. by default+luser · 2005-06-29 06:25 · Score: 1, Informative

But when you look at it, most of the time they are talking about +20% on framerates that are already twice the refresh rate. It's completely artificial - in the FPS benchmark they turn off the Vsync to be a more informative test, but not in the game. In the game, the drawing code will wait for a Vsync before drawing a new frame, so both cards will end up rendering the same frame rate when you actually play the game.

Not true. Both Nvidia and ATIs' driver sets default profiles have force v-sync disabled. Most games don't have it enabled by default, and few offer an option to enable it.

What's the difference? With v-sync enabled, anytime the video card needs longer than one sync period to draw a frame, the video card displays the same frame as the last one. This, to the end user, looks like your framerate literally dropped in half, and feels jerky because it usually happens intermittently.

With v-sync disabled, the video card always draws the next frame, even if the drawing is still in progress. This means if you suddenly have to take a little longer to render a frame, you will get a torn frame, but at least the update rate to the screen will be consistent.

So, why do people want higher framerates than their monitor's refresh rate can handle?

* If you're using v-sync, you want to make sure the minimum framerate never falls much below you monitor's refresh rate, or risk the display feeling randomly sluggish during frames with significantly higher processing requirements.

* If you have v-sync disabled, the less often the minimum framerate drops below the refresh rate of your monitor, the less tearing you encounter. Of course, tearing is not as noticeable as you would think. You tend to notice tearing only on vertical polygon edges, and then only when the polygon has moved significantly in your view between frames. That's why this is typically the default in most games...it SOUNDS like it should look bad, but it rarely does.

It's basically a user preference: inconsistent sluggishness, or inconsistent bouts of tearing, but they both benefit from having an AVERAGE framerate much higher than the 85Hz refresh rate of your monitor.

--
Man is the animal that laughs.
And occasionally whores for Karma.
Re:What about other sorts? by AuMatar · 2005-06-29 06:31 · Score: 2, Informative

Of course, they require a lot of memory- buckt requires an array of integers of size maxvalue-minvalue, where max and min value are the biggest numbers being sorted. To sort all uint32 will take 2^32 indexes into the array, or at least 16 GB with 32 bit array indixes. Not practicle for random data, although if your data is limited to a range (say 1-1000) its a nice sort method, and fairly simple to write.

--
I still have more fans than freaks. WTF is wrong with you people?