Impressive Benchmarks: Sorting with a GPU
An anonymous reader writes "The Graphics research group at the University of North Carolina at Chapel Hill has posted some interesting benchmarks for a sorting implementation which is done entirely on a GPU. There have been efforts on doing general purpose computation on GPUs before (previous Slashdot article). However, most of them had generally utilized the fragment processing pipeline of the GPUs which is slower then the default high speed rendering pipeline. Apparently, the above implementation is done using "simple texture mapping operations" and "cache efficient memory accesses" only. There also seems to an option to download the distribution for non-commercial use, though the requirements seem pretty hefty (a very decent nVidia graphics card and the latest nVidia drivers)."
~~~
Who the fuck has enough data to sort on a regular basis that they'd need hardware-assisted sorting?
Perhaps you've heard of, I dunno.. Google, or Oracle?
I'd love to see Judy-style thinking applied to GPU problems..."
..
especially since i use Judy arrays for tons of things on two different architectures, and its just a darn efficient hash library for pretty much all of my needs
; -- the corruption of government starts with its secrets. a truly free people keep no secrets. --
Sorts are very common in applications and _can_ be slow. In a future where everyone has a GPU this can effectivly serve as a dual processor setup. Just as the FPU helped us 10 years ago with floating point operations.
the fragment processing pipeline of the GPUs which is slower then the default high speed rendering pipeline
For the past two generations or so (starting with the Radeon 9800), there has been no such thing as "the default high speed rendering pipeline". The only circuitry present in the chip has been for evaluating shaders, and the fixed-function pipeline has been implemented as "shaders" that the driver runs on the chip automatically.
At least, I know for a fact this is true of ATI chips, and would not be at all surprised if nVidia is doing something similar.
Apparantly, the above implementaion is done using "simple texture mapping operations" and "cache efficient memory accesses" only.
Fry: Magic. Got it.
First, congratulations to J. Ruby who pioneered this work and is mentioned throughout the article. I work with him, and on his desk are _already_ four machines with dual GeForce 7800 in SLI mode. Talk about lust factor.
Ruby's first published work was the SETI@HOME modified client which uses Nvidia (or) ATI GPU for the waveform FFT calculations. I have watched him steadily upgrade his Nvidia GPU up to this wicked 7800 arrangement he is using today.
Go Jim! You owe me a beer.
I probably don't know what I'm talking about, but I'm wondering....
What is the performance if the GPU is busy rendering when you play a game?
When the GPU is busy doing what it is supposed to do... a program should resort to qsort right?
...that the greatest threat to Intel's market domination in the future is not going to come from AMD but from a company such as Nvidia.
Take a look at what they are doing with their GPUs right now and you can understand why someone would suggest this.
- Toby
I really hope they are not using the C-library implementation of qsort in those timing comparisons...
It:
a) Makes a call to a function, via a function pointer, for each comparison.
b) Uses a variable element size
Both of these things will slow down the sort a lot, compared to a specialized implementation that only sorts 32-bit integers.
Indeed, qsort is known to be slow. See:
http://theory.stanford.edu/~amitp/rants/c++-vs-c/
A comparison with the much faster STL sort should be interesting.
Anyone who generalizes about slashdotters is a typical slashdotter.
Presumably though the algorithm they used in GPUsort can be made to work on a Pentium IV
Not necessarily...
Their use of the GPU to sort might very well run something along the lines of assigning Z-coordinates based on the key values, and colors based on a simple index , then asking the GPU to "show" the "pixels" in Z-order, then just read the "real" data of any arbitrary size and type in the order specified by the returned colors/indices. That would perform a sort using the GPU, very very rapidly, but you can't really translate it to run on a CPU - Sure, you could write code to fake it, but at the lowest level, you'd end up using something like a quicksort, rather than dedicated hardware, to emulate the desired behavior.
Now, admittedly, I don't know that the method under consideration used such an approach. But it appears they at least took the approach of using the GPU for its strong points, rather than trying to force it to act as a general-purpose CPU.
As for the choice of Quicksort - Most likely, they chose it because just about every C library out there has an implementation of quicksort. And while personally I prefer heapsort (in the worst case, quicksort has Q*O(n^2) behavior, while heapsort always takes only P*O(n log n), But P >> Q), I'll admit that for almost all unstructured input sets, quicksort finishes quite a lot faster than anything else.
I've heard about doing GPU calculations (as per the implementation being non-IEEE)? Doubly, would someone in the know help explain what the aforementioned "weirdness" means?
There are just some functions missing, like different roundings and the missing double precision. GPUs are simply not optimized for scientific calculations, but that doesn't mean that they can't be programmed to be useful for those workloads.
Presumably though the algorithm they used in GPUsort can be made to work on a Pentium IV.
Yes, but the algorithm won't be anything special. It won't be a better algorithm than qsort and definitely not more efficient than O(n log n) comparisons. What is special is that it runs on a GPU.
they should have compared GPUsort on the CPU as well
And how exactly were they supposed to do that?!? GPUsort has been programmed to run on a GPU, and even if you don't know the first thing about computers, the G should suggest that GPU is very different from a CPU.
One can prove that no sorting algorithm using binary comparisons can do better than use O(n log n) comparisons. Hence GPUsort couldn't have been asymptotically more efficient that qsort.
If you think about it, the standart qsort implementation is definitely more optimised than most algorithms out there; it has been around for very long. But none of this matters; GPUsort can't run on a CPU.
Well, to be fair GPUs are only "so much more powerful than CPUs" if your task is suitable for running on a GPU. If not, then you're better off using the CPU.
Kind of like how a bulldozer is much more powerful than a hammer, but totally unsuitable to banging a nail in a bit of wood. If you want something torn down or moved about, though...
It's official. Most of you are morons.
It is true. The GPU floating point implementations are not IEEE compliant. There is an interesting study on the floating point behavoir of the GPU here (and the associated pdf)