Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs
An anonymous reader writes "Researchers at the University of Virginia have recently open sourced an algorithm capable of sorting at a rate of one billion (integer) keys per second using a GPU. Although GPUs are often assumed to be poorly suited for algorithms like sorting, their results are several times faster than the best known CPU-based sorting implementations."
I find it very disappointing that a group of programmer/computer science types who even supply BibTeX to make it easier to reference their work, resort to screen-capturing an Excel chart to display their data.
This isn't a "barrier" like the "sound barrier". There are no special difficulties that start around 1G/sec! It's just a threshold.
Don't get me wrong -- I'm not saying this isn't impressive, but no "barrier" was broken here!
Algorithms aren't measured in "x per second"... only implementations are measured that way. The speed of an algorithm is described in big-O notation, such as O(n log n). The metric of "sorted keys per second" is largely useless, because it depends on the particular hardware setup.
The problem with big-oh notation is that the constant isn't explicit, so for any given n (pick as large as desired), it is possible that O(nlogn) < O(n) for some choice of constants. That's why ops per second is still a useful metric when comparing implementations on standardized hardware.
As always, in theory there's no difference between theory and practice, but in practice there is...
The typical sorting problem I've encountered in my career (various types of scientific, telecommunications and business software, though not games) involves an array of pointers to fixed length records that have a sort key (let's say an integer) at a predefined offset. Not an array of integers, nor an array of directly embedded small fixed length records which I'm guessing was used in TFA. The former situation requires random as well as stream access to memory, which would likely favor processing by the CPU in the motherboard of a typical $1000-$2000 PC.
Unfortunately I can't mod having already posted in this thread, but please allow me to /bow. This is the best explanation I've ever read anywhere for the differences. Even I knew the differences but couldn't have expressed it so finely. Bravo.
"The true measure of a person is how they act when they know they won't get caught." - DSRilk
Well, yeah, they're not claiming they invented radix sort. They're claiming that their GPU implementation of radix sort runs about 4x as fast as the CPU implementation you describe.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
It's generally not size of RAM that breaks radix sort; it's the size of cache. Modern processors are highly reliant on cache, which means they're highly reliant on things in memory being in small tight chunks that fit in cache - because cache misses are expensive enough that if you thrash cache badly enough, you may end up running slower than if you hadn't had any cache at all.
Good comparison sorts may start fragmented, but by their very nature each pass of the algorithm makes them less so. Radix sort is the other way around; it follows pointers (so more precious scarce cache in use already) that point in more and more fragmented patterns with every pass. That's why even though radix sort's average speed is theoretically faster than quicksort, quicksort still wins on real life hardware. And that's probably why radix sort wins on GPUs - the data fits in the card's dedicated memory, which is already optimized to be accessed in a much more parallel way than main memory.