Slashdot Mirror


Impressive Benchmarks: Sorting with a GPU

An anonymous reader writes "The Graphics research group at the University of North Carolina at Chapel Hill has posted some interesting benchmarks for a sorting implementation which is done entirely on a GPU. There have been efforts on doing general purpose computation on GPUs before (previous Slashdot article). However, most of them had generally utilized the fragment processing pipeline of the GPUs which is slower then the default high speed rendering pipeline. Apparently, the above implementation is done using "simple texture mapping operations" and "cache efficient memory accesses" only. There also seems to an option to download the distribution for non-commercial use, though the requirements seem pretty hefty (a very decent nVidia graphics card and the latest nVidia drivers)."

40 of 222 comments (clear)

  1. Is GPU memory swapped? by Anonymous Coward · · Score: 3, Interesting
    If not, it might be a good place to stash crypto keys, passwords, etc. (At least until someone writes a utility to dump it and adds it to something like Cain).

    ~~~

    1. Re:Is GPU memory swapped? by SlightOverdose · · Score: 3, Informative

      As far as I know, the Linux kernel allows applications to have their pages marked as non-swappable. I presume most mainstream OS's also do this.

  2. Re:Just what I need! by Defiler · · Score: 4, Insightful

    Who the fuck has enough data to sort on a regular basis that they'd need hardware-assisted sorting?

    Perhaps you've heard of, I dunno.. Google, or Oracle?

  3. Like Judy for GPU? by torpor · · Score: 3, Interesting

    I'd love to see Judy-style thinking applied to GPU problems..."

    especially since i use Judy arrays for tons of things on two different architectures, and its just a darn efficient hash library for pretty much all of my needs ..

    --
    ; -- the corruption of government starts with its secrets. a truly free people keep no secrets. --
  4. Re:What about other sorts? by Defiler · · Score: 2, Informative

    qsort() is a very well-understood algorithm that has been highly optimized.
    Not including it in the benchmarks would have been a sign that some smoke and mirrors were involved. If the MSVC++ library had anything faster, people would be using it.

  5. Re:Just what I need! by grazzy · · Score: 4, Interesting

    Sorts are very common in applications and _can_ be slow. In a future where everyone has a GPU this can effectivly serve as a dual processor setup. Just as the FPU helped us 10 years ago with floating point operations.

  6. Not accurate by Have+Blue · · Score: 4, Informative

    the fragment processing pipeline of the GPUs which is slower then the default high speed rendering pipeline

    For the past two generations or so (starting with the Radeon 9800), there has been no such thing as "the default high speed rendering pipeline". The only circuitry present in the chip has been for evaluating shaders, and the fixed-function pipeline has been implemented as "shaders" that the driver runs on the chip automatically.

    At least, I know for a fact this is true of ATI chips, and would not be at all surprised if nVidia is doing something similar.

    1. Re:Not accurate by daVinci1980 · · Score: 3, Informative

      Your comment doesn't really make sense, and seems to convey a lack of understanding of how graphics accelerators work. (This isn't a dig, they're incredibly complicated.)

      First, as the GP said, there is no fixed function pipeline on modern GPUs. When you submit a primitive (or a batch of primitives) with fixed-function functionality, the driver converts the current fixed function operations into appropriate shader programs and sends it on down the pipe.

      Secondly, the "fixed function pipeline" for vertex processing is ludicrously simple. There's actually really nothing that's done, save multiplying the vertices by the appropriate matrix (world * view * proj, in the usual case). The interesting work, and the work that's really done by the GPU is in the fragment processor. That's where the overwhelming majority of fixed function operations are actually performed.

      However, it concerns me that you talk about writing programs that try to be as general as the fixed function pipeline. Due to the nature of fragment processors, it's phenomenally expensive to branch. And even if you were to write a general-case implementation of the fixed function pipeline that didn't branch, it would contain so many instructions as to totally hammer your performance.

      A rule of thumb in the fragment processor is that you should have no more than ~40 instructions per fragment for a full screen fill. (For pre-7800 hardware).

      --
      I currently have no clever signature witicism to add here.
    2. Re:Not accurate by graphicsguy · · Score: 2, Interesting

      Yes, my post was a bit haphazard...

      First, as the GP said, there is no fixed function pipeline on modern GPUs. When you submit a primitive (or a batch of primitives) with fixed-function functionality, the driver converts the current fixed function operations into appropriate shader programs and sends it on down the pipe.

      Agreed. However, because the guys at NVIDIA don't have to write their code in CG or VP or HLSL, they may have access to some additional means of optimizations that take advantage of some lower-level assembly actually implemented by the hardware.

      Secondly, the "fixed function pipeline" for vertex processing is ludicrously simple. There's actually really nothing that's done, save multiplying the vertices by the appropriate matrix (world * view * proj, in the usual case).

      Funny. I thought it usually also does illumination, which can be complicated by two-sided lighting, for example.

      Anyhow, I have had tests where loading the most trivial vertex program may cause a slowdown over using the standard OpenGL functionality. But I can't recall all the details of the test, so it may be that my conclusions are flawed. My conclusion was that there may be overheads associated with the use of user-written vertex programs that could stem either from lack of access to the most optimized programming level or from some other part of the driver related to having user-loaded programs running. I do understand that the standard programs run on the same hardware units and that there are likely different "fixed-function" programs for different settings of global state to avoid branching.

  7. Obl. Futurama quote by Anonymous Coward · · Score: 3, Funny

    Apparantly, the above implementaion is done using "simple texture mapping operations" and "cache efficient memory accesses" only.

    Fry: Magic. Got it.

  8. True GPU Genius: J. Ruby by Anonymous Coward · · Score: 3, Informative

    First, congratulations to J. Ruby who pioneered this work and is mentioned throughout the article. I work with him, and on his desk are _already_ four machines with dual GeForce 7800 in SLI mode. Talk about lust factor.

    Ruby's first published work was the SETI@HOME modified client which uses Nvidia (or) ATI GPU for the waveform FFT calculations. I have watched him steadily upgrade his Nvidia GPU up to this wicked 7800 arrangement he is using today.

    Go Jim! You owe me a beer.

  9. very nice by __aahlyu4518 · · Score: 3, Interesting

    I probably don't know what I'm talking about, but I'm wondering....

    What is the performance if the GPU is busy rendering when you play a game?
    When the GPU is busy doing what it is supposed to do... a program should resort to qsort right?

    1. Re:very nice by graphicsguy · · Score: 2, Informative

      Sorting is not the problem to be solved. It is a piece of the solution to MANY problems. So for lots of problems you would like to solve on the GPU, you need to do sorting, and doing it on the GPU avoids constantly moving the data back and forth between the CPU and GPU.

  10. I knew somebody who did math on a "blitter" by crovira · · Score: 2, Funny

    (Smalltalk bit block transfer) He was always making assmptions about the underlying virtual engine.

    He was always getting it wrong too. I have logged thousands of hours and thousands of miles, from Montreal, Canada to Lisbon, Portugal cleaning up after this yobbo. What a fuckup he was.

    The opinion was shared too. It got to the point to where we could write code that would detect his code and, as soon as we came across it and confirmed it we would remove it and read the original spec to know what to code.

    "Ghoul" was a geek's geek. He stayed married about a week to a co-worker's daughter. Sad in a funy sort'o way.

    --
    MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
  11. It's been said... by TobyWong · · Score: 4, Interesting

    ...that the greatest threat to Intel's market domination in the future is not going to come from AMD but from a company such as Nvidia.

    Take a look at what they are doing with their GPUs right now and you can understand why someone would suggest this.

    --
    - Toby
  12. It is really time... by ratta · · Score: 2, Interesting

    that, since graphic hardware is now completely programmable, ATI and Nvidia realese their specs, so that everyone can use the GPU as a specialized vectorial coprocessor. A CPU that cannot be programmed freely just makes no sense, the same should be for GPU.

    --
    Wondering why i am doing so strange posts? I am trying to get a "+5,Flamebait" or "-1,Insightful" rating.
  13. Re:What about other sorts? by Anonymous Coward · · Score: 4, Interesting

    I really hope they are not using the C-library implementation of qsort in those timing comparisons...

    It:
    a) Makes a call to a function, via a function pointer, for each comparison.
    b) Uses a variable element size

    Both of these things will slow down the sort a lot, compared to a specialized implementation that only sorts 32-bit integers.

  14. GPUs, and Floating Point Numbers General Question by BioCS.Nerd · · Score: 2, Interesting

    This may be slightly offtopic insofar that it doesn't directly deal with the subject (sorting with a GPU) at hand, but I was wondering how these sorts of research projects overcome "floating point number weirdness" I've heard about doing GPU calculations (as per the implementation being non-IEEE)? Doubly, would someone in the know help explain what the aforementioned "weirdness" means?

  15. Probably not for game applications by benhocking · · Score: 2, Interesting

    Most GPGPU (general purpose GPU) researchers are envisioning scientific purposes. It really galls many of us in the scientific community that GPUs are so much more powerful than CPUs (if one can efficiently use the parallel processing capabilities of GPUs), and yet mostly we have to let these powerful processors go unused because it is typically very difficult to use GPUs for non-graphical computations (hence the G in GPU, of course).

    --
    Ben Hocking
    Need a professional organizer?
    1. Re:Probably not for game applications by Tim+C · · Score: 4, Insightful

      Well, to be fair GPUs are only "so much more powerful than CPUs" if your task is suitable for running on a GPU. If not, then you're better off using the CPU.

      Kind of like how a bulldozer is much more powerful than a hammer, but totally unsuitable to banging a nail in a bit of wood. If you want something torn down or moved about, though...

  16. Re:What about other sorts? by top_down · · Score: 3, Informative

    Indeed, qsort is known to be slow. See:

    http://theory.stanford.edu/~amitp/rants/c++-vs-c/

    A comparison with the much faster STL sort should be interesting.

    --
    Anyone who generalizes about slashdotters is a typical slashdotter.
  17. precision? by Barbarian · · Score: 2, Interesting

    Are not most GPUs limited to 24 or 32 bit precision? The x86 processors can go to 80 bits.

  18. very decent? by mnemonic_ · · Score: 2, Funny

    a very decent nVidia graphics card

    So is that like "average", except "very average"?

  19. Re:GPU has a different architecture by slashflood · · Score: 2, Interesting

    The GPU has lots of separate processing elements that operate independently. It also supports floating-point vector operations. Quite a different animal from yer-average-x86.

    Just like the Cell processor...

  20. Re:What about other sorts? by pla · · Score: 4, Interesting

    Presumably though the algorithm they used in GPUsort can be made to work on a Pentium IV

    Not necessarily...

    Their use of the GPU to sort might very well run something along the lines of assigning Z-coordinates based on the key values, and colors based on a simple index , then asking the GPU to "show" the "pixels" in Z-order, then just read the "real" data of any arbitrary size and type in the order specified by the returned colors/indices. That would perform a sort using the GPU, very very rapidly, but you can't really translate it to run on a CPU - Sure, you could write code to fake it, but at the lowest level, you'd end up using something like a quicksort, rather than dedicated hardware, to emulate the desired behavior.

    Now, admittedly, I don't know that the method under consideration used such an approach. But it appears they at least took the approach of using the GPU for its strong points, rather than trying to force it to act as a general-purpose CPU.


    As for the choice of Quicksort - Most likely, they chose it because just about every C library out there has an implementation of quicksort. And while personally I prefer heapsort (in the worst case, quicksort has Q*O(n^2) behavior, while heapsort always takes only P*O(n log n), But P >> Q), I'll admit that for almost all unstructured input sets, quicksort finishes quite a lot faster than anything else.

  21. Re:GPUs, and Floating Point Numbers General Questi by slashflood · · Score: 3, Insightful


    I've heard about doing GPU calculations (as per the implementation being non-IEEE)? Doubly, would someone in the know help explain what the aforementioned "weirdness" means?

    There are just some functions missing, like different roundings and the missing double precision. GPUs are simply not optimized for scientific calculations, but that doesn't mean that they can't be programmed to be useful for those workloads.

  22. Re:What about other sorts? by Shisha · · Score: 4, Informative

    Presumably though the algorithm they used in GPUsort can be made to work on a Pentium IV.
    Yes, but the algorithm won't be anything special. It won't be a better algorithm than qsort and definitely not more efficient than O(n log n) comparisons. What is special is that it runs on a GPU.

    they should have compared GPUsort on the CPU as well

    And how exactly were they supposed to do that?!? GPUsort has been programmed to run on a GPU, and even if you don't know the first thing about computers, the G should suggest that GPU is very different from a CPU.

    One can prove that no sorting algorithm using binary comparisons can do better than use O(n log n) comparisons. Hence GPUsort couldn't have been asymptotically more efficient that qsort.

    If you think about it, the standart qsort implementation is definitely more optimised than most algorithms out there; it has been around for very long. But none of this matters; GPUsort can't run on a CPU.

  23. Re:GPUs, and Floating Point Numbers General Questi by graphicsguy · · Score: 3, Informative

    It is true. The GPU floating point implementations are not IEEE compliant. There is an interesting study on the floating point behavoir of the GPU here (and the associated pdf)

  24. The algorithm is called "Bitonic sort" by lonedroid · · Score: 2, Informative

    This is an implementation of the Bitonic sort.

    From the article, when comparing their sort to previous GPU sort: "These algorithms also implement Bitonic Sort on the GPU for 16/32-bit floats using the programmable pipeline of GPUs."

    So as I understand it they made a very fast implementation (using the GPU) of an old algorithm suited to parallel processing: bitonic sort was published in 1968 (hey, where were the fast parallel processors in the late sixties ;)

  25. Why qsort? by ringm000 · · Score: 2, Informative

    Comparing with C qsort is strange, qsort will never work fast due to being unable to inline the comparison function. Hand-written qsort of floats should work much faster.

  26. My professor was doing this by yellowcord · · Score: 2, Interesting

    a couple of years ago. Nvidia gave him a FX5900 (I think. It was the one that sounded like a hair drier and got pulled from the market) to do his research with. Anyways, check out his papers on the subject.

  27. The point? by Pedrito · · Score: 2, Interesting

    I don't really see the point. I mean, yes, it's kind of neat, but that's about it. It serves no practical purpose, really.

    Not every machine has a GPU. I don't know, but I suspect GPUs aren't terribly compatible with each other, so for any sort of market, you'd have to code for multiple GPU types.

    The fact is, co-processors have been around since the early x86 CPUs. Not just math co-processors, but Intel used to cell i860 and i960 co-processor boards (some with multiple CPUs) for just this sort of thing.

    I'd also suspect that if your GPU is being used for sorting or some other calculation intensive operation, it's less useful as a GPU. If you don't need the GPU as a GPU, but you need the computing power, I suspect spending the additional money on more CPU power instead, is going to have a bigger overall payoff since it's going to speed up every operation, not just the ones a GPU can be adapted to.

    So, again, I don't really see the point. Really, if you need specialized co-processing, getting a FPGA board will probably be a much better use of your money since it can be customized for many more tasks than a GPU.

  28. Re:Forgive the stupid question, but....er.....why? by graphicsguy · · Score: 2, Insightful

    It is important because sorting is an important algorithmic building block for solving many problems. If you want to solve important problems entirely on the GPU (which has many more GFLOPS available than the CPU, if you can manage to use them efficiently), then you need to be able to do the sorting component of those larger algorithms.

  29. However are they reliable enough? by TheLink · · Score: 2, Interesting

    Are GPUs as reliable as CPUs for these tasks? I mean if there's an error in a GPU you might get a brief rendering glitch and lots of people may not notice.

    In fact, some people are actually talking about ways of testing and allowing slightly broken graphics hardware to be used in situations where the _visual_ results won't be that off.

    --
  30. Quicksort versus HeapSort by tjwhaynes · · Score: 2, Informative
    in the worst case, quicksort has Q*O(n^2) behaviour

    Quicksort exhibits n-squared behaviour when the data set is sorted but reversed. Most decent quicksort routines do a random shuffle of the data at the start to avoid this issue.

    Choosing a sort for a particular situation is very much a matter of choosing. A shell sort is very fast for small data sets - it's quick to implement and follows C*O(n^1.27), where C is small compared to P or Q in your example. With mostly sorted data sets, the choice to sort algorithm is even more tuning sensitive - consider depth-sorting vertices in a rotating model - most of the data isn't moving around very fast and is exchanging a few places up and down the list. Sometimes the more trivial and naive sort works better than quicksort even for large mostly-sorted sets.

    Cheers,
    Toby Haynes

    --
    Anything I post is strictly my own thoughts and doesn't necessarily have anything to do with the opinions of IBM.
  31. Re:Radix is O(n)? Nice try by cain · · Score: 2, Informative
    The number of passes in radix sort is not dependent on the number of items in the list, n. Therefore the number of passes is not log n. The number of passes in radix sort in based on the length of the key being sorted. And for most data is a finite number, k. Therefore the complexity of radix sort is O(kn), where k is finite and not related to n at all, and thus the complexity is O(n). I think your rationale assumes that the length of the key you are sorting can be infinite.

  32. Re:What about other sorts? by philovivero · · Score: 2, Interesting
    As for the choice of Quicksort - Most likely, they chose it because just about every C library out there has an implementation of quicksort. And while personally I prefer heapsort (in the worst case, quicksort has Q*O(n^2) behavior, while heapsort always takes only P*O(n log n), But P >> Q), I'll admit that for almost all unstructured input sets, quicksort finishes quite a lot faster than anything else.
    I humbly suggest you do not understand quicksort vs. heap sort. Layman's terms: Quicksort is better. It takes less RAM. The O(xyz) notation only takes into account computational complexity, and sorta ignores that whole messy "RAM vs. disk" issue that reality imposes on us.

    Quicksort finishes just fine in nlogn time for all input sets EXCEPT those specially-crafted to work badly on quicksort. For any data set of more than about 20 elements, the odds of getting a data set specially ordered to take n^2 time in quicksort are infitesimally small (unless someone has specially set the data up in that "takes a long time" order).

    The advantage of the quicksort is that it can do the sort with Q memory. Why don't you go back and review how much memory a heap sort does with P elements? Here. I'll help you get started. a little page with comparisons of sorts
  33. Re:What about other sorts? by pooly7 · · Score: 2, Informative

    I don't know why the parent comment is mod to "Funny" because this is actually _totally_ true. http://www.rt.com/man/qsort.3.html

  34. Re:What about other sorts? by AuMatar · · Score: 2, Informative

    Of course, they require a lot of memory- buckt requires an array of integers of size maxvalue-minvalue, where max and min value are the biggest numbers being sorted. To sort all uint32 will take 2^32 indexes into the array, or at least 16 GB with 32 bit array indixes. Not practicle for random data, although if your data is limited to a range (say 1-1000) its a nice sort method, and fairly simple to write.

    --
    I still have more fans than freaks. WTF is wrong with you people?
  35. Re:What about other sorts? by joto · · Score: 2, Interesting
    qsort() is a very well-understood algorithm that has been highly optimized.

    Bzzt, wrong. qsort() is NOT, I repeat, NOT, the same as quicksort.

    According to unix manpages: "The qsort subroutine sorts a table of data in place. It uses the quicker-sort algorithm.

    Note that "quicker-sort" is not the same as "quicksort". In fact, I doubt anyone really knows what the hell "quicker sort" is, anyway. My guess is that it might be a name just invented to explain away the q, once some standardization committee decided to not have implementation details in the spec (assuming qsort() really meant quicksort some time in the past).

    So, qsort() is not a very well understood algorithm. And I seriously doubt it has been highly optimized, as there is no point in it.

    You see, for every comparison qsort() does, it has to call a function through a function pointer. That is pretty brain-damaged for the typical case, where the comparison is typically done in just one or two CPU opcodes. Spending lots of effort optimizing something that is a lost race anyway, is not a wise use of time for compiler vendors. For this reason, qsort() is more likely to be a slightly naïve implementation of quicksort.

    Not including it in the benchmarks would have been a sign that some smoke and mirrors were involved.

    Actually, I would say the opposite. Anyone caring even a little bit about performance would run away screaming if someone suggested they'd use qsort(). Benchmarking against something as sluggish as that IS a sign that some smoke and mirrors were involved.

    If the MSVC++ library had anything faster, people would be using it.

    Doesn't MSVC++ have std::sort() ?