Slashdot Mirror


Impressive Benchmarks: Sorting with a GPU

An anonymous reader writes "The Graphics research group at the University of North Carolina at Chapel Hill has posted some interesting benchmarks for a sorting implementation which is done entirely on a GPU. There have been efforts on doing general purpose computation on GPUs before (previous Slashdot article). However, most of them had generally utilized the fragment processing pipeline of the GPUs which is slower then the default high speed rendering pipeline. Apparently, the above implementation is done using "simple texture mapping operations" and "cache efficient memory accesses" only. There also seems to an option to download the distribution for non-commercial use, though the requirements seem pretty hefty (a very decent nVidia graphics card and the latest nVidia drivers)."

222 comments

  1. Is GPU memory swapped? by Anonymous Coward · · Score: 3, Interesting
    If not, it might be a good place to stash crypto keys, passwords, etc. (At least until someone writes a utility to dump it and adds it to something like Cain).

    ~~~

    1. Re:Is GPU memory swapped? by SlightOverdose · · Score: 3, Informative

      As far as I know, the Linux kernel allows applications to have their pages marked as non-swappable. I presume most mainstream OS's also do this.

    2. Re:Is GPU memory swapped? by cbreaker · · Score: 1

      I'd agree with your presumtion; Windows can do the same thing. Take VMWare GSX server, for instance; you can instruct it to never swap VM memory, or swap some of it, or swap all of it.

      --
      - It's not the Macs I hate. It's Digg users. -
  2. What's the point ? by Arthur+B. · · Score: 0

    So the GPU is an additional processor whose power can be used. Any card with an additional processor could do the job. Considering that, much like 3D, most algorithms/operations are often the same, much processing power would be gained with such cards and a standardized library alla opengl.

    --
    \u262D = \u5350
  3. Re:Just what I need! by Anonymous Coward · · Score: 1, Insightful

    "Who the fuck has enough data to sort on a regular basis that they'd need hardware-assisted sorting?"

    Database admins?

  4. Re:Just what I need! by Defiler · · Score: 4, Insightful

    Who the fuck has enough data to sort on a regular basis that they'd need hardware-assisted sorting?

    Perhaps you've heard of, I dunno.. Google, or Oracle?

  5. What about other sorts? by Archibald+Buttle · · Score: 1

    This isn't exactly a fair test as I see it. As far as I can make out they've put a custom sorting algorithm up against the standard C library qsort. How about some comparisons of this GPUsort against other sorting algorithms run on the CPU?

    1. Re:What about other sorts? by Defiler · · Score: 2, Informative

      qsort() is a very well-understood algorithm that has been highly optimized.
      Not including it in the benchmarks would have been a sign that some smoke and mirrors were involved. If the MSVC++ library had anything faster, people would be using it.

    2. Re:What about other sorts? by Archibald+Buttle · · Score: 1

      Presumably though the algorithm they used in GPUsort can be made to work on a Pentium IV. Comparing only against qsort looks suspicious to me - they should have compared GPUsort on the CPU as well as with it on the GPU and qsort.

    3. Re:What about other sorts? by Anonymous Coward · · Score: 4, Interesting

      I really hope they are not using the C-library implementation of qsort in those timing comparisons...

      It:
      a) Makes a call to a function, via a function pointer, for each comparison.
      b) Uses a variable element size

      Both of these things will slow down the sort a lot, compared to a specialized implementation that only sorts 32-bit integers.

    4. Re:What about other sorts? by top_down · · Score: 3, Informative

      Indeed, qsort is known to be slow. See:

      http://theory.stanford.edu/~amitp/rants/c++-vs-c/

      A comparison with the much faster STL sort should be interesting.

      --
      Anyone who generalizes about slashdotters is a typical slashdotter.
    5. Re:What about other sorts? by pla · · Score: 4, Interesting

      Presumably though the algorithm they used in GPUsort can be made to work on a Pentium IV

      Not necessarily...

      Their use of the GPU to sort might very well run something along the lines of assigning Z-coordinates based on the key values, and colors based on a simple index , then asking the GPU to "show" the "pixels" in Z-order, then just read the "real" data of any arbitrary size and type in the order specified by the returned colors/indices. That would perform a sort using the GPU, very very rapidly, but you can't really translate it to run on a CPU - Sure, you could write code to fake it, but at the lowest level, you'd end up using something like a quicksort, rather than dedicated hardware, to emulate the desired behavior.

      Now, admittedly, I don't know that the method under consideration used such an approach. But it appears they at least took the approach of using the GPU for its strong points, rather than trying to force it to act as a general-purpose CPU.


      As for the choice of Quicksort - Most likely, they chose it because just about every C library out there has an implementation of quicksort. And while personally I prefer heapsort (in the worst case, quicksort has Q*O(n^2) behavior, while heapsort always takes only P*O(n log n), But P >> Q), I'll admit that for almost all unstructured input sets, quicksort finishes quite a lot faster than anything else.

    6. Re:What about other sorts? by p3d0 · · Score: 1
      No, qsort() does an indirect function call in hottest innermost loop. A hand-coded data-specific sort routine will always be faster unless the comparison operation itself is very slow.

      Unless your compiler knows about qsort and can (1) inline it, and (2) inline the comparison function into it, then you'll always have that indirect call in the innermost loop. (I just checked; gcc can't do it. I haven't tried the Intel or MS compilers.)

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    7. Re:What about other sorts? by Shisha · · Score: 4, Informative

      Presumably though the algorithm they used in GPUsort can be made to work on a Pentium IV.
      Yes, but the algorithm won't be anything special. It won't be a better algorithm than qsort and definitely not more efficient than O(n log n) comparisons. What is special is that it runs on a GPU.

      they should have compared GPUsort on the CPU as well

      And how exactly were they supposed to do that?!? GPUsort has been programmed to run on a GPU, and even if you don't know the first thing about computers, the G should suggest that GPU is very different from a CPU.

      One can prove that no sorting algorithm using binary comparisons can do better than use O(n log n) comparisons. Hence GPUsort couldn't have been asymptotically more efficient that qsort.

      If you think about it, the standart qsort implementation is definitely more optimised than most algorithms out there; it has been around for very long. But none of this matters; GPUsort can't run on a CPU.

    8. Re:What about other sorts? by Anonymous Coward · · Score: 0

      I wanna see a distributed bubble sort on SLI-paired GForce 7800s, baby!

    9. Re:What about other sorts? by Calyth · · Score: 1

      qsort(), assuming that it's randomized quick sort, it plenty fast. Assuming the complaint about function pointers are valid, it shouldn't be too hard to code up one that doesn't use function pointers, and then a comparision could be made pretty quickly. And even with that I'm still not surprised that a GPU can do this faster. Given the insane amounts of sorting and stuff involved in 3D rendering, a lot of these stuff are accelerated in hardware.
      If you find out any binary sorting that runs faster than O(nlogn), let me know.

    10. Re:What about other sorts? by rodac · · Score: 1

      If the sorting algorithm is O(n log n) then it should (for many kinds of data) always fail compared to a normal CPU running a proper sorting algorithm. IF the sort keys allow a non-compare sort algorithm.

      Devicing an O(n) algorithm that does not do an expensive branch to a compare function is pretty trivial (e.g. radix sort) and will always win by a huge margin to general purpose algorithms such as qsort.

      qsort is a good O(n log n) algorithm that can sort any kind of data.
      The only drawback with O(n) algorithms is that they can only sort sertain kinds of data, but they are much faster.

    11. Re:What about other sorts? by exp(pi*sqrt(163)) · · Score: 1

      Quicksort is a well understoord algorithm. qsort() in the standard C library is a notoriously poor implementation. std::sort() in the C++ library might be a better choice.

      --
      Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.
    12. Re:What about other sorts? by Paul+Rose · · Score: 1
      std::sort() in the C++ library might be a better choice

      Yes, but not because qsort() is such a poor implementation, but becuase the comparison functor for std::sort will nearly always be inlined (for sure by g++ 3+ and msvc6+) while the function pointer compare callback for standard C qsort is never inlined (at least not by any implementation I've seen).
    13. Re:What about other sorts? by Poltras · · Score: 1

      Radix sort is nlogn also. The only known O(n) algorithm is pigeon hole... which is really flawed for sorting floats (or even just 16bits integer).

    14. Re:What about other sorts? by Anonymous Coward · · Score: 0

      No compiler can do it unless both functions are defined in the same translation unit, which is unlikely since qsort() is a library function.

      Even link-time optimization cant do it, since it breaks C aliasing rules.

    15. Re:What about other sorts? by agurkan · · Score: 1, Offtopic

      Actually P~2*Q. Also, there are tweaks to Quicksort which guarantees O(nlogn) behaviour. If you need an efficient but somewhat general purpose quicksort I strongly recommend Jörg Schön's collection of C programs which include a simple include file that creates a very efficient sortinf routine, and coding takes seconds!!

      --
      ato
    16. Re:What about other sorts? by Anonymous Coward · · Score: 0

      The quicksort algorithm, however, is not terribly slow. (except in pathological cases) That's probably what's throwing off some of the other posters, they may be thinking qsort == quicksort. It's just one implementation of it.

      In terms of algorithms, I've found merge sort to actually be significantly faster than quicksort in the presense of large data sets and caches due to the better locality, but it has complications for general-purpose sorting. IIRC, STL typically uses something similar to introspective sort.

    17. Re:What about other sorts? by p3d0 · · Score: 1

      Yeah, I thought so. I was trying to keep an open mind, and imagine that the compiler might have qsort as a great big intrinsic function that it knows intimately. Then it's just the call site and the compare function would need to be in the same translation unit.

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    18. Re:What about other sorts? by mepperpint · · Score: 1

      One can prove that no sorting algorithm using binary comparisons can do better than use O(n log n) comparisons. Hence GPUsort couldn't have been asymptotically more efficient that qsort.

      Actually, qsort is O(n^2), so there are many known asymptotically more efficient algorithms. If we talk about average case instead of worse case, then qsort is THETA(n log n) and your argument holds.

    19. Re:What about other sorts? by BlowChunx · · Score: 1

      The obvious answer is to make a qsort kernel module and get it out of user space!

    20. Re:What about other sorts? by philovivero · · Score: 2, Interesting
      As for the choice of Quicksort - Most likely, they chose it because just about every C library out there has an implementation of quicksort. And while personally I prefer heapsort (in the worst case, quicksort has Q*O(n^2) behavior, while heapsort always takes only P*O(n log n), But P >> Q), I'll admit that for almost all unstructured input sets, quicksort finishes quite a lot faster than anything else.
      I humbly suggest you do not understand quicksort vs. heap sort. Layman's terms: Quicksort is better. It takes less RAM. The O(xyz) notation only takes into account computational complexity, and sorta ignores that whole messy "RAM vs. disk" issue that reality imposes on us.

      Quicksort finishes just fine in nlogn time for all input sets EXCEPT those specially-crafted to work badly on quicksort. For any data set of more than about 20 elements, the odds of getting a data set specially ordered to take n^2 time in quicksort are infitesimally small (unless someone has specially set the data up in that "takes a long time" order).

      The advantage of the quicksort is that it can do the sort with Q memory. Why don't you go back and review how much memory a heap sort does with P elements? Here. I'll help you get started. a little page with comparisons of sorts
    21. Re:What about other sorts? by philovivero · · Score: 1

      Lordy. I notice that the page claims quicksort has n^2 complexity when input is ordered. That is only true of naive quicksort implementations. Textbook quicksort implementations from when I was in college (about 10 years ago) already took care of cases where input was ordered.

    22. Re:What about other sorts? by pooly7 · · Score: 2, Informative

      I don't know why the parent comment is mod to "Funny" because this is actually _totally_ true. http://www.rt.com/man/qsort.3.html

    23. Re:What about other sorts? by Archibald+Buttle · · Score: 1

      Chill dude!

      GPUs are of course specialist processors and not the same as CPUs as you state - I am well aware of the one letter difference. However my understanding is that the processing units in GPUs are rather similar to the likes of the SSE units within Pentium's and Altivec units in PowerPC chips. They are essentially linear vector processors, from what I understand.

      We're talking about algorithms here - if you can express an algorithm in code for one type of processor chip then you can express it in code for another type of chip. That's a basic of programming that everyone here should understand. Your statement that GPUsort can't run on a CPU is false.

      Now it may be impractical to implement GPUsort on a CPU, but it's not impossible.

    24. Re:What about other sorts? by Have+Blue · · Score: 1

      If all you want to sort is integers, you can go beyond quicksort. Google "bucket sort" or "radix sort"- these are O(n) sorts while even quicksort is O(n log n).

    25. Re:What about other sorts? by InfiniteWisdom · · Score: 1

      The whole point of GPUsort is to exploit the parallelism of the GPU... you certainly could implement it... hell you could implement it on a postscript printer... anything that is Turing-complete. But it won't be a valid comparison. What they have compared is their best-known algorithms for sorting on a GPU to what is probably the best general-purpose sorting algorithm on a CPU.

    26. Re:What about other sorts? by exp(pi*sqrt(163)) · · Score: 1
      but becuase the comparison functor for std::sort will nearly always be inlined
      Knowing how to ensure the right things get inlined is one of the aspects of what I call "good implementation". But opinions may vary I suppose.
      --
      Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.
    27. Re:What about other sorts? by AuMatar · · Score: 1

      n*log(n) is only a limit for sorting in-place. You can go as low as O(n) if you have sufficient memory.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    28. Re:What about other sorts? by AuMatar · · Score: 2, Informative

      Of course, they require a lot of memory- buckt requires an array of integers of size maxvalue-minvalue, where max and min value are the biggest numbers being sorted. To sort all uint32 will take 2^32 indexes into the array, or at least 16 GB with 32 bit array indixes. Not practicle for random data, although if your data is limited to a range (say 1-1000) its a nice sort method, and fairly simple to write.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    29. Re:What about other sorts? by doktor-hladnjak · · Score: 1

      That website seems to imply that you need an additional order n amount of space to do heapsort. However, it can be done in place without too much effort:

      1. Take your input array containing n elements and perform a build maxheap operation on it. This takes O(n) time and can be done in place.

      2. Swap the root (the maximum element in the heap) with the last value in the part of the array still occupied by the heap, effectively removing the root from the heap and placing it in the sorted portion at the end of the array. O(1)

      3. Percolate the new root (that was just swapped) down the tree to guarantee the heap property holds for it. O(log n)

      4. Go back to step 2 so long as there are still elements left in the heap.

      Still, quicksort usually tends to be faster in most cases because of overhead. Heapsort really pays off though when you don't want to sort the entire sequence. For example, maybe you just want the 10 largest or smallest elements. In that case, you only have to do the full O(n) build heap operation instead of a full O(n log n) sort.

    30. Re:What about other sorts? by Politburo · · Score: 1

      even if you don't know the first thing about computers, the G should suggest that GPU is very different from a CPU.

      Based on that logic, you would say that a VCR and a VTR are very different machines, correct?

    31. Re:What about other sorts? by Kelerain · · Score: 1

      A comparison with the much faster STL sort should be interesting.

      It would be interesting, but please bear in mind the STL algorithms value *predictable* speed, vs raw average speed. They would ditch an algorithm that was 1/2 the average run time of another , but had ocasional quadrupal time runs. So even someone useing the raw C++ evironment could write a sort that was faster than STL on average without too much work.

    32. Re:What about other sorts? by joto · · Score: 1
      The obvious answer is to make a qsort kernel module and get it out of user space!

      Yeah, that would be really neat. Not only do you still have to call a function for each comparison. But *each* and *every one* of those function calls will need to go across the kernel-userspace boundary. My oh my, you are clever (not!)

    33. Re:What about other sorts? by Anonymous Coward · · Score: 0

      Exactly. Especially the function calls make qsort look really bad. I implemented a simple MergeSort and ran it on different array sizes (32-bit floating point numbers). Performance numbers are for an AMD Ahtlon64 3500+.

      1x10^6 elements: 336 ms
      2x10^6 elements: 713 ms
      4x10^6 elements: 1498 ms
      8x10^6 elements: 3156 ms
      16x10^6 elements: 6608 ms

      As far as I can tell, this is pretty close to the numbers they get on the nVidia card. Can I publish these results somewhere? ;-)

    34. Re:What about other sorts? by joto · · Score: 2, Interesting
      qsort() is a very well-understood algorithm that has been highly optimized.

      Bzzt, wrong. qsort() is NOT, I repeat, NOT, the same as quicksort.

      According to unix manpages: "The qsort subroutine sorts a table of data in place. It uses the quicker-sort algorithm.

      Note that "quicker-sort" is not the same as "quicksort". In fact, I doubt anyone really knows what the hell "quicker sort" is, anyway. My guess is that it might be a name just invented to explain away the q, once some standardization committee decided to not have implementation details in the spec (assuming qsort() really meant quicksort some time in the past).

      So, qsort() is not a very well understood algorithm. And I seriously doubt it has been highly optimized, as there is no point in it.

      You see, for every comparison qsort() does, it has to call a function through a function pointer. That is pretty brain-damaged for the typical case, where the comparison is typically done in just one or two CPU opcodes. Spending lots of effort optimizing something that is a lost race anyway, is not a wise use of time for compiler vendors. For this reason, qsort() is more likely to be a slightly naïve implementation of quicksort.

      Not including it in the benchmarks would have been a sign that some smoke and mirrors were involved.

      Actually, I would say the opposite. Anyone caring even a little bit about performance would run away screaming if someone suggested they'd use qsort(). Benchmarking against something as sluggish as that IS a sign that some smoke and mirrors were involved.

      If the MSVC++ library had anything faster, people would be using it.

      Doesn't MSVC++ have std::sort() ?

    35. Re:What about other sorts? by top_down · · Score: 1
      STL algorithms value *predictable* speed, vs raw average speed


      No this is specified per function. partial_sort() prefers predictability. sort() prefers raw average speed. As another poster mentioned sort() is usually implemented with quicksort or introsort.

      --
      Anyone who generalizes about slashdotters is a typical slashdotter.
    36. Re:What about other sorts? by KingEomer · · Score: 1

      Actually, you can find the first x greatest or smallest elements of an unsorted list in O(n) time. This can be done using a modified quick-sort, in which you use the algorithm itself to choose the pivot.

    37. Re:What about other sorts? by Anonymous Coward · · Score: 0

      C++ and the STL actually overcome both of these limitations. The more flexible of the two C++ sort functions looks like:

      template <class Ran, class Cmp>
      void sort(Ran first, Ran last, Cmp cmp)

      By allowing for template iterators and a templated comparison operator, the compiler can optimize the comparisons and swaps away completely (assuming cmp is either a function object or an inline function).

      Here is a detailed reference for std::sort.

    38. Re:What about other sorts? by Anonymous Coward · · Score: 0

      I should say that the comparisons and swaps aren't optimized away completely; rather, the overhead of function calls to compare and swap are (as is the overhead of dealing with the swap arguments as binary blobs of arbitrary size instead of, say, 32-bit integers).

    39. Re:What about other sorts? by rbarreira · · Score: 1

      Quicksort finishes just fine in nlogn time for all input sets EXCEPT those specially-crafted to work badly on quicksort. For any data set of more than about 20 elements, the odds of getting a data set specially ordered to take n^2 time in quicksort are infitesimally small (unless someone has specially set the data up in that "takes a long time" order).

      Yeah, but I imagine that the chance of the algorithm taking between O(n log n) and O(n^2) time is much greater... But since I'm not a quicksort expert, I did some quick search and perhaps I'm wrong

      --

      The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
    40. Re:What about other sorts? by rbarreira · · Score: 1

      Maybe he was joking? Or maybe not...

      --

      The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
    41. Re:What about other sorts? by Anonymous Coward · · Score: 0

      radix sort of 32 bit integers and using 256 buckets require O(4n) operations.

      as always with ordo you never write out the constant so the complexity of radix sort is O(n).

      ergo radix sort is O(n) WORST-CASE (==average-case==best-case)

      How is qsort() going to be faster since it is O(n log2 n) AVERAGE-CASE but O(n^2) worst-case?

      Have you even tried and compared them?

    42. Re:What about other sorts? by 280Z28 · · Score: 1

      I ran a test using the sort slasrt_() from the CLAPACK library as a comparison. It was last updated September 30, 1994, and I didn't tweak it for this test. On a Pentium 4 3.0GHz, I achieved the following results: Input Sequence Size: 1024 Qsort() Time: 0.000 sec Input Sequence Size: 2048 Qsort() Time: 0.000 sec Input Sequence Size: 4096 Qsort() Time: 0.000 sec Input Sequence Size: 8192 Qsort() Time: 0.001 sec Input Sequence Size: 16384 Qsort() Time: 0.002 sec Input Sequence Size: 32768 Qsort() Time: 0.005 sec Input Sequence Size: 65536 Qsort() Time: 0.011 sec Input Sequence Size: 131072 Qsort() Time: 0.028 sec Input Sequence Size: 262144 Qsort() Time: 0.049 sec Input Sequence Size: 524288 Qsort() Time: 0.093 sec Input Sequence Size: 1048576 Qsort() Time: 0.193 sec Input Sequence Size: 2097152 Qsort() Time: 0.386 sec Input Sequence Size: 4194304 Qsort() Time: 0.798 sec Input Sequence Size: 8388608 Qsort() Time: 1.625 sec I feel this is a legitimate comparison, even though the algorithm varies from the algorithm used on the graphics cards, because the original test already compared different sorting algorithms. So if someone is looking for the result of "how fast can the P4 3._2_ sort a list of numbers?" as a comparison, the answer sure isn't "~6-7 seconds for a list of 8.3million 32-bit floats."

      --
      Turning coffee into code.
    43. Re:What about other sorts? by Have+Blue · · Score: 1

      I don't know what bucket sort you're talking about, but you can save most of that memory by sorting by one digit at a time. A single pass through the bucket algorithm is stable, so you can repeatedly sort on the individual digits starting with the most significant- this uses 10 buckets, not UINT_MAX. It's still O(n), and the constant is now the expected order of magnitude of the largest value.

    44. Re:What about other sorts? by Have+Blue · · Score: 1

      Er, that should be "number of digits" not "order of magnitude".

    45. Re:What about other sorts? by 2short · · Score: 1

      Of course, the other way around these issues is to use C++, where sort is a template so the comparison function can be inlined and the element size fixed at compile time. You always get a specialized implementation that only sorts whatever type you are sorting; but you never have to write it. The cannonical counterexample for people who think plain C should be used for performance.

  6. Like Judy for GPU? by torpor · · Score: 3, Interesting

    I'd love to see Judy-style thinking applied to GPU problems..."

    especially since i use Judy arrays for tons of things on two different architectures, and its just a darn efficient hash library for pretty much all of my needs ..

    --
    ; -- the corruption of government starts with its secrets. a truly free people keep no secrets. --
  7. Re:Just what I need! by grazzy · · Score: 4, Interesting

    Sorts are very common in applications and _can_ be slow. In a future where everyone has a GPU this can effectivly serve as a dual processor setup. Just as the FPU helped us 10 years ago with floating point operations.

  8. Great.. but. by Despised · · Score: 0

    Great.. but. does it run linux?

    1. Re:Great.. but. by graphicsguy · · Score: 1

      It probably does run under linux. The main problem with GPGPU programs under linux (on NVIDIA) until recently was the lack of "render to texture" functionality. This meant that you had to render to a pbuffer and then do a copy operation to make that data a readable texture for the next pass. However, the latest NVIDIA drivers for linux support frame buffer objects, which provide the same functionality (and then some).

    2. Re:Great.. but. by graphicsguy · · Score: 1

      p.s. - if the grandparent was a joke that meant "can I run linux on the GPU", then I guess either my joke meter is broken or it just wasn't funny :-)

  9. Not accurate by Have+Blue · · Score: 4, Informative

    the fragment processing pipeline of the GPUs which is slower then the default high speed rendering pipeline

    For the past two generations or so (starting with the Radeon 9800), there has been no such thing as "the default high speed rendering pipeline". The only circuitry present in the chip has been for evaluating shaders, and the fixed-function pipeline has been implemented as "shaders" that the driver runs on the chip automatically.

    At least, I know for a fact this is true of ATI chips, and would not be at all surprised if nVidia is doing something similar.

    1. Re:Not accurate by Anonymous Coward · · Score: 0

      Yes, from what anyone outside the companies can tell, ATI's R3xx and nVidia's NV2x series chips do not provide a fixed function pipeline at all.

      It's also true that this can be slower, the specialised hardware was very, very fast. However it was obvious that the specialised hardware was a dead-end, so people set off in a new direction even before it reached its limits.

      This software only runs on the programmable hardware. It's not going to do any sorting with your non-programmable ATI Radeon 7000, Voodoo 5 or Intel 8xx series graphics.

      The submitter screwed up because they don't know anything about graphics cards, and of course the Slashdot editors don't check submissions for correctness, so we all end up reading whatever the first idiot put in their submission.

    2. Re:Not accurate by graphicsguy · · Score: 1

      On the other hand, at least for the vertex processors, it does seem that running the "fixed function pipeline" is generally faster than loading your own program (especially if your own program tries to be as general as the fixed function program. I assume this is more due to specialized microcode rather than separate transistors.

    3. Re:Not accurate by daVinci1980 · · Score: 3, Informative

      Your comment doesn't really make sense, and seems to convey a lack of understanding of how graphics accelerators work. (This isn't a dig, they're incredibly complicated.)

      First, as the GP said, there is no fixed function pipeline on modern GPUs. When you submit a primitive (or a batch of primitives) with fixed-function functionality, the driver converts the current fixed function operations into appropriate shader programs and sends it on down the pipe.

      Secondly, the "fixed function pipeline" for vertex processing is ludicrously simple. There's actually really nothing that's done, save multiplying the vertices by the appropriate matrix (world * view * proj, in the usual case). The interesting work, and the work that's really done by the GPU is in the fragment processor. That's where the overwhelming majority of fixed function operations are actually performed.

      However, it concerns me that you talk about writing programs that try to be as general as the fixed function pipeline. Due to the nature of fragment processors, it's phenomenally expensive to branch. And even if you were to write a general-case implementation of the fixed function pipeline that didn't branch, it would contain so many instructions as to totally hammer your performance.

      A rule of thumb in the fragment processor is that you should have no more than ~40 instructions per fragment for a full screen fill. (For pre-7800 hardware).

      --
      I currently have no clever signature witicism to add here.
    4. Re:Not accurate by graphicsguy · · Score: 2, Interesting

      Yes, my post was a bit haphazard...

      First, as the GP said, there is no fixed function pipeline on modern GPUs. When you submit a primitive (or a batch of primitives) with fixed-function functionality, the driver converts the current fixed function operations into appropriate shader programs and sends it on down the pipe.

      Agreed. However, because the guys at NVIDIA don't have to write their code in CG or VP or HLSL, they may have access to some additional means of optimizations that take advantage of some lower-level assembly actually implemented by the hardware.

      Secondly, the "fixed function pipeline" for vertex processing is ludicrously simple. There's actually really nothing that's done, save multiplying the vertices by the appropriate matrix (world * view * proj, in the usual case).

      Funny. I thought it usually also does illumination, which can be complicated by two-sided lighting, for example.

      Anyhow, I have had tests where loading the most trivial vertex program may cause a slowdown over using the standard OpenGL functionality. But I can't recall all the details of the test, so it may be that my conclusions are flawed. My conclusion was that there may be overheads associated with the use of user-written vertex programs that could stem either from lack of access to the most optimized programming level or from some other part of the driver related to having user-loaded programs running. I do understand that the standard programs run on the same hardware units and that there are likely different "fixed-function" programs for different settings of global state to avoid branching.

  10. Obl. Futurama quote by Anonymous Coward · · Score: 3, Funny

    Apparantly, the above implementaion is done using "simple texture mapping operations" and "cache efficient memory accesses" only.

    Fry: Magic. Got it.

  11. True GPU Genius: J. Ruby by Anonymous Coward · · Score: 3, Informative

    First, congratulations to J. Ruby who pioneered this work and is mentioned throughout the article. I work with him, and on his desk are _already_ four machines with dual GeForce 7800 in SLI mode. Talk about lust factor.

    Ruby's first published work was the SETI@HOME modified client which uses Nvidia (or) ATI GPU for the waveform FFT calculations. I have watched him steadily upgrade his Nvidia GPU up to this wicked 7800 arrangement he is using today.

    Go Jim! You owe me a beer.

    1. Re:True GPU Genius: J. Ruby by NatasRevol · · Score: 1, Funny

      I don't think an AC will get a beer...

      --
      There are two types of people in the world: Those who crave closure
    2. Re:True GPU Genius: J. Ruby by ratta · · Score: 1

      Dual Geforce 7800 in SLI?????????? With the insane amount of money required to buy 2 Geforce 7800 a whole opteron cluster could be built (or little less) :-)

      --
      Wondering why i am doing so strange posts? I am trying to get a "+5,Flamebait" or "-1,Insightful" rating.
    3. Re:True GPU Genius: J. Ruby by Lozzer · · Score: 1

      Does he wear ear defenders?

      --
      Special Relativity: The person in the other queue thinks yours is moving faster.
    4. Re:True GPU Genius: J. Ruby by Anonymous Coward · · Score: 0

      What did you say?

    5. Re:True GPU Genius: J. Ruby by speculatrix · · Score: 1

      I am wondering whether an RSA/DES/AES key cracker could be rewritten like this... e.g. the dnetc/rc5-72 cracker.

  12. very nice by __aahlyu4518 · · Score: 3, Interesting

    I probably don't know what I'm talking about, but I'm wondering....

    What is the performance if the GPU is busy rendering when you play a game?
    When the GPU is busy doing what it is supposed to do... a program should resort to qsort right?

    1. Re:very nice by tgv · · Score: 1

      It's not that you don't know what you're talking about, it's that you're missing a (subtle) point.

      These people see the GPU as an extra processor that you can use to do arbitrary tasks. Ok, we know that you can use it rather effectively to render pictures from procedural calls, but, since rendering is basically number crunching, can you also use it to do general computation, e.g. qsort? Well, the answer is yes, and it seems you can get quite some mileage out of it.

      And yes, of course, it would be rather inefficient to share the GPU for both sorting and rendering in the same application, unless of course your sorting duties are much heavier than rendering...

    2. Re:very nice by FriedTurkey · · Score: 1

      Yes, I am sure the researchers are finding faster ways to map the human genome while playing DOOM3 at the same time. It is very important because it is annoying to have your frames per second drop because of some "cure for cancer" crap.

    3. Re:very nice by __aahlyu4518 · · Score: 1

      LOL... you made your point :-)

      But often things that are constructed for scientific purposes are one day going to be used for general purpose. I was obviously thinking about that...

    4. Re:very nice by ImaLamer · · Score: 1

      I'm thinking that this type of thing should be good for screensavers and other hacks. I'd like to see Electric Sheep (http://electricsheep.org/)run entirely on the GPU.

    5. Re:very nice by agbinfo · · Score: 1
      To be fair though, shouldn't the benchmark be made against an optimized 32-bit float sorter.

      The way I see it is that they wrote a sorting algorithm for 16 and 32-bit floats on a GPU and they are comparing this with a generic sorting algorithm. How would the test fair against a radix sort for example? How much memory are they using? How does it compare with qsort when sorting 20 byte keys?

    6. Re:very nice by graphicsguy · · Score: 1

      The most fair comparison in the study is really the comparison against previous GPU sorting implementations. This new work is clearly a big improvement in the state of the art in GPU-based sorting.

    7. Re:very nice by agbinfo · · Score: 1
      Still, I'm not sure I see the point of this except maybe as a pet project.

      I'm not an expert in this sort of thing, no pun intended, but if I needed to accelerate sorting, I would prefer to use a platform that I can rely on instead of some specific hardware. For example, I would use a merge sort and distribute portions of the dataset to other CPUs [using a heap sort instead of a quick sort to ensure that I always have O(n ln n)].

      This way, I have a scalable solution which is easy to implement, is hardware independent, and it's probably cheaper too.

    8. Re:very nice by graphicsguy · · Score: 2, Informative

      Sorting is not the problem to be solved. It is a piece of the solution to MANY problems. So for lots of problems you would like to solve on the GPU, you need to do sorting, and doing it on the GPU avoids constantly moving the data back and forth between the CPU and GPU.

  13. I knew somebody who did math on a "blitter" by crovira · · Score: 2, Funny

    (Smalltalk bit block transfer) He was always making assmptions about the underlying virtual engine.

    He was always getting it wrong too. I have logged thousands of hours and thousands of miles, from Montreal, Canada to Lisbon, Portugal cleaning up after this yobbo. What a fuckup he was.

    The opinion was shared too. It got to the point to where we could write code that would detect his code and, as soon as we came across it and confirmed it we would remove it and read the original spec to know what to code.

    "Ghoul" was a geek's geek. He stayed married about a week to a co-worker's daughter. Sad in a funy sort'o way.

    --
    MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
    1. Re:I knew somebody who did math on a "blitter" by LiquidCoooled · · Score: 0, Flamebait

      Without this yobbo, you wouldn't have a job.

      --
      liqbase :: faster than paper
  14. Come on! by Anonymous Coward · · Score: 0

    However, most of them had generally utilized the fragment processing pipeline of the GPUs which is slower then the default high ...

    What is so difficult about knowing when to use then and when to use than?

    1. Re:Come on! by HyperChicken · · Score: 1

      What is so difficult about knowing when to use then and when to use than?

      Well, they sound similar for starters.

      --
      Free of Flash! Free of Flash!
    2. Re:Come on! by fnj · · Score: 1

      What is so difficult about knowing when to use then and when to use than?

      "Then" and "than" was grammar school stuff back in my day. Maybe that's why they called it grammar school.

      Here's one that might have taken until junior high school to master:

      "I need a discrete source of discreet components." ... or is it ...
      "I need a discreet source of discrete components."

    3. Re:Come on! by Anonymous Coward · · Score: 0
    4. Re:Come on! by Anonymous Coward · · Score: 0

      Well, they sound similar for starters.

      I suspect a large part of the difference is that they sound similar when some people pronounce them, but not for other people.

      For example, I pronounce "your" and "you're" differently, so when I see them misused it looks horribly wrong to me. ("That doesn't even sound right. How can they screw that up?") But some people don't pronounce them differently, and I suspect it's these people who make the mistake.

      I'm much less annoyed when I see "principle"/"principal" misused, because I pronounce them the same. It's still annoying, but not fingernails-on-a-blackboard bad.

      I've got $5 says whoever misused "then"/"than" pronounces them the same.

    5. Re:Come on! by SeventyBang · · Score: 1

      Of course it's the latter.

      If we could only educate the morons who can't get its and it's correct.

      Rule of thumb: "explode" your contractions. It's is a contraction, not a designation of ownership. The same goes for you're, they're, etc. If it doesn't make sense exploded, it won't make sense as a contraction.

      I have yet to figure out why computer people think they are so smart but cannot master simple grammar and punctuation.

    6. Re:Come on! by Anonymous Coward · · Score: 0

      omg wtf lol rofl u r 2 funny.

      need i say more?

    7. Re:Come on! by bdcrazy · · Score: 1

      When someone is reading a sentence and they use the wrong form of its/it's there/they're/their, most people will figure it out. On the other hand, giving a compiler theirfunction is not the same as therefunction, it generates a problem. I'd think that most people want to just get information (or lack thereof) across and if it gets the message across, even if its slightly wrong, thats all that matters.

      --
      Tonights forecast: Dark. Continued dark throughout most of the evening, with some widely-scattered light towards morning
    8. Re:Come on! by Anonymous Coward · · Score: 0
      As a new user of this site, you may not know that many of the people whose grammar and spelling you are criticising are sensitive souls who take enormous offence at such criticisms and try to compensate for their weakness by spending their hard-earned moderation points down-moderating the criticisms that are so hurtful to them, all the while ignoring the most effective remedy of reading the international bestseller about the infamous panda who .

      Panda

    9. Re:Come on! by m50d · · Score: 1

      But as a general rule, it's as a designation of ownership would make sense. John's computer, the computer's mouse, all make sense. Why not it's buttons? Using its in sentences like that is a special case that has to be memorised, because it does not make sense.

      --
      I am trolling
    10. Re:Come on! by ejasons · · Score: 1

      But as a general rule, it's as a designation of ownership would make sense. John's computer, the computer's mouse, all make sense. Why not it's buttons? Using its in sentences like that is a special case that has to be memorised, because it does not make sense.


      Yes, just like hi's, her's, your's, and their's. Oops...
    11. Re:Come on! by m50d · · Score: 1

      Those are still special cases compared to the enormous number of things for which you do use the apostrophe to show possession.

      --
      I am trolling
  15. Cell by should_be_linear · · Score: 0

    This kind of amazing stuff you can expect to see on Cell CPU. And you will not have to hack GPU, it is supported by Linux kernel.

    --
    839*929
  16. Re:Just what I need! by udderly · · Score: 1

    I'd imagine that there are probably more entities than the OP realizes that would need this sort of sorting. I worked in a Fortune 100 retailer, in one of their 600 stores, which had 5000 individual transactions per day, with an average of 18-20 items per transaction. They routinely do data analysis on item movement, during what hours, bought in conjunction with what other items...blah, blah, blah.

    You can do the math, but that's a lot of data at the corporate level.

  17. It's been said... by TobyWong · · Score: 4, Interesting

    ...that the greatest threat to Intel's market domination in the future is not going to come from AMD but from a company such as Nvidia.

    Take a look at what they are doing with their GPUs right now and you can understand why someone would suggest this.

    --
    - Toby
    1. Re:It's been said... by fizze · · Score: 1

      Well, for highly specialised tasks, take a look at Analog Devices http://www.analog.com/ or Texas Instruments http://www.ti.com/.

      They have been producing highly sophisticated cores that left a P4 bite the dust in a lot of cases.

      I have worked on test-bed equipment that used a DSP PCI card that produced more test-data than a dual Xeon system could handle. JFYI.

      GPUs like those from nVidia or ATI are still a lot less sophisticated than those DSPs, or hybrid DSP/uCs.

      Still, in a few years FPGAs or CPLDs will surely be a so called "bigger threat" to (your favorite CPU company's) domination.... ;)

      --
      Powerful is he who overpowers his temptations.
    2. Re:It's been said... by mmp · · Score: 1

      They have been producing highly sophisticated cores that left a P4 bite the dust in a lot of cases.

      I may be wrong, but I'm guessing they don't cost a few hundred dollars each, though, like GPUs do. (The economies of scale that GPU manufacturers see thanks to the fact that they make millions of these things are really quite nice for keeping prices down.)

      GPUs like those from nVidia or ATI are still a lot less sophisticated than those DSPs, or hybrid DSP/uCs.

      I think I disagree. How familiar are you with the architecture of modern GPUs? For example, you do know that modern GPUs have upwards of 300M transistors (which is roughly 2x as many as modern CPUs), that most of them are devoted to computational units, etc.?

      Still, in a few years FPGAs or CPLDs will surely be a so called "bigger threat" to (your favorite CPU company's) domination.... ;)

      I think the FPGA crowd has been predicting that their day will come "real soon now" for quite a long time now. :-)

      -matt

    3. Re:It's been said... by megalomang · · Score: 1

      Wasn't that said by... Jen-Hsun Huang, the Nvidia CEO? I think he was interviewed in Wired, and on the cover it said Nvidia would be the "Intel killer". It's no surprise that the Nvidia CEO would say things that inspire such headlines.

      But wasn't that several years ago, just before Intel increased their market share to well over half the market by selling Intel integrated graphics?

    4. Re:It's been said... by savuporo · · Score: 1

      I may be wrong, but I'm guessing they don't cost a few hundred dollars each, though, like GPUs do Absolutely, they routinely cost from $1 to $100 at most. For instance, take a look at AC motor control hybrid DSP/uC chips by Freescale. Those allow vector control of AC motors by implementing very sophisticated and high-performance algorithms on specialized chip, this kind of stuff was simply impossible just ten years ago.

      --
      http://validator.w3.org/check?uri=http%3A%2F%2Fwww.slashdot.org Errors found while checking this document as HTML5!
    5. Re:It's been said... by mmp · · Score: 1

      And that chip can do 50+ GFLOPS of computation and has 256MB of local memory with 20-30GB/s of bandwidth to that local memory??

      -matt

    6. Re:It's been said... by fizze · · Score: 1
      I may be wrong, but I'm guessing they don't cost a few hundred dollars each, though, like GPUs do. (The economies of scale that GPU manufacturers see thanks to the fact that they make millions of these things are really quite nice for keeping prices down.)
      While GPUs are "only" being used in common graphics cards, which are "only" being used in the PC market, DSPs are used in a much bigger field of electronic appliances.
      Think of measurement equipment, think of the automotive sector, think or aviation, just to name a few.
      They have far bigger production numbers.
      I think I disagree. How familiar are you with the architecture of modern GPUs? For example, you do know that modern GPUs have upwards of 300M transistors (which is roughly 2x as many as modern CPUs), that most of them are devoted to computational units, etc.?
      You are right, but you miss my point. I do not compare GPUs to CPUs.
      And, FYI, transistor count is definetley not a good performance indicator. GPUs are quite similiar to DSPs in many aspects, but they are simply "too new". Those programmable shaders are a step in the right direction, but again, I dont compare them to CPUs.
      Take a 4096 32-bit float discrete fourier transformation (DFT), and tell me how many cycles it'll take on your general purpose CPU, and I'll tell you that it'll take about 8192 on a cheap DSP (30$ish).
      I think the FPGA crowd has been predicting that their day will come "real soon now" for quite a long time now. :-)
      Yeah, right. But still, FPGAs are so much more elegantly when it comes to architecture :)
      --
      Powerful is he who overpowers his temptations.
    7. Re:It's been said... by aurilieus · · Score: 1

      Very rightly said....nvidia is bang on target in its global expansion spree in cost effective countries.They have huge plans of growing from the present 3bn US$ to close to 30-40bn US$ in a very short time frame.. Although most of their work will be concentrated on GPUs and MCP(media and communication processors.They plan on becoming a complete multimedia and graphical solutions provider organisation with a very wide range of products ranging from gaming cards,gaming consoles,multimedia solutions for mobile devices,HDTVs and lots lots more, having GPUs,MCPs ..etc at the heart of it all.I wont be surprised if nvidia gets into a stiff competition with AMD in the very near future..

  18. It is really time... by ratta · · Score: 2, Interesting

    that, since graphic hardware is now completely programmable, ATI and Nvidia realese their specs, so that everyone can use the GPU as a specialized vectorial coprocessor. A CPU that cannot be programmed freely just makes no sense, the same should be for GPU.

    --
    Wondering why i am doing so strange posts? I am trying to get a "+5,Flamebait" or "-1,Insightful" rating.
    1. Re:It is really time... by Xrikcus · · Score: 1

      It's not really a vectorial coprocessor, more suited for large scale stream processing. I'm not sure what extra opening of specs would help in this area though, the programming specifications seem to be quite easily available (people know how to write shaders, afterall) and work has been done on using the GPUs as generalised streaming processors http://graphics.stanford.edu/projects/brookgpu/ind ex.html being a prime example. Within the hardware limitations (of which there are many) they are freely programmable.

    2. Re:It is really time... by mikael · · Score: 1

      The current generations of CPU's are designed to optimise the processing of conditional statements within a pipeline. Since instructions usually take four stages to perform (fetch, read, execute, write) and the location of the current instruction can depend on the condition result (branch on whatever) of the previous instruction. To prevent this from degrading performace, the CPU has to anticipate the effect that both results of a conditional instruction will have on future instructions. Also, there is no dominant data type for a CPU - it may be working with bytes (for web servers, text data), short ints (for text data), float or doubles (for spreadsheets).

      A GPU on the other hand is always processing large batches of vector data (3-float vectors or 4-float colors) using mathematical expressions (vertex and pixel shaders), with conditional statements being used a small percentage of the time. Because of this, there's no real advantage of investing transistor real-estate in anticipating future states.

      --
      Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
  19. GPU has a different architecture by gvc · · Score: 1

    The GPU has lots of separate processing elements that operate independently. It also supports floating-point vector operations. Quite a different animal from yer-average-x86.

    1. Re:GPU has a different architecture by slashflood · · Score: 2, Interesting

      The GPU has lots of separate processing elements that operate independently. It also supports floating-point vector operations. Quite a different animal from yer-average-x86.

      Just like the Cell processor...

    2. Re:GPU has a different architecture by Arthur+B. · · Score: 0

      RMFP please. I didn't claim GPU had nothing superior to x86, I said it's no wonder an additional processor can provide additional computing power. What I said is that this is only a hack but is showed the urge to develop specialized cpu for common tasks other than 3D. I also pointed out the need to have a library use those processors because compilers will not have the ability in the future to optimize code for CPU performing high-level operations such as sorting.

      --
      \u262D = \u5350
  20. GPUs, and Floating Point Numbers General Question by BioCS.Nerd · · Score: 2, Interesting

    This may be slightly offtopic insofar that it doesn't directly deal with the subject (sorting with a GPU) at hand, but I was wondering how these sorts of research projects overcome "floating point number weirdness" I've heard about doing GPU calculations (as per the implementation being non-IEEE)? Doubly, would someone in the know help explain what the aforementioned "weirdness" means?

  21. Re:Just what I need! by SimilarityEngine · · Score: 1

    I'm quite probably talking rubbish but... is hacking the GPU the best possible solution to the problem (i.e. fastest sort possible for the amount of money spent on the hardware)? Or is this just someone having some fun (which is fair enough)?

    Someone please enlighten me!

    --
    Those who can make you believe absurdities can make you commit atrocities. - Voltaire
  22. Probably not for game applications by benhocking · · Score: 2, Interesting

    Most GPGPU (general purpose GPU) researchers are envisioning scientific purposes. It really galls many of us in the scientific community that GPUs are so much more powerful than CPUs (if one can efficiently use the parallel processing capabilities of GPUs), and yet mostly we have to let these powerful processors go unused because it is typically very difficult to use GPUs for non-graphical computations (hence the G in GPU, of course).

    --
    Ben Hocking
    Need a professional organizer?
    1. Re:Probably not for game applications by RootsLINUX · · Score: 1

      By the way, for those interested in GPGPU research/ideas, there's a pretty nice site here: http://www.gpgpu.org. It has some sample code, slides from conferences presentations, a forum, etc. It's a pretty nice site for information. I was interested in GPGPUs a few months ago and read through the material on that site heavily, but in the end I didn't have the time to try anything cool out because you'll need to learn how to use a language like Cg or steam to program your GPGPU.

      --
      Hero of Allacrost, a FOSS RPG for *NIX/*BSD/OS X/Win
    2. Re:Probably not for game applications by Tim+C · · Score: 4, Insightful

      Well, to be fair GPUs are only "so much more powerful than CPUs" if your task is suitable for running on a GPU. If not, then you're better off using the CPU.

      Kind of like how a bulldozer is much more powerful than a hammer, but totally unsuitable to banging a nail in a bit of wood. If you want something torn down or moved about, though...

    3. Re:Probably not for game applications by Threni · · Score: 1

      > http://www.gpgpu.org/

      When did this site last work?

    4. Re:Probably not for game applications by gardyloo · · Score: 1

      Kind of like how a bulldozer is much more powerful than a hammer, but totally unsuitable to banging a nail in a bit of wood. If you want something torn down or moved about, though...

      Is that you, Mr. Stephenson?

  23. precision? by Barbarian · · Score: 2, Interesting

    Are not most GPUs limited to 24 or 32 bit precision? The x86 processors can go to 80 bits.

    1. Re:precision? by Anonymous Coward · · Score: 0

      Currently they are limitted to 24 bits and are not scheduled to go any higher. This is sufficient for game graphics. Researchers are asking for double precision capabilities but i doubt they will appear.

  24. very decent? by mnemonic_ · · Score: 2, Funny

    a very decent nVidia graphics card

    So is that like "average", except "very average"?

    1. Re:very decent? by cide1 · · Score: 1

      it is so very, very OK

      --
      -- the computer doesn't want any beer, no matter how much you think it does. NEVER, EVER feed your computer beer.
  25. Re:Just what I need! by grazzy · · Score: 1

    Probably not, however if a GPU is available and you cna do your sort operation on it, and the sort operation would take long enough to hog the main CPU + that it wont occupy busses etc to much, you might end up gaining from it i suppose.

    (Given that the sort procedure is something that CAN be run in the background whilst the user is doing something else).

    I highly doubt that companies like Oracle or Google would benefit from using GPUS in large scale instead of CPUS...

  26. Re:GPUs, and Floating Point Numbers General Questi by slashflood · · Score: 3, Insightful


    I've heard about doing GPU calculations (as per the implementation being non-IEEE)? Doubly, would someone in the know help explain what the aforementioned "weirdness" means?

    There are just some functions missing, like different roundings and the missing double precision. GPUs are simply not optimized for scientific calculations, but that doesn't mean that they can't be programmed to be useful for those workloads.

  27. Let's get an Nvidia fanboy response. by Hal_Porter · · Score: 0

    Why I am really, really tempted to post this on an NVidia fansite with the caption

    LOL NVidia suxxorz ATI ownz j00!

    --
    echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    1. Re:Let's get an Nvidia fanboy response. by Slashcrap · · Score: 1

      Why I am really, really tempted to post this on an NVidia fansite with the caption

      LOL NVidia suxxorz ATI ownz j00!


      Yes, it's absolutely shocking to see that a processor with 24bit precision can run an algorithm faster than one with 32bit precision.

      If they were running an algorithm that actually required 32bit precision the results would be rather different.

      I realise that this is something that will go over the heads of fanboys everywhere. And judging by some of the comments to this story, 99% of Slashdot readers.

    2. Re:Let's get an Nvidia fanboy response. by Hal_Porter · · Score: 1

      Actually I was joking about fanboys and meaningless benchmarks.

      E.g. when game x comes out which favours Nvidia, all the Nvidia crow about it. When game y comes out that favours ATI, all the ATI fan boys crow about it.

      But when you look at it, most of the time they are talking about +20% on framerates that are already twice the refresh rate. It's completely artificial - in the FPS benchmark they turn off the Vsync to be a more informative test, but not in the game. In the game, the drawing code will wait for a Vsync before drawing a new frame, so both cards will end up rendering the same frame rate when you actually play the game.

      The only time it made any difference was with Doom 3, which would drop beneath my subjective acceptable frame rate on ATI, and run above it on Nvidia. Most of the time, once things got hectic NVidia cards would be jerky too. But who actually plays Doom 3 anyway - it was just a tech demo.

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    3. Re:Let's get an Nvidia fanboy response. by Anonymous Coward · · Score: 0

      You must be 16 years old.

    4. Re:Let's get an Nvidia fanboy response. by default+luser · · Score: 1, Informative

      But when you look at it, most of the time they are talking about +20% on framerates that are already twice the refresh rate. It's completely artificial - in the FPS benchmark they turn off the Vsync to be a more informative test, but not in the game. In the game, the drawing code will wait for a Vsync before drawing a new frame, so both cards will end up rendering the same frame rate when you actually play the game.

      Not true. Both Nvidia and ATIs' driver sets default profiles have force v-sync disabled. Most games don't have it enabled by default, and few offer an option to enable it.

      What's the difference? With v-sync enabled, anytime the video card needs longer than one sync period to draw a frame, the video card displays the same frame as the last one. This, to the end user, looks like your framerate literally dropped in half, and feels jerky because it usually happens intermittently.

      With v-sync disabled, the video card always draws the next frame, even if the drawing is still in progress. This means if you suddenly have to take a little longer to render a frame, you will get a torn frame, but at least the update rate to the screen will be consistent.

      So, why do people want higher framerates than their monitor's refresh rate can handle?

      * If you're using v-sync, you want to make sure the minimum framerate never falls much below you monitor's refresh rate, or risk the display feeling randomly sluggish during frames with significantly higher processing requirements.

      * If you have v-sync disabled, the less often the minimum framerate drops below the refresh rate of your monitor, the less tearing you encounter. Of course, tearing is not as noticeable as you would think. You tend to notice tearing only on vertical polygon edges, and then only when the polygon has moved significantly in your view between frames. That's why this is typically the default in most games...it SOUNDS like it should look bad, but it rarely does.

      It's basically a user preference: inconsistent sluggishness, or inconsistent bouts of tearing, but they both benefit from having an AVERAGE framerate much higher than the 85Hz refresh rate of your monitor.

      --

      Man is the animal that laughs.
      And occasionally whores for Karma.

  28. Re:Check out Intel and ATI's misdeeds & lawsui by Anonymous Coward · · Score: 1

    That is a sly page advertising http://www.lawyersandsettlements.com/ , ignore this asshole.

  29. Re:Just what I need! by Hal_Porter · · Score: 1

    I think it's just fun.

    Probably the reasons why GPU's are fast for the task they are designed for - a small amount of very fast (assuming you access in the right order) non paged memory and a very simple (no Out of order execution) but highly parallel processor make them bad at general purpose stuff.

    On the other hand, I can imagine that you could build a sort coprocesseor in an FPGA - the fact that it was optimised for the algorithm might be able to outweigh the fact that FPGA implementations of a given piece of hardware will have a lower clock rate than customer silicon ones.

    But no one seems to be doing it, and there's probably a good reason for that. Maybe in SSE5 or something. You have to wonder if the cost in chip real estate for a big chunk of programmable logic is going to bring more benefits than more cache, or ALUs and so on.

    --
    echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
  30. Re:Just what I need! by daikokatana · · Score: 1
    I'm working on some software to do image comparisons the same way as human beings would do it. In other words, you feed the computer two pictures, one of your girlfriend walking around in Paris, another of your girlfriend sitting on a chair in your home and it will recognize both pictures as containing your girlfriend.

    Anyway, the application does A LOT of sorting, I'd give an arm and a leg for a solution like this. If there was some way to ship all this sorting to the graphics card, I'd do it! So now there is, I'm more than wide awake :)

    --
    http://jcsnippets.atspace.com/ - a collection of Java & C# snippets
  31. Wow by CmdrGravy · · Score: 1

    Seriously, like who here hasn't already done this themselves anyway. At least it's not a dupe I suppose !

    1. Re:Wow by SeventyBang · · Score: 1


      {perk} Dupe?

      Look at the editor for this story - Timothy - the dupe from last night. Both stories on the front page

      On top of that, the story has then instead of than.

      'spose he's on day 6 of a 15-day meth binge?

  32. Forgive the stupid question, but....er.....why? by Anonymous Coward · · Score: 0

    While I'm sure it's an interesting and difficult programming challenge, it seems like the notion here is taking a specialized piece of hardware that's developed for one specific purpose (high-speed rendering of graphics) and using it for some other purpose (calculation)

    While it's laudible that this has been tried and succeeded, is this anything more than the equivalent of someone porting linux to a calculator watch? Sure, it's hard, and impressive you were able to do it, but is it something that anyone will actually use?

    If someone wants more sorting power on a box, is this really a viable/useful/most efficient way to get those additional cycles, over options like adding/upgrading existing processors, etc.?

    1. Re:Forgive the stupid question, but....er.....why? by graphicsguy · · Score: 2, Insightful

      It is important because sorting is an important algorithmic building block for solving many problems. If you want to solve important problems entirely on the GPU (which has many more GFLOPS available than the CPU, if you can manage to use them efficiently), then you need to be able to do the sorting component of those larger algorithms.

  33. Re:GPUs, and Floating Point Numbers General Questi by graphicsguy · · Score: 3, Informative

    It is true. The GPU floating point implementations are not IEEE compliant. There is an interesting study on the floating point behavoir of the GPU here (and the associated pdf)

  34. I like that analogy... by benhocking · · Score: 1

    Next time I need to hammer a bunch of nails in parallel, perhaps I'll consider using a bulldozer! :D

    Your point is well made, of course. Nevertheless, with work like this (the GPU sorter) even off-loading a little work to the GPU can allow your CPU to do other work thereby shortening the wall-clock time required to do your computations (in theory - in practice, it can hurt you if you're not careful!). My research area (neural networks) is inherently parallelizable, but I am not yet aware of work to efficiently use GPUs for this purpose.

    --
    Ben Hocking
    Need a professional organizer?
    1. Re:I like that analogy... by Anonymous Coward · · Score: 0

      I ported the fann fixed point library to run on my 6600GT. Its only about 3x faster than my CPU (3200+ Athlon).

      Unfortunately, moving data around is too slow for my uses, but others might be able to do something with it.

  35. The algorithm is called "Bitonic sort" by lonedroid · · Score: 2, Informative

    This is an implementation of the Bitonic sort.

    From the article, when comparing their sort to previous GPU sort: "These algorithms also implement Bitonic Sort on the GPU for 16/32-bit floats using the programmable pipeline of GPUs."

    So as I understand it they made a very fast implementation (using the GPU) of an old algorithm suited to parallel processing: bitonic sort was published in 1968 (hey, where were the fast parallel processors in the late sixties ;)

  36. Radix is O(n)? Nice try by hypnagogue · · Score: 1

    For general purpose sorting, the number of passes in radix sort is log n (e.g. the number of digits of a decimal number is ceil(log10(number) ), so it is also O(n log n). It's only an optimization in sparse sets.

    --
    Liberty you never use is liberty you lose.
  37. Separate the GPU from the video output? by ebcdic · · Score: 1

    Perhaps the time has come for separate GPU and video output cards. The video output card could be fairly dumb, perhaps on the motherboard, and the GPU card optional and usable even in servers without a video card.

    1. Re:Separate the GPU from the video output? by Wesley+Felter · · Score: 1

      And then you need a multi-GB/s connection between the GPU and the DAC, which isn't free.

    2. Re:Separate the GPU from the video output? by Jack+Johnson · · Score: 1

      With digital displays becoming the norm. Won't the DAC be a thing of the past before long anyway?

    3. Re:Separate the GPU from the video output? by Wesley+Felter · · Score: 1

      OK, so you need a multi-GB/s connection between the GPU and the DVI transmitter, which still isn't free.

    4. Re:Separate the GPU from the video output? by Jack+Johnson · · Score: 1
      OK, so you need a multi-GB/s connection between the GPU and the DVI transmitter, which still isn't free.

      Not free but negligible. Budget boards pull of integrated, accelerated video while keeping to low price points already. Sapphire will be releasing a Radeon Xpress integrated video + integrated DVI solution shortly.

  38. More specifically by gr8_phk · · Score: 1
    "The implementation can handle both 16 and 32-bit floats."

    So it's "hard coded" for a couple types. The standard qsort has you pass a function pointer to a comparison routine so it can sort anything. Standard qsort also sorts a list of pointers to the items - I bet this GPU sort works directly on the data. Implementing a less generic version on the CPU is likely to result in it being faster than the GPU sort.

    The blurb at the end about increasing GPU speed with each generation is crap too. Both CPU and GPU performance are now limited by power dissipation issues.

    It remains more effective to write stuff like this on the CPU rather than code for 2 different devices. Imposing the same limitations on generic C code will usually result in the same or better performance on the CPU. These "general purpose GPU" programs always seem to illustrate why the GPU is not general purpose.

    1. Re:More specifically by cbreaker · · Score: 1

      "Both CPU and GPU performance are now limited by power dissipation issues."

      Somewhat.

      There's a lot more room for growth in the GPU arena then there currently is in the CPU. They're only running at about 500Mhz right now, and they can add more processing pipelines and other enhancements currently found in the CPU, as well as shrinking the manufacturing process that will allow higher Mhz and more processing power.

      The present-day GPU is very advanced, but they're still behind modern CPU's in terms of manufacturing processes by a generation.

      I guess the whole idea around doing something like this is the old "why not" and also the ability to perform these functions on large amounts of data without tying up the CPU.

      --
      - It's not the Macs I hate. It's Digg users. -
    2. Re:More specifically by Mr.+Mikey · · Score: 1

      "Implementing a less generic version on the CPU is likely to result in it being faster than the GPU sort."

      How do you make a meaningful statement concerning the speed of a "less generic" algorithm running on a general-purpose CPU vs. a "generic" algorithm running on a piece of special-purpose hardware (GPU)?

      Methinks your statement carries with it a load of unstated assumptions... care to spell them out for us?

      --
      wants to be the first monkey to touch the monolith
    3. Re:More specifically by gr8_phk · · Score: 1
      "Methinks your statement carries with it a load of unstated assumptions... care to spell them out for us?"

      What I meant is that the GPU version is NOT generic like the qsort() CPU version. By a "less generic" version on the CPU I meant one that is optimized for a specific data type like "float". The standard quicksort does not sort a bunch of numbers, it sorts a list of pointers to things (like structs for example) when you provide it a pointer to a compare function that compares the things.

      One can easily increase the performance of qsort by removing the generic function call to a compare funciton, and replacing it with a straight compare. That alone should make it run quite a bit faster (probably more than the 2x it's trailing in this comparison).

      If the GPU sort is sorting actual data rather than pointers to data, it is even less generic than I assumed and the CPU version could be sped up again.

      So my big assumption is that the GPU sort they implemented does not work on generic structures.

      If it does sort pointers to structures (making an ass of me), and is only limited in the actual compare operation (i.e. no strings in the structures) then the CPU version can still be sped up by hard coding the offset (or passing it in) to a numeric element in a structure and removing the generic function call. The significance of this in terms of performance should not be underestimated, and would not be unfair since the GPU version has a similar restriction.

      Optimizing for a particular case on a piece of hardware and then comparing it to unoptimized code on another piece of hardware is not a fair comparison.

      To me, all they've done is show that the GPU really isn't as general purpose as the fanboys keep telling us. They didn't implement a generic sort because it was too hard. Even doing what they did is fairly tricky and required writing new code rather than running the same code on both platforms.

    4. Re:More specifically by Woody77 · · Score: 1

      I guess the whole idea around doing something like this is the old "why not"

      I think this is the group that my wife's cousin was working with when at school (physics graduate student). Now he's working a summer internship at nVidia.

      But they were doing a LOT of highpower physics computations that were designed to be vector-based algorithms that could be highly parallelizable, and then using nVidia or ATI graphics cards to do the heavy lifting, as even an $800-1000 card is a lot cheaper than a supercomputer.

      Not as fast as a supercomputer, either, but a lot faster than CPU for vector math, especially highly parallelizable operations.

    5. Re:More specifically by Anonymous Coward · · Score: 0

      These "general purpose GPU" programs always seem to illustrate why the GPU is not general purpose.

      Keep in mind that many of the GPU algorithms that have been created are meant to show how a streaming processor could be more efficient for tasks than a CPU. As someone who has written a 3d fluid simulator on the GPU I know that it is not easy to write alogorithms, or debug them on the GPU, but that does not mean they shouldn't be explored.
      As far as the speed issue goes. GPU matrix multiplication has succeeded in surpassing the speed of the Atlas library's implementation. Which is one of the fastest, if not fastest, CPU implementation. So there are general purpose apps where the GPU can be more efficient.

  39. Why qsort? by ringm000 · · Score: 2, Informative

    Comparing with C qsort is strange, qsort will never work fast due to being unable to inline the comparison function. Hand-written qsort of floats should work much faster.

    1. Re:Why qsort? by fred+fleenblat · · Score: 1

      The inlining issue is critical. Last time I profiled a sort function there was a solid 20-30% gain.

    2. Re:Why qsort? by Anonymous Coward · · Score: 0

      Nope. You'll get more function-call & other overhead from setting up the sort in the GPU, so this is completely irrelevant.

    3. Re:Why qsort? by yeremein · · Score: 1

      Nope. You'll get more function-call & other overhead from setting up the sort in the GPU, so this is completely irrelevant.

      Is there the same overhead per comparison? std::sort() performs much faster than qsort because it avoids the overhead of a function call on every comparison. With a deep-pipelined processor like a P4 Prescott, there's real savings to be found there.

    4. Re:Why qsort? by Anonymous Coward · · Score: 0

      You're right, I misread the original comment, and jumped the gun.

  40. Anyone is going to ..... by samxiao · · Score: 0

    put Linux on it? man, i want to run Debian Sarge on it :D

  41. CS at liberal arts universities? by sailor420 · · Score: 1

    Computer science research being done at a liberal arts university? The universe is going to implode, SCO will run IBM out of business, Google will start selling our life history to terrorists, and George Lucas will make a movie that doesn't suck!

    Disclaimer: I'm a tarheel born and bred. And a student at Chapel Hill.

    1. Re:CS at liberal arts universities? by doktor-hladnjak · · Score: 1
      UNC is most definitely NOT a liberal arts university. Although there is no real engineering program, there is a lot of natural science and medical research going on. Plus, there's the business, law, journalism and a myriad of other more professional schools.

      The CS department here is a top 20 research-based program with a very strong focus in the computer graphics and imaging areas.

    2. Re:CS at liberal arts universities? by sailor420 · · Score: 1

      Oh, I know the CS program at UNC is good. Great, even. I was admitted as a CS major there (here?) before I changed, realizing that I didn't want my job to be exported to India.

    3. Re:CS at liberal arts universities? by cant_get_a_good_nick · · Score: 1

      UNC Chapel Hill (besides being home of the Dean Dome) has one of the nations, possibly the world's best visualization labs. The fact they were doing GPU research is not a surprise to me.

      I have some pride that my otherwise crappy school (UIC) is also at the relative forefront of visualization as well.

  42. Damn calculus, GPUs need to get into 3rd dimension by Anonymous Coward · · Score: 1, Funny

    I don't care about any general purpose computing with GPU.

    What I want is a CPGA card, that is a Color Porno-Graphic Adapter, which accelerates drawing spherical triangles (aka Bolyai-Lobatschevskiy geometry).

    That way we can get decent-looking boobs and asses in the upcoming Lara Croft game, composed from a few dozen spherical triangles instead of thousands of flat ones. That will result in 1000+ fps at 1600x1200 resolution.

    I am truly fed up with those homonculus-looking bitches in GTA San Andreas. They are as desirable as Frankeinstein or an ork.

  43. Numerical sorting by RealProgrammer · · Score: 1

    I was working on a sorting algorithm based on curve fitting. It's not a comparison sort, but more like a radix sort, but not one.

    You start with a bunch of numbers, or things that can be given a number. Loop through the numbers, keeping track of High, Low, total, mean, std dev, etc. Use that information to develop an interpolating curve, which tells you for a given value where it ought to end up.

    Put the Low at the front and the High at the back. On the next pass throught the numbers, put the number you swapped with Low about where it goes. It replaces something; put that where it goes, and so on.

    I'm kind of stuck about what to do with collisions. It's led me into different directions, such as whether to segment and recur with a linear interpolant or to use several thresholds and group the numbers, making a higher order polynomial.

    I decided to waste time on Slashdot, instead. I'm such a loser.

    --
    sigs, as if you care.
  44. LAME GPU by Doc+Ruby · · Score: 1

    I can't wait for someone to port an MP3 encoder to GPGPU. A $300 P3, stuffed with 5 $200 videocards, could host a pool of MP3 encoders, faster than 10 Xeons which would cost $30k. And a lot easier to administer in a streaming farm.

    --

    --
    make install -not war

    1. Re:LAME GPU by Neuroelectronic · · Score: 0

      how would you feed it uncompressed data fast enough?

    2. Re:LAME GPU by Doc+Ruby · · Score: 1

      AGP (1x) does 254.3MBps, PCI (v1.0) does 127.2MBps. CD data is 172.4KBps. So PCI does 738 CD streams, while AGP does 1475. At 128Kbps, each stream IO is 192.4KBps duplex, so AGP: 1322 encoder streams, PCI 661 encoder streams. A P4/3GHz does 6GFLOPS, including host apps (OS, etc), while a $115 GeForce FX5900 does 20GFLOPS; a $470 GeForce 6800 Ultra does 40GFLOPS. Even at the slowest AGP and PCI speeds, and the fastest theoretical GPGPU speeds, the GPU is still slower than the bus, and 5 of those cards in PCI would have to do over 100 streams each to fill the bus, even leaving 25% PCI capacity for the host P4. That's over 500 streams for about $1000, which can't be done any other way I know of.

      --

      --
      make install -not war

    3. Re:LAME GPU by BiggerIsBetter · · Score: 1

      Heh. You still going on about this? ;-) I've had a think and I'm pretty sure it's doable...

      --
      Forget thrust, drag, lift and weight. Airplanes fly because of money.
    4. Re:LAME GPU by Neuroelectronic · · Score: 0

      OK "the bus will feed it" was not an answer. what feeds the bus? I don't think you can connect 1475 CD-ROMS to a PS3. besides there isn't 5 gpu's in a PS3.

    5. Re:LAME GPU by Doc+Ruby · · Score: 1

      I'm not going to go down the signal chain, with you constantly complaining that there's a bottleneck somewhere upstream. You didn't even read my post well enough to know that I never mentioned a PS3 - you made that up. So, where's the bottleneck, smart guy?

      Before you get all heated up, just consider that right now, a 6GFLOPS P4/3GHz PC (without GPGPU) bottlenecks the P4 in the intensive MP3 compression code. Make the P4 just feed the bus, from memory, flash or disk, and the bottleneck might still be the P4. But that bottleneck will be much faster than the P4 compressing by itself. So "the bus is not the bottleneck" isn't good enough. Where is the bottleneck, that makes multiple GPUs less effective than the P4?

      --

      --
      make install -not war

  45. Re:Just what I need! by chemistry · · Score: 0

    I do.

  46. Re:Radix is O(n)? Nice try by agbinfo · · Score: 1
    But the benchmark is not against any set. It's against 16 and 32 bit floats.

    We don't know the memory requirements for the GPUSort (this should be provided when comparing sort algorithms). We don't know if there is a limit to the amount of keys it can sort (what if the data doesn't fit in the GPU memory). Does the algorithm keep the original order when the key is the same?

    In the case we are interested in, with enough memory radix sort is O(n) in speed so it would beat qsort and GPUSort for a large enough set of data. If we had been talking about larger keys, we don't know how GPUSort would fair either.

  47. Re:Just what I need! by Anonymous Coward · · Score: 0

    Will it detect this "mythical girlfriend" if she's naked? Or was this an illusion drawn up by your fevered mind?

    this is /. always rememeber that.

    - moomin

  48. My professor was doing this by yellowcord · · Score: 2, Interesting

    a couple of years ago. Nvidia gave him a FX5900 (I think. It was the one that sounded like a hair drier and got pulled from the market) to do his research with. Anyways, check out his papers on the subject.

  49. The point? by Pedrito · · Score: 2, Interesting

    I don't really see the point. I mean, yes, it's kind of neat, but that's about it. It serves no practical purpose, really.

    Not every machine has a GPU. I don't know, but I suspect GPUs aren't terribly compatible with each other, so for any sort of market, you'd have to code for multiple GPU types.

    The fact is, co-processors have been around since the early x86 CPUs. Not just math co-processors, but Intel used to cell i860 and i960 co-processor boards (some with multiple CPUs) for just this sort of thing.

    I'd also suspect that if your GPU is being used for sorting or some other calculation intensive operation, it's less useful as a GPU. If you don't need the GPU as a GPU, but you need the computing power, I suspect spending the additional money on more CPU power instead, is going to have a bigger overall payoff since it's going to speed up every operation, not just the ones a GPU can be adapted to.

    So, again, I don't really see the point. Really, if you need specialized co-processing, getting a FPGA board will probably be a much better use of your money since it can be customized for many more tasks than a GPU.

    1. Re:The point? by Anonymous Coward · · Score: 0

      >>Not every machine has a GPU. I don't know, but I suspect GPUs aren't terribly compatible with each other, so for any sort of market, you'd have to code for multiple GPU types.

      Wrong. They are for the most part compatible with each other and are becoming more and more compatible with each driver release from the two vendors. For example: ARB_texture_float (vs. NV_float_buffer and ATI_texture_float) ARB_texture_rectangle (vs. NV_float_buffer), NPOT's being part of the OpenGL 2.0 standard, and EXT_framebuffer_object.

      >>but you but you need the computing powerI'd also suspect that if your GPU is being used for sorting or some other calculation intensive operation, it's less useful as a GPU. If you don't need the GPU as a GPU, but you need the computing power, I suspect spending the additional money on more CPU power instead, is going to have a bigger overall payoff since it's going to speed up every operation, not just the ones a GPU can be adapted to.

      1. GPUs are cheap. The equivalent GFLOPS on CPUs would be more expensive.
      2. CPUs aren't good for all algorithms. Any algorithm which naturally fits into the stream programming model will benefit greatly from being implemented on a GPU (for example image processing filters).

      >>So, again, I don't really see the point. Really, if you need specialized co-processing, getting a FPGA board will probably be a much better use of your money since it can be customized for many more tasks than a GPU.

      How many GFLOPs per $$$?

    2. Re:The point? by Chirs · · Score: 1

      The point is that many machines come with a GPU as default hardware, even if they're never going to do 3D graphics work. If you can make use of something that's *already* on your machine, it's a net gain.

      Also, FPGA boards are not necessarily cheap, while GPUs can be obtained for reasonable amounts of money and have very high performance if your problem can be coded to run on them.

    3. Re:The point? by cant_get_a_good_nick · · Score: 1

      as far as generally available co-processors, GPUs are by far the most common. therefore you get general availability on a number of machines, and economies of scale to lower the price somewhat.

      Though this particular implementation (sorting) probably won't revolutionize the world, it can be seen as a step, a way of learning how to use the GPU in a more general fashion. Maybe we should see this more as "Hello World" in a new programming language on a new system rather than an end in itself.

    4. Re:The point? by Jack+Johnson · · Score: 1
      Not every machine has a GPU.

      However, many, if not most do, probably more than you expect. Run of the mill Dells have been incorporating low end Radeon and Geforce chips for several years.

      I don't know, but I suspect GPUs aren't terribly compatible with each other, so for any sort of market, you'd have to code for multiple GPU types.

      I would imagine that they are not really that different. After all, they conform to a particular feature set or directx/opengl revision. Games don't have to be recompiled to run on different cards as long as a driver is available.

      Why wouldn't the same be true other programs?

  50. Why nVidia? by Mind+Mage · · Score: 0

    "requirements seem pretty hefty (a very decent nVidia graphics card and the latest nVidia drivers)."

    But in the first graph, it clearly shows the less expensive ATI card beating nVidia. I also don't understand why the latest nVidia drivers are a hefty requirement.

    1. Re:Why nVidia? by Anonymous Coward · · Score: 1, Informative

      Check out the data precision. ATI systems are limited to 24 bits but nVidia implements 32 bits (per the IEEE spec).

      It's no surprise that the ATI GPU is faster when it processes less data.

  51. However are they reliable enough? by TheLink · · Score: 2, Interesting

    Are GPUs as reliable as CPUs for these tasks? I mean if there's an error in a GPU you might get a brief rendering glitch and lots of people may not notice.

    In fact, some people are actually talking about ways of testing and allowing slightly broken graphics hardware to be used in situations where the _visual_ results won't be that off.

    --
  52. Beowolf cluster of Smalltalk processes by crovira · · Score: 1

    You could run a large simulation as a neural net in distributed Smalltalk, taking advantage of the inherent parallelizability (if such a word exists?) of the actions of a neuron firing when inputs hit certain critical tresholds.

    Basically, it the same structure I would use when doing a terrain simulation using finite state automata. The wider you can spread the computational net, the more computational fish you can catch.

    Its not the object that is so complicated, (a neuron is, uh, stupidly simple [despite its exquisitely complex bio-chemical processes], and can be likened to a wet on-off swich,) its the Relationship and Connections to other objects that become extremely rich.

    Luckily connections are existentatial in nature and either, to wax Shakespearian, __be__ (and can request the objects objects connected to acknowledge each other,) or __not be__ (in which case the only thing that can happen is that a connection be created between the objects.)

    One of the objects can itself be a connection which means that Relationships are recursive.

    --
    MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
  53. Quicksort versus HeapSort by tjwhaynes · · Score: 2, Informative
    in the worst case, quicksort has Q*O(n^2) behaviour

    Quicksort exhibits n-squared behaviour when the data set is sorted but reversed. Most decent quicksort routines do a random shuffle of the data at the start to avoid this issue.

    Choosing a sort for a particular situation is very much a matter of choosing. A shell sort is very fast for small data sets - it's quick to implement and follows C*O(n^1.27), where C is small compared to P or Q in your example. With mostly sorted data sets, the choice to sort algorithm is even more tuning sensitive - consider depth-sorting vertices in a rotating model - most of the data isn't moving around very fast and is exchanging a few places up and down the list. Sometimes the more trivial and naive sort works better than quicksort even for large mostly-sorted sets.

    Cheers,
    Toby Haynes

    --
    Anything I post is strictly my own thoughts and doesn't necessarily have anything to do with the opinions of IBM.
    1. Re:Quicksort versus HeapSort by Anonymous Coward · · Score: 0

      Quicksort exhibits n-squared behaviour when the data set is sorted but reversed.

      Actually, standard Quicksort also exhibits O(n^2) behavior when the data set is already sorted in the correct order, not just sorted in reverse order.

      And with that, that nit is picked.

      (Oh, and in addition to random shuffling, there's also a deterministic O(n) algorithm to find the median element of a set that can be used to avoid these conditions as well.)

  54. Re:Just what I need! by Kontinuum · · Score: 1

    In pattern recognition circles, there's a very simple alogorithm known as the k-nearest neighbors classifier. First you start with a big database of information. Let's say you have a database of a million bank loan applicants in the past, and for each one of those applicants, you have several variables: say, income, age, amount of loan, length of loan, total household income, etc. You also have final information as to whether that loan applicant ended up defaulting on their loan. Now, you have a new applicant come in, and you want to predict, are they a safe loan? The k-nearest neighbor algorithm says: just compute the distance (you have a lot of freedom as to how you define distance) between this new applicant and the million people in your database. For example, you can just take the sum of squares difference between all the variables. Now, take those million distances you've computed, sort them from nearest to farthest, and take the k closest instances, and basically have them vote on whether or not this new person will default. If k=500, and the 453 nearest neighbors all defaulted, then you suspect the new person will also default.

    Sorting a million entries is not hard, but doing it quickly gets a bit difficult. If you want to get even fancier, you can scale the units along each dimension before you compute the distance, to give more weight to certain parameters (feature weighting, or wrapper bias, in the pattern recognition parlance). To do that really really well, you have to do an optimization. Because this is fraught with local minima, this requires either a lot of restarts or stochastic algorithms. Either way, you end up evaluating, and doing the million-entry sort, many hundreds or thousands of times.

    Now, there are really fast ways of sorting out the k-nearest neighbors (KD-trees, for instance), and there are ways of pruning those million training points down, but this is just an example of a problem that involves colossal sorting. This algorithm is considered basic in that its very easy to understand and implement, and often performs very well, regardless of the structure of the data. However, just because its "basic" doesn't mean that it isn't used very heavily in real-life data mining situations.

    Hope that sheds some light on the subject! Anyone else have another giant sorting application to share?

  55. Re:Radix is O(n)? Nice try by cain · · Score: 2, Informative
    The number of passes in radix sort is not dependent on the number of items in the list, n. Therefore the number of passes is not log n. The number of passes in radix sort in based on the length of the key being sorted. And for most data is a finite number, k. Therefore the complexity of radix sort is O(kn), where k is finite and not related to n at all, and thus the complexity is O(n). I think your rationale assumes that the length of the key you are sorting can be infinite.

  56. FUI: Why do I need to sort? by Albert+Sandberg · · Score: 1

    Except if just want to turn off zbuffer, when you draw transperant objects, you want to draw them back to front, because the end result will be errorous if you don't.

    Perhaps there are other tasks, but that's the most common I suppose, and the one effect I've been using while coding 3d..

    Albert

  57. FBOs in Linux? by Anonymous Coward · · Score: 0

    This might be off topic, but about a month ago
    (after i was sure Frame Buffer Objects were implemented in my windows drivers) I tried
    FBOs in Linux and they did not seem to be there.

    Have you actually used FBOs in Linux, if so I would be really interested to know with which cards?

    As far as I know ATI still does not have them for windows (for Linux? ha), and OSX does not have support for them for any card. I am hoping something has changed since I last tried.

    I currently have a cheap GF 6 series, which I bought pretty much just for using FBOs in Linux.

  58. And the cycle cycles round again... by Anonymous Coward · · Score: 0

    when PCs start having the "GPU" on a separate card from the graphics card. The next step would be to have a GPU socket on the motherboard, just like in the old days of FPU co-processors. Soon after that, the CPU makers will start incorporating the GPU into the CPU again (just like the FPU was integrated in the 486 generation and later). And we have another complete cycle of functionality being first taken out of the main CPU and done in specialised hardware, and then being integrated back into the CPU...

  59. Fucktarded Moderators by Anonymous Coward · · Score: 0

    moderators are afraid of criticism.. instead of doing the right thing, they censor facts

  60. So in other words... by Anonymous Coward · · Score: 1, Insightful

    In high flow operations like sorting, memory latency and bandwidth matters.
    Woh.

    What seems unfair to me is comparing 24bit accuracy with 32bit. Yea, you can do it faster, but you're sacrificing a lot of the usefulness of doing it!

  61. Grammar nazi by Anonymous Coward · · Score: 0

    Does Slashdot have editors?

    "slower then..."
    "Apparantly..."

    This is an epidemic. You may not think it's important, but it is. Please try to use whichever language you choose properly. That way your mistakes don't distract from the message, and you sound like you're actually smart. Notice how I used both "your" and "you're" correctly there? Amazing!

    And editors, please actually EDIT submissions.

  62. Saskatchewan by Anonymous Coward · · Score: 1, Informative

    GPUsorting by Callele Neufeld & DeLathouwer at U of Saskatchewan - paper, videos & 2003 PPT.

  63. hmmmm... $250 aint so bad... by iamhassi · · Score: 1
    "Our results on the ATI X800 XT and NVIDIA GeForce 6800 GPUs indicate better performance in comparison to the qsort routine on a 3.4 GHz Pentium IV PC....Note that in both cases, GPUSort performs considerably faster."

    suddenly paying $250 for a video card doesnt seem like such a bad deal...

    Looks like we have a new videocard benchmark too.

    --
    my karma will be here long after I'm gone
  64. Re:What about other sorts? HP2000C by Anonymous Coward · · Score: 1, Interesting

    Back in the day -- 1975 -- my first hack in college was to write a program to sort a collection of data by creating a file which had "line numbers" created in the same numerical as the data to be sorted. I then told the HP2000C Basic that this was a "program" and it dutifully sorted the "program" by its "line numbers."

    This was at least 10 times faster than a Quick or Heapsort in Basic.

    An unfortunate side effect was that the entire time sharing system went down to focus on completing this privileged task before swapping out for anyone else. Heh.

    Evergreen State College System Admin Chas. Douglass was not happy with this sorting algorithm... but then we discovered what happened when we said Mat X[20000]=0...

  65. Re:Just what I need! by daikokatana · · Score: 1
    One of the tests I ran was a model shoot of a friend photographer. Clothed, nude, close-ups, what-have-you: my software was able to find all 100 images amidst 10.000s of others.

    A shorter answer to your question would be "yes" :)

    --
    http://jcsnippets.atspace.com/ - a collection of Java & C# snippets
  66. its vs. it's by bodrell · · Score: 1
    Rule of thumb: "explode" your contractions. It's is a contraction, not a designation of ownership. The same goes for you're, they're, etc. If it doesn't make sense exploded, it won't make sense as a contraction.

    I do know and understand the difference between its and it's, but your "rule of thumb" is retarded in this case. For all other nouns (not pronouns), an apostrophe-s signifies possession. It's the only remaining case-ending (well, two if you count plural), which is the genitive case. Latin and Greek nerds correct me if I'm mistaken.

    So I think it's perfectly reasonable for someone to assume that "it's" is both a contraction for "it is" and a possessive of "it" at the same time. "His" is a more obvious exceptions to the apostrophe-s ending, although I can imagine someone incorrectly writing "her's" instead of "hers." If I was talking about Sally's cat, falsely expanding "Sally's" into "Sally is" would just cause confusion. Bottom line, if you meant to write a possessive using the standard English apostrophe-s, then there is no contraction to "explode" as you put it. The same would apply to "your's."

    My personal pet peeve (although mistakes with you're, they're etc. are indeed annoying) is when people use an apostrophe for plurals! I once saw a sign in a dinner advertising a special on "ham and egg's." Blech.

    --
    Si la vida me da palo, yo la voy a soportar Si la vida me da palo, yo la voy a espabilar
  67. On further investigation by gr8_phk · · Score: 1
    I just read more of their documentation. They sort an array of (key, pointer) pairs, where the pointer is to the rest of a record. Kind of like takeing structures and pulling the key value out into the tuple. This makes their sort more useful than I originally thought, but it is still not the same as qsort even when just using it for floats. The graphs don't specify which GPUsort they used (they do have one that isn't even a tuple version).

    Implementing a similar "tuple sort" on a CPU with the same restrictions would be much faster than generic qsort. They claim the Intel C++ qsort is optimized using hyper threading and SSE instructions (they don't specify so I assume it's just a compiler setting), but I don't see that offering the same advantages of restricting it to a "tuple sort".

    Theirs is more useful than I first thought, but I still think the CPU should be able to match it by imposing similar restrictions.

    BTW, I'm also a strong believer in heapsort vs quicksort. Most comparisons I've seen show quicksort about 1.5x to 2x faster, but they also omit some optimizations of heapsort that I use. Heapsort is guaranteed to be O(n*log(n)) where quicksort is probably that fast, but degrades to O(n^2) if you're unlucky.

  68. A comparison with the STL-sort by Anonymous Coward · · Score: 0

    I made a quick benchmark with the c++ STL-sort by sorting an array of 16 million 32-bit floats.

    With a fairly old AthlonXP 2000+ the sort-command took around 4.5 seconds to run, which seems to be _almost twice as fast_ as the NVIDIA 6800Ultra in the linked benchmark, so in this light the sorting-performance of the GPUs is indeed not yet too impressive.

    If the testers were using the fastest known software for the GPUs, it would probably also have been reasonable not to use terribly clumsy sorting implementations with the CPUs either.

  69. Re:Just what I need! by cpeterso · · Score: 1


    and I thought DOOM 3 had serious graphics card requirements! Now Oracle database servers will require NVIDIA graphics cards?! :)

  70. kids in ten years might say... by MegaFur · · Score: 1

    Perhpas in ten years, we'll get to hear a kid say, "GPU? Oh that stands for General Processing Unit, right?" eh? Perhaps not.

    --
    Furry cows moo and decompress.
  71. PCI Express? by Kent+Recal · · Score: 1

    So, does this work with PCI Express cards?
    Or is there a dual-xeon board with an AGP slot?
    Supermicro isn't offering one.

    I think this little hack (if it's stable enough for production use) would be a nice addition to "anything database driven".

    Most RDBs (well, any kind of DB I suppose) spend a significant amount
    of time on sorting stuff. Moving the task to the GPU should result in a serious performance boost (offloading the CPU *and* sorting faster sounds like a pretty good deal to me)

    Well, just a pipe dream?

  72. Those who use a database as their filesystem by UnapprovedThought · · Score: 1

    (I'm finding it so hard to abstain from commenting on the wisdom of replacing marvelously stable and transparent file systems with databases... I guess I'll try this time.)

    As database operations become more commonplace, even the average user will "benefit" from this. I place it in quotes because it was a problem that didn't need solving prior to the "use databases to manage your buddy list" mentality.

    There are probably other, saner uses for it though, although people will have to weigh this benefit against the cost of creating another dependency on their system for a specific library tied to specific hardware, and the concerns of what happens when some of the parts are upgraded but not the others....

  73. Email to the authors by rbarreira · · Score: 1

    An email I just sent to the authors:

    Hi,

    I've seen your page and the benchmarks are very interesting :)

    But I have some doubts about the quality of the CPU benchmark - qsort has
    some performance problems, since it involves calling a function indirectly
    through a pointer for every comparison made, and it supports variable size
    elements, etc... Besides, for sorting integers maybe sorting methods not
    based on comparisons would be better, this would have to be tested though.
    Did you try other sorting methods or a custom quicksort routine?

    Regards,
    Ricardo Barreira

    --

    The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
  74. Data Structures Nightmares by Stopher2475 · · Score: 0

    I just had a bad Data Structures flashback from reading these posts. Next people are going to strat posting induction proofs.

  75. Re:Radix is O(n)? Nice try by rodac · · Score: 1

    well, radix sort can not be used for general purpose sorting. that is the drawback, the benefit is that it IS fast.

    As for sorting integers or floats, radix sort IS O(n).
    rs also has nice properties that it is best-case/average-case/worst-case O(n).

    qsort is only O(n log n) for average case IFF all the input data to sort is completely random or carefully selected to perform well.

    try qsort() and compare it with non-random real-world data and see how they differ.

  76. Very nice Ricardo by Anonymous Coward · · Score: 0

    You have shown your ability to read (other posts), rewrite the content and the need for more attention.

    Now go somewhere else and jack off or whatever.

    1. Re:Very nice Ricardo by rbarreira · · Score: 1

      Thanks :)

      --

      The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
  77. rel="nofollow" and URL for the previous comment by mosel-saar-ruwer · · Score: 1

    The previous comment [Mod parent up!] has got some funky HTML attributes [rel="nofollow"] and a weird local URL:
    http://a/
    Has someone discovered [and exploited] a bug in Slashcode?

  78. My bad - it was the sig. by mosel-saar-ruwer · · Score: 1


    Sorry - I was had.

    1. Re:My bad - it was the sig. by ZosX · · Score: 1

      Yeah, I was just had today and I was picking through this guys history and found your comment. If it makes you feel any better my response to his terrible sig caused his post to be modded to -1 and I got +4 informative so far for just writing the flame. =)

      Anyways, just wanted you to know that you weren't the only one and for the record, AFAIK, you cannot spoof the slashcode into displaying nothing on a badly formed URL. The least it will display is either a charachter (it ignores whitespaces) or just simply http or whatever you put in the href in quotes. I guess it is possible to try a buffer overflow or something, but I'm guessing that it has probably already been tried and that they likely do checks against that sort of thing, much in the way that they check for too many caps, etc.

      Cheers.