Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs
An anonymous reader writes "Researchers at the University of Virginia have recently open sourced an algorithm capable of sorting at a rate of one billion (integer) keys per second using a GPU. Although GPUs are often assumed to be poorly suited for algorithms like sorting, their results are several times faster than the best known CPU-based sorting implementations."
Given that the particular hardware setup is detailed here (a GTX 480 achieves the 1 billion keys/sec figure), and the algorithm used (radix sort) has known asymptotic behavior (O(nk) for n keys of length k), 10^9 keys/sec is quite meaningful, particularly since it's a significant implementation challenge (possibly even an algorithmic challenge) to port this algorithm to a GPU.
Furthermore, I think sorting speed is appropriately measured in keys/sec. Big-O does not in fact describe the speed, but rather the upper bound of the growth of an algorithm's asymptotic running time, which needs to be paired with the implementation, architecture, and data set to determine a speed. It turns out the constant factors can actually be quite important in practice.
you've got lots and lots of RAM with room for the keys and lots of space to waste for unfilled pointers.
Pass 1, read the records, at the key radix store a record URI
Pass 2, sweep RAM and read the record URIs in the order you encounter them copying them onto a sequential write device.
I was doing this years ago.
If you are careful with the number and sizes of your read buffers, the re-read done for pass 2 doesn't have to be all that disk intensive.
You can even generate the keys using what ever hash function you find that is truly efficient and store collisions separately (leave a bit high and go into the a link list maintained by the hash generator for those few keys that hash redundantly.)
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
I wish they'd start putting the "P" into these Big-O notations, where the "P" is the number of processors. Some algorithms don't scale well, some do. Putting the P in illustrates this.
eg. O( n/P ) illustrates an algorithm that scales perfectly with more cores added. O( n / log(P) ) not so much.
The answer to hitting a wall with traditional CPUs is complicated. The number of transistors on new CPUs is actually keeping up with Moore's law. The size of the transistors and power consumption is also steadily decreasing. However clock speeds have been stagnant and performance/clock cycle hasn't been increasing as fast as it has in the past.
When it comes to raw performance numbers GPUs destroy CPUs. The problem is trying to take advantage of the power GPUs offer. For starters the algorithm has to be parallel in nature. And not just part of it, the majority of the algorithm has to be parallel or the overhead will erase any performance gain. The application also has to run long enough to justify offloading it to the GPU or again, the overhead will get you.
Even if you have a parallel algorithm, implementing it isn't trivial. To use CUDA or OpenCL you have to have not only a good understanding of general parallel programming but also a good understanding of the architecture of the GPU hardware. These languages are not user friendly. They really put the burden on the programmer. On the other hand this does mean they can be very powerful in the right hands.
Now lets say your application meets these criteria and you implemented it in CUDA and got a 10x speedup. No one with an ATI card can run it. Sure you could implement it in OpenCL instead to be cross platform but OpenCL seems to still be in it's infancy and not as mature as CUDA.
I'm not trying to say GPGPU computing has no future, just that it has a long way to go. Parallel Programming has actually had quite the revival lately and I'm truly interested to see where it goes. Some type of parallel compiler that relieves the programmer of having to deal with all the headaches associated with parallel programming would be ground breaking and have awesome implications. Some people claim this isn't possible. If this topic interests you I would recommend looking into reconfigurable computing. Theres some real interesting stuff going on in that area and it supports a much wider range of algorithms than GPGPU currently does.
Not sure how they are defining efficient in this experiement. I'd like to see how many clock cycles it took to sort each problem, maybe how much memory the radix sort is using too.
Has anyone really been far even as decided to use even go want to do look more like?
[ST8Z6FR57ABE6A8RE9UF]
CS people get way too caught up in Big O forgetting that, as you note, it is the theory describing the upper bound on time, not actual practice AND is only relevant for one factor, n. A great example is ray tracing. CS people love ray tracing because most create a ray tracer as part of their studies. They love to talk on about how it is great for rendering because "It is O(log n)!" They love to hype it over rasteraztion like current GPUs do. However there's two really important things they forget:
1) What is n? In the case, polygons. It scales with the log of the number of polygons you have. This is why ray tracer demos love showing off spheres made of millions of polygons and so on. It is cheap to do. However turns out polygons aren't the only thing that matters for graphics. Pixels also matter and ray tracing is O(n) with relation to pixels and it gets worse for anti aliasing. For FSAA you cast multiple rays per pixel. That means that 4x FSAA has 4x the computation requirements. Turns out rasterization scales much better with regards to resolution, and AA. In fact these days 4x FSAA is often O(1) in actual implementation in that it doesn't hit frame rate to turn it on. That is also why ray tracing demos are low resolution, because THAT'S the hard part.
2) Real hardware limits. In the case of ray tracing, it is memory access. While those polygons scale in a perfectly logarithmic fashion in theory, real system RAM isn't so forgiving. As you start to have more and more you run in to RAM bandwidth limits and things slow down. All the memory access required becomes a limit that wasn't on paper. You can cry all you like that it wasn't a problem in theory, on actual hardware you run in to other limits.
People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation. CS people also need to understand for users, there is only one benchmark that matters: The clock on the wall. Whatever is faster is better. Doesn't matter what the theory says, if the hardware can do X faster than Y, then X is better according to users.
While an interesting idea, many algorithms behave unpredictably on multiple processors (depending on how much communication would be required). Some will even be slower!
The affect that extra CPUs will have is too dependent on the hardware implementation to be able to formalize like this.