Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs
An anonymous reader writes "Researchers at the University of Virginia have recently open sourced an algorithm capable of sorting at a rate of one billion (integer) keys per second using a GPU. Although GPUs are often assumed to be poorly suited for algorithms like sorting, their results are several times faster than the best known CPU-based sorting implementations."
Specifically, a GTX480 (just over 1 B keys/sec), followed up by a Tesla 2050 at around 75% of the speed of the GTX480. (745 M keys/sec)
"The true measure of a person is how they act when they know they won't get caught." - DSRilk
GPUs have always been better at sorting your money from your wallet.
This isn't a "barrier" like the "sound barrier". There are no special difficulties that start around 1G/sec! It's just a threshold.
Don't get me wrong -- I'm not saying this isn't impressive, but no "barrier" was broken here!
Algorithms aren't measured in "x per second"... only implementations are measured that way. The speed of an algorithm is described in big-O notation, such as O(n log n). The metric of "sorted keys per second" is largely useless, because it depends on the particular hardware setup.
GPUs are special kinds of processors, often called stream processors. They are very efficient at some kinds of operations, and not efficient at others. Some things, they run literally a thousand times faster than the CPU. Graphics rasterization would be one of these (no surprise, that's their prime job). However other things they run much slower. For something to run fast on a GPU it has to meet the following requirement, the more it matches them, the faster it is:
1) It needs to be parallel to a more or less infinite extent. GPUs are highly parallel devices. The GTX 480 in question has 448 shaders, meaning for max performance it needs to be working on 448 things in parallel. Things that are only somewhat parallel don't work well.
2) It needs to not have a whole lot of branching. Modern GPUs can branch, but they incur a larger penalty than CPUs do. So branching in the code needs to be minimal. It needs to mostly be working down a known path.
3) When a branch happens, things need to branch the same way. The shaders work in groups with regards to data and instructions. So if you have half a group branching one way, half the other, that'll slow things down as it'll have to be split out and done separately. So branches need to be uniform for the most part.
4) The problem set needs to fit in to the RAM of the GPU. This varies, 1GB is normal for high end GPUs and 4-6GB is possible for special processing versions of those. The memory on board is exceedingly fast, over a hundred gigabytes per second in most cases. However the penalty for hitting the main system RAM is heavy, the PCIe bus is but a fraction of that. So you need to be able to load data in to video RAM and work on it there, only occasionally going back and forth with the main system.
5) For very best performance, your problem needs to be single precision floating point (32-bit). That is what GPUs like the best. Very modern ones can do double precision as well, but at half the speed. I don't know how their integer performance fares over all, they can do it, but again not the same speed as single precision FP.
Now this is very useful. There are a lot of problems that fall in that domain. As I said, graphics would be one of the biggest, hence why they exist. However there are many problems that don't. When you get ones that are way outside of that, like, say, a relational database, they fall over flat. A normal CPU creams them performance wise.
That's why we have the separate components. CPUs can't do what GPUs do as well, but they are good at everything. GPUs do particular things well, but other things not so much.
In fact this is taken to the extreme in some electronics with ASICs. They do one and only one thing, but are fast as hell. Gigabit switches are an example. You find that tiny, low power, chips can switch amazing amounts of traffic. Try it on a computer with gigabit NICs and it'll fall over. Why? Because those ASICs do nothing but switch packets. They are designed just for that, with no extra hardware. Efficient, but inflexible.
CS people get way too caught up in Big O forgetting that, as you note, it is the theory describing the upper bound on time, not actual practice AND is only relevant for one factor, n. A great example is ray tracing. CS people love ray tracing because most create a ray tracer as part of their studies. They love to talk on about how it is great for rendering because "It is O(log n)!" They love to hype it over rasteraztion like current GPUs do. However there's two really important things they forget:
1) What is n? In the case, polygons. It scales with the log of the number of polygons you have. This is why ray tracer demos love showing off spheres made of millions of polygons and so on. It is cheap to do. However turns out polygons aren't the only thing that matters for graphics. Pixels also matter and ray tracing is O(n) with relation to pixels and it gets worse for anti aliasing. For FSAA you cast multiple rays per pixel. That means that 4x FSAA has 4x the computation requirements. Turns out rasterization scales much better with regards to resolution, and AA. In fact these days 4x FSAA is often O(1) in actual implementation in that it doesn't hit frame rate to turn it on. That is also why ray tracing demos are low resolution, because THAT'S the hard part.
2) Real hardware limits. In the case of ray tracing, it is memory access. While those polygons scale in a perfectly logarithmic fashion in theory, real system RAM isn't so forgiving. As you start to have more and more you run in to RAM bandwidth limits and things slow down. All the memory access required becomes a limit that wasn't on paper. You can cry all you like that it wasn't a problem in theory, on actual hardware you run in to other limits.
People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation. CS people also need to understand for users, there is only one benchmark that matters: The clock on the wall. Whatever is faster is better. Doesn't matter what the theory says, if the hardware can do X faster than Y, then X is better according to users.