Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs
An anonymous reader writes "Researchers at the University of Virginia have recently open sourced an algorithm capable of sorting at a rate of one billion (integer) keys per second using a GPU. Although GPUs are often assumed to be poorly suited for algorithms like sorting, their results are several times faster than the best known CPU-based sorting implementations."
Specifically, a GTX480 (just over 1 B keys/sec), followed up by a Tesla 2050 at around 75% of the speed of the GTX480. (745 M keys/sec)
"The true measure of a person is how they act when they know they won't get caught." - DSRilk
Revisiting Sorting for GPGPU Stream Architectures (PDF)
What one fool can do, another can. (Ancient Simian Proverb)
GPUs are highly parallel processors, but most of our computing algorithms were developed for fast single core processors. As we figure out how to implement new solutions to old problems to take advantage of these highly parallel processors, you'll continue to see stories like this one. But, there's a limit to how good they can be at certain types or problems. Read up on Amdahl's law.
Basically, traditional x86 processors are good at lots of stuff. Modern GPUs are great at a few things.
Do you even know anything about perl? -- AC Replying to Tom Christiansen post.
GPUs are special kinds of processors, often called stream processors. They are very efficient at some kinds of operations, and not efficient at others. Some things, they run literally a thousand times faster than the CPU. Graphics rasterization would be one of these (no surprise, that's their prime job). However other things they run much slower. For something to run fast on a GPU it has to meet the following requirement, the more it matches them, the faster it is:
1) It needs to be parallel to a more or less infinite extent. GPUs are highly parallel devices. The GTX 480 in question has 448 shaders, meaning for max performance it needs to be working on 448 things in parallel. Things that are only somewhat parallel don't work well.
2) It needs to not have a whole lot of branching. Modern GPUs can branch, but they incur a larger penalty than CPUs do. So branching in the code needs to be minimal. It needs to mostly be working down a known path.
3) When a branch happens, things need to branch the same way. The shaders work in groups with regards to data and instructions. So if you have half a group branching one way, half the other, that'll slow things down as it'll have to be split out and done separately. So branches need to be uniform for the most part.
4) The problem set needs to fit in to the RAM of the GPU. This varies, 1GB is normal for high end GPUs and 4-6GB is possible for special processing versions of those. The memory on board is exceedingly fast, over a hundred gigabytes per second in most cases. However the penalty for hitting the main system RAM is heavy, the PCIe bus is but a fraction of that. So you need to be able to load data in to video RAM and work on it there, only occasionally going back and forth with the main system.
5) For very best performance, your problem needs to be single precision floating point (32-bit). That is what GPUs like the best. Very modern ones can do double precision as well, but at half the speed. I don't know how their integer performance fares over all, they can do it, but again not the same speed as single precision FP.
Now this is very useful. There are a lot of problems that fall in that domain. As I said, graphics would be one of the biggest, hence why they exist. However there are many problems that don't. When you get ones that are way outside of that, like, say, a relational database, they fall over flat. A normal CPU creams them performance wise.
That's why we have the separate components. CPUs can't do what GPUs do as well, but they are good at everything. GPUs do particular things well, but other things not so much.
In fact this is taken to the extreme in some electronics with ASICs. They do one and only one thing, but are fast as hell. Gigabit switches are an example. You find that tiny, low power, chips can switch amazing amounts of traffic. Try it on a computer with gigabit NICs and it'll fall over. Why? Because those ASICs do nothing but switch packets. They are designed just for that, with no extra hardware. Efficient, but inflexible.
Dude, an algorith which is O(n*log(n)) is not faster than O(n) just because n*log(n) < n.
When an algorithm is O(n* log(n)), it means the actual time requirement is p*n*log(n)+q, where p and q are constants specific to the algorithm.
The O(n*log(n)) algorith is faster than the O(n) one when
p1*n*log(n)+q1 < p2*n+q2
... and for any n, it is possible to choose p1, p2, q1 and q2 so that the O(n) algorith becomes faster.
This means, for example, that an algorithm which is O(n*log(n)) isn't automatically faster than an algorithm which is O(n) on lists with three elements or more. The O(n*log(n)) algorithm may take a hundred times longer to sort a list of two elements than the O(n) one (due each step being more complex), and in that case the lists will need to grow some before the O(n*log(n)) algorithm becomes faster.
Typically, I hear researchers describe the parallelism of an algorithm separately from its computational complexity (big oh notation) using the idea of "speedup."
The perfect scaling in your first example has linear speedup, and the second example has logarithmic speedup (that is, the speedup is log(p)).
Here is the relevant Wikipedia article.
Amen. Some tools like that would be a godsend. It could be coming. http://en.wikipedia.org/wiki/Linked_Data http://linkeddata.org/ - Not what you are talking about, but what you describe may result from it.
After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.