Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs
An anonymous reader writes "Researchers at the University of Virginia have recently open sourced an algorithm capable of sorting at a rate of one billion (integer) keys per second using a GPU. Although GPUs are often assumed to be poorly suited for algorithms like sorting, their results are several times faster than the best known CPU-based sorting implementations."
I find it very disappointing that a group of programmer/computer science types who even supply BibTeX to make it easier to reference their work, resort to screen-capturing an Excel chart to display their data.
Specifically, a GTX480 (just over 1 B keys/sec), followed up by a Tesla 2050 at around 75% of the speed of the GTX480. (745 M keys/sec)
"The true measure of a person is how they act when they know they won't get caught." - DSRilk
... to sort my first post to the top.
So why is this GPU sorting ability important to me?
GPUs have always been better at sorting your money from your wallet.
This isn't a "barrier" like the "sound barrier". There are no special difficulties that start around 1G/sec! It's just a threshold.
Don't get me wrong -- I'm not saying this isn't impressive, but no "barrier" was broken here!
Algorithms aren't measured in "x per second"... only implementations are measured that way. The speed of an algorithm is described in big-O notation, such as O(n log n). The metric of "sorted keys per second" is largely useless, because it depends on the particular hardware setup.
Revisiting Sorting for GPGPU Stream Architectures (PDF)
What one fool can do, another can. (Ancient Simian Proverb)
I think the bubble sort would be the wrong way to go.
—Barack Obama
Hold on, so I can play Jezz Ball, Chips Challenge and all my favourite old school games with a greater level of speed.
Thank god, because my online Porn collection isn't going to sort itself.
you've got lots and lots of RAM with room for the keys and lots of space to waste for unfilled pointers.
Pass 1, read the records, at the key radix store a record URI
Pass 2, sweep RAM and read the record URIs in the order you encounter them copying them onto a sequential write device.
I was doing this years ago.
If you are careful with the number and sizes of your read buffers, the re-read done for pass 2 doesn't have to be all that disk intensive.
You can even generate the keys using what ever hash function you find that is truly efficient and store collisions separately (leave a bit high and go into the a link list maintained by the hash generator for those few keys that hash redundantly.)
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
The problem with big-oh notation is that the constant isn't explicit, so for any given n (pick as large as desired), it is possible that O(nlogn) < O(n) for some choice of constants. That's why ops per second is still a useful metric when comparing implementations on standardized hardware.
As always, in theory there's no difference between theory and practice, but in practice there is...
I will not be impressed until it plays a video of sorting examples using humans at 120 FPS while doing its other sorting!
Tired of my customary (Score:1)
at the whim of sounding like a complete arrogant jerk, does this actually support the theory that we've hit a wall with the x86 architecture with general computing, or is it still the 'best' at that?
I'm a bit confused when they, again and again, come up with gpu architecture doing stuff miles faster that our old trusty x86. I mean, would it be meaningful to actually start researching the architectural way of computing like we have seen gaming consoles do for decades? (latest being the ps3's cell)
just a thought that came to mind.
The typical sorting problem I've encountered in my career (various types of scientific, telecommunications and business software, though not games) involves an array of pointers to fixed length records that have a sort key (let's say an integer) at a predefined offset. Not an array of integers, nor an array of directly embedded small fixed length records which I'm guessing was used in TFA. The former situation requires random as well as stream access to memory, which would likely favor processing by the CPU in the motherboard of a typical $1000-$2000 PC.
GPUs are special kinds of processors, often called stream processors. They are very efficient at some kinds of operations, and not efficient at others. Some things, they run literally a thousand times faster than the CPU. Graphics rasterization would be one of these (no surprise, that's their prime job). However other things they run much slower. For something to run fast on a GPU it has to meet the following requirement, the more it matches them, the faster it is:
1) It needs to be parallel to a more or less infinite extent. GPUs are highly parallel devices. The GTX 480 in question has 448 shaders, meaning for max performance it needs to be working on 448 things in parallel. Things that are only somewhat parallel don't work well.
2) It needs to not have a whole lot of branching. Modern GPUs can branch, but they incur a larger penalty than CPUs do. So branching in the code needs to be minimal. It needs to mostly be working down a known path.
3) When a branch happens, things need to branch the same way. The shaders work in groups with regards to data and instructions. So if you have half a group branching one way, half the other, that'll slow things down as it'll have to be split out and done separately. So branches need to be uniform for the most part.
4) The problem set needs to fit in to the RAM of the GPU. This varies, 1GB is normal for high end GPUs and 4-6GB is possible for special processing versions of those. The memory on board is exceedingly fast, over a hundred gigabytes per second in most cases. However the penalty for hitting the main system RAM is heavy, the PCIe bus is but a fraction of that. So you need to be able to load data in to video RAM and work on it there, only occasionally going back and forth with the main system.
5) For very best performance, your problem needs to be single precision floating point (32-bit). That is what GPUs like the best. Very modern ones can do double precision as well, but at half the speed. I don't know how their integer performance fares over all, they can do it, but again not the same speed as single precision FP.
Now this is very useful. There are a lot of problems that fall in that domain. As I said, graphics would be one of the biggest, hence why they exist. However there are many problems that don't. When you get ones that are way outside of that, like, say, a relational database, they fall over flat. A normal CPU creams them performance wise.
That's why we have the separate components. CPUs can't do what GPUs do as well, but they are good at everything. GPUs do particular things well, but other things not so much.
In fact this is taken to the extreme in some electronics with ASICs. They do one and only one thing, but are fast as hell. Gigabit switches are an example. You find that tiny, low power, chips can switch amazing amounts of traffic. Try it on a computer with gigabit NICs and it'll fall over. Why? Because those ASICs do nothing but switch packets. They are designed just for that, with no extra hardware. Efficient, but inflexible.
Not sure how they are defining efficient in this experiement. I'd like to see how many clock cycles it took to sort each problem, maybe how much memory the radix sort is using too.
Has anyone really been far even as decided to use even go want to do look more like?
[ST8Z6FR57ABE6A8RE9UF]
CS people get way too caught up in Big O forgetting that, as you note, it is the theory describing the upper bound on time, not actual practice AND is only relevant for one factor, n. A great example is ray tracing. CS people love ray tracing because most create a ray tracer as part of their studies. They love to talk on about how it is great for rendering because "It is O(log n)!" They love to hype it over rasteraztion like current GPUs do. However there's two really important things they forget:
1) What is n? In the case, polygons. It scales with the log of the number of polygons you have. This is why ray tracer demos love showing off spheres made of millions of polygons and so on. It is cheap to do. However turns out polygons aren't the only thing that matters for graphics. Pixels also matter and ray tracing is O(n) with relation to pixels and it gets worse for anti aliasing. For FSAA you cast multiple rays per pixel. That means that 4x FSAA has 4x the computation requirements. Turns out rasterization scales much better with regards to resolution, and AA. In fact these days 4x FSAA is often O(1) in actual implementation in that it doesn't hit frame rate to turn it on. That is also why ray tracing demos are low resolution, because THAT'S the hard part.
2) Real hardware limits. In the case of ray tracing, it is memory access. While those polygons scale in a perfectly logarithmic fashion in theory, real system RAM isn't so forgiving. As you start to have more and more you run in to RAM bandwidth limits and things slow down. All the memory access required becomes a limit that wasn't on paper. You can cry all you like that it wasn't a problem in theory, on actual hardware you run in to other limits.
People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation. CS people also need to understand for users, there is only one benchmark that matters: The clock on the wall. Whatever is faster is better. Doesn't matter what the theory says, if the hardware can do X faster than Y, then X is better according to users.
Were the benchmarks performed on lists that already resided on the GPU cards or do the benchmarks include "I/O" to and from CPU memory?
How does the implementation perform when the entire list does not fit in memory on the GPU card?
That guy was pretty cool. Whatever happened to him?
No we can finally sort the digits of Pi...
If you don't like my sig then don't read it.
You know, if they up it to just a bit over 4 billion unique 32-bit keys, say around 4,294,967,296 or so, I think I could sort them rather efficiently, as long as they weren't attached to any payload. ;-)
Program Intellivision!
I suspect your problem is going to become getting the keys in and out of the GPU memory, not the time the GP takes to sort them.
Why can't they spend as much effort on learning the simple and effective tools that exist to work with text files as they spend learning Excel tricks?
I have seen people pay for "Advanced Excel" training in order to do what I can do with a few lines of Unix utilities. The cost/benefit ratio of Excel is very poor.
Plain text has another advantage in that it's the most universal standard for data that exists. I can read text files from the start of the computing age without too much problem. I can even scan punched cards or punched tape to recover the data. I can scan and OCR text from printed pages.
The only advantage Excel has is familiarity for people in the corporate environment. A familiarity that has come through a large spending in training and software purchases.
Notice that some of the most significant advances in computing lately are a result of technologies developed for gaming? Further proof that if you poor a little money into optimization of tech to better suit human needs (gaming), you can still reap benefits for the "hardcore" sciences. Science for science sake is generally not as profitable or beneficial as science for human sake. Sink less money into general "I wonder if", and more into "I wonder if I can solve THIS specific problem." Many of the greatest scientists did much of their greatest work when given a specific issue to tackle via a goverment, corporation, or family member.
True colors coming through. He lied before he was elected, why would he stop? He would even change his views mid stream.
Funny, but you're basically right. Even sorting a billion 32 bits keys can be done more efficiently with a "tally sort". Allocate and zero a billion-bit array (128 MB); for every input N you set the Nth bit. Loop over the array, and if the Nth bit is set, print N. Note that the amount of RAM needed (128 MB) is less than even the input size (4GB).
His naivety hit the real world and exploded.
It's more of a lossy compression.
Does it horrify anyone else that we talk this way now?