Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs

Excel Charts by Anonymous Coward · 2010-08-29 14:30 · Score: 3, Insightful

I find it very disappointing that a group of programmer/computer science types who even supply BibTeX to make it easier to reference their work, resort to screen-capturing an Excel chart to display their data.

Re:Excel Charts by bramp · 2010-08-29 14:59 · Score: 3, Funny

I've also found this annoying when reading papers. Perhaps I just spend too much time learning how to use gnuplot so that my graphs look nice.
Re:Excel Charts by Anpheus · 2010-08-29 15:21 · Score: 3, Insightful

Maybe excel was just the right tool for the job? It's quick and easy to use, and to reformat the graphs.
I know the Linux tools tend to be a little longer between tweaking, rendering and displaying, so a fast WYSIWIG tool works just fine.
Re:Excel Charts by pspahn · 2010-08-29 18:29 · Score: 4, Insightful

Actually, I find even more disappointing that a decent way to display datasets on a web page isn't standard yet. Why can't a nice one be embeddable with column sorts and robust methods for retrieving data? There are solutions, sure, but I have yet to find one that isn't unnecessarily complex or just plain ugly and difficult to use. But I guess it's just a matter of time, right?

--
Someone flopped a steamer in the gene pool.
Re:Excel Charts by w0mprat · 2010-08-29 21:51 · Score: 3, Informative

Amen. Some tools like that would be a godsend. It could be coming. http://en.wikipedia.org/wiki/Linked_Data http://linkeddata.org/ - Not what you are talking about, but what you describe may result from it.

--
After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.

The video card in question.. by black3d · 2010-08-29 14:33 · Score: 5, Informative

Specifically, a GTX480 (just over 1 B keys/sec), followed up by a Tesla 2050 at around 75% of the speed of the GTX480. (745 M keys/sec)

--
"The true measure of a person is how they act when they know they won't get caught." - DSRilk

Re:The video card in question.. by MarkRose · 2010-08-29 15:47 · Score: 3, Funny

I never would have suspected the GTX480 would have been good at this sorta thing.

--
Be relentless!

No Surprise... by Anonymous Coward · 2010-08-29 14:47 · Score: 5, Funny

GPUs have always been better at sorting your money from your wallet.

Not a barrier by Captain+Segfault · 2010-08-29 14:49 · Score: 5, Insightful

This isn't a "barrier" like the "sound barrier". There are no special difficulties that start around 1G/sec! It's just a threshold.

Don't get me wrong -- I'm not saying this isn't impressive, but no "barrier" was broken here!

Re:Not a barrier by XanC · 2010-08-29 14:51 · Score: 4, Insightful

It's not a threshold! It's just a milestone.
Re:Not a barrier by CoolGopher · 2010-08-29 14:59 · Score: 3, Funny

It's just a milestone.
Hang on, since when do you measure sorting performance using a distance indicator? And an imperial one at that!
No, this is not a serious comment.
Re:Not a barrier by PatPending · 2010-08-29 15:35 · Score: 4, Funny

Actually, if you look at shockwave dynamics during the moment an object crosses from subsonic to supersonic velocity, it can very easily be considered much more of a barrier than 1gkeys/sec can.
Actually in this case, your analogy should use ludicrous speed.

--
What one fool can do, another can. (Ancient Simian Proverb)

Um... by Anonymous Coward · 2010-08-29 14:50 · Score: 5, Insightful

Algorithms aren't measured in "x per second"... only implementations are measured that way. The speed of an algorithm is described in big-O notation, such as O(n log n). The metric of "sorted keys per second" is largely useless, because it depends on the particular hardware setup.

Re:Um... by PrecambrianRabbit · 2010-08-29 15:19 · Score: 5, Interesting

Given that the particular hardware setup is detailed here (a GTX 480 achieves the 1 billion keys/sec figure), and the algorithm used (radix sort) has known asymptotic behavior (O(nk) for n keys of length k), 10^9 keys/sec is quite meaningful, particularly since it's a significant implementation challenge (possibly even an algorithmic challenge) to port this algorithm to a GPU.
Furthermore, I think sorting speed is appropriately measured in keys/sec. Big-O does not in fact describe the speed, but rather the upper bound of the growth of an algorithm's asymptotic running time, which needs to be paired with the implementation, architecture, and data set to determine a speed. It turns out the constant factors can actually be quite important in practice.
Re:Um... by SpazmodeusG · 2010-08-29 15:36 · Score: 4, Interesting

I wish they'd start putting the "P" into these Big-O notations, where the "P" is the number of processors. Some algorithms don't scale well, some do. Putting the P in illustrates this.
eg. O( n/P ) illustrates an algorithm that scales perfectly with more cores added. O( n / log(P) ) not so much.
Re:Um... by Anonymous Coward · 2010-08-29 18:43 · Score: 3, Informative

Typically, I hear researchers describe the parallelism of an algorithm separately from its computational complexity (big oh notation) using the idea of "speedup."
The perfect scaling in your first example has linear speedup, and the second example has logarithmic speedup (that is, the speedup is log(p)).
Here is the relevant Wikipedia article.

Link to Technical Paper by PatPending · 2010-08-29 14:59 · Score: 4, Informative

Revisiting Sorting for GPGPU Stream Architectures (PDF)

--
What one fool can do, another can. (Ancient Simian Proverb)

I think the bubble sort would be the wrong way to by joeyadams · 2010-08-29 15:12 · Score: 4, Funny

I think the bubble sort would be the wrong way to go.

—Barack Obama

PRON! Marches on! by jewishbaconzombies · 2010-08-29 15:17 · Score: 3, Funny

Thank god, because my online Porn collection isn't going to sort itself.

Big deal. Radix sort works well IF ... by crovira · 2010-08-29 15:21 · Score: 3, Interesting

you've got lots and lots of RAM with room for the keys and lots of space to waste for unfilled pointers.

Pass 1, read the records, at the key radix store a record URI
Pass 2, sweep RAM and read the record URIs in the order you encounter them copying them onto a sequential write device.

I was doing this years ago.

If you are careful with the number and sizes of your read buffers, the re-read done for pass 2 doesn't have to be all that disk intensive.

You can even generate the keys using what ever hash function you find that is truly efficient and store collisions separately (leave a bit high and go into the a link list maintained by the hash generator for those few keys that hash redundantly.)

--
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.

Re:Big deal. Radix sort works well IF ... by Trepidity · 2010-08-29 17:28 · Score: 4, Insightful

Well, yeah, they're not claiming they invented radix sort. They're claiming that their GPU implementation of radix sort runs about 4x as fast as the CPU implementation you describe.

--
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Re:Big deal. Radix sort works well IF ... by kramulous · 2010-08-29 19:18 · Score: 3, Interesting

So, does that mean if they went out and got the fastest Xeon processor (they used the fastest gpu - excluding the C2070), wrote parallel and used the Intel Compiler (writing to it) the speedup over the cpu is less than zero?
After having just looked at their code, also remove the cpu stl shit (actually any template since they don't vectorise). If you are writing speedy code for the gpu, to perform an adequate test for the cpu you also have to write appropriately.
hahahahaha
This gets better and better...
They only timed the compute time. Cudamalloc was not part of the timing or cudamemcpy.
Sorry, I only count 'time to solution'. That is all i'm interested in. I thought is was strange that a less than O(n**2) was faster on the gpu.
It is like writing benchmarks that ignore disk IO, ignore reading from system memory, L3 cache, etc. Only time stuff that is in the registers.

--
.

Ugh. by martin-boundary · 2010-08-29 15:29 · Score: 3, Insightful

Stupid HTML ate my <...

The problem with big-oh notation is that the constant isn't explicit, so for any given n (pick as large as desired), it is possible that O(nlogn) < O(n) for some choice of constants. That's why ops per second is still a useful metric when comparing implementations on standardized hardware.

As always, in theory there's no difference between theory and practice, but in practice there is...

Re:x86 by emmons · 2010-08-29 16:14 · Score: 4, Informative

GPUs are highly parallel processors, but most of our computing algorithms were developed for fast single core processors. As we figure out how to implement new solutions to old problems to take advantage of these highly parallel processors, you'll continue to see stories like this one. But, there's a limit to how good they can be at certain types or problems. Read up on Amdahl's law.

Basically, traditional x86 processors are good at lots of stuff. Modern GPUs are great at a few things.

--
Do you even know anything about perl? -- AC Replying to Tom Christiansen post.

No by Sycraft-fu · 2010-08-29 16:29 · Score: 5, Informative

GPUs are special kinds of processors, often called stream processors. They are very efficient at some kinds of operations, and not efficient at others. Some things, they run literally a thousand times faster than the CPU. Graphics rasterization would be one of these (no surprise, that's their prime job). However other things they run much slower. For something to run fast on a GPU it has to meet the following requirement, the more it matches them, the faster it is:

1) It needs to be parallel to a more or less infinite extent. GPUs are highly parallel devices. The GTX 480 in question has 448 shaders, meaning for max performance it needs to be working on 448 things in parallel. Things that are only somewhat parallel don't work well.

2) It needs to not have a whole lot of branching. Modern GPUs can branch, but they incur a larger penalty than CPUs do. So branching in the code needs to be minimal. It needs to mostly be working down a known path.

3) When a branch happens, things need to branch the same way. The shaders work in groups with regards to data and instructions. So if you have half a group branching one way, half the other, that'll slow things down as it'll have to be split out and done separately. So branches need to be uniform for the most part.

4) The problem set needs to fit in to the RAM of the GPU. This varies, 1GB is normal for high end GPUs and 4-6GB is possible for special processing versions of those. The memory on board is exceedingly fast, over a hundred gigabytes per second in most cases. However the penalty for hitting the main system RAM is heavy, the PCIe bus is but a fraction of that. So you need to be able to load data in to video RAM and work on it there, only occasionally going back and forth with the main system.

5) For very best performance, your problem needs to be single precision floating point (32-bit). That is what GPUs like the best. Very modern ones can do double precision as well, but at half the speed. I don't know how their integer performance fares over all, they can do it, but again not the same speed as single precision FP.

Now this is very useful. There are a lot of problems that fall in that domain. As I said, graphics would be one of the biggest, hence why they exist. However there are many problems that don't. When you get ones that are way outside of that, like, say, a relational database, they fall over flat. A normal CPU creams them performance wise.

That's why we have the separate components. CPUs can't do what GPUs do as well, but they are good at everything. GPUs do particular things well, but other things not so much.

In fact this is taken to the extreme in some electronics with ASICs. They do one and only one thing, but are fast as hell. Gigabit switches are an example. You find that tiny, low power, chips can switch amazing amounts of traffic. Try it on a computer with gigabit NICs and it'll fall over. Why? Because those ASICs do nothing but switch packets. They are designed just for that, with no extra hardware. Efficient, but inflexible.

Re:No by black3d · 2010-08-29 16:52 · Score: 3, Insightful

Unfortunately I can't mod having already posted in this thread, but please allow me to /bow. This is the best explanation I've ever read anywhere for the differences. Even I knew the differences but couldn't have expressed it so finely. Bravo.

--
"The true measure of a person is how they act when they know they won't get caught." - DSRilk

Also by Sycraft-fu · 2010-08-29 17:30 · Score: 5, Interesting

CS people get way too caught up in Big O forgetting that, as you note, it is the theory describing the upper bound on time, not actual practice AND is only relevant for one factor, n. A great example is ray tracing. CS people love ray tracing because most create a ray tracer as part of their studies. They love to talk on about how it is great for rendering because "It is O(log n)!" They love to hype it over rasteraztion like current GPUs do. However there's two really important things they forget:

1) What is n? In the case, polygons. It scales with the log of the number of polygons you have. This is why ray tracer demos love showing off spheres made of millions of polygons and so on. It is cheap to do. However turns out polygons aren't the only thing that matters for graphics. Pixels also matter and ray tracing is O(n) with relation to pixels and it gets worse for anti aliasing. For FSAA you cast multiple rays per pixel. That means that 4x FSAA has 4x the computation requirements. Turns out rasterization scales much better with regards to resolution, and AA. In fact these days 4x FSAA is often O(1) in actual implementation in that it doesn't hit frame rate to turn it on. That is also why ray tracing demos are low resolution, because THAT'S the hard part.

2) Real hardware limits. In the case of ray tracing, it is memory access. While those polygons scale in a perfectly logarithmic fashion in theory, real system RAM isn't so forgiving. As you start to have more and more you run in to RAM bandwidth limits and things slow down. All the memory access required becomes a limit that wasn't on paper. You can cry all you like that it wasn't a problem in theory, on actual hardware you run in to other limits.

People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation. CS people also need to understand for users, there is only one benchmark that matters: The clock on the wall. Whatever is faster is better. Doesn't matter what the theory says, if the hardware can do X faster than Y, then X is better according to users.

Re:Also by pz · 2010-08-30 01:33 · Score: 3, Interesting

People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation.
Right. And any good programmer understands that *first* you pick the right algorithm, and *then* you optimize the implementation. Working the other way around is wasting time.
But, more importantly, that the parent seems to miss, is that the speed improvement from changing the order of growth of an algorithm can swamp nearly any possible improvements in hardware. Going from O(n^4) to O(n^2) when n is 10,000 is like replacing your desktop box with ten thousand supercomputers. No amount of implementation tweaking or upgrading processors and memory -- which only affects the constants in algorithm speed -- is going to give you the same level of speedup.
There is a very, very good reason to pay attention to O() analysis of algorithms, and, in particular, their implementation. You can implement bubble sort which is O(n^2) when done correctly, in a brain-dead way that makes it O(n^3) --- if you, for example, re-extend the output array for each new element --- and the difference is huge. Extra factors of n creep in easily, and understanding them can be highly useful.
So, the parent can review real-world constraints all he wants, but in the end, order of growth is more important.

--

Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.

1 billion? Up it to over 4 billion! by Mr+Z · 2010-08-29 20:52 · Score: 3, Funny

You know, if they up it to just a bit over 4 billion unique 32-bit keys, say around 4,294,967,296 or so, I think I could sort them rather efficiently, as long as they weren't attached to any payload. ;-)

--
Program Intellivision!

Slashdot Mirror

Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs

29 of 187 comments (clear)