Slashdot Mirror


Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs

An anonymous reader writes "Researchers at the University of Virginia have recently open sourced an algorithm capable of sorting at a rate of one billion (integer) keys per second using a GPU. Although GPUs are often assumed to be poorly suited for algorithms like sorting, their results are several times faster than the best known CPU-based sorting implementations."

187 comments

  1. Excel Charts by Anonymous Coward · · Score: 3, Insightful

    I find it very disappointing that a group of programmer/computer science types who even supply BibTeX to make it easier to reference their work, resort to screen-capturing an Excel chart to display their data.

    1. Re:Excel Charts by bramp · · Score: 3, Funny

      I've also found this annoying when reading papers. Perhaps I just spend too much time learning how to use gnuplot so that my graphs look nice.

    2. Re:Excel Charts by Anonymous Coward · · Score: 1, Insightful

      Who cares what tool they use to display data, if they're getting their point across in an effective manner? Good lord...

    3. Re:Excel Charts by Anpheus · · Score: 3, Insightful

      Maybe excel was just the right tool for the job? It's quick and easy to use, and to reformat the graphs.

      I know the Linux tools tend to be a little longer between tweaking, rendering and displaying, so a fast WYSIWIG tool works just fine.

    4. Re:Excel Charts by Anonymous Coward · · Score: 0, Funny

      Researchers at the University of Virginia have recently open sourced

      I stopped this shit about right there. You think I'm going to trust my sorting to some open sores buggy shit? I think I'll just keep using Microsoft for my algorithms thank you very much.

    5. Re:Excel Charts by ScrewMaster · · Score: 0

      Researchers at the University of Virginia have recently open sourced

      I stopped this shit about right there. You think I'm going to trust my sorting to some open sores buggy shit? I think I'll just keep using Microsoft for my algorithms thank you very much.

      I'm assuming that was an attempt at humor. Otherwise, you're going to get modded all to hell very soon.

      --
      The higher the technology, the sharper that two-edged sword.
    6. Re:Excel Charts by udippel · · Score: 1

      Nevermind what the poster attempted do it.
      I for one found it quite funny.
      (Simply have no mod points now, so a comment should be in order.)

    7. Re:Excel Charts by Anonymous Coward · · Score: 0

      Seems the issue is that, instead of providing the data, its a screenshot of the data (less able to verify the results?)

    8. Re:Excel Charts by Anonymous Coward · · Score: 0

      FUD

    9. Re:Excel Charts by Anonymous Coward · · Score: 0

      I'm more bothered by the ragged edges and too-wide text columns. Seriously. The paper looks unprofessional. With LaTeX, this is automatic.

    10. Re:Excel Charts by pspahn · · Score: 4, Insightful

      Actually, I find even more disappointing that a decent way to display datasets on a web page isn't standard yet. Why can't a nice one be embeddable with column sorts and robust methods for retrieving data? There are solutions, sure, but I have yet to find one that isn't unnecessarily complex or just plain ugly and difficult to use. But I guess it's just a matter of time, right?

      --
      Someone flopped a steamer in the gene pool.
    11. Re:Excel Charts by Anonymous Coward · · Score: 0

      Are you f*cking kidding me? gnuplot is all that was needed here.

    12. Re:Excel Charts by dave420 · · Score: 1

      Maybe the screenshot isn't the finished paper, and just a sneak-peek into the results for those interested?

    13. Re:Excel Charts by Anonymous Coward · · Score: 0

      Did you mean to say you spend too much time learning Excel? GNU plot, I learned in one afternoon.

    14. Re:Excel Charts by w0mprat · · Score: 3, Informative

      Amen. Some tools like that would be a godsend. It could be coming. http://en.wikipedia.org/wiki/Linked_Data http://linkeddata.org/ - Not what you are talking about, but what you describe may result from it.

      --
      After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.
    15. Re:Excel Charts by Anonymous Coward · · Score: 0

      I know the feeling. Getting good at them though :p

    16. Re:Excel Charts by dominious · · Score: 2, Insightful

      what. the. fuck. +4 Insightful because they use Excel charts?

      Hey I just solved N=NP.
      Yeah, but you are using Excel charts...hmmm sorry kthnx later.

    17. Re:Excel Charts by Anonymous Coward · · Score: 0

      it's gnuplot. It has nothing to do with GNU.

    18. Re:Excel Charts by brm · · Score: 1
    19. Re:Excel Charts by pinkushun · · Score: 1

      Actually there is, I'm surprised that not more people use (or know) of it. See Google Chart API

    20. Re:Excel Charts by multipartmixed · · Score: 2, Insightful

      1. Why do you believe those are screen captures, rather than, say, exported images?

      2. How would the data look different it were displayed with BibTeX?

      3. How fast is using BibTeX? (I've never used it). I could create those same charts in Excel '97 from a CSV of input points easily; probably in under a minute.

      --

      Do daemons dream of electric sleep()?
    21. Re:Excel Charts by Surt · · Score: 1

      Researchers at the University of Virginia have recently open sourced

      I stopped this shit about right there. You think I'm going to trust my sorting to some open sores buggy shit? I think I'll just keep using Microsoft for my algorithms thank you very much.

      I'm assuming that was an attempt at humor. Otherwise, you're going to get modded all to hell very soon.

      Just ask yourself: could someone possibly actually hold this opinion?
      He was clearly joking.

      --
      "Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
    22. Re:Excel Charts by pspahn · · Score: 1

      Which is fine and dandy, but it doesn't appear to simply embed a spreadsheet, excel or otherwise.

      I would like to use something like [sheet src="someurl.com/data.csv"]

      --
      Someone flopped a steamer in the gene pool.
    23. Re:Excel Charts by ScrewMaster · · Score: 1

      Just ask yourself: could someone possibly actually hold this opinion? He was clearly joking.

      Yes, they can and do. I'm surprised you haven't experienced such opinions. I have, and it's unnerving, and frankly explains a lot of the problems we have trying to improve online security.

      For example, I know a couple of Ph.Ds that simply cannot understand how anyone could use an open source OS because "everyone can see how it works." I've pointed out that a truly secure software system is so by its very nature, regardless of whether you have the source code or not. In one ear and out the other: security by obscurity just seems to make more sense to them. They've had issues with Windows malware in the past, and their assumption is that it's just a necessary evil and would, of course, be worse in an open source environment because they're of course much easier to crack.

      The irony is that they don't make the connection between the openness of successful science, and the success of major open source projects like Linux. They prefer to trust closed source software because "nobody knows how it works and so they can't break in." Okey dokey ... keep on using Internet Explorer and when your bank account gets plundered I don't want to hear about it.

      Boggles the mind I know, but I deal with it all the time.

      --
      The higher the technology, the sharper that two-edged sword.
    24. Re:Excel Charts by TheRaven64 · · Score: 1

      Maybe excel was just the right tool for the job?

      Given prior experience, this seems highly improbably for any possible value of 'the job'.

      --
      I am TheRaven on Soylent News
    25. Re:Excel Charts by thePowerOfGrayskull · · Score: 1

      Indeed. When it comes to quickly making charts with minimal tweaking ... it's hard to beat Excel.

    26. Re:Excel Charts by capitaladot · · Score: 1

      A family member who is a Phd had to resubmit a journal article for that reason (one of his co-authors did something similar to the article). Binary blobs (in this case, screenshots and the ilk) are never your friend!

    27. Re:Excel Charts by lonecrow · · Score: 1

      Well doesn't IE support OLE?

  2. The video card in question.. by black3d · · Score: 5, Informative

    Specifically, a GTX480 (just over 1 B keys/sec), followed up by a Tesla 2050 at around 75% of the speed of the GTX480. (745 M keys/sec)

    --
    "The true measure of a person is how they act when they know they won't get caught." - DSRilk
    1. Re:The video card in question.. by MarkRose · · Score: 3, Funny

      I never would have suspected the GTX480 would have been good at this sorta thing.

      --
      Be relentless!
    2. Re:The video card in question.. by Sycraft-fu · · Score: 1

      That's strange, I wonder what the 480 is faster? The 2070 should be the Tesla equivalent to the GTX 480. nVidia doesn't make different processors, they just change the configuration for their cards. Teslas have no display output and generally more RAM.

    3. Re:The video card in question.. by black3d · · Score: 1

      It's quite possible a 2070 would perform just as well, except they tested it against a 2050. I thought the 2070 would actually perform better than the 480, due to screeds (still no idea if it's a word, but love it!) more RAM, but I presume it's simply what they had available. Maybe a dedicated 2070 wasn't on this year's budget? :)

      --
      "The true measure of a person is how they act when they know they won't get caught." - DSRilk
    4. Re:The video card in question.. by Sycraft-fu · · Score: 1

      Oh didn't notice that. I presume the 2050 is the GTX470 (or maybe 460) equivalent, which would explain things.

    5. Re:The video card in question.. by FilipeMaia · · Score: 2, Informative

      The reason for the GTX480 being faster is that it has 15 SM compared to 14 from the Tesla 2050. Also the GTX 480 runs at a higher clock speed (700 compared to 575). Put together this is 575/700*14/15 = 76.7% which comes pretty close to the 75%.

    6. Re:The video card in question.. by DavidD_CA · · Score: 1

      That's the sort of joke I'd expect from Slashdot.

      --
      -David
    7. Re:The video card in question.. by Anonymous Coward · · Score: 0

      Its scads.

    8. Re:The video card in question.. by ericcj · · Score: 2, Informative

      Chips on the GTX 480, C2050, and C2070 come from the exact same die and wafer. The C20XX GPUs actually run at a lower clock speed for 32-bit floating-point and integer operations than a GTX 480.

      C20XX series hardware is intended for physics/science/engineering calculations, where double-precision is preferred. The C20XX series is 4 times faster at double-precision calculations than the GTX 480. This is the sweet spot it is tuned for.

    9. Re:The video card in question.. by black3d · · Score: 1

      Interestingly, the study claims the C2050 result to be 77%, matching your calculations. It just doesn't match the graph, so I presume they were reporting on spikes as the top processing speed. Seems like you're spot on!

      --
      "The true measure of a person is how they act when they know they won't get caught." - DSRilk
  3. Slashdot dont need no GPU.... by Anonymous Coward · · Score: 0, Troll

    ... to sort my first post to the top.

    So why is this GPU sorting ability important to me?

  4. No Surprise... by Anonymous Coward · · Score: 5, Funny

    GPUs have always been better at sorting your money from your wallet.

  5. Not a barrier by Captain+Segfault · · Score: 5, Insightful

    This isn't a "barrier" like the "sound barrier". There are no special difficulties that start around 1G/sec! It's just a threshold.

    Don't get me wrong -- I'm not saying this isn't impressive, but no "barrier" was broken here!

    1. Re:Not a barrier by XanC · · Score: 4, Insightful

      It's not a threshold! It's just a milestone.

    2. Re:Not a barrier by CoolGopher · · Score: 3, Funny

      It's just a milestone.

      Hang on, since when do you measure sorting performance using a distance indicator? And an imperial one at that!

      No, this is not a serious comment.

    3. Re:Not a barrier by PatPending · · Score: 1

      It's just a milestone.

      I fail to see what a stone functioning as a milepost has to do with this.

      --
      What one fool can do, another can. (Ancient Simian Proverb)
    4. Re:Not a barrier by martin-boundary · · Score: 2, Funny

      It means the Queen of England herself endorses the calculations?

    5. Re:Not a barrier by Anonymous Coward · · Score: 0

      Technically sound barrier isnt a barrier either!

    6. Re:Not a barrier by caerwyn · · Score: 2, Insightful

      Actually, if you look at shockwave dynamics during the moment an object crosses from subsonic to supersonic velocity, it can very easily be considered much more of a barrier than 1gkeys/sec can.

      --
      The ringing of the division bell has begun... -PF
    7. Re:Not a barrier by PatPending · · Score: 4, Funny

      Actually, if you look at shockwave dynamics during the moment an object crosses from subsonic to supersonic velocity, it can very easily be considered much more of a barrier than 1gkeys/sec can.

      Actually in this case, your analogy should use ludicrous speed.

      --
      What one fool can do, another can. (Ancient Simian Proverb)
    8. Re:Not a barrier by Anonymous Coward · · Score: 0

      my brains.... are going into my feeeeeet

    9. Re:Not a barrier by ScrewMaster · · Score: 1

      Actually, if you look at shockwave dynamics during the moment an object crosses from subsonic to supersonic velocity, it can very easily be considered much more of a barrier than 1gkeys/sec can.

      Yes, it was

      --
      The higher the technology, the sharper that two-edged sword.
    10. Re:Not a barrier by Tumbleweed · · Score: 1

      Hang on, since when do you measure sorting performance using a distance indicator? And an imperial one at that!

      The GTX480 can make the Kessel Run in 18 Parsecs, too.

      And why should I have to add "Kessel" to my spelling dictionary? Geez.

    11. Re:Not a barrier by jamesh · · Score: 1

      Actually, if you look at shockwave dynamics during the moment an object crosses from subsonic to supersonic velocity, it can very easily be considered much more of a barrier than 1gkeys/sec can.

      I take it you didn't feel the disturbance in the force when the 1gkeys/sec barrier was broken then? It felt like a million GPU marking executives all took a deep breath and shouted "w00t!"

    12. Re:Not a barrier by jamesh · · Score: 1

      GPU marking executives

      and that should have been marketing executives obviously. I blame the migraine i am currently experiencing (presumably a result of said force disturbance :)

    13. Re:Not a barrier by __aaoyac5342 · · Score: 1

      "no "barrier" was broken here!" That's what she said.

    14. Re:Not a barrier by msormune · · Score: 0

      Thank you, Captain Obvious.

    15. Re:Not a barrier by Anonymous Coward · · Score: 0

      I don't know... What are the shockwave dynamic like at 1 billion keys/second ?

    16. Re:Not a barrier by sznupi · · Score: 1

      Though the effects are in full swing quite a bit before "the moment an object crosses from subsonic to supersonic velocity"; the "barrier" is marked by entry into transonic flight, around 0.2 Mach below the speed of sound.

      --
      One that hath name thou can not otter
    17. Re:Not a barrier by asliarun · · Score: 1

      It means the Queen of England herself endorses the calculations?

      Yeah, you can count on her.

    18. Re:Not a barrier by martin-boundary · · Score: 1

      The GTX480 can make the Kessel Run in 18 Parsecs, too.

      Pff, my Emacs can do the M-x kessel-run in O([No match]) time and space!

    19. Re:Not a barrier by ockegheim · · Score: 1

      Cool, I know what transonic means now... thanks Slashdot & Wikipedia!

      --
      I’m old enough to remember 16K of memory being described as “whopping”
    20. Re:Not a barrier by phagstrom · · Score: 1

      well, sort of.

    21. Re:Not a barrier by Laxori666 · · Score: 1

      This isn't a "barrier" like the "sound barrier". There are no special difficulties that start around 1G/sec! It's just a threshold.

      Don't get me wrong -- I'm not saying this isn't impressive, but no "barrier" was broken here!

      Yep. Actually, looking from the chart, it looks like they just thought 1G/s would be a pretty number, so they optimized their code until it was that fast, then called it a day.

    22. Re:Not a barrier by Lost+Race · · Score: 1

      Yes, that was the point: The sound "barrier" is actually sort of a barrier; the 10^n foos per bar "barrier" never is, for any value of n, foo, or bar.

    23. Re:Not a barrier by caerwyn · · Score: 1

      I think you missed the actual parent to my post: "Technically the sound barrier isn't a barrier either!" I was correcting that statement, not Captain Segfault's.

      --
      The ringing of the division bell has begun... -PF
  6. Um... by Anonymous Coward · · Score: 5, Insightful

    Algorithms aren't measured in "x per second"... only implementations are measured that way. The speed of an algorithm is described in big-O notation, such as O(n log n). The metric of "sorted keys per second" is largely useless, because it depends on the particular hardware setup.

    1. Re:Um... by martin-boundary · · Score: 1, Troll

      The problem with big-oh notation is that the constant isn't explicit, so for any given n (pick as large as desired), it is possible that O(nlogn) As always, in theory there's no difference between theory and practice, but in practice there is...

    2. Re:Um... by PrecambrianRabbit · · Score: 5, Interesting

      Given that the particular hardware setup is detailed here (a GTX 480 achieves the 1 billion keys/sec figure), and the algorithm used (radix sort) has known asymptotic behavior (O(nk) for n keys of length k), 10^9 keys/sec is quite meaningful, particularly since it's a significant implementation challenge (possibly even an algorithmic challenge) to port this algorithm to a GPU.

      Furthermore, I think sorting speed is appropriately measured in keys/sec. Big-O does not in fact describe the speed, but rather the upper bound of the growth of an algorithm's asymptotic running time, which needs to be paired with the implementation, architecture, and data set to determine a speed. It turns out the constant factors can actually be quite important in practice.

    3. Re:Um... by Anonymous Coward · · Score: 0

      Algorithms aren't measured in "x per second"... only implementations are measured that way. The speed of an algorithm is described in big-O notation, such as O(n log n). The metric of "sorted keys per second" is largely useless, because it depends on the particular hardware setup.

      Not quite. Big-O is a measure of asymptotic complexity, which is only part of how fast an algorithm is. The other part is the constant factors. An algorithm which runs in 2n time is twice as fast as an algorithm which runs in n time, even though they have the same Big-O complexity. An algorithm which a slower Big-O complexity but lower constant factors can be faster practically.

      There are many algorithms which have great Big-O complexities but which are rarely or never used due to big constant factors.

    4. Re:Um... by SpazmodeusG · · Score: 4, Interesting

      I wish they'd start putting the "P" into these Big-O notations, where the "P" is the number of processors. Some algorithms don't scale well, some do. Putting the P in illustrates this.

      eg. O( n/P ) illustrates an algorithm that scales perfectly with more cores added. O( n / log(P) ) not so much.

    5. Re:Um... by Concerned+Onlooker · · Score: 1

      Obviously you weren't close enough to hear the sonic boom.

      --
      http://www.rootstrikers.org/
    6. Re:Um... by Anonymous Coward · · Score: 0

      You think it is not useful to know how fast a certain algorithm runs on certain hardware?

    7. Re:Um... by Guido+del+Confuso · · Score: 2, Funny

      O( n / log(P) ) not so much.

      That algorithm does particularly poorly on just one processor. In fact, if it ran successfully the universe would implode.

    8. Re:Um... by frosty_tsm · · Score: 2, Interesting

      While an interesting idea, many algorithms behave unpredictably on multiple processors (depending on how much communication would be required). Some will even be slower!

      The affect that extra CPUs will have is too dependent on the hardware implementation to be able to formalize like this.

    9. Re:Um... by Anonymous Coward · · Score: 3, Informative

      Typically, I hear researchers describe the parallelism of an algorithm separately from its computational complexity (big oh notation) using the idea of "speedup."

      The perfect scaling in your first example has linear speedup, and the second example has logarithmic speedup (that is, the speedup is log(p)).

      Here is the relevant Wikipedia article.

    10. Re:Um... by Anonymous Coward · · Score: 1, Insightful

      The affect that extra CPUs will have is too dependent on the hardware implementation to be able to formalize like this.

      It's no more exotic than having several levels of cache between the CPU and RAM (or even treating RAM as a cache between CPU and network/spinning disk). Even O(N) algorithms curve when N overflows L1, L2, L3, and RAM.

    11. Re:Um... by Anonymous Coward · · Score: 0

      O(n/[log(P)......+1?])

    12. Re:Um... by Lobais · · Score: 1

      It doesn't only depend on hardware setup. It depends on the number of keys!
      Unless the algorithm is actually O(n) and not O(n log n), the speed drops width the increase of keys to sort.

    13. Re:Um... by Anonymous Coward · · Score: 0

      The problem with big-oh notation is that the constant isn't explicit

      That's because for most problems, it doesn't matter. (And algorithms that take less than linear time are exceedingly rare in practice: you already need linear time to completely read your input, after all.)

    14. Re:Um... by Anonymous Coward · · Score: 0

      I wish they'd start putting the "P" into these Big-O notations....

      When I put my "P" in, "Big-O's" follow within minutes. Sometimes seconds. : (

    15. Re:Um... by Kjella · · Score: 1

      Big-O notation does not describe the function on a physical processor with caches and pipelines, it's a pure mathemathical concept counting mathematical operations. Doubling n may mean things no longer fit in L1/L2/L3/RAM which will have huge performance implications, there is no simple notation for an actual implementation.

      Adding processors would be just one factor, what's the latency and bandwidth of the interconnects? What memory can they access and at what cost, NUMA plays a big role. There's a huge, huge difference between running something on a cluster (slow interconnects) and a supercomputer (fast interconnects) which is as last as big as the difference between CPU and GPU.

      I suppose you could create a theoretical parallelization with zero latency, infinite bandwidth counting only the mathmatical dependencies, but the results would be pretty much meaningless. Like this sorting algorithm you could parallelize with hundreds of thousands of simultanious compares - in theory. In practice the overhead of doing it would completely kill performance.

      --
      Live today, because you never know what tomorrow brings
    16. Re:Um... by badkarmadayaccount · · Score: 1

      Try cabergoline in combination with low dose SSRIs, though you need to be careful with your choice of the second.

      --
      I know tobacco is bad for you, so I smoke weed with crack.
    17. Re:Um... by Anonymous Coward · · Score: 0

      Um, it can be (and has) proven mathematically that that only works up to a point, then you get diminishing returns. You can not simply divide n by p. That won't work for ANY algorithm.

    18. Re:Um... by Anonymous Coward · · Score: 0

      But you need a new algorithm to take advantage of GPUs in sorting keys. This new algorithm does not need to have less complexity or be better in theory but to get the most work done with actual graphic cards.

  7. Link to Technical Paper by PatPending · · Score: 4, Informative
    --
    What one fool can do, another can. (Ancient Simian Proverb)
  8. I think the bubble sort would be the wrong way to by joeyadams · · Score: 4, Funny

    I think the bubble sort would be the wrong way to go.

    —Barack Obama

  9. New level of gaming. by Murdoch5 · · Score: 2, Funny

    Hold on, so I can play Jezz Ball, Chips Challenge and all my favourite old school games with a greater level of speed.

  10. PRON! Marches on! by jewishbaconzombies · · Score: 3, Funny

    Thank god, because my online Porn collection isn't going to sort itself.

    1. Re:PRON! Marches on! by Anonymous Coward · · Score: 1, Funny

      A porn collection is something that needs to be sorted manually.

    2. Re:PRON! Marches on! by Anonymous Coward · · Score: 1, Funny

      "I sorted that whole 1 giga porn collection single-handedly!".

    3. Re:PRON! Marches on! by Anonymous Coward · · Score: 0

      What are you going to sort it by? By dimensions of certain anatomical features, by ages, by perversity level, by number of participants or something else?

  11. Big deal. Radix sort works well IF ... by crovira · · Score: 3, Interesting

    you've got lots and lots of RAM with room for the keys and lots of space to waste for unfilled pointers.

    Pass 1, read the records, at the key radix store a record URI
    Pass 2, sweep RAM and read the record URIs in the order you encounter them copying them onto a sequential write device.

    I was doing this years ago.

    If you are careful with the number and sizes of your read buffers, the re-read done for pass 2 doesn't have to be all that disk intensive.

    You can even generate the keys using what ever hash function you find that is truly efficient and store collisions separately (leave a bit high and go into the a link list maintained by the hash generator for those few keys that hash redundantly.)

    --
    MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
    1. Re:Big deal. Radix sort works well IF ... by Trepidity · · Score: 4, Insightful

      Well, yeah, they're not claiming they invented radix sort. They're claiming that their GPU implementation of radix sort runs about 4x as fast as the CPU implementation you describe.

    2. Re:Big deal. Radix sort works well IF ... by Anonymous Coward · · Score: 0

      Big deal, I doubt you were getting anywhere near the performance they were quoting. Was yours coded for a GPU as well?
      Or was this just your attempt at showing how cool you are?

    3. Re:Big deal. Radix sort works well IF ... by Anonymous Coward · · Score: 0

      I was doing this years ago.

      You, Sir, are an idiot and have not understood the article.

      Who the hell modded this guy up?

    4. Re:Big deal. Radix sort works well IF ... by kramulous · · Score: 3, Interesting

      So, does that mean if they went out and got the fastest Xeon processor (they used the fastest gpu - excluding the C2070), wrote parallel and used the Intel Compiler (writing to it) the speedup over the cpu is less than zero?

      After having just looked at their code, also remove the cpu stl shit (actually any template since they don't vectorise). If you are writing speedy code for the gpu, to perform an adequate test for the cpu you also have to write appropriately.

      hahahahaha

      This gets better and better...
      They only timed the compute time. Cudamalloc was not part of the timing or cudamemcpy.

      Sorry, I only count 'time to solution'. That is all i'm interested in. I thought is was strange that a less than O(n**2) was faster on the gpu.

      It is like writing benchmarks that ignore disk IO, ignore reading from system memory, L3 cache, etc. Only time stuff that is in the registers.

      --
      .
    5. Re:Big deal. Radix sort works well IF ... by evilWurst · · Score: 2, Insightful

      It's generally not size of RAM that breaks radix sort; it's the size of cache. Modern processors are highly reliant on cache, which means they're highly reliant on things in memory being in small tight chunks that fit in cache - because cache misses are expensive enough that if you thrash cache badly enough, you may end up running slower than if you hadn't had any cache at all.

      Good comparison sorts may start fragmented, but by their very nature each pass of the algorithm makes them less so. Radix sort is the other way around; it follows pointers (so more precious scarce cache in use already) that point in more and more fragmented patterns with every pass. That's why even though radix sort's average speed is theoretically faster than quicksort, quicksort still wins on real life hardware. And that's probably why radix sort wins on GPUs - the data fits in the card's dedicated memory, which is already optimized to be accessed in a much more parallel way than main memory.

    6. Re:Big deal. Radix sort works well IF ... by Anonymous Coward · · Score: 0

      Sorry. Speedup less than one.

    7. Re:Big deal. Radix sort works well IF ... by julesh · · Score: 1

      This gets better and better...
      They only timed the compute time. Cudamalloc was not part of the timing or cudamemcpy.

      Sorry, I only count 'time to solution'. That is all i'm interested in.

      Total time including initialisation and reporting result is only useful if what you're looking at is a complete process -- but when was the last time you just needed to sort something and that sorting wasn't part of a larger algorithm? It's at least not totally out of the question that the data will already be in GPU memory, and that the next step of processing can occur there too, at which point these (O(n)) overheads become irrelevant.

    8. Re:Big deal. Radix sort works well IF ... by Anonymous Coward · · Score: 0

      1. They're already maxing out the memory. What else are you going to do? Sort again? 2. The bandwidth issue kills the speed gain. Hence the O(n**2) comment. This will always be the case with current architectures. Now, Intel have a chip coming soon with gpu and CPU on the same die (i5) which will make things much better. Can't wait for that.

    9. Re:Big deal. Radix sort works well IF ... by Anonymous Coward · · Score: 0

      They only timed the compute time. Cudamalloc was not part of the timing or cudamemcpy.

      You don't usually time mallocs, and because a CUDA GPU is an asynchronous coprocessor, measuring memcpy isn't meaningful. The GPU is able to do processing while memcpy'ing.

    10. Re:Big deal. Radix sort works well IF ... by Anonymous Coward · · Score: 0

      So, does that mean if they went out and got the fastest Xeon processor (they used the fastest gpu - excluding the C2070), wrote parallel and used the Intel Compiler (writing to it) the speedup over the cpu is less than zero?

      After having just looked at their code, also remove the cpu stl shit...

      They compare to the fastest published CPU sorting results for the fastest CPUs at the time and on Larrabee (results from published papers out of Intel in both cases). The STL CPU code in there is for correctness testing, not timing results.

      Yes you need to be aware of PCIe transfer time, malloc time, etc. But sorting is only a primitive anyway, not an end unto itself. In many algoirthms the thing you need to sort is already on the card, because the sort is an intermediate step in the middle of some bigger algorithm. And as you point out, a code that really is sort-heavy will be disk-limited anyway.

      To me the most impressive thing about these results is the 770 Mkey-value/pairs per second. Sorting usually needs to carry a payload of some sort...

    11. Re:Big deal. Radix sort works well IF ... by Anonymous Coward · · Score: 0

      What you describe is counting sort ( http://en.wikipedia.org/wiki/Counting_sort ), not Radix sort ( http://en.wikipedia.org/wiki/Radix_sort ). Radix sort makes a pass for each integer 'digit' in the numbers, so unless you're saying there is only one 'digit' in the integers they sorted, the algorithm you describe isn't accurate.

      Perhaps you were doing something else years ago, thus had different results?

  12. Ugh. by martin-boundary · · Score: 3, Insightful
    Stupid HTML ate my <...

    The problem with big-oh notation is that the constant isn't explicit, so for any given n (pick as large as desired), it is possible that O(nlogn) < O(n) for some choice of constants. That's why ops per second is still a useful metric when comparing implementations on standardized hardware.

    As always, in theory there's no difference between theory and practice, but in practice there is...

    1. Re:Ugh. by jmerlin · · Score: 1

      nlog(n) < n when log(n) < 1, log(n) < 1 when n < e (or 10, depending on whether you know math or you just do math), the next biggest integer is 3 (or 10). When you're talking about 3 (or 10) things, you usually don't care about the efficiency of the algorithm. When testing on standardized hardware, theoretical and actual results are close except in two cases: 1. When n is very small. 2. When optimization (be it in software or hardware) plays a large role. Both of these are largely negligible by using a very large test case and compiling in debug mode, so in reality, this metric is quite honestly useless.

    2. Re:Ugh. by martin-boundary · · Score: 1

      That's not the range where the constants matter, though. Take log(10^12) = 28 approx, then for n=10^9, an algo that's nlogn is equivalent to (or even better than) an algo that's 28n over the range [1,n]. Now a factor of 30 in practical implementations is not that rare. Penalties due to caching effects or disk I/O can amount to much more and amortized over the full runtime could give that. So I don't think it's that pathological to expect some pairs of algos to compare oppositely in theory and in implementation.

    3. Re:Ugh. by metacell · · Score: 2, Informative

      Dude, an algorith which is O(n*log(n)) is not faster than O(n) just because n*log(n) < n.

      When an algorithm is O(n* log(n)), it means the actual time requirement is p*n*log(n)+q, where p and q are constants specific to the algorithm.

      The O(n*log(n)) algorith is faster than the O(n) one when

      p1*n*log(n)+q1 < p2*n+q2

      ... and for any n, it is possible to choose p1, p2, q1 and q2 so that the O(n) algorith becomes faster.

      This means, for example, that an algorithm which is O(n*log(n)) isn't automatically faster than an algorithm which is O(n) on lists with three elements or more. The O(n*log(n)) algorithm may take a hundred times longer to sort a list of two elements than the O(n) one (due each step being more complex), and in that case the lists will need to grow some before the O(n*log(n)) algorithm becomes faster.

    4. Re:Ugh. by gmthor · · Score: 1

      I take it, that you don't get the big O notation correct. As the constants are different, it makes no sense to calculate when nlog(n) < n.

      Ask yourself, when is C*n*log(n) < K*n, for unknown C and K(these constants depend on the implementation) and you will see that the only way to determine this is measuring the result of your implementation.

      --
      How do I uncompress my MD5 archive?
    5. Re:Ugh. by Anonymous Coward · · Score: 0

      No, it is also possible to sit down and count the cycles each assembler instruction will take.
      I have plenty of source-codes where the comments are cycle counts from a specific point.

    6. Re:Ugh. by SashaM · · Score: 1

      Uhh, what?

      An algorithm that runs in n*log(n) is slower than n.

      Also, big-O is the wrong thing to use here - you want little-o (or little-omega). An algorithm being O(f(n)) only means that the algorithm runs in (asymptotically) less than or same number of of steps as f(n). An algorithm that runs in k*log(n)+p steps is in O(n^2) and in O(n) and in O(log(n)). To show that an algorithm A (that runs in f(n) steps) is faster than algorithm B (that runs in g(n) steps) you need to show that g(n) is in w(f(n)) ('w' is little-omega), or conversely that f(n) is in o(g(n)) ('o' is little-o).

      I hate the use of the equals sign with the asymptotical growth notations, because it is extremely confusing to people who forget that O(f(n)) is a set, as are o(f(n)), w(f(n) and Omega(f(n)). If you use '=', you get confusing things like n=O(n^2) and n^2=O(n^2), so one might deduce that n=n^2. So let's everyone agree to use \in instead of '=' from now on. OK?

      Big-O Notation

    7. Re:Ugh. by metacell · · Score: 1

      Um, you're right, the left and right hand of the inequality should switch places...

  13. Not impressed unless... by PmanAce · · Score: 0

    I will not be impressed until it plays a video of sorting examples using humans at 120 FPS while doing its other sorting!

    --
    Tired of my customary (Score:1)
    1. Re:Not impressed unless... by Anonymous Coward · · Score: 1, Funny

      Please update your sig to reflect your new status... It should be "Tired of my customary (Score:0)"

    2. Re:Not impressed unless... by chill · · Score: 1

      A Benny Hill skit is the first thing that pops to mind.

      Theme music here, for those too young to remember. (URL:http://www.youtube.com/watch?v=MK6TXMsvgQg)

      --
      Learning HOW to think is more important than learning WHAT to think.
    3. Re:Not impressed unless... by PmanAce · · Score: 0

      Done. :(

      --
      Tired of my customary (Score:1)
  14. x86 by wargod667 · · Score: 1

    at the whim of sounding like a complete arrogant jerk, does this actually support the theory that we've hit a wall with the x86 architecture with general computing, or is it still the 'best' at that?

    I'm a bit confused when they, again and again, come up with gpu architecture doing stuff miles faster that our old trusty x86. I mean, would it be meaningful to actually start researching the architectural way of computing like we have seen gaming consoles do for decades? (latest being the ps3's cell)

    just a thought that came to mind.

    1. Re:x86 by emmons · · Score: 4, Informative

      GPUs are highly parallel processors, but most of our computing algorithms were developed for fast single core processors. As we figure out how to implement new solutions to old problems to take advantage of these highly parallel processors, you'll continue to see stories like this one. But, there's a limit to how good they can be at certain types or problems. Read up on Amdahl's law.

      Basically, traditional x86 processors are good at lots of stuff. Modern GPUs are great at a few things.

      --
      Do you even know anything about perl? -- AC Replying to Tom Christiansen post.
    2. Re:x86 by Mr+Z · · Score: 1

      Ok, so the page lists a 240M keys/sec sorting implementation for a quad-core Core i7, versus their 1G keys/sec for the GPU. How many processors were on that GPU? It's a 4x speedup end-to-end, sure, but it sure didn't scale linearly in the number of processors.

      Computing tasks will always be a mix of truly serial vs. mostly serial but parallelizable with herculean effort vs. embarrassingly parallel (and all points in between). GPUs work on the embarrassingly parallel stuff with ease, and fall off as you get more serial. x86s, on the other hand, attack truly serial tasks with clock rate and mostly serial tasks with speculation and a bunch of extra hardware that hunts for every spec of available parallelism. Which type of processor does best depends on the problem you're trying to solve.

    3. Re:x86 by jhachen · · Score: 2, Interesting

      The answer to hitting a wall with traditional CPUs is complicated. The number of transistors on new CPUs is actually keeping up with Moore's law. The size of the transistors and power consumption is also steadily decreasing. However clock speeds have been stagnant and performance/clock cycle hasn't been increasing as fast as it has in the past.

      When it comes to raw performance numbers GPUs destroy CPUs. The problem is trying to take advantage of the power GPUs offer. For starters the algorithm has to be parallel in nature. And not just part of it, the majority of the algorithm has to be parallel or the overhead will erase any performance gain. The application also has to run long enough to justify offloading it to the GPU or again, the overhead will get you.

      Even if you have a parallel algorithm, implementing it isn't trivial. To use CUDA or OpenCL you have to have not only a good understanding of general parallel programming but also a good understanding of the architecture of the GPU hardware. These languages are not user friendly. They really put the burden on the programmer. On the other hand this does mean they can be very powerful in the right hands.

      Now lets say your application meets these criteria and you implemented it in CUDA and got a 10x speedup. No one with an ATI card can run it. Sure you could implement it in OpenCL instead to be cross platform but OpenCL seems to still be in it's infancy and not as mature as CUDA.

      I'm not trying to say GPGPU computing has no future, just that it has a long way to go. Parallel Programming has actually had quite the revival lately and I'm truly interested to see where it goes. Some type of parallel compiler that relieves the programmer of having to deal with all the headaches associated with parallel programming would be ground breaking and have awesome implications. Some people claim this isn't possible. If this topic interests you I would recommend looking into reconfigurable computing. Theres some real interesting stuff going on in that area and it supports a much wider range of algorithms than GPGPU currently does.

    4. Re:x86 by JohnnyBGod · · Score: 1

      While it is a nice theoretical result and provides food for thought, generally Amdahl's law isn't that meaningful, since the parallelizable portion of a given problem usually increases along with the problem's size. There are exceptions, of course, but normally people just throw bigger problems at the hardware.

    5. Re:x86 by Anonymous Coward · · Score: 0

      its not that the algorithms are developed for single core processors, so much as there really only seems to be one way to do certain tasks which lends itself to a nonparallel method

    6. Re:x86 by emmons · · Score: 1

      Very true. There are other limitations to GPUs as well. They don't handle branching well at all, some scatter/gather operations are slow, etc.

      But don't get me wrong; you'd have to pry my dual ATI (AMD) Radeon 5870's from my cold, dead fingers. And hopefully soon I'll get my hands on a GTX 485... :)

      --
      Do you even know anything about perl? -- AC Replying to Tom Christiansen post.
  15. most real life sorting involves indirection by Anonymous Coward · · Score: 2, Insightful

    The typical sorting problem I've encountered in my career (various types of scientific, telecommunications and business software, though not games) involves an array of pointers to fixed length records that have a sort key (let's say an integer) at a predefined offset. Not an array of integers, nor an array of directly embedded small fixed length records which I'm guessing was used in TFA. The former situation requires random as well as stream access to memory, which would likely favor processing by the CPU in the motherboard of a typical $1000-$2000 PC.

    1. Re:most real life sorting involves indirection by MadKeithV · · Score: 2

      Solution: preprocess the data into a map of keys to pointers, and sort that.

  16. No by Sycraft-fu · · Score: 5, Informative

    GPUs are special kinds of processors, often called stream processors. They are very efficient at some kinds of operations, and not efficient at others. Some things, they run literally a thousand times faster than the CPU. Graphics rasterization would be one of these (no surprise, that's their prime job). However other things they run much slower. For something to run fast on a GPU it has to meet the following requirement, the more it matches them, the faster it is:

    1) It needs to be parallel to a more or less infinite extent. GPUs are highly parallel devices. The GTX 480 in question has 448 shaders, meaning for max performance it needs to be working on 448 things in parallel. Things that are only somewhat parallel don't work well.

    2) It needs to not have a whole lot of branching. Modern GPUs can branch, but they incur a larger penalty than CPUs do. So branching in the code needs to be minimal. It needs to mostly be working down a known path.

    3) When a branch happens, things need to branch the same way. The shaders work in groups with regards to data and instructions. So if you have half a group branching one way, half the other, that'll slow things down as it'll have to be split out and done separately. So branches need to be uniform for the most part.

    4) The problem set needs to fit in to the RAM of the GPU. This varies, 1GB is normal for high end GPUs and 4-6GB is possible for special processing versions of those. The memory on board is exceedingly fast, over a hundred gigabytes per second in most cases. However the penalty for hitting the main system RAM is heavy, the PCIe bus is but a fraction of that. So you need to be able to load data in to video RAM and work on it there, only occasionally going back and forth with the main system.

    5) For very best performance, your problem needs to be single precision floating point (32-bit). That is what GPUs like the best. Very modern ones can do double precision as well, but at half the speed. I don't know how their integer performance fares over all, they can do it, but again not the same speed as single precision FP.

    Now this is very useful. There are a lot of problems that fall in that domain. As I said, graphics would be one of the biggest, hence why they exist. However there are many problems that don't. When you get ones that are way outside of that, like, say, a relational database, they fall over flat. A normal CPU creams them performance wise.

    That's why we have the separate components. CPUs can't do what GPUs do as well, but they are good at everything. GPUs do particular things well, but other things not so much.

    In fact this is taken to the extreme in some electronics with ASICs. They do one and only one thing, but are fast as hell. Gigabit switches are an example. You find that tiny, low power, chips can switch amazing amounts of traffic. Try it on a computer with gigabit NICs and it'll fall over. Why? Because those ASICs do nothing but switch packets. They are designed just for that, with no extra hardware. Efficient, but inflexible.

    1. Re:No by black3d · · Score: 3, Insightful

      Unfortunately I can't mod having already posted in this thread, but please allow me to /bow. This is the best explanation I've ever read anywhere for the differences. Even I knew the differences but couldn't have expressed it so finely. Bravo.

      --
      "The true measure of a person is how they act when they know they won't get caught." - DSRilk
    2. Re:No by Anonymous Coward · · Score: 0

      Thank you, my day on slashdot wasn't wasted, I learned a lot from your post.

    3. Re:No by Sycraft-fu · · Score: 1

      Thanks :). I've had to give that speech to professors a number of times so I've some practice. Some faculty don't understand the whole GPGPU thing well and think they are a panacea for speeding up things. We (computer support) need to explain to them the limits, so they can evaluate if their research is the kind that would benefit.

    4. Re:No by speedingant · · Score: 1

      Thanks mate - that's a great explanation.

    5. Re:No by FlawedLogic · · Score: 2, Informative

      The GTX480 can actually do a double precision op per clock cycle. Fermi was designed with DP supercomputing in mind which is why it's so bloody expensive. To get the price down for consumer cards they removed that ability since graphics doesn't generally need it. Consumer cards need four ticks to do the equivalent DP op.

    6. Re:No by Anonymous Coward · · Score: 0

      Can't we brute-force problems that are not well parallelized using ... artificial neural networks? ANNs are parallelized well and should be flexible enough to work on any problem.

    7. Re:No by Sycraft-fu · · Score: 1

      The GTX480 is the consumer grade part. Quadros are the pro graphics cards, Teslas are the pro GPGPU units, GTXs are consumer graphics cards.

    8. Re:No by FlawedLogic · · Score: 1

      My mistake, getting my NVidia parts confused, I blame senility. Does make you wonder what a Tesla would make of their sorting algorithm though doesn't it?

    9. Re:No by LoztInSpace · · Score: 1

      Great post. Thanks.
      I may be wrong/living in the past here but I was under the impression that another limitation of GPUs is that they "cheat". By that I mean that in that some of their operations produce perfectly acceptable results when translated to graphic/visible operations but at the expense of some mathematical correctness, meaning you have to watch out how you apply some of those algorithms to non-graphical applications.
      You seem to know what you are talking about - maybe you can tell me if this is the case?

    10. Re:No by Anonymous Coward · · Score: 0

      Hmm. Speaking as an astronomer, with a more-than-passing interest in computing - do Fast Fourier Transforms meet those conditions? As I understand it, the algorithm is highly parallel at the initial stages (when you're doing transforms of short blocks and combining them), then becomes less parallel as you start combining them into larger and larger products and approach the final result. I guess that sort of pattern's common to lots of divide-and-conquer algorithms.

      (Actually, my biggest current problem is embarrassingly parallel, with 10^5 s of chunks which can happily be FFTed separately - but the runtime isn't long enough to concern me yet. When I get more data, though...)

    11. Re:No by psilambda · · Score: 1

      Really modern GPUs (Fermi) can do OK on branches if the branches occur at a warp level or stay to a small number (less than 16) or stop--partially because of support for concurrent kernels. (Check out http://psilambda.com/ if you think that concurrent kernels can not be used.) You are incorrect or imprecise to state that there is a problem if it splits in half--if a warp splits in half, then there is less performance but if half of the warps split at a branch then there is not a performance loss (i.e. it splits at the warp level). There is no "penalty" in these cases.

      It is wrong to say: "a relational database, they fall over flat. A normal CPU creams them performance wise.". It depends on what you are talking about. As one example, for SQL window functions over partitions that involve any calculations (not just lag and lead--thank you), the GPU can cream the CPU. Check out: http://psilambda.com/2010/07/kappa-quick-start-guide-for-windows/ for some throughput numbers. The relational database is more the bottleneck--not the GPU. Other examples for relational databases that could benefit from the parallel processing of the GPU might be calculating the optimal query plan with large numbers of joins, sorting, merging keys (joins), etc. (if the relational database is written to process blocks of tuples instead of a tuple at a time--a relational database written to process blocks of tuples would also be more efficient on the CPU because of call overhead and cache).

    12. Re:No by Surt · · Score: 1

      About the same, because (take your pick):
      the fp op rate shouldn't matter to this algorithm as far as I can tell.
      the fp op rate is the same in tesla/gtx480 (ignoring minor clock speed differences)

      There are SOME slower consumer gtx cards, where the fp rate is 1/4 of the gtx480/tesla, but since those aren't in the picture ...

      --
      "Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
    13. Re:No by Surt · · Score: 1

      They cheat in the rasterization phase rather than in the setup phase, generally, so anything written in CUDA should be pretty safe.

      --
      "Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
    14. Re:No by TheRaven64 · · Score: 1

      FFT was one of the first algorithms to be implemented in shaders. It's used in a lot of image coding things, so GPUs often include dedicated instructions that do parts of it. nVidia actually includes FFT as one of the examples in the CUDA SDK, as I recall.

      --
      I am TheRaven on Soylent News
    15. Re:No by sulimma · · Score: 1

      Special purpose hardware is always faster than general purpose hardware.

      Except in the general case...

  17. Efficient? by scotty.m · · Score: 2, Interesting

    Not sure how they are defining efficient in this experiement. I'd like to see how many clock cycles it took to sort each problem, maybe how much memory the radix sort is using too.

    --
    Has anyone really been far even as decided to use even go want to do look more like?
    [ST8Z6FR57ABE6A8RE9UF]
    1. Re:Efficient? by Trepidity · · Score: 1

      Wallclock time: they're claiming that this is, in absolute numbers, the fastest sort in keys-per-second yet demonstrated on commodity hardware.

  18. Also by Sycraft-fu · · Score: 5, Interesting

    CS people get way too caught up in Big O forgetting that, as you note, it is the theory describing the upper bound on time, not actual practice AND is only relevant for one factor, n. A great example is ray tracing. CS people love ray tracing because most create a ray tracer as part of their studies. They love to talk on about how it is great for rendering because "It is O(log n)!" They love to hype it over rasteraztion like current GPUs do. However there's two really important things they forget:

    1) What is n? In the case, polygons. It scales with the log of the number of polygons you have. This is why ray tracer demos love showing off spheres made of millions of polygons and so on. It is cheap to do. However turns out polygons aren't the only thing that matters for graphics. Pixels also matter and ray tracing is O(n) with relation to pixels and it gets worse for anti aliasing. For FSAA you cast multiple rays per pixel. That means that 4x FSAA has 4x the computation requirements. Turns out rasterization scales much better with regards to resolution, and AA. In fact these days 4x FSAA is often O(1) in actual implementation in that it doesn't hit frame rate to turn it on. That is also why ray tracing demos are low resolution, because THAT'S the hard part.

    2) Real hardware limits. In the case of ray tracing, it is memory access. While those polygons scale in a perfectly logarithmic fashion in theory, real system RAM isn't so forgiving. As you start to have more and more you run in to RAM bandwidth limits and things slow down. All the memory access required becomes a limit that wasn't on paper. You can cry all you like that it wasn't a problem in theory, on actual hardware you run in to other limits.

    People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation. CS people also need to understand for users, there is only one benchmark that matters: The clock on the wall. Whatever is faster is better. Doesn't matter what the theory says, if the hardware can do X faster than Y, then X is better according to users.

    1. Re:Also by chammy · · Score: 1

      This is why ray tracer demos love showing off spheres made of millions of polygons and so on. It is cheap to do. However turns out polygons aren't the only thing that matters for graphics.

      The spheres would most likely be represented as an equation, not a soup of polygons. It's much more efficient (not to mention a whole lot easier) to raytrace. It's also infinitely precise, which is actually why a lot of people are more interested in raytracing than approximating things with polygons.

      For instance, it's a heck of a lot easier to render isosurfaces in a raytracer than turning them into a big polygon soup and rasterizing that.

    2. Re:Also by Anonymous Coward · · Score: 1, Informative

      Doesn't matter what the theory says, if the hardware can do X faster than Y, then X is better according to users.

      Normally big-O notation is applied on a purely theoretical level where all operations are assumed to have the same base cost in terms of time to execute. This does not make the notation invalid for real world applications and implementations, however. But when doing so you have to adjust your formulas by adding the proper weighting to execution time. In practice this is usually a waste of effort as it's generally faster to just write an implementation and time it on various pieces of hardware.

      In the article we're talking about, they are comparing a single implementation of a single algorithm on several pieces of hardware. So first of all the summary shouldn't be shouting about breaking any kind of record- they weren't trying to hit any particular benchmark it's a relative test of the hardware, not the algorithm or implementation.

      The reason why using big-O and making comparisons is useful, is because if all we use is this type of test the answer to any speed problem is simply "Get faster hardware, or buy a piece of hardware which runs this code faster". In all likliehood, there are other methods which, when implemented on the same hardware, may yield much faster results. Heck, it's possible someone else's implementation of the same algorithm may yield faster results as well.

      In regards to your comments about raytracing vs. polygon rendering, all I'm going to say is that you don't have a very good concept of what raytracing really is if you think a sphere created from a million triangles will raytrace faster than one modeled as a mathematical sphere. It won't- those demonstrations are a pure head-to-head comparison operating on a scene which has actually been optimized for a non-raycasting technique.

    3. Re:Also by Thiez · · Score: 1

      I don't think it's a fair comparison, really. The only reason modern games look so pretty is because they have the GPU that happens to be designed to help them out. If companies had been designing GPUs to do raytracing for over a decade, it seems rather likely to me that we could do raytracing at modern resolutions.

      Wouldn't it be more fair to compare raytracing to rasterization when you're only allowed to use the CPU?

    4. Re:Also by sznupi · · Score: 1

      Not quite, you can just buy Larab...oh, wait, it seems there were some problems with its goals...

      --
      One that hath name thou can not otter
    5. Re:Also by smallfries · · Score: 1

      Pixels also matter and ray tracing is O(n) with relation to pixels and it gets worse for anti aliasing.

      Not sure what your point is there. Any algorithm for rendering that spits out n pixels at the end has an O(n) lower-bound.

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    6. Re:Also by jonaskoelker · · Score: 1

      People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation.

      Heck, why don't we computer scientists do the last step of this whole science thing and come up with a model that describes the actual reality more accurately?

      Oh, we do. Consider the I/O model, the cache-oblivious model or (for a fun combination) the parallel I/O model. (See for instance Aggarwal & Vitter: The Input/Output Complexity of Sorting and Related Problems.)

      if the hardware can do X faster than Y, then X is better according to users.

      Well, users as a collection run your algorithm with a distribution of inputs. You can't say that an algorithm is faster than another (either big-O or wall-clock) without talking about a particular case (worst, best, average) and distribution (diff works "better" on lines of code than paragraphs of latex/text; that's mostly a usability thing, but it illustrates different distributions of real-world data).

    7. Re:Also by Anonymous Coward · · Score: 0

      Sounds like an argument for bubble-sort. Today the data is sorted when we get it, so we'll just run bubble sort to ensure that, since it's the fastest sort algorithm on already-sorted data. Next year, when the data source switches to unsorted data, you're completely hosed since you chose the fastest algorithm by the clock on the wall to shave a few seconds, rather than picking a good algorithm that makes sense.

    8. Re:Also by Anonymous Coward · · Score: 2, Interesting

      You are missing the same part almost every other CS student out there misses

      K*O(N)+C

      There is usually some K involved that is done on every operation. This can turn a log n op into something huge. Or the set up (the C part) could take longer than to just brute force the issue. Many people make this mistake. However, hardware does not stand still either. People expect next years graphics to be more photo realistic than last years.

      You use raster vs ray (because you hate how long ray takes) as the reason you hate big O. Instead you should wonder isnt raster just a special low cost implementation of ray casting where the reflection is set to 1 or 2? Eventually ray will take over. It has to. There is only so much more pre cooking you can do (which is what raster is). The results do look good and people use it to good effect. However, eventually you have to ray trace it. In fact today you use many ray casting methods. Such as z buffer, per vertex shading, per pixel shaders, etc. These methods were invented to speed up ray casting. They work with raster as it is special case of ray tracing, one where many of the pixels are pre rendered. Eventually you can not get the correct colors unless you figure out all the light sources in the scene. Raster also has a scaling problem. In that you can stretch the pixels out but they turn into jaggy squares. Aliasing helps some but can only do so much, as the information just does not exist. As screens get larger and higher pixel depths, and higher geometry rates you eventually just can not move the raster images around the system fast enough to keep up.

      How do people know these things? They use big O and the other constants to see what it is doing. If you think nVidia, AMD, and Intel are going to keep making raster only systems from here til the end of time your dreaming. Intel cant make a graphics part we all know this. But they to date have made the most compelling ray tracing one. nVidia, and AMD both have stated eventually it will be ray casting. As they know raster only works for so long. As you will see more and more general highly parallel special purpose GPUs coming from these folks. Eventually it will be as easy to do ray casting in combination with raster to create very compelling graphics. Notice I didnt say raster goes away. Anything but, it will be around for a long time.

      Ray casting has a distinct advantage long term over raster. It is done 'as code' meaning you are not slugging huge bitmaps thru the system bus. But instead rendering directly into the video memory. The code has an advantage as it can be loaded and probably fit into the high speed memory system cache. Bitmaps will need to get ever larger to accommodate better looking graphics and higher polygon counts.

      Also you use FSAA as a 'this looks better' when in fact in quite a few cases it looks worse. You forget the FS part of that full screen. Meaning things that should not be aliased are. Ray casting can take that into account PER object, PER pixel, and depending on the z buffer and then decide only when its needed. With ray tracing you dont need FSAA all the time because you have already done the alias part at the ray cast level.

      Also raster image memory is not cheap either. It is why they built special bus's to handle it. You know AGP, PCIe, etc... It was a practical solution to what the industry is doing right now. However if you notice PCIe is not only graphics but about bandwidth in general. While AGP was a special bus for graphics. They know there are things coming up that need more bandwidth, ray casting being one of many.

      Also have you not wondered why the graphics chips guys have been in a huge hurry to make themselves into parallel processors? They do not want to be thrown to the side. The CPU eventually would be able to handle the raster images that they do today. With room to spare. They must make the graphics better and do it faster than a CPU could. It just so happens there are stacks of graphics algs t

    9. Re:Also by pz · · Score: 3, Interesting

      People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation.

      Right. And any good programmer understands that *first* you pick the right algorithm, and *then* you optimize the implementation. Working the other way around is wasting time.

      But, more importantly, that the parent seems to miss, is that the speed improvement from changing the order of growth of an algorithm can swamp nearly any possible improvements in hardware. Going from O(n^4) to O(n^2) when n is 10,000 is like replacing your desktop box with ten thousand supercomputers. No amount of implementation tweaking or upgrading processors and memory -- which only affects the constants in algorithm speed -- is going to give you the same level of speedup.

      There is a very, very good reason to pay attention to O() analysis of algorithms, and, in particular, their implementation. You can implement bubble sort which is O(n^2) when done correctly, in a brain-dead way that makes it O(n^3) --- if you, for example, re-extend the output array for each new element --- and the difference is huge. Extra factors of n creep in easily, and understanding them can be highly useful.

      So, the parent can review real-world constraints all he wants, but in the end, order of growth is more important.

      --

      Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
    10. Re:Also by pitchpipe · · Score: 1

      You can cry all you like that it wasn't a problem in theory, on actual hardware you run in to other limits.

      Reminds me of the quote:

      In theory there is no difference between theory and practice. In practice there is. - Yogi Berra

      --
      Look where all this talking got us, baby.
    11. Re:Also by Anonymous Coward · · Score: 0

      First off, any CS person worth their salt recognizes that raytracing is generally O(n log m) where n is the number of pixels, and m is the number of polygons. (Also, anyone who uses "spheres made of millions of polygons" is doing it wrong. :) You use mathematically perfect spheres, and only use polygons for complex geometry.)

      Secondly, you're comparing apples and oranges. Like in the GGP, algorithms are nothing more than a recipe for work. If I said, here's how you build a fence: "dig a hole for each post, put a post in each hole, fill the remaining space in the hole with concrete, attach horizontal boards, attach fence boards to horizontal boards", would you then say "Oh, using this method I can build 200ft of fence per day!" No, you wouldn't, because that wouldn't make any sense. That would be *highly* dependent on how many people were working on the job, what their skill level was, experience, etc. At best, you could say the work only depends on the length of the fence and the number of boards, etc.

      On the other hand, if you were an expert fence builder with his own fence building company with a skilled team of employees, giving a speed measurement in x feet per day is meaningful and relevant. That is exactly why it's perfectly normal to say "My implementation of this algorithm can do 1 billion key sorts per second on hardware configuration X, but only reaches 750mil/sec on hardware configuration Y", to say the same about the algorithm makes no sense.

      I can write an implementation of a binary search tree query [O(log n)] that takes three days to run with 10 items in the tree. The algorithm is still highly efficient, but I would intentionally do stupid things to make it slow. Remember, O(f(n)) means = c*f(n)... c can be *really* large.

      My point is be careful what you're talking about... an algorithm does not take clock time in to account.. an implementation does. Also, big-O is very important to understand. A O(2^n) algorithm will never be fast, no matter how much you try. O(log n) means it *can* be fast.

      The problem is that some CS people really don't know what they're talking about, which is also true of some EECE folks, and well, any other field. You generally only need a C+ in algorithms class to get your undergrad degree.

    12. Re:Also by Surt · · Score: 1

      It's not that the operations aren't equal cost that is the problem in real implementations, it's the number of them. E.g., if my O(n) sort requires a million instructions setup, and a few hundred per bucket assignment, that may significantly impact how it compares in performance to an O(n log(n)) sort that requires only a hundred instructions of setup, and 3 operations per swap.

      --
      "Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
    13. Re:Also by Surt · · Score: 1

      Different ns.

      --
      "Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
    14. Re:Also by Rich0 · · Score: 1

      I dunno - sometimes practicality dictates the implementation, and then you have to pick the algorithm to match.

      Suppose hard drive storage is slow, but costs pennies per gigabyte, and RAM is lightning-fast but costs tens of dollars per gigabyte (don't need to use much imagination there).

      I need to sort a 1TB file. Algorithm 1 works faster than algorithm 2 on an O(n) basis. However, algorithm 1 requires random access across the file, and algorithm 2 only requires sequential access. In almost any scenario algorithm 2 is going to be much faster, unless you're willing to buy 1 TB of RAM (and then unless the data is already in RAM you still waste a full cycle through the data loading it into RAM).

      Now, suppose algorithm 3 is slower than both 1 and 2, but it is already implemented in a library you can link to and save yourself a week's worth of coding, testing, and follow-ups on bugs/etc. If the resulting software is only moderately used, or the sort happens infrequently, then it might be cheaper for the users of your software than paying for an extra week of your time in the license cost.

      Many algorithms involve tradeoffs like this. The important thing is for programmers to UNDERSTAND the nature of these tradeoffs and understand the requirements of the software (not just what it needs to do, but in terms of the overall environment). The choice of algorithm and implementation should be a conscious and understood one.

    15. Re:Also by pz · · Score: 1

      Those are all valid practical reasons, but theory still wins out. If the implementation has different constants for different algorithms, that's all well and good until the data set gets large enough.

      O(n) will, for large enough n, always be faster than O(n^2). The range for n might be beyond your current level of interest, which is certainly something to take into consideration when selecting an algorithm, but you would be very surprised at how quickly n^2 grows, even when the leading constant is very small. An O(n^2) algorithm that works reasonably quickly for n=100, say one second, is unlikely to be useful when n grows to 100,000 and suddenly the running time is a million times slower. If you have been so foolhardy to implement as O(n^3), it becomes highly questionable if there are enough computers at your disposal to have a reasonable running time. At O(n^4) there are not enough computers on the planet, even if you could cobble them all together. How easy is O(n^4)? Surprisingly so, if one part of a system has O(n^2) behavior, and it calls a second part of the system, perhaps implemented by a different programmer, that also has O(n^2) behavior.

      Order of growth wins, every time, once the problem gets big enough. If the programmer is lazy, and does not pay attention to global algorithmic optimization, it wins very very quickly. Developing a faster algorithm is far, far more powerful than spending more money on hardware.

      --

      Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
    16. Re:Also by Spykk · · Score: 1

      This is why ray tracer demos love showing off spheres made of millions of polygons and so on.

      Ray traced spheres are generally not made up of polygons. One of the benefits of ray tracing is that you can render a true sphere without having to try and build one out of triangles.

    17. Re:Also by jackbird · · Score: 1

      Show me an offline raytracer that beats offline rasterizers for speed on arbitrary datasets with usable antialiasing.

    18. Re:Also by oldhack · · Score: 1

      That's why EEs make the best software engineers. We also jeer at CS lackeys who dropped out of EE into CS. It was either that or accounting for these guys.

      That's right, you heard me, weenies.

      --
      Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
    19. Re:Also by Anonymous Coward · · Score: 0

      Typical. I know A LOT of EEs. They ALL think they can do software, and better then actual software engineers. They can't.

  19. Benchmark details desired by Anonymous Coward · · Score: 0

    Were the benchmarks performed on lists that already resided on the GPU cards or do the benchmarks include "I/O" to and from CPU memory?

    How does the implementation perform when the entire list does not fit in memory on the GPU card?

    1. Re:Benchmark details desired by kramulous · · Score: 1

      Goto the source; They only timed the compute time. Cudamalloc was not part of the timing or cudamemcpy.

      Sorry, I only count 'time to solution'. That is all i'm interested in. I thought it was strange that a less than O(n**2) was faster on the gpu.

      It is like writing benchmarks that ignore disk IO, ignore reading from system memory, L3 cache, etc. Only time stuff that is in the registers.

      --
      .
  20. Re:I think the bubble sort would be the wrong way by DirePickle · · Score: 1

    That guy was pretty cool. Whatever happened to him?

  21. Great by robinvanleeuwen · · Score: 1, Funny

    No we can finally sort the digits of Pi...

    --
    If you don't like my sig then don't read it.
    1. Re:Great by Anonymous Coward · · Score: 0

      I can sort the unique digits for you

      0,1,2,3,4,5,6,7,8,9
      phew! Took a while but did it without a gpu

      Non-unique?
      0000000000000000000000000000000000000000....

    2. Re:Great by kayumi · · Score: 0

      Do you mean that you have a simple proof that the number of '0' digits
      in the decimal expansion of PI is unbounded. Please post it if your
      margin is big enough.

    3. Re:Great by mangu · · Score: 1

      The proof that the number of at least two digits in the decimal expansion of pi is unbounded is trivial, because otherwise we would come to a point where there would be just one digit repeating endlessly and pi would be rational.

      My math skills aren't good enough to prove that all of the ten digits must be unbounded, but I suspect this must be the case.

  22. 1 billion? Up it to over 4 billion! by Mr+Z · · Score: 3, Funny

    You know, if they up it to just a bit over 4 billion unique 32-bit keys, say around 4,294,967,296 or so, I think I could sort them rather efficiently, as long as they weren't attached to any payload. ;-)

  23. Memory bandwidth by brunes69 · · Score: 1

    I suspect your problem is going to become getting the keys in and out of the GPU memory, not the time the GP takes to sort them.

  24. Plain text files by mangu · · Score: 1

    Why can't they spend as much effort on learning the simple and effective tools that exist to work with text files as they spend learning Excel tricks?

    I have seen people pay for "Advanced Excel" training in order to do what I can do with a few lines of Unix utilities. The cost/benefit ratio of Excel is very poor.

    Plain text has another advantage in that it's the most universal standard for data that exists. I can read text files from the start of the computing age without too much problem. I can even scan punched cards or punched tape to recover the data. I can scan and OCR text from printed pages.

    The only advantage Excel has is familiarity for people in the corporate environment. A familiarity that has come through a large spending in training and software purchases.

    1. Re:Plain text files by Anonymous Coward · · Score: 0

      If you believe that, I'm sorry to say you are a fucking idiot.

      Excel is nowhere near the same as a plaintext file.

  25. gaming by Anonymous Coward · · Score: 0

    Notice that some of the most significant advances in computing lately are a result of technologies developed for gaming? Further proof that if you poor a little money into optimization of tech to better suit human needs (gaming), you can still reap benefits for the "hardcore" sciences. Science for science sake is generally not as profitable or beneficial as science for human sake. Sink less money into general "I wonder if", and more into "I wonder if I can solve THIS specific problem." Many of the greatest scientists did much of their greatest work when given a specific issue to tackle via a goverment, corporation, or family member.

    1. Re:gaming by Chatterton · · Score: 1

      But a lot of problems would still have solutions if there was no "I wonder if" scientists. I know some orphan disease (some of them are less than 100 know cases in the world) who are only worked by "I wonder if" scientists and none of the big pharmaceutical companies.

  26. Re:I think the bubble sort would be the wrong way by Anonymous Coward · · Score: 0

    True colors coming through. He lied before he was elected, why would he stop? He would even change his views mid stream.

  27. Re:1 billion? Up it to over 4 billion! by Anonymous Coward · · Score: 0

    Funny, but you're basically right. Even sorting a billion 32 bits keys can be done more efficiently with a "tally sort". Allocate and zero a billion-bit array (128 MB); for every input N you set the Nth bit. Loop over the array, and if the Nth bit is set, print N. Note that the amount of RAM needed (128 MB) is less than even the input size (4GB).

  28. Re:I think the bubble sort would be the wrong way by Anonymous Coward · · Score: 1, Informative

    His naivety hit the real world and exploded.

  29. It's more of a lossy compression.

  30. Open Sourcing algorithms by General+Wesc · · Score: 1

    open sourced an algorithm

    Does it horrify anyone else that we talk this way now?