Slashdot Mirror


Five Nvidia CUDA-Enabled Apps Tested

crazipper writes "Much fuss has been made about Nvidia's CUDA technology and its general-purpose computing potential. Now, in 2009, a steady stream of launches from third-party software developers sees CUDA gaining traction at the mainstream. Tom's Hardware takes five of the most interesting desktop apps with CUDA support and compares the speed-up yielded by a pair of mainstream GPUs versus a CPU-only. Not surprisingly, depending on the workload you throw at your GPU, you'll see results ranging from average to downright impressive."

17 of 134 comments (clear)

  1. Re:Nice, but... by slummy · · Score: 4, Informative

    CUDA is a framework that will work on Windows and Linux.

  2. Re:Nice, but... by gustgr · · Score: 4, Informative

    I know you are trolling, but actually CUDA applications work better on Linux than on Windows. If you run a CUDA kernel on Windows that lasts longer than 5~6 seconds, your system will hang. The same will happen on Linux but then you can just disable the X server or have one card providing your graphical display and another one as your parallel co-processor.

  3. Re:Nice, but... by mikiN · · Score: 4, Funny

    Well, everywhere else in the world, Linux runs the CUDA Toolkit, so I can imagine that in Soviet Russia, a Beowulf cluster of Nvidia cards run Linux.

    --
    The Hacker's Guide To The Kernel: Don't panic()!
  4. Tied to a card by ComputerDruid · · Score: 5, Insightful

    What I don't understand is why people hype a technology that is tied to a specific manufacturer of card. If nvidia died tomorrow, we'd have a fair amount of code thats no longer relevant, unless there was some way to design cards that are CUDA-capable but not nvidia.

    Also worth noting that I'd completely forgotten CUDA even ran on windows, as I've only heard it in the context of linux recently.

    1. Re:Tied to a card by gustgr · · Score: 5, Insightful

      OpenCL will hopefully help to set a solid ground for GPU and CPU parallel computing, and since it is not technically very different from CUDA, porting existing applications to OpenCL will not be a challenge. Nowadays with current massively parallel technology the hardest part is making the algorithms parallel, not programming any specific device.

    2. Re:Tied to a card by Anonymous Coward · · Score: 4, Informative

      I hear this a lot in CUDA/GPGPU-related threads on slashdot, primarily from people who simply have zero experience with GPU programming. The bottom line is that in the present and for the foreseeable future, if you are going to try to accelerate a program by offloading some of the computation to a GPU, you are going to be tying yourself to one vendor (or writing different versions for multiple vendors) anyways. You simply cannot get anything approaching worthwhile performance from a GPU kernel without having a good understanding of the hardware you are writing for. nVidia has a paper that illustrates this excellently, in which they start off with a seemingly good "generic" parallel reduction code and go through a series of 7 or 8 optimizations -- most of them based on knowledge of the hardware -- and improve its performance by more than a factor of 30 versus the generic implementation.

      Another thing to keep in mind is that CUDA is very simple to learn as an API -- if you're familiar with C you can pick up CUDA in an afternoon easily. The difficulty, as I said in the previous paragraph, is optimization; and optimizations that work well for a particular GPU in CUDA will (or at least should) work well for the same GPU in OpenCL.

    3. Re:Tied to a card by jared9900 · · Score: 4, Informative

      But OpenCL is a specification, not an implementation. The only 3 implementations I'm currently aware of is Apple's (with Snow Leopard), AMD demoed implementation back in March, and Nvidia's beta implementation. So far none of those are open source. If you're aware of an open source implementation, please let me know I'm actually very interested in it, but have yet to locate one.

  5. SETI? by NiteMair · · Score: 4, Informative

    Waste your GPU cycles on something more interesting than SETI...

    http://www.gpugrid.net/
    http://distributed.net/download/prerelease.php (ok, maybe that's less interesting...)

    And why limit this discussion to CUDA? ATI/AMD's STREAM is usable as well...

    http://folding.stanford.edu/English/FAQ-ATI

  6. h.264 encoding by BikeHelmet · · Score: 5, Informative

    h.264 encoding didn't improve with more shaders for some of the results(like PowerDirector 7), because of the law of diminishing returns.

    I remember reading about x264 when quad-cores were becoming common. It mentioned that if quality is of the utmost importance, you should still encode on a single core. It splits squares of pixels between the cores; where those squares connect there can be very minor artifacts. It smooths these artifacts out with a small amount of extra data and post processing; the end result is a file hardly 1-2% bigger than if encoded on a single core, but encoded roughly 4x faster.

    Now, if we're talking about 32 cores, or 64, or 128, would the size difference be bigger than 1-2%? Probably. After a certain point, it would almost certainly not be worth it.

    This is supported by Badaboom's results, where the higher resolution videos (with more encoded squares) seem to make use of more shaders when encoding, while most of the lower resolution vids do not. (indicating that some shaders may be lying idle)

    What I'm curious about, is could the 9800GTX encode two videos at once, while the 9600GT could only manage one? ;)

    I'm also curious why the 320x240 video encoded so quickly - but that could be from superior memory bandwidth, shader clockspeed, and some other important factor in h.264 encoding.

    Take it with a grain of salt; I'm not an encoder engineer; just regurgitating what I once read, hopefully accurately. ;)

    1. Re:h.264 encoding by SpazmodeusG · · Score: 4, Informative

      Encoding from multiple different keyframes works when you can seek to any part of the input video but it doesn't help with realtime encoding.

      If i'm encoding a signal in realtime from TV i have to start encoding at 0% onwards. The only way to parallelize it is to split the individual frames up into boxes (as done by the Badaboom).

  7. Re:Tom's Hardware by XPeter · · Score: 5, Funny

    Totally not a biased, money-hatted site. Totally. Trust us.

    Hi! You must be new to the internet as well as Slashdot, let me give you some tips.

            1. Always use the word "lunix" in place of "linux" in slashdot's discussion forums.
            2. You can steal mod points by copying someone else's insightful comment and pasting it as a reply to an earlier one.
            3. Mac users are a bunch of fucking queers.
            4. When there's something you need to do that can't be done with Windows but can be done with Lunix, keep in mind that you can do an even better job with Mac OS X. Some argue that BSD can do it better but no one makes software for BSD since no one gives a flying fuck.
            5. Adequacy.org was one of the best sites on the internet. Want to know if your sons a computer hacker? Click here! http://www.adequacy.org/stories/2001.12.2.42056.2147.html

    Good luck, friend!

    --
    "The difference between genius and stupidity is that genius has it's limits" - Albert Einstein
  8. Re:Nice, but... by 3.1415926535 · · Score: 4, Informative

    Folding@Home runs its computations in short bursts. gustgr is talking about a single computation kernel that takes more than 5-6 seconds.

  9. Re:Nice, but... by Jah-Wren+Ryel · · Score: 5, Insightful

    Does it matter? Linux is not anywhere close to the target market,

    Linux support for CUDA matters hugely, Linux boxes are head and shoulders above any other market for CUDA-based software. That's because linux is the OS for supercomputing nowadays and CUDA's biggest niche is the exact same kind of number crunching that is typically associated with supercomputer workloads.

    In fact, these GPUs are yet another example of how there is nothing new under the sun. A GPU is very much like the vector processor of Cray-style supercomputing (when Cray was still alive that is) aka SIMD (single instruction, multiple data).

    --
    When information is power, privacy is freedom.
  10. Well, it works awesome if your problem is parellel by Muerte23 · · Score: 5, Interesting

    The Tesla 1060 is a video card with no video output (strictly for processing) that has something like 240 processor cores and 4 GB of DDR3 RAM. Just doing math on large arrays (1k x 1k) I get a performance boost of about a factor of forty over a dual core 3.0 GHz Xeon.

    The CUDA extension set has FFT functionality built in as well, so it's excellent for signal processing. The SDK and programming paradigm is super easy to learn. I only know C (and not C++) and I can't even make a proper GUI, but I can make my array functions run massively in parallel.

    The trick is to minimize memory moving between the CPU and the GPU because that kills performance. Only the brand newest cards support functionality for "simultaneous copy and execute" where one thread can be reading new data to the card, another can be processing, and the third can be moving the results off the card.

    One way that the video people can maybe speed up their processing (disclaimer: I don't know anything about this) is to do a quick sweep for keyframes, and then send the video streams between keyframes to individual processor cores. So instead of each core gets a piece of the frame, maybe each core gets a piece of the movie.

    The days of the math coprocessor card have returned!

  11. OpenCL is an Open Standard Compute Language by Gary+W.+Longsine · · Score: 5, Informative
    It's not really clear what you're looking for, possibly because you're looking for the wrong thing. It might help if you first spend an hour or three learning a little more about OpenCL, and reading up at various sites to see who's doing what.

    OpenCL is an Open Standard compute language which comprises:
    • a language extended from C99,
    • a platform (hardware + OpenCL-aware device driver), and
    • a compiler and runtime (which may decide where to send a compute task at run time).

    If you're writing an OpenCL-aware device device driver for a GPU, you'll probably need to wait a bit for some open source examples. It's reasonably likely that there will be some included in Darwin (once updated for Snow Leopard).

    Look to the LLVM project (sponsored heavily by Apple and others) for an open source compiler which will (if it doesn't already) know about OpenCL.

    It sounds like you might be looking for a higher level API which allows you to more easily use the OpenCL, or possibly for language bindings to Java or Python perhaps? I suspect you'll see those coming along, once Apple ships Snow Leopard, and people have a chance to kick the tires, and then integrate LLMV into their tool chains, extend various higher level API, bridge to Java and whatnot.

    The earliest high level API to take easy and broad advantage of OpenCL will probably be from Apple, of course. They'll likely provide some nicely automatic ways to take advantage of OpenCL without programming the OpenCL C API directly. As a Cocoa programmer, you'll be using various high level objects, maybe an indexer for example, which have been taught new OpenCL tricks. You'll just recompile your program and it will tap the GPU as appropriate and if available. The Cocoa implementation is closed source, but people will see what's possible and emulate it in various open source libraries, on other platforms, for Java and other languages.

    Here's a good place to start: OpenCL - Parallel Computing on the GPU and CPU. Follow up with a google search.

    --
    If you mod me down, I shall become more powerful than you could possibly imagine.
  12. Re:Nice, but... by Jah-Wren+Ryel · · Score: 4, Informative

    Uhh...Cray is still very much alive. And doing vectors. And threads. And multicore. All long before Intel/AMD.

    Seymour Cray was killed by a speeding redneck in a trans-am in 1996.

    The company currently known as Cray as formerly known as TERA, which bought the assets of Cray Research from SGI who acquired Cray Research after Seymour had left to form Cray Computer which is also defunct.

    Seymour was never significantly involved in multi-core or multi-threaded processors or NUMA. In fact, he specifically avoided designs even hinting of that sort of complexity because he felt that simplicity in design made it easier to fully utilize the maximum performance of the hardware.

    --
    When information is power, privacy is freedom.
  13. Re:Nice, but... by parlancex · · Score: 5, Interesting

    In fact, these GPUs are yet another example of how there is nothing new under the sun. A GPU is very much like the vector processor of Cray-style supercomputing (when Cray was still alive that is) aka SIMD (single instruction, multiple data).

    Actually, not quite. The execution architecture in the Nvidia's G80 series GPUs and onwards is actually SIMT, single instruction multiple threads. The not so subtle difference here is that in a SIMD vector architecture the application explicitly manages instruction level divergence which will generally narrow the SIMD width of divergent paths to only 1 path, whereas in a SIMT architecture when threads diverge within a warp all divergent threads executing the same branch within that warp can be issued an instruction simultaneously, with the threads that are not on that branch within that warp inactive for that cycle. This is transparent to the application. Currently in Nvidia's latest architecture the warp size is still statically set at 32 threads so you'll see performance penalties when threads within any warp diverge proportional to the number of unique paths taken. Interestingly the next iteration of the hardware is rumored to feature a thread scheduler capable of variable warp sizes, probably still with some lower bound, but this would bring the GPU much closer to the ideal "array of independently executing processing cores" that we have in modern CPUs, but with obviously far more cores.