Five Nvidia CUDA-Enabled Apps Tested
crazipper writes "Much fuss has been made about Nvidia's CUDA technology and its general-purpose computing potential. Now, in 2009, a steady stream of launches from third-party software developers sees CUDA gaining traction at the mainstream. Tom's Hardware takes five of the most interesting desktop apps with CUDA support and compares the speed-up yielded by a pair of mainstream GPUs versus a CPU-only. Not surprisingly, depending on the workload you throw at your GPU, you'll see results ranging from average to downright impressive."
post.push("First!");
All fine and dandy, but...does it run Linux?
The Hacker's Guide To The Kernel: Don't panic()!
With NVIDIA slowly pushing it's way into the CPU market (CUDA is the first step, in a few years I wouldn't be surprised if Nvidia started developing processors) and Intel trying to cut into NVidia's GPU market share with Larrabee http://en.wikipedia.org/wiki/Larrabee_(GPU), we'll see who can develop outside of their box faster. This is good news for AMD since Intel will be more focused on Nvidia instead of being neck to neck with them in the processor market. Hey, maybe AMD will regain it's power in the server and netbook realms.
There's also going to be a battle of patents pretty soon too. Wish I was a tech lawyer.
"The difference between genius and stupidity is that genius has it's limits" - Albert Einstein
What I don't understand is why people hype a technology that is tied to a specific manufacturer of card. If nvidia died tomorrow, we'd have a fair amount of code thats no longer relevant, unless there was some way to design cards that are CUDA-capable but not nvidia.
Also worth noting that I'd completely forgotten CUDA even ran on windows, as I've only heard it in the context of linux recently.
Fold@home can use CUDA in linux, but you have to compile the CUDA driver first.
Absolute power corrupts absolutely. indymedia
Totally not a biased, money-hatted site. Totally. Trust us.
(Not saying they're biased in this case, but because of the bullshit they've pulled in the past I'll never visit their site again.)
Waste your GPU cycles on something more interesting than SETI...
http://www.gpugrid.net/
http://distributed.net/download/prerelease.php (ok, maybe that's less interesting...)
And why limit this discussion to CUDA? ATI/AMD's STREAM is usable as well...
http://folding.stanford.edu/English/FAQ-ATI
The same way the DoD payed for the Cray supercomputers, gamers are paying for the GPUs. Science dropped by and said thanks.
Can't wait for the APL support. Reorganising my keyboard keys in anticipation.
h.264 encoding didn't improve with more shaders for some of the results(like PowerDirector 7), because of the law of diminishing returns.
I remember reading about x264 when quad-cores were becoming common. It mentioned that if quality is of the utmost importance, you should still encode on a single core. It splits squares of pixels between the cores; where those squares connect there can be very minor artifacts. It smooths these artifacts out with a small amount of extra data and post processing; the end result is a file hardly 1-2% bigger than if encoded on a single core, but encoded roughly 4x faster.
Now, if we're talking about 32 cores, or 64, or 128, would the size difference be bigger than 1-2%? Probably. After a certain point, it would almost certainly not be worth it.
This is supported by Badaboom's results, where the higher resolution videos (with more encoded squares) seem to make use of more shaders when encoding, while most of the lower resolution vids do not. (indicating that some shaders may be lying idle)
What I'm curious about, is could the 9800GTX encode two videos at once, while the 9600GT could only manage one? ;)
I'm also curious why the 320x240 video encoded so quickly - but that could be from superior memory bandwidth, shader clockspeed, and some other important factor in h.264 encoding.
Take it with a grain of salt; I'm not an encoder engineer; just regurgitating what I once read, hopefully accurately. ;)
The Tesla 1060 is a video card with no video output (strictly for processing) that has something like 240 processor cores and 4 GB of DDR3 RAM. Just doing math on large arrays (1k x 1k) I get a performance boost of about a factor of forty over a dual core 3.0 GHz Xeon.
The CUDA extension set has FFT functionality built in as well, so it's excellent for signal processing. The SDK and programming paradigm is super easy to learn. I only know C (and not C++) and I can't even make a proper GUI, but I can make my array functions run massively in parallel.
The trick is to minimize memory moving between the CPU and the GPU because that kills performance. Only the brand newest cards support functionality for "simultaneous copy and execute" where one thread can be reading new data to the card, another can be processing, and the third can be moving the results off the card.
One way that the video people can maybe speed up their processing (disclaimer: I don't know anything about this) is to do a quick sweep for keyframes, and then send the video streams between keyframes to individual processor cores. So instead of each core gets a piece of the frame, maybe each core gets a piece of the movie.
The days of the math coprocessor card have returned!
I thought Nvidia was indicating they were going to move to supporting OpenCL, or are the simply planning to support multiple technologies?
Jumpstart the tartan drive.
We've run some signal processing on a Tesla card, and get roughly 500x improvement over (somewhat poorly written) code for a Core 2 Duo.
~8 hr on a Core 2 Duo
~1.5 hr on Core i7
seconds on Tesla
Well I didn't say my code was *well* written. Apparently there's a lot of trickery with copying global memory to cached memory to speed up operations. Cached memory takes (IIRC) one clock cycle to read or write, and global GPU memory takes six hundred cycles. And there's all this whatnot and nonsense about aligning your threads with memory locations that I don't even bother with.
OpenCL is an Open Standard compute language which comprises:
If you're writing an OpenCL-aware device device driver for a GPU, you'll probably need to wait a bit for some open source examples. It's reasonably likely that there will be some included in Darwin (once updated for Snow Leopard).
Look to the LLVM project (sponsored heavily by Apple and others) for an open source compiler which will (if it doesn't already) know about OpenCL.
It sounds like you might be looking for a higher level API which allows you to more easily use the OpenCL, or possibly for language bindings to Java or Python perhaps? I suspect you'll see those coming along, once Apple ships Snow Leopard, and people have a chance to kick the tires, and then integrate LLMV into their tool chains, extend various higher level API, bridge to Java and whatnot.
The earliest high level API to take easy and broad advantage of OpenCL will probably be from Apple, of course. They'll likely provide some nicely automatic ways to take advantage of OpenCL without programming the OpenCL C API directly. As a Cocoa programmer, you'll be using various high level objects, maybe an indexer for example, which have been taught new OpenCL tricks. You'll just recompile your program and it will tap the GPU as appropriate and if available. The Cocoa implementation is closed source, but people will see what's possible and emulate it in various open source libraries, on other platforms, for Java and other languages.
Here's a good place to start: OpenCL - Parallel Computing on the GPU and CPU. Follow up with a google search.
If you mod me down, I shall become more powerful than you could possibly imagine.
That's the whole point of of the OpenCL architecture, to let the compiler figure out the hardware specific optimizations. If you want a cross platform, GPU-independent mechanism to:
[ _Booming_ _Monster_ _Truck_ _Voice_]
Tap the hidden potential of your GPU! then you want OpenCL.
If you mod me down, I shall become more powerful than you could possibly imagine.
Apple and other OpenCL partners are undoubtedly looking forward, beyond SIMD, to the coming generation of MIMD capable GPU such as the nVIDIA GT300.
If you mod me down, I shall become more powerful than you could possibly imagine.
That explains so much about me. Classic. Great link. ;-)
Quack, quack.
Those benchmarks show that even older ($120-140) nVidia GPU cards can really speed up some processing tasks, especially transcoding video. But what I think is even more exciting than just the acceleration from offloading CPU to GPU is using multiple GPU cards in a single host PC. Stuff a $1000 PC with $1120 in GPUs (like 8 $140 nVidia cards), and that's 1024 parallel cores, anywhere from 16x to 56x the performance at only just over double the price. PCI-e should make the data parallel fast enough to feed the cards. I bet that 8 $1000 cards stuffed into a $1000 PC would be something like 200x to 4000x for only 9x the price.
So what I want to see is benchmarks for whole render farms. I want to see HD video transcoded into H.264 and other formats simultaneously on the fly, in realtime, with true fast-forward, in multiple independent streams from the same master source. This stuff is possible now on a reasonable budget.
--
make install -not war
Actually, what you are referring to is simultaneous DMA and kernel execution, and this is available in every card that has compute 1.1 capability which is actually every card but the very first G80 series cards (8800 GTX and 8800 GTS). The GPU actually executes the DMA and pulls memory that has been allocated as aligned and pagelocked and this can be overlapped with kernel execution, it doesn't have anything to do with GPU or CPU threads. Transfers from non page-locked memory are always synchronous and as such can't be overlapped with kernel execution. But, generally, yes, host -> device memory bandwidth is usually the bottleneck for most CUDA applications. Applications that are able to perform a large amount of processing on the same data if that data will fit simultaneously in device memory are able to mitigate this, but this doesn't usually include supercomputing or general coprocessor-esque applications (transcoding).
I assume that's what the parent meant.
As an addendum, the newest CUDA 2.2 (with chip of the newest generation, i.e. GT200) actually has support for reading directly from (page-locked) host memory inside of GPU kernels... something I believe ATI cards have allowed for a while.
I'm currently running 2x Geforce 9800GTX and dual-booting Ubuntu and XP.
What interesting and practical things am I, the average schmoe with a gaming pc, able to do with CUDA today? What resources have I been squandering?
see: Amdahl's law
"....is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors." http://en.wikipedia.org/wiki/Amdahl%27s_law
You get the big speedup only if you're doing single precision floating point computations.
On the NVIDIA GTX 280 & 260, a multiprocessor has eight single-precision floating point ALUs (one per core) but only one double-precision ALU (shared by the eight cores). Thus, for applications whose execution time is dominated by floating point computations, switching from single-precision to double-precision will increase runtime by a factor of approximately eight.
A lot of my HPC customers do CFD with (1) double precision in (2) Fortran. 1 and 2 are not easy or fast with CUDA.
There's no place like 127.0.0.1
Is that for single or double precision work? Which Xeon exactly? Which compiler? How was the code written for the compiler? Which compiler flags?
Although I don't dispute your claims, writing to get max performance out the newer xeons is *hard* and you need to be very careful. The 256 bit wide registers on the 54xx can be extremely handy for codes written the right way.
I currently have a client that needs to run a lot of this and so far, I have the single cpu version running 10x faster than the parallel version running on 8 cores (single node). Only simple changes thus far although there is a particularly nasty data structure in there that is next for the chopping block.
Just saying.
.
Yeah, zerocopy is what they're calling it. It's most interesting in Nvidia's latest integrated chipsets because the latency is much lower than across the PCI-E bus which allow for some interesting applications (it wouldn't be that hard to write a sound driver that could process almost everything hardware on your GPU, and you could probably use the SPDIF mixed out over HDMI to actually output the sound directly).