Yeah, zerocopy is what they're calling it. It's most interesting in Nvidia's latest integrated chipsets because the latency is much lower than across the PCI-E bus which allow for some interesting applications (it wouldn't be that hard to write a sound driver that could process almost everything hardware on your GPU, and you could probably use the SPDIF mixed out over HDMI to actually output the sound directly).
Actually, what you are referring to is simultaneous DMA and kernel execution, and this is available in every card that has compute 1.1 capability which is actually every card but the very first G80 series cards (8800 GTX and 8800 GTS).
The GPU actually executes the DMA and pulls memory that has been allocated as aligned and pagelocked and this can be overlapped with kernel execution, it doesn't have anything to do with GPU or CPU threads. Transfers from non page-locked memory are always synchronous and as such can't be overlapped with kernel execution.
But, generally, yes, host -> device memory bandwidth is usually the bottleneck for most CUDA applications. Applications that are able to perform a large amount of processing on the same data if that data will fit simultaneously in device memory are able to mitigate this, but this doesn't usually include supercomputing or general coprocessor-esque applications (transcoding).
In fact, these GPUs are yet another example of how there is nothing new under the sun. A GPU is very much like the vector processor of Cray-style supercomputing (when Cray was still alive that is) aka SIMD (single instruction, multiple data).
Actually, not quite. The execution architecture in the Nvidia's G80 series GPUs and onwards is actually SIMT, single instruction multiple threads.
The not so subtle difference here is that in a SIMD vector architecture the application explicitly manages instruction level divergence which will generally narrow the SIMD width of divergent paths to only 1 path, whereas in a SIMT architecture when threads diverge within a warp all divergent threads executing the same branch within that warp can be issued an instruction simultaneously, with the threads that are not on that branch within that warp inactive for that cycle. This is transparent to the application.
Currently in Nvidia's latest architecture the warp size is still statically set at 32 threads so you'll see performance penalties when threads within any warp diverge proportional to the number of unique paths taken.
Interestingly the next iteration of the hardware is rumored to feature a thread scheduler capable of variable warp sizes, probably still with some lower bound, but this would bring the GPU much closer to the ideal "array of independently executing processing cores" that we have in modern CPUs, but with obviously far more cores.
memset would be basically the same, except instead of xor eax, eax you would have a small series of instructions to load an 8 bit value in memory into each 8 bits of the 32 bit register eax, which although technically slower it would be extremely meaningless for anything but a very large number of very small buffers.
...I'd like to point out that the internal implementation of memcpy on many platforms will be much faster than the equivilent C using a loop for large copies, including x86/64 due to the use of architecture specific instructions designed to facilitate the operation that most compilers probably don't use even on the highest optimization levels.
Actually, when zeroing a large array it can be faster to use a loop because memcpy has to line up blocks of memory as part of it's operation.
I don't see how memcpy should ever be used to zero blocks of memory. The best way to do this on Windows platforms is with win32 API function ZeroMemory, who's implementation would be something along the lines of:
xor eax,eax
mov edi, dst_address
lea ecx, [byte_count / 0x04]
rep stosd
mov ecx, byte_count
and ecx, 0x03
rep stosb
Where the last 3 instructions could be averted if byte_count was guaranteed to be a multiple of 4. This would be substantially faster than using a loop.
The implementation of the STL is platform dependent but the behavior is (at least it _should_ be) mostly identical so you can be assured that by using memcpy when it is available that it will be the fastest way to copy a large number of bytes on any platform.
(The unix based implementation of the STL is in libc and the Windows implementation is in msvcrt.dll)
I know you were kidding, but I'd like to point out that the internal implementation of memcpy on many platforms will be much faster than the equivilent C using a loop for large copies, including x86/64 due to the use of architecture specific instructions designed to facilitate the operation that most compilers probably don't use even on the highest optimization levels.
That _would_ work well using floating point data types, but if they had the ability to use floating point that wouldn't have precluded them from using accurate more simple methods anyway.
More than likely the embedded microprocessor had limited integer fixed point arithmetic capabilities only, using this kind of method with fixed point integers would result in the truncation of the fractional contribution of each successive sample, which would cause the calculation to become highly inaccurate after a large number of samples, specifically, it would be up to N less than the real average where N is the number of samples.
But this is totally different! Because an operating system isn't a piece of software, it's uh....um.....
Yeah, zerocopy is what they're calling it. It's most interesting in Nvidia's latest integrated chipsets because the latency is much lower than across the PCI-E bus which allow for some interesting applications (it wouldn't be that hard to write a sound driver that could process almost everything hardware on your GPU, and you could probably use the SPDIF mixed out over HDMI to actually output the sound directly).
Actually, what you are referring to is simultaneous DMA and kernel execution, and this is available in every card that has compute 1.1 capability which is actually every card but the very first G80 series cards (8800 GTX and 8800 GTS). The GPU actually executes the DMA and pulls memory that has been allocated as aligned and pagelocked and this can be overlapped with kernel execution, it doesn't have anything to do with GPU or CPU threads. Transfers from non page-locked memory are always synchronous and as such can't be overlapped with kernel execution. But, generally, yes, host -> device memory bandwidth is usually the bottleneck for most CUDA applications. Applications that are able to perform a large amount of processing on the same data if that data will fit simultaneously in device memory are able to mitigate this, but this doesn't usually include supercomputing or general coprocessor-esque applications (transcoding).
In fact, these GPUs are yet another example of how there is nothing new under the sun. A GPU is very much like the vector processor of Cray-style supercomputing (when Cray was still alive that is) aka SIMD (single instruction, multiple data).
Actually, not quite. The execution architecture in the Nvidia's G80 series GPUs and onwards is actually SIMT, single instruction multiple threads. The not so subtle difference here is that in a SIMD vector architecture the application explicitly manages instruction level divergence which will generally narrow the SIMD width of divergent paths to only 1 path, whereas in a SIMT architecture when threads diverge within a warp all divergent threads executing the same branch within that warp can be issued an instruction simultaneously, with the threads that are not on that branch within that warp inactive for that cycle. This is transparent to the application. Currently in Nvidia's latest architecture the warp size is still statically set at 32 threads so you'll see performance penalties when threads within any warp diverge proportional to the number of unique paths taken. Interestingly the next iteration of the hardware is rumored to feature a thread scheduler capable of variable warp sizes, probably still with some lower bound, but this would bring the GPU much closer to the ideal "array of independently executing processing cores" that we have in modern CPUs, but with obviously far more cores.
memset would be basically the same, except instead of xor eax, eax you would have a small series of instructions to load an 8 bit value in memory into each 8 bits of the 32 bit register eax, which although technically slower it would be extremely meaningless for anything but a very large number of very small buffers.
...I'd like to point out that the internal implementation of memcpy on many platforms will be much faster than the equivilent C using a loop for large copies, including x86/64 due to the use of architecture specific instructions designed to facilitate the operation that most compilers probably don't use even on the highest optimization levels.
Actually, when zeroing a large array it can be faster to use a loop because memcpy has to line up blocks of memory as part of it's operation.
I don't see how memcpy should ever be used to zero blocks of memory. The best way to do this on Windows platforms is with win32 API function ZeroMemory, who's implementation would be something along the lines of:
xor eax,eax
mov edi, dst_address
lea ecx, [byte_count / 0x04]
rep stosd
mov ecx, byte_count
and ecx, 0x03
rep stosb
Where the last 3 instructions could be averted if byte_count was guaranteed to be a multiple of 4. This would be substantially faster than using a loop.
The implementation of the STL is platform dependent but the behavior is (at least it _should_ be) mostly identical so you can be assured that by using memcpy when it is available that it will be the fastest way to copy a large number of bytes on any platform. (The unix based implementation of the STL is in libc and the Windows implementation is in msvcrt.dll)
I know you were kidding, but I'd like to point out that the internal implementation of memcpy on many platforms will be much faster than the equivilent C using a loop for large copies, including x86/64 due to the use of architecture specific instructions designed to facilitate the operation that most compilers probably don't use even on the highest optimization levels.
That _would_ work well using floating point data types, but if they had the ability to use floating point that wouldn't have precluded them from using accurate more simple methods anyway. More than likely the embedded microprocessor had limited integer fixed point arithmetic capabilities only, using this kind of method with fixed point integers would result in the truncation of the fractional contribution of each successive sample, which would cause the calculation to become highly inaccurate after a large number of samples, specifically, it would be up to N less than the real average where N is the number of samples.