AMD Details Next-Gen Kaveri APU's Shared Memory Architecture
crookedvulture writes "AMD has revealed more details about the unified memory architecture of its next-generation Kaveri APU. The chip's CPU and GPU components will have a shared address space and will also share both physical and virtual memory. GPU compute applications should be able to share data between the processor's CPU cores and graphics ALUs, and the caches on those components will be fully coherent. This so-called heterogeneous uniform memory access, or hUMA, supports configurations with either DDR3 or GDDR5 memory. It's also based entirely in hardware and should work with any operating system. Kaveri is due later this year and will also have updated Steamroller CPU cores and a GPU based on the current Graphics Core Next architecture."
bigwophh writes links to the Hot Hardware take on the story, and writes "AMD claims that programming for hUMA-enabled platforms should ease software development and potentially lower development costs as well. The technology is supported by mainstream programming languages like Python, C++, and Java, and should allow developers to more simply code for a particular compute resource with no need for special APIs."
Apparently not too many finnish speakers here yet. Kaveri => partner/pal/mate, APU => help.
HTH,
ac
In OpenCL you need to copy items from the system memory to the GPU's memory and then load the kernel on the GPU to start execution. Then you must copy the data back from the GPU's memory at the end after execution. AMD is saying that you can instead pass a pointer to the data in the main memory instead of actually making copies of the data.
This should reduce some of the memory shifting on the system and speed up OpenCL execution. It will also eliminate some of the memory constraints on OpenCL regarding what you can do on the GPU. On a larger scale it will open up some opportunities for optimizing work.
The "slow interconnect" you're talking about to main memory, PCI Express v3.0 has an effective bandwidth of 32GB/s which actually exceeds the best main memory bandwidth you'd get out of an Ivy Bridge CPU with very fast memory, so no, that's not a bottleneck for bandwidth, though yes, there is some latency there.
Its both, for my application, the GPU is roughly 3x-5x as fast as a high end CPU. This is fairly common on a lot of GPGPU workloads. The GPU provides a decent but not huge performance advantage.
But, we don't use the GPU! Why not? Because copying the data over the PCIe link, waiting for the GPU to complete the task, and then copying the data back over the PCI bus yields a net performance loss over just doing it on the CPU.
In theory, a GPU sharing the memory subsystem with the CPU avoids this copy latency. Nor does it preclude still having a parallel memory subsystem dedicated for local accesses on the GPU. That is the "nice" thing about opencl/CUDA the programmer can control the memory subsystems at a very fine level.
Whether or not AMD's solution helps our application remains to be seen. Even if it doesn't its possible it helps some portion of the GPGPU community.
BTW:
In our situation its a server system so it has more memory bandwidth than your average desktop. On the other hand, i've never seen a GPU pull more than small percentage of the memory bandwidth over the PCIe links doing copies. Nvidia ships a raw copy benchmark with the CUDA SDK, try it on your machines the results (theoretical vs reality) might surprise you.