AMD Details Next-Gen Kaveri APU's Shared Memory Architecture
crookedvulture writes "AMD has revealed more details about the unified memory architecture of its next-generation Kaveri APU. The chip's CPU and GPU components will have a shared address space and will also share both physical and virtual memory. GPU compute applications should be able to share data between the processor's CPU cores and graphics ALUs, and the caches on those components will be fully coherent. This so-called heterogeneous uniform memory access, or hUMA, supports configurations with either DDR3 or GDDR5 memory. It's also based entirely in hardware and should work with any operating system. Kaveri is due later this year and will also have updated Steamroller CPU cores and a GPU based on the current Graphics Core Next architecture."
bigwophh writes links to the Hot Hardware take on the story, and writes "AMD claims that programming for hUMA-enabled platforms should ease software development and potentially lower development costs as well. The technology is supported by mainstream programming languages like Python, C++, and Java, and should allow developers to more simply code for a particular compute resource with no need for special APIs."
will feature this technology. It will be interesting to see how it stacks up.
You can't beat an Ivy Bridge chip for performance for watt though.
Ehugh. Yes no kind of.
For "general" workloads IVB chips are the best in performance per Watt.
In some specific workloads, the high core count piledrivers beat IVB, but that's rare. For almost all x86 work IVB wins.
For highly parallel churny work that GPUs excel at, they beat all X86 processors by a very wide margin. This is not surprising. They replace all the expensive silicon that make general purpose processors go fast and put in MOAR ALUs. So much like the long line of accelerators, co processors, DSPs and so on, they make certain kinds of work go very fast and are useless at others.
But for quite a few classes of work, GPUs trounce IVB at performance per Watt.
The trouble is that GPUs suck. They have teeny amounts of local memory and a slow interconnect to main memory. They also suck at certain things and batting data between the fast (for some things) GPU and fast (for other things) CPU is a real drag becuase of the latency. This limits the applicability of GPUs.
Only with the new architecture, which I (and presumably many others) hoped was AMDs long term goal a number of these problems have disappeared since the link is very low latency and the memory fully shared.
This means the very superior performance per Watt (for some things) GPU can be used for a wider range of tasks.
So yes, this should do a lot for power consumption for a number of tasks.
SJW n. One who posts facts.
In OpenCL you need to copy items from the system memory to the GPU's memory and then load the kernel on the GPU to start execution. Then you must copy the data back from the GPU's memory at the end after execution. AMD is saying that you can instead pass a pointer to the data in the main memory instead of actually making copies of the data.
This should reduce some of the memory shifting on the system and speed up OpenCL execution. It will also eliminate some of the memory constraints on OpenCL regarding what you can do on the GPU. On a larger scale it will open up some opportunities for optimizing work.
Because when you are doing stuff like OpenCL, dispatching from CPU space to GPU space has a huge overhead. The GPU may be 100x better at doing a problem than the CPU, but it takes so long to transfer data over to the GPU and set things up that it may still be faster to do it on the CPU. It's basically the same argument that led to the FPU being moved onto the same chip as the CPU a generation ago. There was a time when the FPU was a completely separate chip,a nd there were valid reasons why it ought to be. But, moving it on chip was ultimately a huge performance win. The idea behind AMD's strategy is basically to move the GPU so close to the CPU that you use it as freely as we currently use the FPU.
The "slow interconnect" you're talking about to main memory, PCI Express v3.0 has an effective bandwidth of 32GB/s which actually exceeds the best main memory bandwidth you'd get out of an Ivy Bridge CPU with very fast memory, so no, that's not a bottleneck for bandwidth, though yes, there is some latency there.
Its both, for my application, the GPU is roughly 3x-5x as fast as a high end CPU. This is fairly common on a lot of GPGPU workloads. The GPU provides a decent but not huge performance advantage.
But, we don't use the GPU! Why not? Because copying the data over the PCIe link, waiting for the GPU to complete the task, and then copying the data back over the PCI bus yields a net performance loss over just doing it on the CPU.
In theory, a GPU sharing the memory subsystem with the CPU avoids this copy latency. Nor does it preclude still having a parallel memory subsystem dedicated for local accesses on the GPU. That is the "nice" thing about opencl/CUDA the programmer can control the memory subsystems at a very fine level.
Whether or not AMD's solution helps our application remains to be seen. Even if it doesn't its possible it helps some portion of the GPGPU community.
BTW:
In our situation its a server system so it has more memory bandwidth than your average desktop. On the other hand, i've never seen a GPU pull more than small percentage of the memory bandwidth over the PCIe links doing copies. Nvidia ships a raw copy benchmark with the CUDA SDK, try it on your machines the results (theoretical vs reality) might surprise you.