AMD Details Next-Gen Kaveri APU's Shared Memory Architecture
crookedvulture writes "AMD has revealed more details about the unified memory architecture of its next-generation Kaveri APU. The chip's CPU and GPU components will have a shared address space and will also share both physical and virtual memory. GPU compute applications should be able to share data between the processor's CPU cores and graphics ALUs, and the caches on those components will be fully coherent. This so-called heterogeneous uniform memory access, or hUMA, supports configurations with either DDR3 or GDDR5 memory. It's also based entirely in hardware and should work with any operating system. Kaveri is due later this year and will also have updated Steamroller CPU cores and a GPU based on the current Graphics Core Next architecture."
bigwophh writes links to the Hot Hardware take on the story, and writes "AMD claims that programming for hUMA-enabled platforms should ease software development and potentially lower development costs as well. The technology is supported by mainstream programming languages like Python, C++, and Java, and should allow developers to more simply code for a particular compute resource with no need for special APIs."
will feature this technology. It will be interesting to see how it stacks up.
I'm not so sure how I feel about this whole Linux advocacy thing you're trying to promote. But spam, now there's an idea I can get behind! Take my money!
I went to eat some animal crackers and the box said, "Do not eat if seal is broken." I opened the box and sure enough..
This should really help round trip times trough the GPU. With most existing setups, doing a render to texture, and getting the results back CPU side is quite expensive, but this should help a lot. It should also work great for procedural editing/generating/swapping geometry that you are rendering. Getting all those high poly LODs onto the GPU will not longer be an issue with systems like this.
Interestingly enough, this is somewhat similar to what Intel has now for their integrated graphics, except it looks like the AMD GPU has access to the full address space and cache system, which Intel does not do. Also, its not an Intel GPU, so its likely better in other ways too, but I shouldn't need to point that out.
Intel's Haswell is moving in the opposite direction working to get some dedicated memory for the GPU, which is closer to the traditional GPU approach. Its nice to see companies exploring new areas; hopefully we will get some great hardware out of it, ideally with no broken drivers.
You can't beat an Ivy Bridge chip for performance for watt though.
Ehugh. Yes no kind of.
For "general" workloads IVB chips are the best in performance per Watt.
In some specific workloads, the high core count piledrivers beat IVB, but that's rare. For almost all x86 work IVB wins.
For highly parallel churny work that GPUs excel at, they beat all X86 processors by a very wide margin. This is not surprising. They replace all the expensive silicon that make general purpose processors go fast and put in MOAR ALUs. So much like the long line of accelerators, co processors, DSPs and so on, they make certain kinds of work go very fast and are useless at others.
But for quite a few classes of work, GPUs trounce IVB at performance per Watt.
The trouble is that GPUs suck. They have teeny amounts of local memory and a slow interconnect to main memory. They also suck at certain things and batting data between the fast (for some things) GPU and fast (for other things) CPU is a real drag becuase of the latency. This limits the applicability of GPUs.
Only with the new architecture, which I (and presumably many others) hoped was AMDs long term goal a number of these problems have disappeared since the link is very low latency and the memory fully shared.
This means the very superior performance per Watt (for some things) GPU can be used for a wider range of tasks.
So yes, this should do a lot for power consumption for a number of tasks.
SJW n. One who posts facts.
Apparently not too many finnish speakers here yet. Kaveri => partner/pal/mate, APU => help.
HTH,
ac
In OpenCL you need to copy items from the system memory to the GPU's memory and then load the kernel on the GPU to start execution. Then you must copy the data back from the GPU's memory at the end after execution. AMD is saying that you can instead pass a pointer to the data in the main memory instead of actually making copies of the data.
This should reduce some of the memory shifting on the system and speed up OpenCL execution. It will also eliminate some of the memory constraints on OpenCL regarding what you can do on the GPU. On a larger scale it will open up some opportunities for optimizing work.
Because when you are doing stuff like OpenCL, dispatching from CPU space to GPU space has a huge overhead. The GPU may be 100x better at doing a problem than the CPU, but it takes so long to transfer data over to the GPU and set things up that it may still be faster to do it on the CPU. It's basically the same argument that led to the FPU being moved onto the same chip as the CPU a generation ago. There was a time when the FPU was a completely separate chip,a nd there were valid reasons why it ought to be. But, moving it on chip was ultimately a huge performance win. The idea behind AMD's strategy is basically to move the GPU so close to the CPU that you use it as freely as we currently use the FPU.
When someone asks me about buying AMD or Intel, the general summarization I give them is that AMD's built-in GPU handily beats Intel's built-in GPU but Intel's CPU beats AMD's CPU. If graphics are a big concern, they should get a cheap discrete card as one under $100 will be good for most games. Thus AMD's advantage is negated. Also both companies offer more CPU processing power than most consumers can use anyway.
Well, there's spam egg sausage and spam, that's not got much spam in it.
The "slow interconnect" you're talking about to main memory, PCI Express v3.0 has an effective bandwidth of 32GB/s which actually exceeds the best main memory bandwidth you'd get out of an Ivy Bridge CPU with very fast memory, so no, that's not a bottleneck for bandwidth, though yes, there is some latency there.
Its both, for my application, the GPU is roughly 3x-5x as fast as a high end CPU. This is fairly common on a lot of GPGPU workloads. The GPU provides a decent but not huge performance advantage.
But, we don't use the GPU! Why not? Because copying the data over the PCIe link, waiting for the GPU to complete the task, and then copying the data back over the PCI bus yields a net performance loss over just doing it on the CPU.
In theory, a GPU sharing the memory subsystem with the CPU avoids this copy latency. Nor does it preclude still having a parallel memory subsystem dedicated for local accesses on the GPU. That is the "nice" thing about opencl/CUDA the programmer can control the memory subsystems at a very fine level.
Whether or not AMD's solution helps our application remains to be seen. Even if it doesn't its possible it helps some portion of the GPGPU community.
BTW:
In our situation its a server system so it has more memory bandwidth than your average desktop. On the other hand, i've never seen a GPU pull more than small percentage of the memory bandwidth over the PCIe links doing copies. Nvidia ships a raw copy benchmark with the CUDA SDK, try it on your machines the results (theoretical vs reality) might surprise you.
With GDDR5 memory this could be very interesting.
OK, so the SGI O2's UMA has now been reinvented for a new generation, just with more words tacked on....
AMD beats Intel on the price point however.
And that isn't even counting that with Intel you need to buy a $100 extra card either.
If you *need* top notch performance, go Intel. Otherwise AMD will be lighter on your wallet and do the same job very well.
The "slow interconnect" you're talking about to main memory, PCI Express v3.0 has an effective bandwidth of 32GB/s
32GB/s doesnt sounds like a lot when you divide it amongst the 400 stream processors that an upper end AMD APU has, and thats as favorable a light as I can shine on your inane bullshit. There is a reason that discrete graphics cards have their own memory, and it isnt because they have more stream processors (these days they do, but they didnt always) .. its because PCI Express isnt anywhere near fast enough to feed any modern GPU.
Llano APU's have been witnessed pulling 500 GFLOPS. Does 32GB/s still sound like a lot? No, it sounds like shit. Clearly memory bandwidth is a big issue in this scene.
"His name was James Damore."