Nvidia Working on a CPU+GPU Combo
Max Romantschuk writes "Nvidia is apparently working on an x86 CPU with integrated graphics. The target market seems to be OEMs, but what other prospects could a solution like this have? Given recent development with projects like Folding@Home's GPU client you can't help but wonder about the possibilities of a CPU with an integrated GPU. Things like video encoding and decoding, audio processing and other applications could benefit a lot from a low latency CPU+GPU combo. What if you could put multiple chips like these in one machine? With AMD+ATI and Intel's own integrated graphics, will basic GPU functionality be integrated in all CPU's eventually? Will dedicated graphics cards become a niche product for enthusiasts and pros, like audio cards already largely have?" The article is from the Inquirer, so a dash of salt might make this more palatable.
- Memory Management Units. Even in microcomputers there are some (old m68k machines) that have an off-chip MMU (and some, like the 8086 that just don't have one).
- Floating Point Units. The 80486 was the first x86 chip to put one of these on-die.
- SIMD units. Formerly only found in high-end machines as dedicated chips, now on a lot of CPUs.
- DSPs. Again, formerly dedicated hardware, now found on-die in a few of TI's ARM-based cores.
A GPU these days is very programmable. It's basically a highly parallel stream processor. Integrating it onto the CPU makes a lot of sense.I am TheRaven on Soylent News
http://www.cap-lore.com/Hardware/Wheel.html
This is known as the wheel of reincarnation, and has come up several times in the last forty years of graphics hardware.
Moving it right up next to the CPU will allow the data to flow between the two alot faster than currently where it has to go over a bus... they can finally get rid of the bottlenecks that have been around since the two were seperated.
I'm not good at making signatures...
I am an old school programmer so I tend to use ints a lot. The sad truth if that float using SSE are as fast and sometimes faster than the old tricks we used to avoid floats!
Yes we live in an upside down world where floats are faster than ints some times.
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
That's sort of the point of building them on the same die. You can't just run a wire to it, as it would be quite slow. Wires tend to have parasitic inductances and capacitances, so the setup and hold times on the lines would be too large to provide a benefit.
System RAM is SLOW compared to GPU RAM. PCIe actually allows very high speed access to system RAM, but the RAM itself is too slow for GPUs. That's one of the reasons their RAM amounts are so small, they use higher speed and thus more expensive RAM. Also because of the speed you end up dealing with cooling and signal issues which makes it impractical (or perhaps impossible) to simply stick it in addon slots to allow for upgrades.
Even fast as it is, it's still slower than the GPU would really like.
What you've suggested is already done by low end accelerators like the Intel GMA 950. Works ok, but as I said, slow.
Unless you are willing to start dropping serious amounts of cash on system RAM, we'll be needing to stick with dedicated video RAM here for some time.
Typically, unimplemented instructions cause an exception. The operation can then be emulated in software.
Intron: the portion of DNA which expresses nothing useful.
1) Processors are wicked fast at floating point these days. Have a look at the benchmarks a modern chip using SSE2 can do some time. Integer doesn't inherently mean faster, and chips these days have badass FPUs.
2) For many things, it DOES make a difference. You might ask why do we need more than 24-bit (or 32-bit if you consider the alpha channel) integer colour? After all, it's enough to look highly realistic. Yes well that's fine for a final image, but you don't want to do the computation like that. Why? Rounding errors. You find that with iterative things like shaders doing them integer adds up to nasty errors which equals nasty colours and jaggies. There's a reason why pro software does it as 128-bit FP (32-bits per colour channel) and why cards are now going that way as well.
3) In modern games, everything is handled in the GPU anyhow. The CPU sends over the the data and the GPU does all the transform, lighting, texturing and rasterizing. The CPU really is responsible for very little. With vertex shaders the GPU even handles a good deal of the animation these days. The reason is that not only is it more functional but it's waaaaay faster. You can spend all the time you like trying to make a nice optimised integer T&L path in the CPU, the GPU will blow it away. You actually find that some older games run slower than new ones because they rely on the CPU to do the initial rendering phases like T&L before handing it off, whereas newer games let the GPU handle it and thus run faster even though having higher detail.
I believe the last option (option 7) is what x86/87 CPU/FPU combo actually used. That's why there is a coprocessor-prefix in front of the FPU instructions. They are not just unused opcodes.
Option 5 (and sometimes even 3) is commonly used for MMX/3dNOW/SSE/SSE2/SSE3/whatever instructions.
Unless they *really* need nonportable features, most programmers tend to go with option 2.
Yes the SYSTEMS Tom used to test have normal speed ram for systems. Duh. The graphics cards, however, have much faster RAM. For example my system at home has DDR2-667 RAM. That's spec'd to run at 333MHz which is 667MHz is DDR RAM speak. My graphics card, a 7800GT, on the other hand has RAM clocked at 600MHz, or 1200MHz in RAM speak.
Not a small difference, really. My system RAM is rated to somewhere around 10GB/second max bandwidth (it gets like 6 in actuality). The graphics card? 54GB/sec.
Video cards have fast RAM subsystems. They use fast, expensive chips and they have controllers designed for blazing fast (and exclusive) access. You can't just throw normal, slow, system RAM at it and expect it to perform the same.
Having the GPU on the same chip or die as the CPU would reduce the latency by several orders of magnitude and allow a much higher clock for the bus between the two. The memory access could also be improved dramatically, depending on how it was implemented.
I think the first example of this integration we see will use the HyperTransport bus and a single package with CPU and GPU on different dies, though fabbed on the same process. This could be done with an existing AMD socket and motherboard.
Before this happens, though, I think we will see graphics cards on HTX slots. For those who do not know, HTX slots were introduced in a recent revision of the HyperTransport standard. They allow an add-in card to communicate with the CPU with much lower latency and higher bandwidth than PCIe, and no controller in between. The add-in card could even have another CPU on it, and the performance would be comparable to current AMD SMP systems. A GPU on an HTX card could have its own RAM, and be able to access system RAM much faster than PCIe allows. The neat thing is that with HT, the CPU would probably be able to use the graphics RAM as though it were system RAM.
Note that Nvidia is a member of the HyperTransport Consortium due to their chipset business, and they could easily have HTX cards in their labs right now.