Nvidia Working on a CPU+GPU Combo

← Back to Stories (view on slashdot.org)

Nvidia Working on a CPU+GPU Combo

Posted by ryuzaki0 on Friday October 20, 2006 @04:24AM from the that-will-keep-them-out-of-trouble-for-a-while dept.

Max Romantschuk writes "Nvidia is apparently working on an x86 CPU with integrated graphics. The target market seems to be OEMs, but what other prospects could a solution like this have? Given recent development with projects like Folding@Home's GPU client you can't help but wonder about the possibilities of a CPU with an integrated GPU. Things like video encoding and decoding, audio processing and other applications could benefit a lot from a low latency CPU+GPU combo. What if you could put multiple chips like these in one machine? With AMD+ATI and Intel's own integrated graphics, will basic GPU functionality be integrated in all CPU's eventually? Will dedicated graphics cards become a niche product for enthusiasts and pros, like audio cards already largely have?" The article is from the Inquirer, so a dash of salt might make this more palatable.

12 of 178 comments (clear)

Min score:

Reason:

Sort:

Re:Heard This One Before by TheRaven64 · 2006-10-20 04:55 · Score: 5, Informative
It's not just floating point. Originally, CPUs did integer ops and comparisons/branches. Some of the things that were external chips and are now found on (some) CPU dies include:
1. Memory Management Units. Even in microcomputers there are some (old m68k machines) that have an off-chip MMU (and some, like the 8086 that just don't have one).
2. Floating Point Units. The 80486 was the first x86 chip to put one of these on-die.
3. SIMD units. Formerly only found in high-end machines as dedicated chips, now on a lot of CPUs.
4. DSPs. Again, formerly dedicated hardware, now found on-die in a few of TI's ARM-based cores.
A GPU these days is very programmable. It's basically a highly parallel stream processor. Integrating it onto the CPU makes a lot of sense.
--
I am TheRaven on Soylent News
Re:A cyclic process? by shizzle · 2006-10-20 04:56 · Score: 5, Informative

Yup, the idea is pushing 30 years old now, and came out of the earliest work on graphics processors. The term "wheel of reincarnation" came from "On the Design of Display Processors", T.H. Myer and I. E. Sutherland, Communications of the ACM, Vol 11, No. 6, June 1968.
http://www.cap-lore.com/Hardware/Wheel.html
Re:A cyclic process? by levork · 2006-10-20 04:58 · Score: 4, Informative

This is known as the wheel of reincarnation, and has come up several times in the last forty years of graphics hardware.
Re:Heard This One Before by Do+You+Smell+That · 2006-10-20 05:12 · Score: 3, Informative

What I don't understand is that I thought GPUs were made to offload a lot of graphics computations from the CPU. So why are we merging them again? Isn't a GPU supposed to be an auxillary CPU only for graphics? I'm so confused.
You're partially right. GPUs were made to execute the algorithms developed for graphically-intensive programs directly in silicon... thus avoiding the need to run compiled code within an operating system, which entails LOTS of overhead. Being able to do this directly on dedicated hardware (with entirely different processor designs optimized for graphical computations)makes it possible to execute ALOT more calculations per second. You can really see the difference if you, for instance, use DirectX on two nearly identical video cards; one with hardware based DirectX, the other with it running as software.

Moving it right up next to the CPU will allow the data to flow between the two alot faster than currently where it has to go over a bus... they can finally get rid of the bottlenecks that have been around since the two were seperated.

--
I'm not good at making signatures...
Re:Heard This One Before by LWATCDR · 2006-10-20 05:45 · Score: 5, Informative

I am an old school programmer so I tend to use ints a lot. The sad truth if that float using SSE are as fast and sometimes faster than the old tricks we used to avoid floats!
Yes we live in an upside down world where floats are faster than ints some times.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:Heard This One Before by stevenm86 · 2006-10-20 06:38 · Score: 4, Informative

That's sort of the point of building them on the same die. You can't just run a wire to it, as it would be quite slow. Wires tend to have parasitic inductances and capacitances, so the setup and hold times on the lines would be too large to provide a benefit.
Not so much by Sycraft-fu · 2006-10-20 06:56 · Score: 4, Informative

System RAM is SLOW compared to GPU RAM. PCIe actually allows very high speed access to system RAM, but the RAM itself is too slow for GPUs. That's one of the reasons their RAM amounts are so small, they use higher speed and thus more expensive RAM. Also because of the speed you end up dealing with cooling and signal issues which makes it impractical (or perhaps impossible) to simply stick it in addon slots to allow for upgrades.

Even fast as it is, it's still slower than the GPU would really like.

What you've suggested is already done by low end accelerators like the Intel GMA 950. Works ok, but as I said, slow.

Unless you are willing to start dropping serious amounts of cash on system RAM, we'll be needing to stick with dedicated video RAM here for some time.
Re:Heard This One Before by Intron · 2006-10-20 06:58 · Score: 3, Informative

Typically, unimplemented instructions cause an exception. The operation can then be emulated in software.

--
Intron: the portion of DNA which expresses nothing useful.
Some things you forget by Sycraft-fu · 2006-10-20 07:05 · Score: 3, Informative

1) Processors are wicked fast at floating point these days. Have a look at the benchmarks a modern chip using SSE2 can do some time. Integer doesn't inherently mean faster, and chips these days have badass FPUs.

2) For many things, it DOES make a difference. You might ask why do we need more than 24-bit (or 32-bit if you consider the alpha channel) integer colour? After all, it's enough to look highly realistic. Yes well that's fine for a final image, but you don't want to do the computation like that. Why? Rounding errors. You find that with iterative things like shaders doing them integer adds up to nasty errors which equals nasty colours and jaggies. There's a reason why pro software does it as 128-bit FP (32-bits per colour channel) and why cards are now going that way as well.

3) In modern games, everything is handled in the GPU anyhow. The CPU sends over the the data and the GPU does all the transform, lighting, texturing and rasterizing. The CPU really is responsible for very little. With vertex shaders the GPU even handles a good deal of the animation these days. The reason is that not only is it more functional but it's waaaaay faster. You can spend all the time you like trying to make a nice optimised integer T&L path in the CPU, the GPU will blow it away. You actually find that some older games run slower than new ones because they rely on the CPU to do the initial rendering phases like T&L before handing it off, whereas newer games let the GPU handle it and thus run faster even though having higher detail.
Re:Heard This One Before by joto · 2006-10-20 07:25 · Score: 3, Informative
There are a couple of strategies:
1. Write a specialized program that will only run at a single computer, the one the programmer owns, as everything is specialized and optimized for his/her hardware. If other people needs to run the program, write a new one, or at the very least use some other compiler options.
2. Don't use non-portable features. Always go for the lowest common denominator.
3. Manually testing for existence of coprocessor at each FPU instruction, branch to emulator if FPU doesn't exist.
4. Same as above, but tests are inserted automatically by the compiler.
5. Test for existence of coprocessor at start of program execution. If FPU doesn't exist, dynamically replace all FPU instructions with branches to emulator routines
6. Same as above, but done automatically by the OS program loader
7. Make it mandatory for CPUs to: either support the FPU instructions (with a coprocessor if necessary); or to issue some sort of trap/interrupt that can be used by software such as the OS kernel/libc to use an emulator routine instead.
I believe the last option (option 7) is what x86/87 CPU/FPU combo actually used. That's why there is a coprocessor-prefix in front of the FPU instructions. They are not just unused opcodes.
Option 5 (and sometimes even 3) is commonly used for MMX/3dNOW/SSE/SSE2/SSE3/whatever instructions.
Unless they *really* need nonportable features, most programmers tend to go with option 2.
What's that got to do with anything? by Sycraft-fu · 2006-10-20 07:34 · Score: 5, Informative

Yes the SYSTEMS Tom used to test have normal speed ram for systems. Duh. The graphics cards, however, have much faster RAM. For example my system at home has DDR2-667 RAM. That's spec'd to run at 333MHz which is 667MHz is DDR RAM speak. My graphics card, a 7800GT, on the other hand has RAM clocked at 600MHz, or 1200MHz in RAM speak.

Not a small difference, really. My system RAM is rated to somewhere around 10GB/second max bandwidth (it gets like 6 in actuality). The graphics card? 54GB/sec.

Video cards have fast RAM subsystems. They use fast, expensive chips and they have controllers designed for blazing fast (and exclusive) access. You can't just throw normal, slow, system RAM at it and expect it to perform the same.
Re:Heard This One Before by 644bd346996 · 2006-10-20 08:13 · Score: 3, Informative

Having the GPU on the same chip or die as the CPU would reduce the latency by several orders of magnitude and allow a much higher clock for the bus between the two. The memory access could also be improved dramatically, depending on how it was implemented.

I think the first example of this integration we see will use the HyperTransport bus and a single package with CPU and GPU on different dies, though fabbed on the same process. This could be done with an existing AMD socket and motherboard.

Before this happens, though, I think we will see graphics cards on HTX slots. For those who do not know, HTX slots were introduced in a recent revision of the HyperTransport standard. They allow an add-in card to communicate with the CPU with much lower latency and higher bandwidth than PCIe, and no controller in between. The add-in card could even have another CPU on it, and the performance would be comparable to current AMD SMP systems. A GPU on an HTX card could have its own RAM, and be able to access system RAM much faster than PCIe allows. The neat thing is that with HT, the CPU would probably be able to use the graphics RAM as though it were system RAM.

Note that Nvidia is a member of the HyperTransport Consortium due to their chipset business, and they could easily have HTX cards in their labs right now.