Twilight of the GPU — an Interview With Tim Sweeney
cecom writes to share that Tim Sweeney, co-founder of Epic Games and the main brain behind the Unreal engine, recently sat down at NVIDIA's NVISION con to share his thoughts on the rise and (what he says is) the impending fall of the GPU: "...a fall that he maintains will also sound the death knell for graphics APIs like Microsoft's DirectX and the venerable, SGI-authored OpenGL. Game engine writers will, Sweeney explains, be faced with a C compiler, a blank text editor, and a stifling array of possibilities for bending a new generation of general-purpose, data-parallel hardware toward the task of putting pixels on a screen."
For once I'm reading an 'xzy is going to die' article that doesn't sound like utter rubbish. Could it be that, for once, the one stating this actually knows what he's talking about?
My last custom realtime GPU was a Geforce Ti4200. I'm now using a Mac Mini with GT950. Mind you, Blender *is* quite a bit slower on the 950, even though it runs with twice the sysclock, but I'm not really missing the old Geforce. I too think it highly plausible that the GPU and the CPU merge within the next few years.
We suffer more in our imagination than in reality. - Seneca
He talks about the impending fall of the fixed function GPU.
gpu's aren't really parallel in that sense, they are parallel in the SIMD sense.
MP3 Search Engine
In such a world you won't need APIs because you'll have libraries that you can include in the compile process.
A library you include in the compile process is an implementation of an API.
APIs reduce code bulk at the cost of reduced code speed, don't they?
No.
gpu's aren't really parallel in that [traditional multithreaded] sense, they are parallel in the SIMD sense.
Actually, they're somewhere in between. Some current hardware can reallocate individual processors between fragment and vertex processing depending on the current workload profile. Even at the level of an individual processor lots of "threads" may be running simultaneously; this is to hide latency when a shader program blocks on memory (texture or framebuffer) access.
If you look at NV's descriptions of their 8xx-series drivers, they talk about *hundreds* of threads in flight at any given time. These aren't threads in the classical sense - there's no preemption, for a start - but they're much, much more advanced than SIMD-style "apply this instruction to all these values" parallelism.
You are very, very wrong. The history of computer hardware has been one where extra functionality is moved from the cpu for speed, folded back in a few years later for efficiency, and farmed out to an add-on card for speed some time later...
See http://catb.org/jargon/html/W/wheel-of-reincarnation.html for details.
At the deep RISC level, they probably wouldn't be. In fact, they'd certainly not be, or you'd simply have an SMP cluster with some emulation on top. If you're going for the migrating code, you'd need binary compatibility at the emulated mode (think Transmeta or IBM's DAISY project) but the underlying specialization would give you the improvement over a homogeneous cluster. If you're going for the totally heterogeneous design - basically the Cell approach but on a far, far larger scale - you need endian compatibility and bus protocol compatibility but nothing more.
This Cell-like approach gives the greatest room for innovation but also imposes the greatest development costs and greatest purchase costs. It also makes ABI backwards compatibility extremely hard or impossible, so you'd end up with a proliferation of builds of the same code for any binary packages (including all closed-source) and a far more complex build and optimization process for all source packages (gentoo users beware). It also makes bus design far more complex, as the more specialized the decentralized processor units (DPUs) get, the more synchronization headaches you will get.
A DPU cluster should logically give the best performance, for the same reason pure RISC outperforms pure CISC - fewer overheads, tighter logic and also (in consequence) more real-estate for optimizations, parallelizations, cache and other goodies. Distances would be greater between processing units, which will have an impact, but so long as the mean gain across the DPUs exceeded the mean loss due to extra distance and extra communication layers, you'd gain overall. This means a DPU computer cannot be flat beyond a certain scale. As SMP clusters cannot exceed 16 processors due to locking issues with shared resources, DPU computers cannot exceed 16 DPUs for a single resource and would probably avoid sharing resources if at all physically possible. This means a DPU computer must be heavy on duplicate resources. But for duplication to beat the deadlocking issue, bus bandwidth needs to be extremely high and bus latency needs to be almost non-existent.
Cell processors are much too basic to run into these sorts of problems, but if you wanted to scale the concept up by, oh, an order of magnitude and beat the design limitations in the Cell processor, you'd need to be spending serious time and money. I expect further "specialist" *PUs to be developed for some time, but the truly RISC, truly distributed DPU is unlikely to exist outside of theory or maybe a research lab or two for at least a decade and I don't expect DPU home machines for at least 30+ years.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
How long has audio been around? Have you ever seen an audio chip integrated into the CPU? Most of them are done by onboard chips, not on the CPU.
They've moved from being standalone cards to being predominantly integrated into the mainboard and using the cpu for processing... rather like HSP modems, really.