Using GPUs For General-Purpose Computing
Paul Tinsley writes "After seeing the press releases from both Nvidia and ATI announcing their next generation video card offerings, it got me to thinking about what else could be done with that raw processing power. These new cards weigh in with transistor counts of 220 and 160 million (respectively) with the P4 EE core at a count of 29 million. What could my video card be doing for me while I am not playing the latest 3d games? A quick search brought me to some preliminary work done at the University of Washington with a GeForce4 TI 4600 pitted against a 1.5GHz P4. My Favorite excerpt from the paper:
'For a 1500x1500 matrix, the GPU outperforms the CPU by a factor of 3.2.' A PDF of the paper is available here."
The GPU are very fast ... at performing vector and matrix calculations. This is the whole point. If general computing CPUs were capable of doing vector or matrix calcs very efficiently, we would probably not have GPUs.
The Pentium 4 EE actually has 178 million transistors, which puts it in between ATI's and NVIDIA's latest.
In all of this, keep in mind that there's computing and there's computing...the kind of computing power in a GPU is excellent for doing the same numeric computation to every element of a large vector or matrix, not so much for branchy decisiony type things like walking a binary tree. You wouldn't want to run a database on something structured like a GPU (or an old vector-processing Cray), but something like a simulation of weather or molecular modeliing could be perfect for it.
The similarities of a GPU to a vector processing system bring up an interesting possibility...could Fortran see a renaissance for writing shader programs?
The whole point of graphic cards is that they have a dedicated purpose. Using the cards for anything that is general purpose is like using a motorcycle to tow a pop-up camper.
What's relevant is that to the processor on a graphics card, its dedicated purpose is simply a bunch of logic. There's no dedicated "this must be used for pixels only, all else is waste" logic inherent in the system. there are MANY purposes for which the same/similar logic that applies in generating 3D imagery can be used, and that seems the purpose of this paper. Run THOSE type operations on the GPU. Some things they won't be able to do well no doubt - but those they can, they can do extremely well.
Creating a way to use the specialize GPUs for vector processing that is not graphics related is ingenious. Like a lot of great ideas, it is sooo obvious AFTER you see some one else do it.
Don't miss the point that this is not intended for general purpose computing. Don't port OoO to the graphics chip.
Where it is huge is in signal processing. FPGAs have begun replacing even the G4s in this area recently because of the huge gains in speed vs. power consumption an FPGA affords. However, FPGAs are not bought and used as is, and end up costing a significant amount (of development time/money) to become useful. Being able to use these commodity GPUs for vector processing creates a very desirable price/processing power/power consumption option. If I were nVIDIA or ATI, I would be shoveling these guys money to continue their work.
I am living proof of the Peter Principle
...will someone finally port john the ripper to a new video card's graphical pipeline? :)
There is however one thing to keep in mind. Presently our GPU's may have the headroom to play with, but with Apple's Quartz, and Microsoft's Longhorn, let alone what's coming with X. That headroom may disappear, and our video cards will have to go back to being video cards.
Remember the co-processors? Well, actually I don't (I'm a tad to young). But I know about them.
Maybe it's time to start making co-processing add-on cards for advanced operations such as matrix mults and other operations that can be done in parallell on a low level. Add to that a couple of hundred megs of RAM and you have a neat little helper when raytracing etc. You could easily emulate the cards if you didn't have them (or needed them). The branchy nature of the program itself would not affect the performance of the co-processor since it should only be used for calculations.
I for one would like to see this.
Perhaps offloading the CPU to the GPU is the wrong way to look at things? With the apparently imminent arrival of commodity (low power) multi-CPU chips, maybe we should be considering what we need to add to perform graphics more efficiently (ala MMX et al)?
While it's true that general purpose hardware will never perform as well as or as efficiently as a design specifically targeted to the task (or at least it better not), it is also equally as true that eventually general purpose/commodity hardware will achieve a price-performance point where it is more than "good enough" for majority.
From a design standpoint, I can imagine a GPU that donates its power to the CPU would be a nightmare. It violates the fundamental tenet that everything should do one thing and do it well. OTOH, that tenet focuses on simplicity and maintainability over performance. Is such a tradeoff worth it?
You don't seem to understand that GPU's are very specific purpose computing devices. They aren't like a general purpose processor like you CPU. They crunch matrices, and that's about it. Even all the programmable stuff is just putting parameters on the matrix churning.
My blog. Good stuff (when I remember to update it). Read it.
GCC is an inferior compiler for the x86, whether you like it or not. Intel's optimizing C/C++ compiler is much faster according to numerous benchmarks (I'm sorry, it's too late to find the links.) On the other hand, I understand that GCC is great on the Mac, since Apple optimized it properly. (Certainly I appreciate the hard work of the various GCC teams over the years; hopefully new optimizations will continue to improve the quality of the release until it is as fast as Intel's offerings.)
In any case, why do you believe all of Apple's conveniently high numbers, but you don't believe Spec numbers reported by Dell, AMD, etc.? These are not numbers pulled out of a hat; they are standard Spec results. Thus, the numbers should be comparable from company to company. But Apple retested other companies' products and released new numbers without properly optimizing for the x86. Why is it when Microsoft pays for benchmarks, people freak out, but when Apple PERFORMS benchmarks, people believe them instantly?
There are plenty of other links out there that provide similar information. It is patently false advertising for Apple to claim that they use the fastest chip of any PC.
Oh, and re: the Linux issue, you're right. But you'll find that the x86 is faster in Linux with a proper optimizing compiler.
My issue is basically that at best -- at best! -- the results are inconclusive. At worst, Apple blatently lied. It's foolish to believe Apple blindly just because they're the underdogs and produce a pretty, Unix-based OS. And it's foolish to hold this strange hatred for all that is x86. I don't understand this mentality.
These applications are not likely to generate or process data at such a rate that the slow AGP read speed will matter that much, if at all.
The Internet's nature is peer to peer - 20050301_cs_profs.pdf
Lemme try to help:
a) Not equal. Apples and oranges. A GPU will do repeated calculations very, very fast, like matrix transforms and the like. A CPU on the other hand will make decisions based on input, rather than just crunching numbers
b) The main display (the GUI) already uses many tricks on the graphics card. The hard part is making sure that all graphics cards support the features. Things like the xrender extension and such are becoming more common as graphics cards and drivers get "standard" capabilities
c) Your imagination is the limit as to what it could be used for. Just realize that it's a good data processing unit, not a good program execution unit. Use each for their strengths.
d) Modified? With new cards/drivers, all it takes is OpenGL calls to start taking advantage of this power. All it really takes is someone who knows what they're doing and has a bit of inspiration.
My blog. Good stuff (when I remember to update it). Read it.
When I say oh shut the fuck up.
Sorry for the flames, but seriously, I get so damn sick of all the "all new games suck" whiners. Look, there are legit reasons to want new technology. It is nice to have better graphics, more realistic sound, etc. It is NICE to have game that looks and sounds more like reality. Yes, that doesn't make the game great, but that doesn't mean it's worthless.
What's more, don't pretend like all modern games suck while old games ruled. That's a bunch of bullshit. Sure, there are plenty of modern games that suck, but guess what? There are tons of old games that suck too. Thing is, you just tend to forget about them. You remember the greats that you enjoyed or heard about, the ones that helped shape gaming today. You forget all the utter shit that was released, just as is released today.
So get off it. If you don't like nice graphics, fine. Stick with old games, no one is forcing you to upgrade. But don't pretend like there is no reason to want better graphics in games.
Yes, one thing shocked me in their paper: they don't talk much about the precision they use..
Strange because it is a big problem for using GPU as coprocessors: usually scientific computation use 64bit floats or on Intel 80-bit floats!
What I remember about co-processing cards and "intelligent peripheral cards" (like raid controllers or network cards with an onboard processor) is this:
There is a certain overhead because a communications protocol is to be established between the main processor and the co-processor. For simple tasks the main processor often stops and waits for the co-processor to complete the task and retrieves the results. For more complicated tasks, the main processor continues but later an interrupt occurs that the main processor must service.
You must be very careful or the extra overhead of this communication makes the execution of the task slower than without the co-processor. This is certainly going to happen at some time in the future, when you increase central processor power all the time but keep using the same co-processor.
For example, your matrix co-processor needs to be fed the matrix data, start working, and tell it is finished. Your performance would not only be limited by the processor speed, but also by the bus transfer rate, and by the impact those fast bus transfers have on the CPU-memory bandwidth available and the on-CPU cache validity.
When you are unlucky, the next CPU you buy is faster in performing the task itself.
Math co processor boards would be great, buy still quite fixed function.
It would be much more efficient if you would implement an co processor with an FPGA. First programming the FPGA what functions to execute. And then feeding the data to it, when the calculation is completed you just reprogram it to become whatever you want.
This way you would not have an math only board, but a board that could perform many many functions. You just need to write algorithms to exploit them.
I think the real reason Apple comes out with newer and bette technology is because they have to fight for their user base. After all, if Apple's products were the same as Microsoft's, who would care?
Microsoft can afford to be lazy with their products, they make money either way. I don't think that will last forever though. Sometimes they do try hard, NT for example, but then they pile a bunch of poorly designed stuff to go on top of it and that ruins it. If you can, check out OS X's directory structure, it's beautiful. Now compare that to Window's cryptic system...
"Microsoft, as usual, announced the feature after Apple shipped it"
"God I'm tired of hearing that phrase over and over again when 95% of the time it's just because Apple can control the hardware and it would be a total disaster if MS included a technology as fast as they do..."
I wonder if it's deliberate, to sell the "pro" cards they use for the rendering farms
No, it's just the way that the OpenGL and DirectX API's evolved. There never was any need in the past to have a substantial data feedback. The only need back then was to read pixelmaps and selection tags for determining when an object had been picked.
However I do know that a lot of people had been wondering about this for a while, could it be done, and was it worth attempting, so now we know. Maybe we shall soon see PCI cards containing an array of GPUs, I imagine the cooling arrangements will be quite interesting!
There are other things which are faster than a typical CPU, are not some of the processors in games machines 128-bit? Again, you could in theory put some of these together as a co-processor of some sort.
This was a good piece of work technically, but it says something about society that the fastest mass-produced processors, whether for GPUs or games consoles, exist because people want a higher frame rate in Quake. I can't think of any professional application that needs really fast graphics output, but many that could use faster processing. So why can't Intel and AMD stop putting everything in the one CPU (multiple CPUs with one memory are not really much better), and make co-processors again, which will do fast matrix operations on very large arrays, etc, for those who need them? The ultimate horror of the one CPU philosophy was the winmodem and winprinter, both ridiculous. Silicon is in fact quite cheap, as Nvidia have proved, people's time while they wait for long calculations to finish is not.
Maybe we are going to see an architectural change coming, I expect it will be supported by FOSS long before Longhorn, just like the AMD64.
Touche. However, with the upcoming advances in bus speeds (read: PCI Express) and the available bandwidth to the PCI bus, we won't have to worry about latency when using a coprocessor type piece of hardware. There is room to grow with this new bus to almost outlandish amounts of bandwidth. Not a problem we'll run into any time soon.
Listen to my experimental-industrial-techno!
I would be interested in a reference for that, since the 1541 serial link was so slow. If you are talking about Mindsmear that was not actually released, but a demo would have to be pretty clever to make the communication time worth while (and accurate with the screen still turned on).
I disagree. Performances of Altivec-aware apps of heavyly vectorized computation shows that they beat those of similar apps on the Wintel side over and over, even at higher MHz (although not as much as Apple claims). I know that other factors such as optimization, the non-vector code, etc. influence the outcome, but in the absence of true vector computation benchmarking, I can accept that Altivec is better than SSE2.
Now, compared to GPUs, I think SIMD instructions suck. Why do you think 3D games utilize GPUs than Altivec or SSE2? In general, you can't compare the performance a part of a general utility chip to a specifically designed chip tuned to gain the highest performance without having to worry about trade-offs.