Using GPUs For General-Purpose Computing
Paul Tinsley writes "After seeing the press releases from both Nvidia and ATI announcing their next generation video card offerings, it got me to thinking about what else could be done with that raw processing power. These new cards weigh in with transistor counts of 220 and 160 million (respectively) with the P4 EE core at a count of 29 million. What could my video card be doing for me while I am not playing the latest 3d games? A quick search brought me to some preliminary work done at the University of Washington with a GeForce4 TI 4600 pitted against a 1.5GHz P4. My Favorite excerpt from the paper:
'For a 1500x1500 matrix, the GPU outperforms the CPU by a factor of 3.2.' A PDF of the paper is available here."
http://developers.slashdot.org/article.pl?sid=03/1 2/21/169200&mode=thread&tid=152&tid=18 5
Here's a HTML version of the PDF, thanks to Google.
General-purpose computation using graphics hardware has been a significant topic of study for the last few years. Pointers to a lot of papers and discussion on the subject are available at: www.gpgpu.org
Yes, it's true that it has that many transistors BUT, only 29 million of them are part of the core, the rest is memory. The transistor count on the video cards does not count the ram.
Is a course being offered at caltech since last summer on using gpus for numerical work. Course page is here.
:wq
Before you get excited just remember how asymmetric the APG bus is. Those GPUs will be at much better use when we get them as 64bit pci cards.
If you have a matrix solver, there is no telling what you can do. And i remember, these papers show that the speed is faster than the matrix calculations of the same stuff using the CPU.
# Linear Algebra Operators for GPU Implementation of Numerical Algorithms
Jens Krüger, Rüdiger Westermann
# Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid
Jeff Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder
# Nonlinear Optimization Framework for Image-Based Modeling on Programmable Graphics Hardware
Karl E. Hillesland, Sergey Molinov, Radek Grzeszczuk
BrookGPU
from the BrookGPU website...
As the programmability and performance of modern GPUs continues to increase, many researchers are looking to graphics hardware to solve problems previously performed on general purpose CPUs. In many cases, performing general purpose computation on graphics hardware can provide a significant advantage over implementations on traditional CPUs. However, if GPUs are to become a powerful processing resource, it is important to establish the correct abstraction of the hardware; this will encourage efficient application design as well as an optimizable interface for hardware designers.
From what I understand this project it aimed at making an abstraction layer for GUP hardware so writing code to run on it is easier and standardsied.
I thought this looked familiar:
1 /169200.shtml?tid=152&tid=185
http://developers.slashdot.org/developers/03/12/2
At least, I would imagine most of the comments would be the same or similar....
Well they already make DSP cards for audio processing. Simply do a google(TM) search for "DSP card" and you will get several vendors.
I can't imagine it would take a whole lot to hack them for just their processing power outside of audio applications.
I've been thinking about using the GPU for audio DSP work for some time, even got to a point where I could transform some signal by "rendering" it into a texture (in a simple way, I could mix two sounds using the alpha as factor).
The problem is that these cards are made to be "write only" and that basicaly fetching back anything from them is *very* slow, which makes them totaly useless for the purpose, since you *kmow* the results are there, but you can't fetch them in an usefull/fast maneer.
I wonder if it's deliberate, to sell the "pro" cards they use for the rendering farms
QE is cool, but it doesn't do anything similar at all to what they're talking about here. FFTs on an NV30 are only incidentally related to texture mapping window contents. Check out gpgpu.org or BrookGPU. In a sense, the idea is to treat modern graphics hardware as the next step beyond SIMD instruction sets. Incidentally, e17 exploited (hardware) GL rendering of 2D graphics via evas a bit before Apple put that into OS X.
Having done a similar work for my final year project this year, I have some experience attempting general purpose computation on a GPU. The results that I recieved when comparing the CPU with the GPU were very different with many of the applications coming in at 7-15 times slower on the GPU. Further, I discovered some problems which I mention below:
! Matrix results
As in mentioned earlier in the report, the graphics pipeline does not support a branch instruction. So with a limitied number of assembly instructions that can be executed in each stage of the pipeline (either 128 or 256 in current cards), how is it possible for them to perform a calculation on a 1500x1500 matrix multiplication. To calculate a single result 1500 multiplications would need to take place and if they are really clever about how they encode the data into texture s to optimise access, they would need two texture accesses for even 4 multiplications. By my calculations that is 1875 instructions, where you can only do 128 or 256.
My tests found that using the Cg compiler provided by NVidia, that a matrix of size 26x26 could be multiplied before the unrolling of the for loop exceed the 256 limitation.
One aspect that my evaluation did not get to examine was the possiblity of reading partial results back from the framebuffer to the texture memory along with loading a slightly modified program to generate the next partial result. They don't mention if they used this strategy so I assume that they don't.
! Inclusion of a branch instruction
Even if a branch instruction were to be included into the vertex and fragment stages of the pipeline, it would cause serious timing issues. As student of Computer Science, I have been taught that the pipeline operates at the speed of the slowest stage and from designing simple pipelined ALUs, I see the logic behind it. However, if a branch instruction is included then the fragment processing stage could become the slowest as the pipeline stalls waiting for the fragment processor to output its information into the framebuffer. I believe it for this reason that the GPU designers specifically did not include a branch instruction.
! Accuracy
My work also found a serious accuracy issue with attempting compuation on the GPU. Firstly, the GPU hardware represents all number in the pipeline as floating point values. As many of you can probably guess, this brings up the ever present problem of 'floating point error'. The interface between GPU and CPU are traditionally 8-bit values. Once they are imported into the 32-bit floating point pipeline the representation has them falling between 0 and 1, meaning that these numbers must be scaled up to their intended representations (integers between 0 and 255 for example) before computation can begin. Combine these two necessary operations and what I saw was a serious accuracy issue where five of my nine results(in the 3x3 matrix) were one integer value out.
While I don't claim to be an expert on these matters, I do think there is the possiblity of using commodity graphics cards for general purpose computation. However, using hardware that is not designed for this purpose holds some serious constraints in my opinion. Anyone who cares to look at my work can find it here