thurin_the_destroyer · Slashdot Mirror

← Back to Users

User: thurin_the_destroyer

thurin_the_destroyer's activity in the archive.

Stories: 0
Comments: 2
First seen: 2004-05-09
Last seen: 2004-05-09
Profile: (view on slashdot.org)

Comments · 2

Re:Interesting work that raises some questions... on Using GPUs For General-Purpose Computing · 2004-05-09 22:40 · Score: 1

> On NV40-based architectures, you get a branch instruction as well as a 65536 instruction program limit.

Interesting. I would like to see how this effects the timing in the pipeline and if you would see a noticable slowdown when you make use of the full 512 instructions, executing 65k times.

> A deep unwavering belief is a sure sign you're missing something...

I don't know about `deep unwavering belief` but my work was based on a GeForce FX 5600 so all of my observations are based on that.
Interesting work that raises some questions... on Using GPUs For General-Purpose Computing · 2004-05-09 02:54 · Score: 4, Informative

Having done a similar work for my final year project this year, I have some experience attempting general purpose computation on a GPU. The results that I recieved when comparing the CPU with the GPU were very different with many of the applications coming in at 7-15 times slower on the GPU. Further, I discovered some problems which I mention below:

! Matrix results
As in mentioned earlier in the report, the graphics pipeline does not support a branch instruction. So with a limitied number of assembly instructions that can be executed in each stage of the pipeline (either 128 or 256 in current cards), how is it possible for them to perform a calculation on a 1500x1500 matrix multiplication. To calculate a single result 1500 multiplications would need to take place and if they are really clever about how they encode the data into texture s to optimise access, they would need two texture accesses for even 4 multiplications. By my calculations that is 1875 instructions, where you can only do 128 or 256.

My tests found that using the Cg compiler provided by NVidia, that a matrix of size 26x26 could be multiplied before the unrolling of the for loop exceed the 256 limitation.

One aspect that my evaluation did not get to examine was the possiblity of reading partial results back from the framebuffer to the texture memory along with loading a slightly modified program to generate the next partial result. They don't mention if they used this strategy so I assume that they don't.

! Inclusion of a branch instruction
Even if a branch instruction were to be included into the vertex and fragment stages of the pipeline, it would cause serious timing issues. As student of Computer Science, I have been taught that the pipeline operates at the speed of the slowest stage and from designing simple pipelined ALUs, I see the logic behind it. However, if a branch instruction is included then the fragment processing stage could become the slowest as the pipeline stalls waiting for the fragment processor to output its information into the framebuffer. I believe it for this reason that the GPU designers specifically did not include a branch instruction.

! Accuracy
My work also found a serious accuracy issue with attempting compuation on the GPU. Firstly, the GPU hardware represents all number in the pipeline as floating point values. As many of you can probably guess, this brings up the ever present problem of 'floating point error'. The interface between GPU and CPU are traditionally 8-bit values. Once they are imported into the 32-bit floating point pipeline the representation has them falling between 0 and 1, meaning that these numbers must be scaled up to their intended representations (integers between 0 and 255 for example) before computation can begin. Combine these two necessary operations and what I saw was a serious accuracy issue where five of my nine results(in the 3x3 matrix) were one integer value out.

While I don't claim to be an expert on these matters, I do think there is the possiblity of using commodity graphics cards for general purpose computation. However, using hardware that is not designed for this purpose holds some serious constraints in my opinion. Anyone who cares to look at my work can find it here