Using GPUs For General-Purpose Computing

video stuff by rexguo · 2004-05-08 18:54 · Score: 4, Interesting

At my work place, I'm looking into using the GPUs to do video analysis. Things like cut-scene detection, generating multi-resolution versions of a video frame, applying video effects and other proprietary technologies that were previously done in CPU. The combination of pixel shaders and floating-point buffers really make GPUs a Super-SIMD machine if you know how to exploit it.

--
www.rexguo.com - Technologist + Designer

Re:Not the Point by JonoPlop · 2004-05-08 18:59 · Score: 4, Interesting

The whole point of graphic cards is that they have a dedicated purpose. Using the cards for anything that is general purpose is like using a motorcycle to tow a pop-up camper.

No, it's like using your pop-up camper for storage space when you're using it on holidays.

DSP using GPUs by crushinghellhammer · 2004-05-08 19:01 · Score: 3, Interesting

Does anybody know of pointers to papers/research pertaining to using GPUs to perform digital signal processing for, say, real-time audio? Replies would be much appreciated.

Not just the GPU : the RAM by ratboot · 2004-05-08 19:10 · Score: 5, Interesting

What's interesting with new video cards it's their memory capacity, 128 or 256 MB and that this memory is accessible on some new cards at 900 MHz with a data path of 256 bit (which is a lot faster than a CPU with DDR 400 installed).

Re:Not just the GPU : the RAM by drinkypoo · 2004-05-11 12:46 · Score: 2, Interesting

The part that annoys me is that it's all the same speed. Texture memory doesn't have to be near as fast as video memory and furthermore you could have two classes of texture memory, which will make sense as video cards reach and exceed 512MB. There have in the past been video cards with high speed video memory and something like EDO for textures, which makes a lot of sense, especially if you're willing to cache most-used textures somewhere in video memory.
Is it just me, or should the cards have maybe 64 or 128MB of high speed memory, and then a couple of DIMM slots that take ordinary DDR SDRAM? That would still be pretty fast stuff, especially if the cards had dual-channel memory controllers, and plenty fast enough for textures. The card could cache the most-used textures in whatever video memory was left after drawing screens.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

Wow by cubicledrone · 2004-05-08 19:10 · Score: 5, Interesting

All that processing power, and the latest games still run at about 22 frames per second, if that.

The CPU can do six billion instructions a second, the GPU can do 18 billion, and every last cycle is being used to stuff a 40MB texture into memory faster. What a waste. Yeah, the walls are even more green and slimy. Whoop-de-fucking-do.

Would it be great if all that processing power could be used for something other than yet-another-graphics-demo?

Like, maybe some new and innovative gameplay?

--
Business isn't willing to pay for products, innovation and careers, so we get brands, mortgage commercials and layoffs.

audio stuff by RobPiano · 2004-05-08 19:12 · Score: 4, Interesting

At my work we do audio stuff. It would be really neat if I could do some of the more complicated audio analysis (FFT etc) that requires lots of vector math using the video cards gpu. There is probably even some way you could sync the timing for multimedia stuff.

I know nothing about CPU design though

Re:audio stuff by Anonymous Coward · 2004-05-09 07:24 · Score: 1, Interesting

I use Mathlab at work a lot. It runs a mathmatical simulation of a system. Lots of digital communications people do it. I wonder if a system can be converted into a matrix and then solved numerically.

http://www.gpgpu.org/ is a great resource by aancsiid · 2004-05-08 19:21 · Score: 4, Interesting

http://www.gpgpu.org/ is a great resource for general purpose graphics processor usage.

Not so... by oboylet · 2004-05-08 19:27 · Score: 4, Interesting

High-powered GPUs can make for really good general-purpose devices.

Apple's Newton had no CPU, only a GPU that was more than adequate.

Ideas like these are good in general. I'd like to see the industry move away from the CPU-as-chief status quo. Amigas were years ahead of their time in large part because the emphasis wasn't as much on central processing. The CPU did only what it was supposed to do -- hand out instructions to the gfx and audio subsystems.

Hardly using a "motorcycle to tow a pop-up camper." If anything, the conventional wisdom is, "when all you have is a hammer, everything looks like a nail."

So when do we get unified memory? by Anonymous Coward · 2004-05-08 19:28 · Score: 2, Interesting

Many of the problems stated in using a GPU for non-graphics tasks would be implicitly solved if the GPU and CPU shared memory. While this would slightly slow down the GPU's memory access, in 3 years, I don't think that would be an issue. Especially compared to the benefits of having only one memory pool.

I can see it now.... by TypoNAM · 2004-05-08 19:29 · Score: 3, Interesting

...Several indies and companies figure out how to use the powerful GPU's in an efficient manner that would benefit everyone who uses computers on a daily basis and improves the usefulness of the computer making it the best thing in the world again then some greedy bastard comes along flashing his granted patent by the U.S. Patent Office which makes us all screwed...

Ohh well the idea was good while it lasted. ;)

--
This space is not for rent.

Imagine... by rokzy · 2004-05-08 19:32 · Score: 4, Interesting

a beowulf cluster of them.

seriously, we have a 16 node beowulf cluster and each node has an unnecessarily good graphics card in them. a lot of the calculations are matrix-based e.g. several variables each 1xthousands (1D) or hundredsxhundreds (2D).

how feasible and worthwhile do you think it would be to tap into the extra processing power?

Re:Imagine... by BiggerIsBetter · 2004-05-08 20:12 · Score: 2, Interesting

It's a good idea if your datasets take a long enough time to process. You could run 6 or so cards (maybe 1 AGP super fast, 5 PCI slowish (eg FX5200)) in your machine and send a dataset to each GPU and the main CPU, then get the results back. The trick is to keep them working without blowing all your bandwidth or PSU. Also depends on the resolution required, because the GPU is only 32 bits FP, compared to 80 bits for the CPU.

All I can suggest is download the Brook libraries and try it out. See if it helps, and see if the results are accurate enough. And yes, Fortran can be used if you can bind it - Intel's compiler suite worked for me.

--
Forget thrust, drag, lift and weight. Airplanes fly because of money.

Altivec by ensignyu · 2004-05-08 19:41 · Score: 1, Interesting

I'm curious how GPUs stack up against the Altivec engine in G4/G5s.

Re:As has been said many time before ... by lazy_arabica · 2004-05-08 19:50 · Score: 5, Interesting

The GPU are very fast ... at performing vector and matrix calculations. This is the whole point. If general computing CPUs were capable of doing vector or matrix calcs very efficiently, we would probably not have GPUs.

Yes. But 3D graphics are not the only use of these mathematical objects ; I wonder if it would be possible to use a GPU to perform video encoding or digital sound manipulation at a higher speed, as both operations require matrices. I'm also sure they could take advantage of these processors vector manipulation capabilities.

Documentation by Detritus · 2004-05-08 19:51 · Score: 2, Interesting

Do any of the video chip manufacturers make free and complete documentation available for their GPUs? Everything that I have read in the past has said that they are encumbered with NDAs and claims of trade secrets. I'd prefer not to waste my time dealing with companies that treat their customers as potential enemies.

--
Mea navis aericumbens anguillis abundat

Frogger by BiggerIsBetter · 2004-05-08 19:52 · Score: 4, Interesting

Some dude wrote Frogger almost entirely in pixel shaders. http://www.beyond3d.com/articles/shadercomp/result s/ (2nd from the bottom).

--
Forget thrust, drag, lift and weight. Airplanes fly because of money.

SETI by ryanw · 2004-05-08 20:04 · Score: 1, Interesting

I would what seti could do by the extra cycles in parallel with the CPU. Is it possible to get 2x or 3x the crunching of data for seti clients?

Re:Maybe that's the answer... by trg83 · 2004-05-08 20:11 · Score: 3, Interesting

From the link you mentioned: "while Apple used a compiler you've never heard of (at least in the x86 world)."

My understanding is that they used GCC.

Further, "Another said that some version of Linux had to be used to compare apples to apples. Well, MacOS X isn't Linux, and the desktop standard for x86 machines is Windows (not that using a properly optimized Linux bothered the Opterons very much). You want to know what machine is fastest, you test in their native environment."

Oh, silly me. Processors are so obviously made to run only one operating system!

I'll take this site's info with a grain of salt.

Re:Link to previous discussion on same/similar sub by hype7 · 2004-05-08 20:15 · Score: 4, Interesting

There's some good stuff in there.

However, it seems a few organisations have actually beaten us to it.

Apple, for example, uses the 3d aspect of the GPU to accelerate its 2d compositing system with quartz extreme. Microsoft, as usual, announced the feature after Apple shipped it, and with any luck Windows users might have it by 2007

-- james

Unused computing Power? by JLang22 · 2004-05-08 20:26 · Score: 1, Interesting

I am a novice in a lot of these discussions so I don't post much. Let me see if I understand this:

The graphics card has a lot of unused computing power, nearly equal to the main processor chip in the computer if not more, that is not being used when there is no game or video being played, right?

Is there no way to tap into this power?

Perhaps it could be used for the main display on the computer (I think you guys call it GUI?)?

What else could it be used for?

Could Linux be modified to make use of this power?

Just a know nothing, nobody with questions.
J

Let me check my notes... by Impeesa · 2004-05-08 20:58 · Score: 4, Interesting

I did a paper on the topic of general-purpose GPU programming for my parallel computing course just this last semester here, interestingly enough. I believe our research indicated that even a single PCI card was so badly throttled by the bus throughput that it was basically useless. AGP does a lot better taking data in, but it's still pretty costly sending data back to the CPU. I have a feeling your proposed setup will be a whole lot more feasible if/when PCI Express becomes mainstream.

Re:Let me check my notes... by sonamchauhan · 2004-05-09 02:21 · Score: 3, Interesting

Somewhere in this story, I found a post with a a link that explains this is a software problem:
Notice that they're quick to point out the problem isn't likely a hardware issue. There should be plenty of bandwidth on the AGP bus, but graphics chip makers don't seem to have written their drivers to handle transfers from AGP cards to main memory properly.

Then they run some tests and conclude:
That means even if you can render high-quality images at 30 frames per second, you won't be able to get them out of the graphics card at anything near that rate.

Alternative use by Zog+The+Undeniable · 2004-05-08 21:01 · Score: 2, Interesting

Remember the story about PS2's being used in Iraqi WMDs? No doubt the next "outlaw state" will be accused of using GeForce Ti4600's to manage fast breeder reactors.

--
When I am king, you will be first against the wall.

Re:Maybe that's the answer... by phatsharpie · 2004-05-08 21:09 · Score: 2, Interesting

Actually, GCC may have optimization for the G5, but it is far from being optimal:

The compiler that seems to be best/fully optimized for the G5 is the new IBM XL compilers, released at the beginning of the year.

http://forums.macnn.com/showthread.php?s=&thread id =197118

There doesn't seem to be much benchmark done using it yet, but all information points to significant gain in performance when using the IBM compiler versus GCC (not surprising, since IBM built the chip). The only benchmark I can find is from a German site:

http://www.heise.de/ct/Redaktion/as/spec/ct04082 30 /

I don't believe the G5 is indeed the "fastest" personal computer in the world as claimed by Apple, but it certainly is comparable to the best in the x86 world. Not to mention it is a very new architecture, and there are still plenty of optimization that can be made to make it faster. But to claim that GCC is fully optimized for the G5, and that Apple was using it to justify its claim of being the "fastest" is incorrect. It used a compiler that is arguably good, but certainly not excellent for it.

In regards to comparing Mac OS X to Linux rather than Windows. I think the comparison is valid considering the market Apple has been targeting recently. Apple seems to have backed off from wooing the MS crowd, but instead focusing on firms that use UNIX workstations. Apple wants these companies to switch to the PowerMac rather than to a x86/Linux platform. This is highlighted by their advocacy of using OS X for biotech and film/video effects production. I remember one of their earlier OS X ad even told the reader to send all of their old UNIX boxes to "/dev/null" - or something like that.

-B

Dual Core by BrookHarty · 2004-05-08 21:16 · Score: 4, Interesting

With Dual Core CPU's going to be the norm, why not a Dual Core GPU for even faster gfx cards? With everyone wanting 16x antialiasing at 1600x1200 to get over 100fps, its gonna take some very powerful GPU's (or some dual cores).

Even with the ATI 800XT, 1600x1200 can dip below 30FPS with AA/AF on higher settings. Still a ways to go for that full virtual reality look.

Re:Dual Core by BrookHarty · 2004-05-08 22:00 · Score: 2, Interesting

Video cards are already able to run many things in parallel- they are beyond dual-core.

There where dual ATI GPU's or Matrox or even the old Voodoo2 SLI. Seems you can increase speed with more cores.

Commodore 64 by curator_thew · 2004-05-08 22:01 · Score: 5, Interesting

This concept was being used back in 1988. The Commodore 64 (1mhz 6510, a 6502 like micro processor) had a peripheral 5.25 disk drive called the 1541, which itself had a 1mhz 6510 cpu in it, connected via. a serial link.

It became common practice to introduce fast loaders: these were partially resident in the C64, and also in the 1541: effectively replacing the 1541's limited firmware.

However, demo programmers figured out how to utilise the 1541: one particular demo involved uploading program to the 1541 at start, then upon ever screen rewrite, uploading vectors to the 1541, which the 1541 would perform calculations in parallel with the C64, then at the end of the screen, the C64 fetch the results from the 1541, and incorporate them into the next screen frame.

Equally, GPU provides similar capability if so used.

Expand this thinking! by Osty · 2004-05-08 22:06 · Score: 3, Interesting

You're absolutely correct that these "game snobs" are looking at the past through rose-colored graphics, forgetting all of the stinkers of yesteryear. However, it's not just games where this applies. How many times have you heard people complain about how bad movies are now, or music, or books? It's exactly the same phenomenon. When your grandfather tells you how much better things were "back in the day", it's for exactly the same reason. He's looking back at all the good things, while ignoring all of the bad.

Face it, everything mostly sucks. It always has, and it always will. There will always be some gems that really stand out, and those will be what are remembered when people fondly look back on "the old days". Get over it.

Re:and a sourceforge project too by WinterpegCanuck · 2004-05-08 22:11 · Score: 2, Interesting

What about a general abstraction layer at the OS level? I am by no means at that programming level, but could you not have calculations that are proven to run good on GPU's (int's maybe?) be redirected by the OS, and the rest just sent to the CPU as normal? To me this would take advantage for all programs (except the games that want exclusive GPU use) running on the system instead of only those coded to take advantage of it. I know a few programs in the oil industry that could use all the bogomips they could get.

So? How do you program this? by Anonymous Coward · 2004-05-08 22:50 · Score: 1, Interesting

Are there interrupts? Available to userland?

The 'acceleration' layer (DirectX, Xv?) is not even available to the programmers. The programmer requests from DirectX or SDL to draw a polygon. Then DirectX or SDL invoke acceleration features of the card. But we do not have direct access to those features. They are not even documented.

Will the kernel provide those facilities?
Because it would be stupid to go through SDL to perform FFT with the video card's capabilities.

what's really needed by curator_thew · 2004-05-09 00:11 · Score: 3, Interesting

What's really needed is to couple the GPU and CPU in such a way that the GPU actually runs a very low level O/S, like an L4Ka style kernel (http://l4ka.org/), and becomes "just another" MP resource.

Then, on top of this low level, actually runs the UI graphics driver and so on. Other tasks can also run, but ultimately the priority is given to the UI driver.

Then, the O/S on the CPU needs to be able to know generally how to distribute tasks across to the GPU. Fairly standard for a tightly coupled MP that has shared bus memory.

Why do I say this? Because the result is

(a) if you're using an especially high performance application, the GUI runs full throttle dedicated to rendering/etc and acts as per normal;

(b) if you're not, e.g. such as when running Office or Engineering other compute intensive tasks (e.g. recoding video without displaying the video), then the GPU is just another multi processor resource to soak up cycles.

Then, CPU/GPU is just a seamless computing resource. The fantastic benefit of this is that if the O/S is designed properly, then it could allow simply buying/plugging in additional PCI (well, PCI probably not good because of low speed, perhaps AGP?) cards that are simply "additonal processors" - then you get a relatively cheaper way of putting more MP into your machine.

Re:Maybe time for a new generation of math-process by Temkin · 2004-05-09 00:35 · Score: 2, Interesting

Remember the co-processors? Well, actually I don't (I'm a tad to young). But I know about them.

Dig deeper. 8087 FPU's were nice, though they ran hot enough to cook on, but the idea had existed for 15 or more years before they appeared. Try looking into the old DEC PDP-11 archives. There you'll find DEC's own "CIS" or "commercial instruction set", which was a set of boards (later a add on chip) that added string, character and BCD math instructions. DEC also had a FPU card set that implemented a 64-bit FPU out of AMD 2901 bit slice processors. Many low-budget not-quite-supercomputers were really add-on hardware boxes to a general purpose computers. Basicly add-on stunt boxes.

Dam... I'm too young to feel this old! Most of this stuff was in play when I was in grade school.

Temkin

compiler? by Anonymous Coward · 2004-05-09 00:40 · Score: 1, Interesting

Could this be integrated with a compiler, so that the compiler could elect to use the GPU? That would be really cool:

gcc --with-gpu somebigprog.c

Very bad article by Slash.ter · 2004-05-09 01:58 · Score: 3, Interesting

This is a very poor quality article, I analyzed it before. There are possibly better ones mentioned by others.

Just look at the matrix multiplication case. Look at the graph and see that 1000x1000 takes 30 seconds on CPU and 7 seconds on GPU. Let's translate it to Millions of operations per second: CPU -> 33 Mop/s, GPU -> 142 Mop/s Matrix multiplication has cubic complexity so for CPU: 1000 * 1000 * 1000 / 7 seconds / 1000000 = 33 Mop/s

Now think a while: 33 million operations on 1.5 GHz Pentium 4 with SSE (I assume there is no SSE2). Pentium 4 has fuse multiply-add unit which makes it do two ops per clock. So we get 3 billion ops per second peak performance! What they claim is that the CPU is 100 times slower for matrix multiply. That is unlikely. You can get 2/3 of peak on Pentium 4. Just look at ATLAS or FLAME projects. If you use one of these projects you can multiply 1000 matrix in half a second: 14 times faster than the quoted GPU.

Another thing is the floating point arithmetic. GPU uses 32-bit numbers (at most). This is too small for most scientific codes. CPU can do 64-bits. Also, if you use 32-bits on CPU it will be 4 times as fast as 64-bit (SSE extension). So in 32-bit mode, Pentium 4 is 28 times faster than the quoted GPU.

Finally, the length of the program. The reason matrix multiply was chosen is becuase it can be encoded in very short code - three simple loops. This fits well with 128-instruction vertex code length. You don't have to keep reloading the code. For more challenging codes it will exceed allowed vertex code length. The three loop matrix multiply implementation stresses memory bandwidth. And CPU has MB/s and GPU has GB/s. No wonder GPU wins. But I can guess that without making any tests.

Three questions by pvera · 2004-05-09 02:50 · Score: 2, Interesting

1. Is anyone except Apple trying to leverage the GPU for non-3D tasks? Apple has been doing Quartz Extreme for a while but I have not heard if anyone else is doing it.

2. Has anyone tried something similar to what Quartz Extreme does but for non-graphical tasks?

3. How come GPU makers are not trying to make a CPU by themselves?

--
Pedro
----
The Insomniac Coder

Folding@Home is actually working on this... by pointwood · 2004-05-09 04:13 · Score: 3, Interesting

Some day you may be able to Fold proteins with your GPU.

Re:The day is saved by Directrix1 · 2004-05-09 07:39 · Score: 4, Interesting

Doesn't anybody find it annoying that 3-D operation is being hardwired into the video card to begin with? Why aren't we making 200million transistor math coprocessors with high bus speeds, uncoupled from the video card. This way we wouldn't have to keep getting a new video card every time we want to upgrade our systems 3-d performance. Since these operations are highly symmetric, you could put in an array of these into one machine to incrementally upgrade. Also, this would make the issue of how to access your GPU to use for other purposes irrelevant, as it would be a math coprocessor expected to be used as such anyways. And the best reason for doing it this way: OpenGL (and DirectX too) could become more of a thick software layer on top of the generic coprocessor, and since the coprocessors would eventually standardize on common instruction set, you wouldn't need a new version of OpenGL or DirectX for every new coprocessor release. What do you guys think?

--
Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF

Re:The day is saved by Directrix1 · 2004-05-09 17:37 · Score: 2, Interesting

I don't understand your first statement. The fact that these GPUs exist and are being used to do so many things would imply that its actually not that specialized. It just has a fat pipeline. Matrix operations are very common and many common tasks, such as web browsing even, can easily take advantage of them for image decompression and video/audio streaming. And maybe in the future if we get the whole "we don't need a dedicated coprocessor" idea out of our heads, it could be used for things like Neural Network Assistants, faster/better speech recognition, and other more complex tasks which are only not commonplace on the desktop right now because the desktop can't effectively handle it right now.

For the positioning and cooling, well there is one in there right now. There is enough space more than likely even for more than one.

Also, I'm not saying lets not give the sucker a cache. It would more than likely need a cache of its own dedicated memory to effectively operate just like any processor.

When I was about 15 and I first started reading about the first GPUs, all I could think about was, "Boy is this a step in the wrong direction." I believe in hardware whose purposes are cleanly seperated. Well, the GPU thing has had its hayday, why not start making general purpose coprocessors now so every application can get a nice boost (well a lot of applications). The instructions already resemble a normal processors anyways, so why not.

--
Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF

Slashdot Mirror

Using GPUs For General-Purpose Computing

40 of 396 comments (clear)