Using GPUs For General-Purpose Computing

The day is saved by drsmack1 · 2004-05-08 18:50 · Score: 5, Funny

Now I finally have a use for the 20 Voodoo 2 cards I have in a box in the basement. Now I can have my very own supercomputer. I just need some six pci slot motherboards.... Instant cluster!

--

Humor from a Genetically Molested Mind

Re:The day is saved by PygmySurfer · 2004-05-09 00:30 · Score: 5, Funny

Unless those Voodoo 2s have magically grown T&L units, they're not going to do you much good.

Maybe they have. They've been trapped in that box together in the basement for a long time.
Re:The day is saved by Directrix1 · 2004-05-09 07:39 · Score: 4, Interesting

Doesn't anybody find it annoying that 3-D operation is being hardwired into the video card to begin with? Why aren't we making 200million transistor math coprocessors with high bus speeds, uncoupled from the video card. This way we wouldn't have to keep getting a new video card every time we want to upgrade our systems 3-d performance. Since these operations are highly symmetric, you could put in an array of these into one machine to incrementally upgrade. Also, this would make the issue of how to access your GPU to use for other purposes irrelevant, as it would be a math coprocessor expected to be used as such anyways. And the best reason for doing it this way: OpenGL (and DirectX too) could become more of a thick software layer on top of the generic coprocessor, and since the coprocessors would eventually standardize on common instruction set, you wouldn't need a new version of OpenGL or DirectX for every new coprocessor release. What do you guys think?

--
Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF
Re:The day is saved by Metasquares · 2004-05-09 08:07 · Score: 4, Insightful

This way we wouldn't have to keep getting a new video card every time we want to upgrade our systems 3-d performance.
I think you've just answered your own question.
Re:The day is saved by Directrix1 · 2004-05-09 17:37 · Score: 2, Interesting

I don't understand your first statement. The fact that these GPUs exist and are being used to do so many things would imply that its actually not that specialized. It just has a fat pipeline. Matrix operations are very common and many common tasks, such as web browsing even, can easily take advantage of them for image decompression and video/audio streaming. And maybe in the future if we get the whole "we don't need a dedicated coprocessor" idea out of our heads, it could be used for things like Neural Network Assistants, faster/better speech recognition, and other more complex tasks which are only not commonplace on the desktop right now because the desktop can't effectively handle it right now.

For the positioning and cooling, well there is one in there right now. There is enough space more than likely even for more than one.

Also, I'm not saying lets not give the sucker a cache. It would more than likely need a cache of its own dedicated memory to effectively operate just like any processor.

When I was about 15 and I first started reading about the first GPUs, all I could think about was, "Boy is this a step in the wrong direction." I believe in hardware whose purposes are cleanly seperated. Well, the GPU thing has had its hayday, why not start making general purpose coprocessors now so every application can get a nice boost (well a lot of applications). The instructions already resemble a normal processors anyways, so why not.

--
Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF

What?!?!?! by DarkHelmet · 2004-05-08 18:50 · Score: 5, Funny

What? Matrix operations run faster on a massively parallel form of vector processor over a general purpose processor? How can that be?

Intel's been telling me for years that I need faster hardware from THEM to get the job done...

You mean........ they were lying?!?!?

CRAP!

--
/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i

Re:What?!?!?! by Anonymous Coward · 2004-05-08 20:11 · Score: 5, Funny

Don't worry, the Intel processor is *much* faster at the internet thingy. Graphics cards only do the upload to screen thing, and everyone knows the internet is all about downloading.

And besides, nobody needs or wants Matrix operations anyway. Did you see how bad Matrix Reloaded was? That was *just* reloading, imagine how bad Matrix Multiplying is. You get the idea.

Link to previous discussion on same/similar sub... by 8282now · 2004-05-08 18:51 · Score: 5, Informative

http://developers.slashdot.org/article.pl?sid=03/1 2/21/169200&mode=thread&tid=152&tid=18 5

Googled HTML by balster+neb · 2004-05-08 18:54 · Score: 5, Informative

Here's a HTML version of the PDF, thanks to Google.

video stuff by rexguo · 2004-05-08 18:54 · Score: 4, Interesting

At my work place, I'm looking into using the GPUs to do video analysis. Things like cut-scene detection, generating multi-resolution versions of a video frame, applying video effects and other proprietary technologies that were previously done in CPU. The combination of pixel shaders and floating-point buffers really make GPUs a Super-SIMD machine if you know how to exploit it.

--
www.rexguo.com - Technologist + Designer

As has been said many time before ... by keltor · 2004-05-08 18:55 · Score: 5, Insightful

The GPU are very fast ... at performing vector and matrix calculations. This is the whole point. If general computing CPUs were capable of doing vector or matrix calcs very efficiently, we would probably not have GPUs.

Re:As has been said many time before ... by lazy_arabica · 2004-05-08 19:50 · Score: 5, Interesting

The GPU are very fast ... at performing vector and matrix calculations. This is the whole point. If general computing CPUs were capable of doing vector or matrix calcs very efficiently, we would probably not have GPUs.
Yes. But 3D graphics are not the only use of these mathematical objects ; I wonder if it would be possible to use a GPU to perform video encoding or digital sound manipulation at a higher speed, as both operations require matrices. I'm also sure they could take advantage of these processors vector manipulation capabilities.
Re:As has been said many time before ... by Slack3r78 · 2004-05-09 04:24 · Score: 3, Informative

Actually, the GeForce 6800 includes the hardware to do just that. I'm surprised no one else has mentioned it by now, as I thought it was one of the cooler features of the new NV40 chipset.
Re:As has been said many time before ... by Anonymous Coward · 2004-05-09 06:10 · Score: 3, Informative

ATI has had this for even longer. The all-in-wonder series uses the video card to do accelerated encoding and decoding.

Also, I believe that mplayer, the best video player/encoder I have seen also uses openGL (and thus the video card on a properly configured system) to do playback.

Personally, I don't think there is anything really new in this article.

178 Million in the P4EE by 2megs · 2004-05-08 18:57 · Score: 5, Insightful

The Pentium 4 EE actually has 178 million transistors, which puts it in between ATI's and NVIDIA's latest.

In all of this, keep in mind that there's computing and there's computing...the kind of computing power in a GPU is excellent for doing the same numeric computation to every element of a large vector or matrix, not so much for branchy decisiony type things like walking a binary tree. You wouldn't want to run a database on something structured like a GPU (or an old vector-processing Cray), but something like a simulation of weather or molecular modeliing could be perfect for it.

The similarities of a GPU to a vector processing system bring up an interesting possibility...could Fortran see a renaissance for writing shader programs?

Re:178 Million in the P4EE by Knightmare · 2004-05-08 18:59 · Score: 5, Informative

Yes, it's true that it has that many transistors BUT, only 29 million of them are part of the core, the rest is memory. The transistor count on the video cards does not count the ram.
Re:178 Million in the P4EE by LinuxGeek · 2004-05-08 19:27 · Score: 3, Informative

If they are ignoring the cache on the P4 EE, then why mention the Extreme Edition at all? Cache size is the only difference between the Xeon based EE and a regular Northwood P4. Also, modern GPU's certainly do have cache. Read this old GeForce4 preview .
The Light Speed Memory Architecture (LMA) that was present in the GeForce3 has been upgraded as well, with it's major advancements in what nVidia calls Quad Cache. Quad Cache includes a Vertex Cache, Primitive Cache, Texture Cache and Pixel Caches. With similar functions as caches on CPU's, these are specific, they store what exactly they say.
Another good article has a block diagram showing the cache structures of the GeForce FX GPU. Nvidia and ATI both keep quiet about the cache sizes on their GPUs, but that dosen't mean that the full transistor count is dedicated to the processing core.

--

Kindness is the language which the deaf can hear and the blind can see. - Mark Twain
Re:178 Million in the P4EE by alphakappa · 2004-05-08 19:44 · Score: 2, Funny

Please ANYTHING BUT FORTRAN!!!!!!! Seriously, FORTRAN needs serious reworking to be user friendly in today's age. It was fine a decade or two ago when people were not used to user friendly languages. COBOL anyone? FORTRAN has its uses, but it's horribly, horribly tough to use if you want to combine number crunching with other stuff such as string manipulation.

--
"When the only tool you own is a hammer, every problem begins to resemble a nail." - Abraham Maslow (1908-1970)
Re:178 Million in the P4EE by gunix · 2004-05-08 20:09 · Score: 5, Insightful

Well, it's like UNIX, it's userfriendly, it's just selects it's friends very carefully.
IMHO, the perfect friend is someone interested in maximum performance and knows how to program and knows something about computer hardware.

Have you looked at fortran 90, 95 or 2000?

--
Evolution of Language Through The Ages: 6000 BC : ungh, grrf, booga 2000 AD : grep, awk, sed
Re:178 Million in the P4EE by Bender_ · 2004-05-08 21:02 · Score: 3, Insightful

The transistor count on the video cards does not count the ram

How do you know? In fact, modern GPUs require a large amount of small scattered memory blocks. Texture caches, FIFOs for fragment/pixels/texels when they are not in sync, caches for vertex shader and pixel shader programs etc etc..

More recent GPUs are notorious for their incredibly long latencies. Long latencies imply that a lot of data has to be stored in chip..
Re:178 Million in the P4EE by Hast · 2004-05-08 21:31 · Score: 2, Informative

Well, it's really more that the pipelines are very long. On the order of 600 pipelinestages, and that's pretty damned long. (P4 which is a CPU with a deep pipeline has 21 stages IIRC.)

They do of course store data between those stages, and there are caches on the chip. Otherwise performance would be shot all to hell.

I doubt that the original statement that GPU designs don't count the on chip memory is correct. That just seems like an odd way to do it.
Re:178 Million in the P4EE by squiggleslash · 2004-05-09 01:49 · Score: 3, Informative

Seriously, FORTRAN needs serious reworking to be user friendly in today's age.(...) it's horribly, horribly tough to use if you want to combine number crunching with other stuff such as string manipulation.
Methinks you're confusing "user friendly" with "powerful". It's not that FORTRAN's string manipulation functions aren't user friendly, they're just crap.
(For those unaware of how FORTRAN does strings, they're stored in fixed length arrays that are padded to the end with spaces. When you want to compare two strings you have to make them the same length with a "substr" type operation (eg "STRING1(1:37) .EQ. STRING2(1:37)") - it's easy to use, just too crude to be usable.)
Saying FORTRAN isn't user friendly on the basis of its string handling is like saying Commodore BASIC 2 wasn't user friendly because of its procedures, erm, I mean subroutines. What could be hard about GOSUB and RETURN? Nothing. It's just they're not very useful.

--
You are not alone. This is not normal. None of this is normal.
Re:178 Million in the P4EE by mc6809e · 2004-05-09 03:00 · Score: 2, Informative

Yes, it's true that it has that many transistors BUT, only 29 million of them are part of the core, the rest is memory. The transistor count on the video cards does not count the ram.

Sure it does, it's just that the ram isn't cache, it's mostly huge register files.

Website on this topic by Anonymous Coward · 2004-05-08 18:57 · Score: 5, Informative

General-purpose computation using graphics hardware has been a significant topic of study for the last few years. Pointers to a lot of papers and discussion on the subject are available at: www.gpgpu.org

Re:Not the Point by JonoPlop · 2004-05-08 18:59 · Score: 4, Interesting

The whole point of graphic cards is that they have a dedicated purpose. Using the cards for anything that is general purpose is like using a motorcycle to tow a pop-up camper.

No, it's like using your pop-up camper for storage space when you're using it on holidays.

While not playing games? by pyrrhonist · 2004-05-08 19:01 · Score: 4, Funny

What could my video card be doing for me while I am not playing the latest 3d games?

Two words: virtual pr0n

--
Show me on the doll where his noodly appendage touched you.

DSP using GPUs by crushinghellhammer · 2004-05-08 19:01 · Score: 3, Interesting

Does anybody know of pointers to papers/research pertaining to using GPUs to perform digital signal processing for, say, real-time audio? Replies would be much appreciated.

PDF to HTML by Libraryman · 2004-05-08 19:02 · Score: 2, Informative

Here is a link at Adobe where you can turn any PDF into HTML.

Hacking the GPU by nihilogos · 2004-05-08 19:03 · Score: 5, Informative

Is a course being offered at caltech since last summer on using gpus for numerical work. Course page is here.

--
:wq

What comes next. by CherniyVolk · 2004-05-08 19:04 · Score: 5, Funny

"Utilize the sheer computing power of your video card!"

New market blitz, hmmmm.

SETI ports their code, and within five days their average completed work units increase 1000 fold. 13 hours later, they have evidence of intelligent life at 30000 locations within one degree.

Microsoft gets the hint, and comes out with a brilliant plan to utilize GPUs to speed up their OS and add bells and whistles to their UI.

And, once again, Apple and Quartz Extreme is ignored.

Re:What comes next. by Barbarian · 2004-05-08 19:12 · Score: 4, Funny

Then they throw away the results because the gpu's are not able to calculate at double precision floating point, but only at 24 or 32 bits.
Re:What comes next. by renoX · 2004-05-08 20:55 · Score: 2, Insightful

Yes, one thing shocked me in their paper: they don't talk much about the precision they use..

Strange because it is a big problem for using GPU as coprocessors: usually scientific computation use 64bit floats or on Intel 80-bit floats!
Re:What comes next. by SmackCrackandPot · 2004-05-09 08:56 · Score: 3, Informative

64-bit floating point texture filtering and blending and support for the D3D vertex and pixel shader 3.0 standard,

That's 64-bits for a four element vector (RGBA) or (XYZW), which is thus 16-bits per float. This is referred to as the 'half' floating point data type, as opposed to 'float' or 'double'. This is compatible with Renderman.

It's nice, but could be nicer by Anonymous Coward · 2004-05-08 19:05 · Score: 5, Informative

Before you get excited just remember how asymmetric the APG bus is. Those GPUs will be at much better use when we get them as 64bit pci cards.

Re:Not the Point by Amiga+Lover · 2004-05-08 19:08 · Score: 4, Insightful

The whole point of graphic cards is that they have a dedicated purpose. Using the cards for anything that is general purpose is like using a motorcycle to tow a pop-up camper.

What's relevant is that to the processor on a graphics card, its dedicated purpose is simply a bunch of logic. There's no dedicated "this must be used for pixels only, all else is waste" logic inherent in the system. there are MANY purposes for which the same/similar logic that applies in generating 3D imagery can be used, and that seems the purpose of this paper. Run THOSE type operations on the GPU. Some things they won't be able to do well no doubt - but those they can, they can do extremely well.

Not just the GPU : the RAM by ratboot · 2004-05-08 19:10 · Score: 5, Interesting

What's interesting with new video cards it's their memory capacity, 128 or 256 MB and that this memory is accessible on some new cards at 900 MHz with a data path of 256 bit (which is a lot faster than a CPU with DDR 400 installed).

Re:Not just the GPU : the RAM by drinkypoo · 2004-05-11 12:46 · Score: 2, Interesting

The part that annoys me is that it's all the same speed. Texture memory doesn't have to be near as fast as video memory and furthermore you could have two classes of texture memory, which will make sense as video cards reach and exceed 512MB. There have in the past been video cards with high speed video memory and something like EDO for textures, which makes a lot of sense, especially if you're willing to cache most-used textures somewhere in video memory.
Is it just me, or should the cards have maybe 64 or 128MB of high speed memory, and then a couple of DIMM slots that take ordinary DDR SDRAM? That would still be pretty fast stuff, especially if the cards had dual-channel memory controllers, and plenty fast enough for textures. The card could cache the most-used textures in whatever video memory was left after drawing screens.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

Wow by cubicledrone · 2004-05-08 19:10 · Score: 5, Interesting

All that processing power, and the latest games still run at about 22 frames per second, if that.

The CPU can do six billion instructions a second, the GPU can do 18 billion, and every last cycle is being used to stuff a 40MB texture into memory faster. What a waste. Yeah, the walls are even more green and slimy. Whoop-de-fucking-do.

Would it be great if all that processing power could be used for something other than yet-another-graphics-demo?

Like, maybe some new and innovative gameplay?

--
Business isn't willing to pay for products, innovation and careers, so we get brands, mortgage commercials and layoffs.

Re:Wow by PitaBred · 2004-05-08 20:23 · Score: 4, Insightful

You don't seem to understand that GPU's are very specific purpose computing devices. They aren't like a general purpose processor like you CPU. They crunch matrices, and that's about it. Even all the programmable stuff is just putting parameters on the matrix churning.

--
My blog. Good stuff (when I remember to update it). Read it.

audio stuff by RobPiano · 2004-05-08 19:12 · Score: 4, Interesting

At my work we do audio stuff. It would be really neat if I could do some of the more complicated audio analysis (FFT etc) that requires lots of vector math using the video cards gpu. There is probably even some way you could sync the timing for multimedia stuff.

I know nothing about CPU design though

Re:audio stuff by Hast · 2004-05-08 21:25 · Score: 3, Informative

Look at gpgpu.org I believe they have papers on doing FFT on GPUs. They also have a collection on papers regarding GPU as CPUs.
Re:audio stuff by zsazsa · 2004-05-09 01:17 · Score: 2, Informative

It would be really neat if I could do some of the more complicated audio analysis (FFT etc) that requires lots of vector math using the video cards gpu.

There's a company that actually does this. The Universal Audio UAD-1 audio DSP had a previous life as a video card and a DVD hardware accelerator. Check out this thread on the UAD forums for more technical information.

This is BIG by macrealist · 2004-05-08 19:18 · Score: 5, Insightful

Creating a way to use the specialize GPUs for vector processing that is not graphics related is ingenious. Like a lot of great ideas, it is sooo obvious AFTER you see some one else do it.

Don't miss the point that this is not intended for general purpose computing. Don't port OoO to the graphics chip.

Where it is huge is in signal processing. FPGAs have begun replacing even the G4s in this area recently because of the huge gains in speed vs. power consumption an FPGA affords. However, FPGAs are not bought and used as is, and end up costing a significant amount (of development time/money) to become useful. Being able to use these commodity GPUs for vector processing creates a very desirable price/processing power/power consumption option. If I were nVIDIA or ATI, I would be shoveling these guys money to continue their work.

--
I am living proof of the Peter Principle

Siggraph 2003 by Adam_Trask · 2004-05-08 19:21 · Score: 5, Informative

Check out the publication list in Siggraph 2003. There is a whole section named "Computation on GPUs" (papers listed below). And the papers for Siggraph 2004 should be out shortly.

If you have a matrix solver, there is no telling what you can do. And i remember, these papers show that the speed is faster than the matrix calculations of the same stuff using the CPU.

# Linear Algebra Operators for GPU Implementation of Numerical Algorithms
Jens Krüger, Rüdiger Westermann

# Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid
Jeff Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder

# Nonlinear Optimization Framework for Image-Based Modeling on Programmable Graphics Hardware
Karl E. Hillesland, Sergey Molinov, Radek Grzeszczuk

http://www.gpgpu.org/ is a great resource by aancsiid · 2004-05-08 19:21 · Score: 4, Interesting

http://www.gpgpu.org/ is a great resource for general purpose graphics processor usage.

here ya go by dave1g · 2004-05-08 19:22 · Score: 3, Informative

some one else posted this...

www.gpgpu.org

Website on this topic (Score:0)
by Anonymous Coward on Sunday May 09, @01:57AM (#9098550)
General-purpose computation using graphics hardware has been a significant topic of study for the last few years. Pointers to a lot of papers and discussion on the subject are available at: www.gpgpu.org [gpgpu.org]

Not so... by oboylet · 2004-05-08 19:27 · Score: 4, Interesting

High-powered GPUs can make for really good general-purpose devices.

Apple's Newton had no CPU, only a GPU that was more than adequate.

Ideas like these are good in general. I'd like to see the industry move away from the CPU-as-chief status quo. Amigas were years ahead of their time in large part because the emphasis wasn't as much on central processing. The CPU did only what it was supposed to do -- hand out instructions to the gfx and audio subsystems.

Hardly using a "motorcycle to tow a pop-up camper." If anything, the conventional wisdom is, "when all you have is a hammer, everything looks like a nail."

Re:Not so... by Anonymous Coward · 2004-05-08 20:21 · Score: 2, Informative

Hmm. My Newton has a "160Mhz StrongARM SA-110 RISC Processor". Doesn't sound like a GPU to me.

and a sourceforge project too by Lord+Prox · 2004-05-08 19:28 · Score: 4, Informative

BrookGPU
from the BrookGPU website...
As the programmability and performance of modern GPUs continues to increase, many researchers are looking to graphics hardware to solve problems previously performed on general purpose CPUs. In many cases, performing general purpose computation on graphics hardware can provide a significant advantage over implementations on traditional CPUs. However, if GPUs are to become a powerful processing resource, it is important to establish the correct abstraction of the hardware; this will encourage efficient application design as well as an optimizable interface for hardware designers.

From what I understand this project it aimed at making an abstraction layer for GUP hardware so writing code to run on it is easier and standardsied.

Re:and a sourceforge project too by WinterpegCanuck · 2004-05-08 22:11 · Score: 2, Interesting

What about a general abstraction layer at the OS level? I am by no means at that programming level, but could you not have calculations that are proven to run good on GPU's (int's maybe?) be redirected by the OS, and the rest just sent to the CPU as normal? To me this would take advantage for all programs (except the games that want exclusive GPU use) running on the system instead of only those coded to take advantage of it. I know a few programs in the oil industry that could use all the bogomips they could get.

So when do we get unified memory? by Anonymous Coward · 2004-05-08 19:28 · Score: 2, Interesting

Many of the problems stated in using a GPU for non-graphics tasks would be implicitly solved if the GPU and CPU shared memory. While this would slightly slow down the GPU's memory access, in 3 years, I don't think that would be an issue. Especially compared to the benefits of having only one memory pool.

I can see it now.... by TypoNAM · 2004-05-08 19:29 · Score: 3, Interesting

...Several indies and companies figure out how to use the powerful GPU's in an efficient manner that would benefit everyone who uses computers on a daily basis and improves the usefulness of the computer making it the best thing in the world again then some greedy bastard comes along flashing his granted patent by the U.S. Patent Office which makes us all screwed...

Ohh well the idea was good while it lasted. ;)

--
This space is not for rent.

Imagine... by rokzy · 2004-05-08 19:32 · Score: 4, Interesting

a beowulf cluster of them.

seriously, we have a 16 node beowulf cluster and each node has an unnecessarily good graphics card in them. a lot of the calculations are matrix-based e.g. several variables each 1xthousands (1D) or hundredsxhundreds (2D).

how feasible and worthwhile do you think it would be to tap into the extra processing power?

Re:Imagine... by BiggerIsBetter · 2004-05-08 20:12 · Score: 2, Interesting

It's a good idea if your datasets take a long enough time to process. You could run 6 or so cards (maybe 1 AGP super fast, 5 PCI slowish (eg FX5200)) in your machine and send a dataset to each GPU and the main CPU, then get the results back. The trick is to keep them working without blowing all your bandwidth or PSU. Also depends on the resolution required, because the GPU is only 32 bits FP, compared to 80 bits for the CPU.

All I can suggest is download the Brook libraries and try it out. See if it helps, and see if the results are accurate enough. And yes, Fortran can be used if you can bind it - Intel's compiler suite worked for me.

--
Forget thrust, drag, lift and weight. Airplanes fly because of money.

When... by alexandre · 2004-05-08 19:32 · Score: 2, Insightful

...will someone finally port john the ripper to a new video card's graphical pipeline? :)

Pseudo repost by grape+jelly · 2004-05-08 19:42 · Score: 4, Informative

I thought this looked familiar:

http://developers.slashdot.org/developers/03/12/21 /169200.shtml?tid=152&tid=185

At least, I would imagine most of the comments would be the same or similar....

Finally by Pan+T.+Hose · 2004-05-08 19:45 · Score: 5, Funny

Using GPUs For General-Purpose Computing

I'm glad that finally they started to use the General-Purpose Unit. What took them so long?

--
Sincerely,
Pan Tarhei Hosé, PhD.
"Homo sum et cogito ergo odi profanum vulgus et libido."

Maybe time for a new generation of math-processor? by Anonymous Coward · 2004-05-08 19:47 · Score: 4, Insightful

Remember the co-processors? Well, actually I don't (I'm a tad to young). But I know about them.

Maybe it's time to start making co-processing add-on cards for advanced operations such as matrix mults and other operations that can be done in parallell on a low level. Add to that a couple of hundred megs of RAM and you have a neat little helper when raytracing etc. You could easily emulate the cards if you didn't have them (or needed them). The branchy nature of the program itself would not affect the performance of the co-processor since it should only be used for calculations.

I for one would like to see this.

Re:Altivec by John+Starks · 2004-05-08 19:48 · Score: 2, Informative

I would guess the difference would be comparable. Altivec is no more impressive than the SSE/SSE2/etc. types of instructions of the modern x86.

Re:Not the Point by kfg · 2004-05-08 19:48 · Score: 5, Funny

Dude, you obviously have never tried to sleep in a motorcycle.

KFG

Documentation by Detritus · 2004-05-08 19:51 · Score: 2, Interesting

Do any of the video chip manufacturers make free and complete documentation available for their GPUs? Everything that I have read in the past has said that they are encumbered with NDAs and claims of trade secrets. I'd prefer not to waste my time dealing with companies that treat their customers as potential enemies.

--
Mea navis aericumbens anguillis abundat

Frogger by BiggerIsBetter · 2004-05-08 19:52 · Score: 4, Interesting

Some dude wrote Frogger almost entirely in pixel shaders. http://www.beyond3d.com/articles/shadercomp/result s/ (2nd from the bottom).

--
Forget thrust, drag, lift and weight. Airplanes fly because of money.

Re:Not the Point-headbanger. by Amiga+Lover · 2004-05-08 20:00 · Score: 4, Insightful

There is however one thing to keep in mind. Presently our GPU's may have the headroom to play with, but with Apple's Quartz, and Microsoft's Longhorn, let alone what's coming with X. That headroom may disappear, and our video cards will have to go back to being video cards.

On those operating systems that require them, that could very well be.

Still makes a nice thought that a linux box without even X installed, but a kickass graphics card, could crunch away doing something 4 times quicker than any windowed machine.

Bass Ackwards? by Anonymous Coward · 2004-05-08 20:01 · Score: 5, Insightful

Perhaps offloading the CPU to the GPU is the wrong way to look at things? With the apparently imminent arrival of commodity (low power) multi-CPU chips, maybe we should be considering what we need to add to perform graphics more efficiently (ala MMX et al)?

While it's true that general purpose hardware will never perform as well as or as efficiently as a design specifically targeted to the task (or at least it better not), it is also equally as true that eventually general purpose/commodity hardware will achieve a price-performance point where it is more than "good enough" for majority.

Re:Maybe that's the answer... by trg83 · 2004-05-08 20:11 · Score: 3, Interesting

From the link you mentioned: "while Apple used a compiler you've never heard of (at least in the x86 world)."

My understanding is that they used GCC.

Further, "Another said that some version of Linux had to be used to compare apples to apples. Well, MacOS X isn't Linux, and the desktop standard for x86 machines is Windows (not that using a properly optimized Linux bothered the Opterons very much). You want to know what machine is fastest, you test in their native environment."

Oh, silly me. Processors are so obviously made to run only one operating system!

I'll take this site's info with a grain of salt.

Violation of Compartmentalization by BlakeB395 · 2004-05-08 20:13 · Score: 2, Insightful

From a design standpoint, I can imagine a GPU that donates its power to the CPU would be a nightmare. It violates the fundamental tenet that everything should do one thing and do it well. OTOH, that tenet focuses on simplicity and maintainability over performance. Is such a tradeoff worth it?

Re:Violation of Compartmentalization by evilviper · 2004-05-08 21:55 · Score: 3, Insightful

It violates the fundamental tenet that everything should do one thing and do it well.

No, having a CPU that does everything is what violates the tenet.

I don't know about you, but I don't have a chip that does my video processing for me, I don't have a chip that does all the encryption for me, I don't have a chip that handles (en/de)capsulating network traffic, as well as handing interrupts and routing.

Having a second processor that does some specialized work that a CPU isn't good at is an improvement, not a nightmare. I'd love to be able to plug in a chips or two into my PC and have them do better-than realtime MPEG-4 encoding that doesn't affect my processor at all... Who wouldn't?

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant

Re:Link to previous discussion on same/similar sub by hype7 · 2004-05-08 20:15 · Score: 4, Interesting

There's some good stuff in there.

However, it seems a few organisations have actually beaten us to it.

Apple, for example, uses the 3d aspect of the GPU to accelerate its 2d compositing system with quartz extreme. Microsoft, as usual, announced the feature after Apple shipped it, and with any luck Windows users might have it by 2007

-- james

Re:Maybe that's the answer... by John+Starks · 2004-05-08 20:23 · Score: 3, Insightful

GCC is an inferior compiler for the x86, whether you like it or not. Intel's optimizing C/C++ compiler is much faster according to numerous benchmarks (I'm sorry, it's too late to find the links.) On the other hand, I understand that GCC is great on the Mac, since Apple optimized it properly. (Certainly I appreciate the hard work of the various GCC teams over the years; hopefully new optimizations will continue to improve the quality of the release until it is as fast as Intel's offerings.)

In any case, why do you believe all of Apple's conveniently high numbers, but you don't believe Spec numbers reported by Dell, AMD, etc.? These are not numbers pulled out of a hat; they are standard Spec results. Thus, the numbers should be comparable from company to company. But Apple retested other companies' products and released new numbers without properly optimizing for the x86. Why is it when Microsoft pays for benchmarks, people freak out, but when Apple PERFORMS benchmarks, people believe them instantly?

There are plenty of other links out there that provide similar information. It is patently false advertising for Apple to claim that they use the fastest chip of any PC.

Oh, and re: the Linux issue, you're right. But you'll find that the x86 is faster in Linux with a proper optimizing compiler.

My issue is basically that at best -- at best! -- the results are inconclusive. At worst, Apple blatently lied. It's foolish to believe Apple blindly just because they're the underdogs and produce a pretty, Unix-based OS. And it's foolish to hold this strange hatred for all that is x86. I don't understand this mentality.

GPU = by greppling · 2004-05-08 20:29 · Score: 4, Funny

Now I finally understand that acronym: General purpose unit!

AGP read latency not important when not real time. by anti-NAT · 2004-05-08 20:32 · Score: 2, Insightful

These applications are not likely to generate or process data at such a rate that the slow AGP read speed will matter that much, if at all.

--
The Internet's nature is peer to peer - 20050301_cs_profs.pdf

Re:Unused computing Power? by PitaBred · 2004-05-08 20:37 · Score: 4, Insightful

Lemme try to help:
a) Not equal. Apples and oranges. A GPU will do repeated calculations very, very fast, like matrix transforms and the like. A CPU on the other hand will make decisions based on input, rather than just crunching numbers
b) The main display (the GUI) already uses many tricks on the graphics card. The hard part is making sure that all graphics cards support the features. Things like the xrender extension and such are becoming more common as graphics cards and drivers get "standard" capabilities
c) Your imagination is the limit as to what it could be used for. Just realize that it's a good data processing unit, not a good program execution unit. Use each for their strengths.
d) Modified? With new cards/drivers, all it takes is OpenGL calls to start taking advantage of this power. All it really takes is someone who knows what they're doing and has a bit of inspiration.

--
My blog. Good stuff (when I remember to update it). Read it.

Re:Maybe time for a new generation of math-process by BlueJay465 · 2004-05-08 20:37 · Score: 4, Informative

Well they already make DSP cards for audio processing. Simply do a google(TM) search for "DSP card" and you will get several vendors.

I can't imagine it would take a whole lot to hack them for just their processing power outside of audio applications.

transistor counts through the ages by nothings · 2004-05-08 20:42 · Score: 5, Informative

Transistor counts keep growing, so I keep updating this and reposting it about once a year.

486 : 1.2 million transistors Pentium : 3 million transistors Pentium Pro : 5.5 million transistors Pentium 2 : 7.5 million transistors Nvidia TNT2 : 9 million transistors Alpha 21164 : 9.3 million (1994) Alpha 21264 : 15.2 million (1998) Geforce 256 : 23 million transistors Pentium 3 : 28 million transistors Pentium 4 : 42 million transistors P4 Northwood : 55 million transistors GeForce 3 : 57 million transistors GeForce 4 : 63 million transistors Radeon 9700 : 110 million transistors GeForce FX : 125 million transistors P4 Prescott : 125 million transistors Radeon X800 : 160 million transistors P4 EE : 178 million transistors GeForce 6800 : 220 million transistors

here's the non-sucky version since <ecode> doesn't actually preserve spacing like <pre>.

Re: transistor counts through the ages by Black+Parrot · 2004-05-08 22:21 · Score: 2, Informative

> Transistor counts keep growing, so I keep updating this and reposting it about once a year.

For those who don't already know, what we now think of as "Moore's Law" was originally a statement about the rate of growth in the number of transistors on a chip, not about CPU speed.

--
Sheesh, evil *and* a jerk. -- Jade

I think I speak for many of us by Sycraft-fu · 2004-05-08 20:42 · Score: 5, Insightful

When I say oh shut the fuck up.

Sorry for the flames, but seriously, I get so damn sick of all the "all new games suck" whiners. Look, there are legit reasons to want new technology. It is nice to have better graphics, more realistic sound, etc. It is NICE to have game that looks and sounds more like reality. Yes, that doesn't make the game great, but that doesn't mean it's worthless.

What's more, don't pretend like all modern games suck while old games ruled. That's a bunch of bullshit. Sure, there are plenty of modern games that suck, but guess what? There are tons of old games that suck too. Thing is, you just tend to forget about them. You remember the greats that you enjoyed or heard about, the ones that helped shape gaming today. You forget all the utter shit that was released, just as is released today.

So get off it. If you don't like nice graphics, fine. Stick with old games, no one is forcing you to upgrade. But don't pretend like there is no reason to want better graphics in games.

Re:I think I speak for many of us by Tim+C · 2004-05-08 21:16 · Score: 5, Insightful

Hear, hear.

There's something that's always puzzled me a little about this site - attached to every single article about some new piece of PC tech - a faster processor, better graphics card, etc - there are a number of comments bemoaning the advance. All of them saying that people don't need the power/speed they have already, that they personally are just fine with 4 year old hardware, or, in this case, that better graphics don't make for better games. Hell, the same is true for mobile phones - I've lost count of the number of comments bemoaning advances in them, too.

It's funny, but I thought this was supposed to be a site for geeks; aren't geeks supposed to *like* newer, better toys?

To get back on topic - no, better graphics are not sufficient for a better game. However, if the gameplay is there, then they can certainly make the experience more enjoyable. Would Quake have been as much fun if it was rendered in wireframes?

Better graphics help add to the sense of realisim, making the game a more immersive experience. The whole point of the majority of games is entertainment and (to an extent) escapism. Additionally, what a lot of people like the grand-parent poster seem to forget is that most of the big-name game engines are licensed for use in a number of games. Let people like id spend their time and money coming up with the most graphically intensive, realistic engine they can. Think Doom 3'll suck because the gameplay will be crap? Fine, then wait for someone to license the engine and create a better game with it. In the meantime, please shut up and remember that there are those of us who like things to be pretty, as well as useful/well made/fun/(good at $primaryPurpose)

Good graphics on their own won't make a good game, but they will help make a good game great.

--
It's official. Most of you are morons.

Re:Unused computing Power? by NanoGator · 2004-05-08 20:51 · Score: 3, Informative

"The graphics card has a lot of unused computing power, nearly equal to the main processor chip in the computer if not more, that is not being used when there is no game or video being played, right?"

Longhorn is suppossed to offload a lot of the GUI stuff to the card. So yeah, it'd take advantage of untapped power of the card. However, as for other general purpose stuff, it wouldn't be so interesting. It's kinda like comparing a Ferrari to a school bus. The Ferrari will run circles around the bus, but can only ferry 2 people. The bus can move a LOT of cargo, but not as fast as the Ferrari. We're talking about specialization here. The trick is to find ways to take what the GPU is good at and making them useful.

--
"Derp de derp."

Let me check my notes... by Impeesa · 2004-05-08 20:58 · Score: 4, Interesting

I did a paper on the topic of general-purpose GPU programming for my parallel computing course just this last semester here, interestingly enough. I believe our research indicated that even a single PCI card was so badly throttled by the bus throughput that it was basically useless. AGP does a lot better taking data in, but it's still pretty costly sending data back to the CPU. I have a feeling your proposed setup will be a whole lot more feasible if/when PCI Express becomes mainstream.

Re:Let me check my notes... by sonamchauhan · 2004-05-08 22:38 · Score: 2, Informative

Seems worth checking out: GPGPU.ORG - "General-Purpose Computation Using Graphics Hardware"

> AGP does a lot better taking data in, but it's still pretty
> costly sending data back to the CPU.
I've heard that mentioned a few times, is it true?

From the AGP 3.0 spec:
The AGP3.0 interface is designed to support several platform generations based upon 0.25m (and
smaller) component silicon technology, spanning several technology generations. As with AGP2.0, the
physical interface is designed to operate at a common clock frequency of 66 MHz. Its source
synchronous data strobe operation, however, is octal-clocked and transfers eight double words
(Dwords) of data within the span of time consumed by a single common clock cycle. The AGP3.0 data
bus provides a peak theoretical bandwidth of 2.1 GB/s (32 bits per transfer at 533 MT/s). Both the
common clock and source synchronous data strobe operation and protocols are similar to those
employed by AGP2.0.11

Later on Page 96:
Traditional AGP devices can demand up to the maximum bandwidth available over the AGP ports.
However, the AGP system does not guarantee to deliver the requested bandwidth, nor does it guarantee
transfers will take place within some clearly specified request/transfer latency time. ...
This is done by the system guaranteeing to process a specified number (N) of read or write transactions of a specified size (Y) during each isochronous time period (T). An AGP3.0 device can divide this bandwidth between read and write traffic as appropriate. Further, the system transfers isochronous data over the AGP3.0 Port within a specified latency (L).
(emphasis mine)

I'm no expert, just asking if the "low upsream bandwidth" assumption is true. If it is, there could still some applications (eg: simple data compression) that could use it. Also, maybe output from VGA/DVI ports could be tapped.
Re:Let me check my notes... by sonamchauhan · 2004-05-09 02:21 · Score: 3, Interesting

Somewhere in this story, I found a post with a a link that explains this is a software problem:
Notice that they're quick to point out the problem isn't likely a hardware issue. There should be plenty of bandwidth on the AGP bus, but graphics chip makers don't seem to have written their drivers to handle transfers from AGP cards to main memory properly.

Then they run some tests and conclude:
That means even if you can render high-quality images at 30 frames per second, you won't be able to get them out of the graphics card at anything near that rate.

Alternative use by Zog+The+Undeniable · 2004-05-08 21:01 · Score: 2, Interesting

Remember the story about PS2's being used in Iraqi WMDs? No doubt the next "outlaw state" will be accused of using GeForce Ti4600's to manage fast breeder reactors.

--
When I am king, you will be first against the wall.

Re:Maybe that's the answer... by phatsharpie · 2004-05-08 21:09 · Score: 2, Interesting

Actually, GCC may have optimization for the G5, but it is far from being optimal:

The compiler that seems to be best/fully optimized for the G5 is the new IBM XL compilers, released at the beginning of the year.

http://forums.macnn.com/showthread.php?s=&thread id =197118

There doesn't seem to be much benchmark done using it yet, but all information points to significant gain in performance when using the IBM compiler versus GCC (not surprising, since IBM built the chip). The only benchmark I can find is from a German site:

http://www.heise.de/ct/Redaktion/as/spec/ct04082 30 /

I don't believe the G5 is indeed the "fastest" personal computer in the world as claimed by Apple, but it certainly is comparable to the best in the x86 world. Not to mention it is a very new architecture, and there are still plenty of optimization that can be made to make it faster. But to claim that GCC is fully optimized for the G5, and that Apple was using it to justify its claim of being the "fastest" is incorrect. It used a compiler that is arguably good, but certainly not excellent for it.

In regards to comparing Mac OS X to Linux rather than Windows. I think the comparison is valid considering the market Apple has been targeting recently. Apple seems to have backed off from wooing the MS crowd, but instead focusing on firms that use UNIX workstations. Apple wants these companies to switch to the PowerMac rather than to a x86/Linux platform. This is highlighted by their advocacy of using OS X for biotech and film/video effects production. I remember one of their earlier OS X ad even told the reader to send all of their old UNIX boxes to "/dev/null" - or something like that.

-B

Re:Maybe time for a new generation of math-process by pe1chl · 2004-05-08 21:11 · Score: 4, Insightful

What I remember about co-processing cards and "intelligent peripheral cards" (like raid controllers or network cards with an onboard processor) is this:

There is a certain overhead because a communications protocol is to be established between the main processor and the co-processor. For simple tasks the main processor often stops and waits for the co-processor to complete the task and retrieves the results. For more complicated tasks, the main processor continues but later an interrupt occurs that the main processor must service.

You must be very careful or the extra overhead of this communication makes the execution of the task slower than without the co-processor. This is certainly going to happen at some time in the future, when you increase central processor power all the time but keep using the same co-processor.

For example, your matrix co-processor needs to be fed the matrix data, start working, and tell it is finished. Your performance would not only be limited by the processor speed, but also by the bus transfer rate, and by the impact those fast bus transfers have on the CPU-memory bandwidth available and the on-CPU cache validity.
When you are unlucky, the next CPU you buy is faster in performing the task itself.

Dual Core by BrookHarty · 2004-05-08 21:16 · Score: 4, Interesting

With Dual Core CPU's going to be the norm, why not a Dual Core GPU for even faster gfx cards? With everyone wanting 16x antialiasing at 1600x1200 to get over 100fps, its gonna take some very powerful GPU's (or some dual cores).

Even with the ATI 800XT, 1600x1200 can dip below 30FPS with AA/AF on higher settings. Still a ways to go for that full virtual reality look.

Re:Dual Core by PhrostyMcByte · 2004-05-08 21:58 · Score: 2, Informative

Video cards are already able to run many things in parallel- they are beyond dual-core.
Re:Dual Core by BrookHarty · 2004-05-08 22:00 · Score: 2, Interesting

Video cards are already able to run many things in parallel- they are beyond dual-core.

There where dual ATI GPU's or Matrox or even the old Voodoo2 SLI. Seems you can increase speed with more cores.
Re:Dual Core by BrookHarty · 2004-05-09 02:50 · Score: 2, Informative

I can tell upto about 80'ish FPS, but I run the refresh rate at 85 or 100 for no flicker. So yes there is a point for higher FPS. But you didnt say you played video games. And if you turn vsync off you get tearing.

I remember awhile back someone did quake2 benchmarks on accuracy vs FPS, and how 79FPS (i think) was the sweet spot, faster and lower refresh rate had a negative effect on accuracy.

But I wont argue 20FPS over 80, but 100 seems to be target. imho

Audio DSP by buserror · 2004-05-08 21:23 · Score: 4, Informative

I've been thinking about using the GPU for audio DSP work for some time, even got to a point where I could transform some signal by "rendering" it into a texture (in a simple way, I could mix two sounds using the alpha as factor).
The problem is that these cards are made to be "write only" and that basicaly fetching back anything from them is *very* slow, which makes them totaly useless for the purpose, since you *kmow* the results are there, but you can't fetch them in an usefull/fast maneer.
I wonder if it's deliberate, to sell the "pro" cards they use for the rendering farms

Re:Audio DSP by SmackCrackandPot · 2004-05-08 23:23 · Score: 3, Insightful

I wonder if it's deliberate, to sell the "pro" cards they use for the rendering farms

No, it's just the way that the OpenGL and DirectX API's evolved. There never was any need in the past to have a substantial data feedback. The only need back then was to read pixelmaps and selection tags for determining when an object had been picked.
Re:Audio DSP by attaka · 2004-05-09 03:41 · Score: 2, Funny

Easy!
You just have to figure out how to connect that toslink cable to the digital monitor connector.

Re:Link to previous discussion on same/similar sub by Crazy+Eight · 2004-05-08 22:00 · Score: 5, Informative

QE is cool, but it doesn't do anything similar at all to what they're talking about here. FFTs on an NV30 are only incidentally related to texture mapping window contents. Check out gpgpu.org or BrookGPU. In a sense, the idea is to treat modern graphics hardware as the next step beyond SIMD instruction sets. Incidentally, e17 exploited (hardware) GL rendering of 2D graphics via evas a bit before Apple put that into OS X.

Commodore 64 by curator_thew · 2004-05-08 22:01 · Score: 5, Interesting

This concept was being used back in 1988. The Commodore 64 (1mhz 6510, a 6502 like micro processor) had a peripheral 5.25 disk drive called the 1541, which itself had a 1mhz 6510 cpu in it, connected via. a serial link.

It became common practice to introduce fast loaders: these were partially resident in the C64, and also in the 1541: effectively replacing the 1541's limited firmware.

However, demo programmers figured out how to utilise the 1541: one particular demo involved uploading program to the 1541 at start, then upon ever screen rewrite, uploading vectors to the 1541, which the 1541 would perform calculations in parallel with the C64, then at the end of the screen, the C64 fetch the results from the 1541, and incorporate them into the next screen frame.

Equally, GPU provides similar capability if so used.

Re:Commodore 64 by pommiekiwifruit · 2004-05-09 03:05 · Score: 2, Insightful

I would be interested in a reference for that, since the 1541 serial link was so slow. If you are talking about Mindsmear that was not actually released, but a demo would have to be pretty clever to make the communication time worth while (and accurate with the screen still turned on).
Re:Commodore 64 by curator_thew · 2004-05-09 03:42 · Score: 3, Informative

I don't recall exactly: maybe Horizon, definitely scandinavian. I remember because I decompiled it! What happened was that I started the demo, and unusually the disk drive kept spinning: so I turned if off which caused the demo to fail. Tested loading, then trying to start the demo and it didn't work, so curiosity, an Action Reply and an irq investigation revealed what was going on. I think it was a single part demo: the most memorable C64 demo for me because of that trick.

Expand this thinking! by Osty · 2004-05-08 22:06 · Score: 3, Interesting

You're absolutely correct that these "game snobs" are looking at the past through rose-colored graphics, forgetting all of the stinkers of yesteryear. However, it's not just games where this applies. How many times have you heard people complain about how bad movies are now, or music, or books? It's exactly the same phenomenon. When your grandfather tells you how much better things were "back in the day", it's for exactly the same reason. He's looking back at all the good things, while ignoring all of the bad.

Face it, everything mostly sucks. It always has, and it always will. There will always be some gems that really stand out, and those will be what are remembered when people fondly look back on "the old days". Get over it.

Re:Maybe time for a new generation of math-process by Squant · 2004-05-08 22:47 · Score: 2, Insightful

Math co processor boards would be great, buy still quite fixed function.

It would be much more efficient if you would implement an co processor with an FPGA. First programming the FPGA what functions to execute. And then feeding the data to it, when the calculation is completed you just reprogram it to become whatever you want.

This way you would not have an math only board, but a board that could perform many many functions. You just need to write algorithms to exploit them.

Re:Link to previous discussion on same/similar sub by cehardin · 2004-05-08 23:08 · Score: 3, Insightful

I think the real reason Apple comes out with newer and bette technology is because they have to fight for their user base. After all, if Apple's products were the same as Microsoft's, who would care?

Microsoft can afford to be lazy with their products, they make money either way. I don't think that will last forever though. Sometimes they do try hard, NT for example, but then they pile a bunch of poorly designed stuff to go on top of it and that ruins it. If you can, check out OS X's directory structure, it's beautiful. Now compare that to Window's cryptic system...

"Microsoft, as usual, announced the feature after Apple shipped it"

"God I'm tired of hearing that phrase over and over again when 95% of the time it's just because Apple can control the hardware and it would be a total disaster if MS included a technology as fast as they do..."

All very impressive, but.... by tiger99 · 2004-05-08 23:53 · Score: 3, Insightful

... there are a few snags, such as the fact that a GPU will not have (because it normally does not need) memory management and protection, so it is really only safe to run one task at a time. And, does this not need the knowledge of the architecture and instruction set that Nvidia seem to be unable or unwilling to disclose, hence the continuing controversy over the binary-only Linux drivers?

However I do know that a lot of people had been wondering about this for a while, could it be done, and was it worth attempting, so now we know. Maybe we shall soon see PCI cards containing an array of GPUs, I imagine the cooling arrangements will be quite interesting!

There are other things which are faster than a typical CPU, are not some of the processors in games machines 128-bit? Again, you could in theory put some of these together as a co-processor of some sort.

This was a good piece of work technically, but it says something about society that the fastest mass-produced processors, whether for GPUs or games consoles, exist because people want a higher frame rate in Quake. I can't think of any professional application that needs really fast graphics output, but many that could use faster processing. So why can't Intel and AMD stop putting everything in the one CPU (multiple CPUs with one memory are not really much better), and make co-processors again, which will do fast matrix operations on very large arrays, etc, for those who need them? The ultimate horror of the one CPU philosophy was the winmodem and winprinter, both ridiculous. Silicon is in fact quite cheap, as Nvidia have proved, people's time while they wait for long calculations to finish is not.

Maybe we are going to see an architectural change coming, I expect it will be supported by FOSS long before Longhorn, just like the AMD64.

what's really needed by curator_thew · 2004-05-09 00:11 · Score: 3, Interesting

What's really needed is to couple the GPU and CPU in such a way that the GPU actually runs a very low level O/S, like an L4Ka style kernel (http://l4ka.org/), and becomes "just another" MP resource.

Then, on top of this low level, actually runs the UI graphics driver and so on. Other tasks can also run, but ultimately the priority is given to the UI driver.

Then, the O/S on the CPU needs to be able to know generally how to distribute tasks across to the GPU. Fairly standard for a tightly coupled MP that has shared bus memory.

Why do I say this? Because the result is

(a) if you're using an especially high performance application, the GUI runs full throttle dedicated to rendering/etc and acts as per normal;

(b) if you're not, e.g. such as when running Office or Engineering other compute intensive tasks (e.g. recoding video without displaying the video), then the GPU is just another multi processor resource to soak up cycles.

Then, CPU/GPU is just a seamless computing resource. The fantastic benefit of this is that if the O/S is designed properly, then it could allow simply buying/plugging in additional PCI (well, PCI probably not good because of low speed, perhaps AGP?) cards that are simply "additonal processors" - then you get a relatively cheaper way of putting more MP into your machine.

Re:Maybe time for a new generation of math-process by Temkin · 2004-05-09 00:35 · Score: 2, Interesting

Remember the co-processors? Well, actually I don't (I'm a tad to young). But I know about them.

Dig deeper. 8087 FPU's were nice, though they ran hot enough to cook on, but the idea had existed for 15 or more years before they appeared. Try looking into the old DEC PDP-11 archives. There you'll find DEC's own "CIS" or "commercial instruction set", which was a set of boards (later a add on chip) that added string, character and BCD math instructions. DEC also had a FPU card set that implemented a 64-bit FPU out of AMD 2901 bit slice processors. Many low-budget not-quite-supercomputers were really add-on hardware boxes to a general purpose computers. Basicly add-on stunt boxes.

Dam... I'm too young to feel this old! Most of this stuff was in play when I was in grade school.

Temkin

Very bad article by Slash.ter · 2004-05-09 01:58 · Score: 3, Interesting

This is a very poor quality article, I analyzed it before. There are possibly better ones mentioned by others.

Just look at the matrix multiplication case. Look at the graph and see that 1000x1000 takes 30 seconds on CPU and 7 seconds on GPU. Let's translate it to Millions of operations per second: CPU -> 33 Mop/s, GPU -> 142 Mop/s Matrix multiplication has cubic complexity so for CPU: 1000 * 1000 * 1000 / 7 seconds / 1000000 = 33 Mop/s

Now think a while: 33 million operations on 1.5 GHz Pentium 4 with SSE (I assume there is no SSE2). Pentium 4 has fuse multiply-add unit which makes it do two ops per clock. So we get 3 billion ops per second peak performance! What they claim is that the CPU is 100 times slower for matrix multiply. That is unlikely. You can get 2/3 of peak on Pentium 4. Just look at ATLAS or FLAME projects. If you use one of these projects you can multiply 1000 matrix in half a second: 14 times faster than the quoted GPU.

Another thing is the floating point arithmetic. GPU uses 32-bit numbers (at most). This is too small for most scientific codes. CPU can do 64-bits. Also, if you use 32-bits on CPU it will be 4 times as fast as 64-bit (SSE extension). So in 32-bit mode, Pentium 4 is 28 times faster than the quoted GPU.

Finally, the length of the program. The reason matrix multiply was chosen is becuase it can be encoded in very short code - three simple loops. This fits well with 128-instruction vertex code length. You don't have to keep reloading the code. For more challenging codes it will exceed allowed vertex code length. The three loop matrix multiply implementation stresses memory bandwidth. And CPU has MB/s and GPU has GB/s. No wonder GPU wins. But I can guess that without making any tests.

Ever heard of PCI Express? by Egekrusher2K · 2004-05-09 02:04 · Score: 2, Insightful

Touche. However, with the upcoming advances in bus speeds (read: PCI Express) and the available bandwidth to the PCI bus, we won't have to worry about latency when using a coprocessor type piece of hardware. There is room to grow with this new bus to almost outlandish amounts of bandwidth. Not a problem we'll run into any time soon.

--
Listen to my experimental-industrial-techno!

Three questions by pvera · 2004-05-09 02:50 · Score: 2, Interesting

1. Is anyone except Apple trying to leverage the GPU for non-3D tasks? Apple has been doing Quartz Extreme for a while but I have not heard if anyone else is doing it.

2. Has anyone tried something similar to what Quartz Extreme does but for non-graphical tasks?

3. How come GPU makers are not trying to make a CPU by themselves?

--
Pedro
----
The Insomniac Coder

Re:Three questions by be-fan · 2004-05-09 10:12 · Score: 2, Informative

1. Is anyone except Apple trying to leverage the GPU for non-3D tasks? Apple has been doing Quartz Extreme for a while but I have not heard if anyone else is doing it.
Microsoft, for Longhorn, and freedesktop.org, for X11. Both go quite a bit beyond Quartz Extreme by using D3D/OpenGL for all drawing, not just compositing.

3. How come GPU makers are not trying to make a CPU by themselves?
GPUs are very different from CPUs. Graphics is almost infinitely parallizable, so you are really just limited by how many execution units you can stick on the CPU. Assuming enough memory bandwidth, you get nearly a linear increase with increasing numbers of execution units. CPUs, on the other hand, deal with general-purpose code that has an inherent parallelism of about 3-way to 4-way at most. So CPU manufacturers have to do clever things like SMT to take advantage of increased execution resources, but mainly must concentrate on ramping up clock speed and memory bandwidth.

Interestingly enough, GPU makers wouldn't be very good at making CPUs. GPUs are designed using high-level software, like VHDL. This has a big impact on their maximum clock speed, but that doesn't really matter, because they can always double the number of pipelines and get a nearly 2x increase in performance. Meanwhile, CPUs are designed by hand, and tweeked to get every last MHz, because throwing twice as many execution units on the CPU wouldn't help performance much at all.

--
A deep unwavering belief is a sure sign you're missing something...

Interesting work that raises some questions... by thurin_the_destroyer · 2004-05-09 02:54 · Score: 4, Informative

Having done a similar work for my final year project this year, I have some experience attempting general purpose computation on a GPU. The results that I recieved when comparing the CPU with the GPU were very different with many of the applications coming in at 7-15 times slower on the GPU. Further, I discovered some problems which I mention below:

! Matrix results
As in mentioned earlier in the report, the graphics pipeline does not support a branch instruction. So with a limitied number of assembly instructions that can be executed in each stage of the pipeline (either 128 or 256 in current cards), how is it possible for them to perform a calculation on a 1500x1500 matrix multiplication. To calculate a single result 1500 multiplications would need to take place and if they are really clever about how they encode the data into texture s to optimise access, they would need two texture accesses for even 4 multiplications. By my calculations that is 1875 instructions, where you can only do 128 or 256.

My tests found that using the Cg compiler provided by NVidia, that a matrix of size 26x26 could be multiplied before the unrolling of the for loop exceed the 256 limitation.

One aspect that my evaluation did not get to examine was the possiblity of reading partial results back from the framebuffer to the texture memory along with loading a slightly modified program to generate the next partial result. They don't mention if they used this strategy so I assume that they don't.

! Inclusion of a branch instruction
Even if a branch instruction were to be included into the vertex and fragment stages of the pipeline, it would cause serious timing issues. As student of Computer Science, I have been taught that the pipeline operates at the speed of the slowest stage and from designing simple pipelined ALUs, I see the logic behind it. However, if a branch instruction is included then the fragment processing stage could become the slowest as the pipeline stalls waiting for the fragment processor to output its information into the framebuffer. I believe it for this reason that the GPU designers specifically did not include a branch instruction.

! Accuracy
My work also found a serious accuracy issue with attempting compuation on the GPU. Firstly, the GPU hardware represents all number in the pipeline as floating point values. As many of you can probably guess, this brings up the ever present problem of 'floating point error'. The interface between GPU and CPU are traditionally 8-bit values. Once they are imported into the 32-bit floating point pipeline the representation has them falling between 0 and 1, meaning that these numbers must be scaled up to their intended representations (integers between 0 and 255 for example) before computation can begin. Combine these two necessary operations and what I saw was a serious accuracy issue where five of my nine results(in the 3x3 matrix) were one integer value out.

While I don't claim to be an expert on these matters, I do think there is the possiblity of using commodity graphics cards for general purpose computation. However, using hardware that is not designed for this purpose holds some serious constraints in my opinion. Anyone who cares to look at my work can find it here

the magic of "streaming i/o" by peter303 · 2004-05-09 04:01 · Score: 3, Informative

GPUs pass input and output from GPU memory at 4-12 bytes per flop. This is much faster than CPUs which are limited by bus speeds that are likely to deliver a number every sever several operations. So CPU benchmarks are bogus, using algorithms that use internal memory over and over again.

Its not always easy to reformulate algorithms to fit streaming memory and other limitations of GPUs. This issue has come up in earlier generations of custom computers. So, there are things like cyclic matrices tha map multi-dimensional matrix operations into 1-D streams, and so on.

The 2003 SIGGRAPH had a session on this topic showing you could implement a wide variety of algorithms outside of graphics.

Folding@Home is actually working on this... by pointwood · 2004-05-09 04:13 · Score: 3, Interesting

Some day you may be able to Fold proteins with your GPU.

Re:Link to previous discussion on same/similar sub by Tablizer · 2004-05-09 05:01 · Score: 2, Funny

Quote from that topic: "Reminds me of the good old days when you used the processors in the C64 tapedrive to compute stuff. Wouldn't want to waste those precious cycles."

I wonder if that is where they kept their porn back then also.

--
Table-ized A.I.

Re:Link to previous discussion on same/similar sub by mrchaotica · 2004-05-09 09:07 · Score: 2, Funny

Files aren't supposed to go all together; they're supposed to be divided by type: /bin, /etc, /lib, etc.!

- UNIX fanboy

(yes, that was a joke; actually, I'm looking forward to database-based file systems - but not proprietary ones)

--

"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

Re:Link to previous discussion on same/similar sub by cehardin · 2004-05-09 10:59 · Score: 2, Informative

Utter crap, fanboy. OS X's directory structure is a basic UNIX system hidden by the file manager, with applications thrown on to '/'."

Boy, you really have no idea what the heck you are talking about, do you? Of course the basic UNIX stuff is there, /bin, /sbin, /usr/local, all that stuff.

Those directories have very little files in them, you will also notice a lack of init.d startup scripts. Most of the system is contained in /System.

For example, rather than /etc/init.d, it has startup services in /System/Library/StartupItems. For example there is an apache folder, in that are the scripts necessary to start Apache along with a file which describes Apache's dependencies. Also, these startup items are multi lingual. You can boot into any language you want. All of this in one folder. That's f*cking elegance, yet it is only a very small example.

Check it out, you will see.

Re:Link to previous discussion on same/similar sub by cliffwoolley · 2004-05-09 16:47 · Score: 2, Informative

As for organizations beating slashdot to the punch on this one, that's true... but it's good to see this getting even more exposure. :)

GPGPU (General-Purpose computation on GPUs) was a hot topic at various conferences in 2003; a number of papers were published on the subject. At SIGGRAPH 2004 there will be a full-day course on GPGPU given by eight of the experts in the field (including myself).

Mark Harris of NVIDIA maintains a website dedicated to GPGPU topics, including discussion forums and news postings. Well worth a browse if you're interested in GPGPU topics.

I look forward to seeing some of you at SIGGRAPH! :)

--Cliff

Slashdot Mirror

Using GPUs For General-Purpose Computing

112 of 396 comments (clear)