BrookGPU: General Purpose Programming on GPUs
An anonymous reader writes "
BrookGPU is a compiler and runtime system that provides an easy, C-like programming environment (read: No GPU programming experience needed) for today's GPUs. A shader program running on the NVIDIA GeForce FX 5900 Ultra achieves over 20 GFLOPS, roughly equivalent to a 10 GHz Pentium 4. Combine this with the increased memory bandwidth, 25.3 GB/sec peak compared to the Pentium 4's 5.96 GB/sec peak, and you've got a seriously fast compute engine but programming them has been a real pain. BrookGPU adds simple data parallel language additions to C which allow programmers to specify certain parts of their code to run on the GPU. The compiler and runtime takes care of the rest. Here is the Project Page and Sourceforge page."
I suspect that this high performance is only attainable for the field the GPU is specialized for, i.e. graphics-related things. Or isn't it?
What kind of instructions does the GPU actually accept?
I mean, you probably just can't run any kind of algorithm on there can you?
The path I walk alone is endlessly long.
30 minutes by bike, 15 by bus.
I wonder how long till we see a (insert worthwhile cause here)-At-Home client that supports this?
... can you say 'software synthesists' wet dream?
... $5 to the first person to use Brooke to make a synthesizer. :)
Oh, suddenly, that 'game investment' also gives you a few 100 extra voices of polyphony?
Sweet
; -- the corruption of government starts with its secrets. a truly free people keep no secrets. --
but the link to the project page is correct.
Reminds me of the good old days when you used the processors in the C64 tapedrive to compute stuff. Wouldn't want to waste those precious cycles.
:)
I'm sure a lot of old farts will tell me how they used some serial controller to compute stuff back in the 60's and that I'm just a little kid.
A shader program running on the NVIDIA GeForce FX 5900 Ultra achieves over 20 GFLOPS, roughly equivalent to a 10 GHz Pentium 4.
wait, if there is a technology that allows construction of GPU that is 3 times faster than the fastest CPUs, why Intel and AMD do not use this technology to build those 3times faster CPUs?
are you sure that you can compare the speed of GPU and CPU?
#
#\ @ ? Colonize Mars
#
I'm completely new to meddling with graphics card, so apologies if this is a silly question: when programs utilising the GPU for arbitrary calculations are running does the screen go weird, or is there a way of stopping the output being displayed? A screenfull of junk might not matter to a scientist leaving their computer to crunch numbers for a few months but it wouldn't be good for a general-purpose program.
"'I pass the test,' she said. 'I will diminish, and go into the West, and remain Galadriel.'"
- JRR Tolkien.
It would seem to me that the GPU is not going to be as general-purpose as the CPU, but could still attain the high mathematical throughput with vector-oriented processing.
Doing string searches, complex logic analyses, etc. would probably suck, but big data manipulations, such as SETI-style wave transformations, molecular analysis, etc., might be able to take advantage of them.
Design for Use, not Construction!
"Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language" I'll qualify that this is the first I've heard of Brook, however, the words: "efficient language", ring loud in my ears. If Operating systems were programmed as such, imagine how fast bootup and operation of a computer would be. Instead we have bloated software on all sides of the board, that can barely show us the differences of MHz to GHz machines. What century are we in again?
After taking a quick peek at the language part of the project it seems right now that most of it's functions are all about sets of data and how to move them around.
Makes sence of course as that is what a GPU is all about. (Yes I'm vastly over-simplyifying here.) So I would gather that it might be used for types of data that are streamed alot? Maybe used for video editing, real time video, etc where your trying to deal with a lot of data at once that your trying to move around and not just store or have to perform some more complicated types of functions upon.
However, I'm no 3d programmer and I should would love a more detailed analysis of the potentals for this.
Really, I know what I'm doing...Ohhhh, look at the shiny buttons!
I'd love to see an FFT implementation (maybe it's not so hard ... will have to download and play with it.)
A lot of scientific code is constrained by how fast you can do an FFT, perhaps of arbitrary size. And a fast graphics card is a lot cheaper than a high-end processor.
For embarassingly parallel vector problems, this is just the sort of thing for cheap, powerful clusters based around a cheap PC and a fast GPU.
I didn't pay.
Someone paid for me to have an account. I think it was someones idea of a sick joke.
Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar and efficient language. The general computational model, referred to as streaming, provides two main benefits over traditional conventional languages:
- Data Parallelism: Allows the programmer to specify how to perform the same operations in parallel on different data.
- Arithmetic Intensity: Encourages programmers to specify operations on data which minimize global communication and maximize localized computation.
More about Brook can be found at the Merrimac web site which contains a complete specifications for the language.The BrookGPU compilation and runtime architecture consists of a two components. BRCC is the BrookGPU compiler is a source to source metacompiler which translates Brook source files (.br) into
The BRT is an architecture independent software layer which implements the backend support of the Brook primatives for particular hardware. The BRT is a class library which presents a generic interface for the compiler to use. The implementation of the class methods are customized for each hardware supported by the system. The backend implementation is choosen at runtime based on the hardware available on the system or at request of the user. The backends include: DirectX9, OpenGL ARB, NVIDIA NV3x, and C++ reference.
...but I assume that in any advanced texturing/shading/bump mapping/other GFX function rendering, you apply all the different effects, and when you're done, specifically call that the frame is to be displayed on screen. (E.g. why your FPS != your monitor refresh rate)
I would assume that this program simply never calls the drawing function, but instead gets the results back from the GPU. The normal screen should be able to run in the meanwhile (I assume you can e.g. build a 3D environment while showing a 2D cutscreen), so I would think you can have a plain GUI, as long as it doesn't need to use anything advanced.
Kjella
Live today, because you never know what tomorrow brings
www.gpgpu.org
Very cool. Vector/Graphics processors could one day overtake General processors. They are way more energy efficient too.
1) Each character would have it's own shader program.
2) You would set the shader program, draw a rectange, and the character would appear.
3) The shader programs would be automatically generated by processing TrueType files.
To implement:
1) Break Truetype outline up into a number of convex curve segments.
2) Each of these curve segments would be represented as a set of constants in the shader program
3) For each pixel, test a line from pixel to an edge.
4) If the number of segments crossed is odd the pixel is black else white.
The algorithm can be refined to add antialiasing and hinting.
What you end up with is text that is clear at any resolution. The size of the text is controlled by the rectangle you draw it in. The text can also be clearly rotated and sheared.
An obvious optimization is to get the GPU vendors to add a shader instruction to do the calculation for which side of the bezier curve segment the current point lies.
While not important for games drawing text is critical for desktops. And we all know about the current trends to draw desktops with 3D hardware.
What OS are you using. A 2.6 kernel screams with performance, and 2.4.24 is running beautifully on some of my systems.
The only thing that interests me about Brook is that it may allow for more efficient cryptography operations, if the processors it runs on are very good at running calculations in parallel. I would love to see something like this become a cheaper alternative to dedicated hardware encryption accelerator cards, because Graphics Cards benifit from an economy of scale that encryption accelerators do not, and thus are much cheaper.
The Ro Factor - Jeep/Linux Weblog
Beyond3D had a shader contest and some genius wrote frogger in a pixel shader. That's some great stuff.
If GPUs are to become general purpose, will the AGP bus problems have to be fixed (fast at delivering data, slow at getting it back from the GPU). Does PCI-X fix this?
This looks like a straightforward and clean extension that experienced C/C++ programmers won't find difficult to learn, but it isn't entirely clear to me whether just using this language, without any knowledge of GPU architecture, will lead to big improvements in performance. Granted, you don't need to know the details, but you've got to have an idea of what it is that you're trying to do and in a general way how the special constructs of the language allow you to do that. As with other such language extensions, you can nominally write in the language but not really use the extensions (how many "C++" programs have you seen that were really C programs with // comments and a few couts?) or use them in unintended ways that prevent the intended optimization. It seems to me that if the project really is aiming at programmers who are not familiar with GPUs, they need at least to provide a brief introduction to the special properties of GPU architecture and some guidelines as to how to use the features of the language to take advantage of them. At present I don't find this either on the web sites or in the distribution.
One thing that would interest me is the development of improvements to the mesaGL drivers for linux that would utilize Cg programs running on an nvidia card to compute some graphics calculations.
I know this does not take advantage of the cards built in capabilities for 3d, but it would allow for the creation of a better GPL'ed driver for the nvidia geforce, one suitable for distribution with GPL'ed kernels. The user could at a later point to decide to switch to the closed source driver if they want to go even faster, but at least the default driver that comes with a distribution wouldnt be so crappy.
The Ro Factor - Jeep/Linux Weblog
Shades of programming the Amiga Blitter. I think Dave Haynie had Life running at 60FPS in about 1986-87 on a 68000.
I emailed the editors, but they didn't do a damn thing about it.
I had submitted an AskSlashdot on this subject:
2003-04-20 01:51:36 Using video processing as "attached processor" (askslashdot,hardware) (rejected)
But as you can see it was rejected. I was particularly interested in the use of the GPU for cryptographic functions (e.g., with a loopback encrypted filesystem), to offload the processing from the main CPU. Is anyone aware of any work in this area?
Is this even a viable implementation, or would the overhead of continually dispatching work to the GPU exceed the benefit derived?
Can You Say Linux? I Knew That You Could.
Wasn't there a Slashdot story about the slowness of reading back across the AGP bus? How will that affect the usefullness of GPUs?
I've always wondered why certain research programs (like Folding@home or SETI@home) don't use this type of code. My GPU sees more free time than my CPU plus it would probably get the work done faster. Also, imagine the speed increase of utilizing both the GPU and the CPU to their fullest potential. Now thats some fast folding!
SIGFAULT
But what I'm really looking forward to is a Physics specific processor that sits alongside the graphics processor, and is resposible for collisions detection.
The last few SIGGRAPHS had numerous approaches using GPU's to detect collisions, in real-time, betwen complex volumes using only the GPU. With some minor tweaking, graphics manufacturers can make this 100x more efficent and easier to implement.
With the 'shader' languages being able to create and modify meshesh now, procedurally, this is the best place to detect collisions (beaking back the mesh data to your motherboard so that your local CPU can figure out what collided, is not efficent).
-Malakai
A Dragon Lives in my Garage
We used the AT&T DSP32, a 12.5MFLOPS DSP, 15 years ago at Array Technologies. Programmable in a native C source code, with multiply-accumulate (MAC) instructions optimized in microcode, the DSP32 was lightning fast at y = mx + b equations in its arithmatic logic unit (ALU), and its control logic unit (CLU) was also very fast at branching, including no-overhead looping. Linux runs on one of its many fascinating descendants, the Xilinx Virtex-2 Pro.
--
make install -not war
i don't have a Windows machine with a graphics card, and it doesn't appear to support Linux. Which is a shame. Might make a nice video encoding accelerator, no worries about integer precision for example.
yes, I am a post debian and not currently a Gentoo nut, but there are reasons why people still use Debian. For it's STABILITY, and the fact that there stable trees are very stable. Gentoo is great for obviously you and me who use linux as a desktop environment and don't care about up times much, but I've had many problems with gentoo after updating packages, going to the extreme of hours on end of trying to fix things up which I shouldn't have to. This is something that shouldn't be, and hence the good reason why debian is still a great distro, just in a different way.
/me almost forgot
--
there are 10 type of people in this world, one's that repeat a stupid joke over and over, and one's that learn that it is freaking retarded and never use it... oh yah, I am funny because I spelt 10 in binary.....
Here is a Beyond3d link that has some opcode info. Look around their site for a NV30 vs R300 architecture document that has lots of great stuff. If you are looking for the best s/n ratio, Beyond3d is one of the best. All meat, little fanboyism.
Nvidia has this already!
l
"About Cg The Cg Language Specification is a high-level C-like graphics programming language that was developed by NVIDIA in close collaboration with Microsoft Corporation. The Cg environment consists of two components: the Cg Toolkit including the NVIDIA Cg Compiler Beta 1.0 optimized for DirectX(R) and OpenGL(R); and the NVIDIA Cg Browser, a prototyping/visualization environment with a large library of Cg shaders. Developers also have access to user documentation and a range of training classes and online materials being developed for the Cg language."
http://www.nvidia.com/object/IO_20020612_7133.htm
Ok,
Let's say I'm a glutton for punishment. Does BROOK support MULTIPLE graphics cards? I read the doc's and it doesn't mention it explicitly.
Let's say you have a case where calculations are well parallelized and you would like to consider factoring a problem across as many GPU's as your system can hold. Let's also say that your calculations are just hard enough to keep you below bandwidth saturation even at a 10GHz P4 equivalent... multiple GPU's might be useful.
What, no screenshots???
No doubt Linux is the cream of our crop at the kernel level. I'm finding my gentoo KDE Mozilla performance still at the speeds of a similar MS setup.
I'd just like to see more efficient use of the processing power. These GHz machines should be screaming...but alas, all I hear is grinding.
Same AC as parent - happy to have discovered this
"Fixed In This Release (12/19/03) * nv30gl backend compiles and runs on Linux. Requires Linux cgc compiler from NVIDIA and the latest drivers.
Now I know what the NSA is using to crack PGP.. a whole farm of Geforce FXs and 2.6 :)
Yes.
If they are not working, they they should not post a link at the bottom of the story saying to email in any mistakes.
What about the possibility to have a kernel module doing this stuff?
I'd stack my PCI slots full of spiffy videocards and telnet to the machine with my i386.
Just got it to build using version on SourceForge. Makefile has a spurious object "ihash.o" - delete it, then gram.y barfs with an error - stick a semicolon at the end of the offending line. The cg compiler from nvidia builds OK, you need it.
Of course, a shader isn't general purpose programming. GPU's are optimized for this sort of task. If any of the standard benchmarks were recompiled for a GPU, you would see just how poor they perform certain tasks.
Also GPUs are designed for one way processsing, much like a DSP. If you're not aware even a 600MHz TI DSP will DESTROY any x86 or x86-64 microprocessor when it comes to FFTs. But they would fail miserably at text parsing.
I've wondered about this very thing for a few years now. Good to see that it really was possible.
Is there a prize for the first optimized encoder for some flavor of MPEG4? Imaging ripping a DVD in one hour. Hopefully ATI users on OSX are not left behind.
If you want to compare peak performances, then do it right. P3, P4, K6-2, K7 and K8 can all do 2 single precision adds and multiplies per clock cycle when programmed carefully. This means that you only need 5 GHz in order to achieve 2GFlops.
s/2GFlops/20GFlops/
The reason these units are fast is because they use floating point tricks, the numbers aren't very accurate and shouldn't be used for computation of real things like physics etc.
Even though general purpose CPUs approach the flop rate of GPUs, you cant feed the memory for many data intensive computations fast enough. A GPU may give you 12 or so bytes of data per cycle, where very few commodity CPU buses can do that.
Researchers at Caltech and other institutions have been looking at this for about three years. See "Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid" by Bolz, Farmer, Grinspun and Schroder (SIGGRAPH 2003), for example. The paper, illustrations, and movies are available from Dr. Grinspun's homepage. The primary problems with the approach at the time this work was done was the limited bandwidth of texture-related operations in OpenGL based upon improper assumptions in pipeline optimization.
Joseph R. Kiniry
http://kind.ucd.ie/~kiniry/
Lecturer
UCD School of Computer Science and Informatics
Back in the late part of the eighties, my company got started by making (almost) real-time control systems based on IBM PS-2 machines running Xenix, with the CPU-intensive stuff on Artic RIC cards.
Although the RIC card was meant to be an intelligent serial communications gizmo, a lot of the higher level processing was delegated to those as well.
It seems to me that we are about to get to the other side of the everchanging wheel of "lots of chips" vs "One huge CPU". Again.
//Wegge
Earlier this year, when I was considering the options for using GPUs to do some quantum physics, I found out that there were serious bandwidth problems in the GPU drivers, and even though it was possible to send the data to a GPU for fast processing, the readback of results was awfully inefficient (so much so you lost any processing advantage). This has been solved now?
Weren't the Virginia Tech's G5 supercomputer nodes all equipped with standard ATI cards? If used right, there could be 1100 more processors to use...
When will the new client be out for this platform ?
:=)
I know my PC eats 20 Watts more of power when in 3D mode, but still, I want the faster agent
Is Cg cross-platform? (i.e., can you write programs for Radeon GPUs with it?)
Then again, is Brook cross-platform?
If either is general-purpose enough, they could be used to implement routines on less expensive GPUs, wuch as those from SiS et al.
WTF?! they seriously have a website designated to promote terrorism against /.
Grow up.
We've talked a decent amount about doing crypto on GPU's. The fundamental issue is that such processors are massively optimized for operating on floating point numbers, and almost all crypto is integer based -- lots of bitshifts, MODs, and XOR's, only the latter of which this gear handles correctly. Even if the problem with getting data back off the card was solved, the card itself couldn't do the job.
Indeed, I only know of one crypto hack that uses floats -- being from DJB, it's predictably brilliant. Basically, it's easy to compute the floating point error from a given operation, but computationally hard to find an operation that yields a given error. So you can effectively sign (or at least MAC) arbitrary content. Nice!
--Dan
open4free
C is already an efficient language....and that's what most operating systems are written in.
It's not the fault of the language how people use it, I'm sure people would be able to write big slow things just as well with Brook.
Advanced users are users too!
This cluster has 70 Playstations (one article said that they'd ordered 100, but only 70 are in the cluster... Obviously the others are being used for "research".)
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Precision can be a problem, because GPUs and DSPs often run single-precision floating point, at least for the widest-parallelism parts like shaders. They may have some double-precision capability as well, but it's usually used for less-parallel activities like geometry crunching.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
The problem with your statement is that both ATI and Nvidia use the same GPU on their CAD cards as used on their Gaming cards. The difference is either drivers(1), firmware, or a pin setting. Internally there's no difference.
(1) One reason not to open-source.
So now I can port my slow as tar software rendering engine to this and finally make my DOOM killer 3D Game a reality!
Oh wait.. never mind
-Jason
http://portal.acm.org/citation.cfm?doid=566654.566 640l n line.siggraph.org/2002/Papers/13_GraphicsHardware/ purcell.ppt
http://www.theregister.co.uk/content/54/25312.htm
http://online.cs.nps.navy.mil/DistanceEducation/o
I've never had any issues with Gentoo past the two-day install process. Even that went off without a hitch and now I have a Linux distro which runs faster than yours. Ha ha ha.
Damn you Paul Muaddib!
A typical application was to use a couple of the processors to do geometry while the rest crunched shading, or alternatively to do lots of FFTs for signal processing - the box was mainly designed for the Navy, and 32-bit floating point was more than enough precision given the A/D converters on sonar input.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
You could wonder if this could be used to improve the performance of X, by having more of it running on the GPU?
But beyond that, CPUs spend a lot of their time doing various kinds of I/O, handling interrupts, and talking to multiple coprocessors, while GPUs normally just get handed the stuff they want. Some of this work gets farmed out to other chipset members - NorthBridge memory controllers, Southbridge I/O controllers - but the CPU still ends up in charge of the process.
Also, the CPU gets to do operating system jobs like deciding which chunks of memory belong to which applications, while the GPU doesn't worry about that - everything's going to get drawn on the screen. Perhaps Trusted Computing Digital Ridiculousness Management will change this, and the graphics processors will need to start keeping track of who owns each pixel or vector so it can use the right decryption context, to prevent you from watching movies you haven't paid for, but for the moment it's ignorance and bliss.
Also, the real line is "Sorry, Kids, I need your Playstation 3 to find a cure for Grandma's cancer, go out and play soccer or something."
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
moron
6 graphics cards in parallel ? Anyone writing a new BIOS for that. And fix that slow PCI bus while you are at it :).
Would that speed up my processing ? Will I be able to play Half Life 2 on my Pentium now ?
The Internet's nature is peer to peer - 20050301_cs_profs.pdf
Someone ports a GPU Linux and some asshole loads 8 PCI cards into his machine and maked a beowulf cluster inside of one case?
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano
Granted, accessing main memory from a G4/G5 processor will be slower, but doesn't this sound familiar to Apple users? That said... it's a cool idea. :-)
ATI can execute up to 96 ALU instructions in a fragment progam on their top of the line video card. That is actually 64 native instructions. The GeForceFX can handle 1024. Think about that for a second. Lets say you want to compute 1,000,000 things in parallel so you create a 1000x1000 floating point texture to render to. You have to fit your algorithm into just 64 instructions. And you can't make function calls (Yes, DX9 HLSL has functions, but they are all "inline"). And your algorithm has to have no loops. And your algorithm has to have essentially no branches. Well it can have "branches" that are of the form a = x >= y ? b : c; No real branches. And you can't have static or global variables. Each pixel is executed by the same 64 instruction program. If you want a static variable to save between different programs you have to store it to another 1000x1000 texture. Which is what this system does. If you want an if(whatever){ bunch of instructions } you have to compute all those instructions and then multiply by 0 or 1 at the end in order to emulate branching.. Terribly ineffecient. BTW currently no graphics hardware exists which can emulate the standard T&L pipeline with eight lights! What would be a simple loop like this: for( i = 0; i We need far more flexible video cards before this is useful. ATI's ASHLI is sort of like this. It allows you to compile long Renderman or OpenGL shaders into multiple passes so they can be execute on our current hardware like the crappy ;) 9800XT
http://www.ati.com/developer/ashli.html
This will be a lot more interesting when we get pixel shaders that can do emulation of the standard T&L pipeline.
So I have to wonder how much POVray could be sped up -- if any -- by modifying it so that suitable calculations were run on the GPU, in parallel, while the CPU took care of the rest.
Proud member of the Weirdo-American community.
Sorry, dist.net uses integer math exclusively, while GPUs use vector floating point math almost exclusively. (That, and dist.net likes a hardware 'rotate' function, which I don't know if any modern GPUs have.)
But, I'm sure you could get a few extra cycles out of a GPU. Just not as much as you'd hope.
Another non-functioning site was "uncertainty.microsoft.com."
The purpose of that site was not known.
Currently, the problem with this is AGP. AGP is meant for downstream data( cpu/memory -> agp -> video card ) but is miserable at upstream data.
On the other hand, PCI-X or PCI-Express will solve this problem with very high up AND downstream data paths. So your output would be easily sent back to the CPU for data handling.
With this approach, you could optimize your code to run integer math on the main cpu, and export floats to the graphics card(s)
also, with PCI+? aditional graphics cards could be use and excellerator cards to handle additional math that the primary graphics card cant handle well, and improve performance on the primary card through some sort of Scan line interleaving or something. The fact the the upstream data path is so high make this possible without proprietary tech like voodoo2 cards did.
We got our first boards from the developers of an antiaircraft RADAR signature decoder/sight. We wound up using DSP32Cs, 25MFLOPS as I recall, by late 1990. We had an EISA card (PCI was in the future) with an FPGA for linearly scalable pluggable DSPs. We had experimented with a transputer, but found we could use the DSPs to preprocess the video sensor data during calibration, and load custom logic and buses into the FPGAs for maximum efficency routing the data. When the company folded and reformed, the technology had evolved into a general-purpose FPGA imageprocessor, with scalable utility DSPs embedded in the hyperarray of FPGAs. The lead engineer went to Xilinx, which has consistently produced the most advanced FPGAs since then, including the Virtex-II line with embedded RISCs (PPCs).
One DSP SW engineer I worked with at Array had come from the academic computational music world. He had hooked each DSP32's six parallel ports to the other members of a cube topology, with buses around the surface of the cube. The buses were connected to actual I/O buffers. Some of those were connected to input controls, like sensors on big cans, or tuned monochords, or hard rubber blocks. Outputs from the cube were hooked to output actuators, like solenoids strapped to gongs, motorized clappers against barrels, and rows of mallets aimed at piano keys. Musicians would bang their parts out on the inputs, with computed rhythms and "pitches" spewed out of the actuators. Keyboard/monitor stations allowed musicians on the parallel network to sample parametrized rhythms, sequences, timbres and other values in realtime from other musicians.
The whole thing was totally insane, but then we were a Silicon Valley company in Oakland during the last recession that recruited exclusively musicians, philosophers, exhippie mathematicians, and yours truly (college dropout) as their mascot, for an imageprocessing startup. I've never been the same since, and the industry has yet to catch up with any of us.
--
make install -not war
It's not good to use the currently available GPUs for computation tasks. Since the GPU specs change all the time and there is no standard (especially when it comes to shaders), it will create all shorts of nighmarish problems concerning OSes, especially open source ones.
It is the ideas behind the GPU that must move down to CPU; mainly the vector unit and the high bandwidth. Remember that the PowerPC with the Altivec extensions is a very impressive CPU; also remember that the Playstation 2 can do 48 GB / sec data transfer. The PC needs a really fast bus and lots of custom hardware for most parallelisable tasks.
I'd love to see an FFT implementation (maybe it's not so hard ... will have to download and play with it.)
A paper on that very subject was presented at Graphics Hardware 2003. You should be able to find it here
Neural nets on the GPU might be a better application. Lots of data going into the net. Lots of computation within the net. A small amount of data (the answer) leaving the net.
If you actually read anything you would have noticed that modern GPUs are indeed turing complete, and many, many mathematical operations can be performed with great parallel efficiency.
Music speeds up when you yawn, but does not change pitch.
Interesting, at least as GPU is realy a sort of DSP (Digital Signal Processor). And as i am deeply into both Audio and Brodaband signal processing hardware systems development, i find using those chips on the high performance video cards to be extremely useful in processing waveforms using the base of all of it - matrix calculations. It allows both FFT, iFFT, (of course DFTs and DCT and so on), as well es QMF, PQF filtring and synthesys.
I could dream to do hi-fi vocoder out of video card - crasy but interesting! =)
From the other hand, i think a little bit sceptical about all this, as it will not work even at half or third of its gflops performance, when not used for the "native application". This means it could, after some time and hellish efforts, show that "PAIN vs BENEFIT" ratio falls more and more to the pain side.
I remember times i tryed to use 6510 cpu+8kbyts (dont remember exactly) ram inside c64 disk drive to process graphics in parallel with main pocessor. Efforts fell
And from the thrid point of view -- see how intel processors suck (~flamebait;) - "price>performance" like allways. Any small embedded chip outrivals it.
p.s.
Still hold on for the coming of the FPGA
Your music application sounds like fun. I didn't know anybody was still doing anything quite like that by 1990 - there was a whole range of people around John Cage's time who did lots of prepared piano stuff.
Some of the people who were trying to sell our multi-processor supercomputer flavor came up with a music studio application, doing lots of audio processing and mixing, sort of like your device turned inside out. Don't know if they sold more than one of them before the Lucent spinoff took them away.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
I'd think that since the obvious applications that would benefit from such techniques would be in renderfarms. Perhaps it would be in the best interest of Weta Digital, Pixar, and ILM to invest some money into this project. After all, it would be much cheaper to acquire a bunch of used PCI-based Voodoo and GeForce cards and load multiple units into their PCs than to just increase the number of PCs sitting next to each other; they'd save money in terms of electricity consumption as well. There is a prescidence afterall; I'm thinking about the pooling of resources that Paramount, Fox, and Disney did a few months ago in getting Adobe Photoshop and other mission-critical Windows programs to run on Linux...
"Right now, somewhere in this world, Scott Baio is plowing a woman he doesn't love," - Peter Griffin, *Family Guy*
One major difference is that GPUs were built using whatever architecture they cared to use, and the GPU's entire architecture can be turned upside-down in every version if the hardware manufacturer cares to. There's a powerful preprocessor - your CPU - running on the nVidia or ATI drivers to ensure that what goes in is right, and optimal.
This idea isn't new to CPU manufacturers. This is exactly what Intel tried to accomplish with Itanium by doing the preprocessing at compile-time by the software distributor, and sure enough, Itanium blows the P4's doors off.
It is telling though that nVidia's entry into the motherboard domain drastically increased system memory bandwidth, something they've needed to focus on in the "machine in an AGP card" videocards they've made.
It would be interesting to compare the performance of an FX GPU running a shader, and an Itanium on a crossbar memory controller running the same. I bet it would be comparable, and in some cases the FX GPU would lose.
It would be fun to see the SETI@Home screensaver doing its own math. A GPU is basically a DSP, right?
This is not my sandwich.
I'm willing to bet that before long we will be seeing video drivers that essentially upload a frame-decrypter to the video card, which will unscramble and display streaming video there. Your computer itself will never know what it's displaying.
This is not my sandwich.
if you dont shut up, we'll make a website to promote terrorism against you!
Setting the base of GPUs at the Nvidia NV30 level is excluding way too many mainstream videocards that aren't currently being used. If you are a gamer, ask yourself how many videocards you have cluttering your home because you are constantly upgrading to the next best card that hit the market. I myself could spare a couple of Voodoo1 cards, two Voodoo3 cards, a TNT2, and a couple of GeForce2's.
The base of this project should be something like the 3dfx Voodoo1 or Nvidia's Riva128. While you can no longer count on any updated drivers for WindowsXP for these models, they surely have suitable Linux drivers. You could mount multiple PCI versions into a single PC (obviously, it would probably be best if they were all the same cards). Now that's the way to get some extra performance for distributed computing projects...
Disclaimer: Yes, anything below a Voodoo3 (on the 3dfx front) would have issues with OpenGL because 3dfx used a mini-GL driver since at that time they still favored their own GLide format over both OpenGL and DirectX.
"Right now, somewhere in this world, Scott Baio is plowing a woman he doesn't love," - Peter Griffin, *Family Guy*
Matt...
Save the Bottom Line
but alas, all I hear is grinding.
If you can hear rapid clicking noises whenever your computer is doing something, then your computer is swapping data to and from the hard drive. In that case, you either need more RAM or you shouldn't try to run so many programs at once.
...a compiler and runtime system that provides an easy, C-like programming environment...
Well, which is it - easy or C-like?
My favorite was the IBM/Tecmar M-ACPA sound card. It sold for $495, at a 10 MHz clock (the TI TMS320C25 DSP), it was pipelined, and if you ordered instructions to take advantage of the pipeline, it could spit out a multiply-accumulate once for every clock cycle. It also came with no software to speak of, even to record and play sound, and the A/D was clocked at a fixed 44.1 kHz and the D/A at a fixed 88.2 kHz sampling rate. Unlike the Motorola DSP in competing products, the C25 could be halted, and the PC could read and write DSP memory through I/O ports whether the thing was halted or running, so you could just dump hex programs into the thing without having to bring up a bootstrap loader on the DSP, and you could start, halt, and inspect for debugging purposes, and it could generate an "A/D buffer full" interrupt for continuous A/D operation.
IBM supplied sample code where they used some simple macros in the PC macro assembler to generate op codes for the C25 DSP and wrote a simple PC-side loader in 8086 assembler. This was fairly easy to do because the C25 instructions are all fixed 16-bit words, and while the hacked-up assembler was all fixed address with no relocation, it all worked pretty reliably because you had complete control over all aspects of the software and hardware -- TI has some software tools for there TMS320CXX eval boards that were a POS because they were buggy as heck, but this setup was the ultimate in Keep it Simple and Stupid.
Digging into Crochiere and Rabiner on something called polyphase multirate FIR filter design, I had a bank of filters that upsampled and downsampled to support rates of 10, 11, 16, 20, and 22 kHz to match existing digitized speech files (the TI-MIT-NIST-DARPA speech database was sampled at 16, I had stuff sampled at 20). I also has subroutines for the card for real-valued forwards and inverse FFT, FIR filter, and second-order-section cascade IIR filter, and the whole wad of software fit in DSP memory at absolute addresses, and I had a little Turbo Pascal interface library to the whole thing, and I was king of the world.
Guess what. The M-ACPA card kind of went by the wayside by the mid 1990's when Windows 95 came along: there were much cheaper Windows sound cards and IBM never could get a Windows driver for the darned thing that didn't leave gaps (sound clicks) with every interrupt cycle. And the Pentium came along which just blew the thing away speed wise, and the DSP library could be written in C or other compiler language, portable to any other computer, and using floating point so one could stop worrying about scaling and overflow and fixed-point roundoff.
For all my work on the DSP library, I think I got at most about 3 years useful life out of it, and besides, I had the versioning problem of deciding who had one of these M-ACPA's installed and who didn't and reverting to a non-ACPA library for those who didn't (remember when the 8087 was optional and the pain that caused?). After that experience, I don't want to touch another array processor/DSP/GPU whatever: I am going to program whatever CPU is available in whatever compiler is available, and I am not going to mess in assembler with any strange instruction set enhancement promoted by disco dancers in clean room suits. I am just going to sit back and wait for Moore's law and wait for the CPU and compiler to catch up.