BrookGPU: General Purpose Programming on GPUs
An anonymous reader writes "
BrookGPU is a compiler and runtime system that provides an easy, C-like programming environment (read: No GPU programming experience needed) for today's GPUs. A shader program running on the NVIDIA GeForce FX 5900 Ultra achieves over 20 GFLOPS, roughly equivalent to a 10 GHz Pentium 4. Combine this with the increased memory bandwidth, 25.3 GB/sec peak compared to the Pentium 4's 5.96 GB/sec peak, and you've got a seriously fast compute engine but programming them has been a real pain. BrookGPU adds simple data parallel language additions to C which allow programmers to specify certain parts of their code to run on the GPU. The compiler and runtime takes care of the rest. Here is the Project Page and Sourceforge page."
I suspect that this high performance is only attainable for the field the GPU is specialized for, i.e. graphics-related things. Or isn't it?
What kind of instructions does the GPU actually accept?
I mean, you probably just can't run any kind of algorithm on there can you?
The path I walk alone is endlessly long.
30 minutes by bike, 15 by bus.
I wonder how long till we see a (insert worthwhile cause here)-At-Home client that supports this?
... can you say 'software synthesists' wet dream?
... $5 to the first person to use Brooke to make a synthesizer. :)
Oh, suddenly, that 'game investment' also gives you a few 100 extra voices of polyphony?
Sweet
; -- the corruption of government starts with its secrets. a truly free people keep no secrets. --
but the link to the project page is correct.
Reminds me of the good old days when you used the processors in the C64 tapedrive to compute stuff. Wouldn't want to waste those precious cycles.
:)
I'm sure a lot of old farts will tell me how they used some serial controller to compute stuff back in the 60's and that I'm just a little kid.
A shader program running on the NVIDIA GeForce FX 5900 Ultra achieves over 20 GFLOPS, roughly equivalent to a 10 GHz Pentium 4.
wait, if there is a technology that allows construction of GPU that is 3 times faster than the fastest CPUs, why Intel and AMD do not use this technology to build those 3times faster CPUs?
are you sure that you can compare the speed of GPU and CPU?
#
#\ @ ? Colonize Mars
#
I'm completely new to meddling with graphics card, so apologies if this is a silly question: when programs utilising the GPU for arbitrary calculations are running does the screen go weird, or is there a way of stopping the output being displayed? A screenfull of junk might not matter to a scientist leaving their computer to crunch numbers for a few months but it wouldn't be good for a general-purpose program.
"'I pass the test,' she said. 'I will diminish, and go into the West, and remain Galadriel.'"
- JRR Tolkien.
It would seem to me that the GPU is not going to be as general-purpose as the CPU, but could still attain the high mathematical throughput with vector-oriented processing.
Doing string searches, complex logic analyses, etc. would probably suck, but big data manipulations, such as SETI-style wave transformations, molecular analysis, etc., might be able to take advantage of them.
Design for Use, not Construction!
"Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language" I'll qualify that this is the first I've heard of Brook, however, the words: "efficient language", ring loud in my ears. If Operating systems were programmed as such, imagine how fast bootup and operation of a computer would be. Instead we have bloated software on all sides of the board, that can barely show us the differences of MHz to GHz machines. What century are we in again?
After taking a quick peek at the language part of the project it seems right now that most of it's functions are all about sets of data and how to move them around.
Makes sence of course as that is what a GPU is all about. (Yes I'm vastly over-simplyifying here.) So I would gather that it might be used for types of data that are streamed alot? Maybe used for video editing, real time video, etc where your trying to deal with a lot of data at once that your trying to move around and not just store or have to perform some more complicated types of functions upon.
However, I'm no 3d programmer and I should would love a more detailed analysis of the potentals for this.
Really, I know what I'm doing...Ohhhh, look at the shiny buttons!
I'd love to see an FFT implementation (maybe it's not so hard ... will have to download and play with it.)
A lot of scientific code is constrained by how fast you can do an FFT, perhaps of arbitrary size. And a fast graphics card is a lot cheaper than a high-end processor.
For embarassingly parallel vector problems, this is just the sort of thing for cheap, powerful clusters based around a cheap PC and a fast GPU.
Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar and efficient language. The general computational model, referred to as streaming, provides two main benefits over traditional conventional languages:
- Data Parallelism: Allows the programmer to specify how to perform the same operations in parallel on different data.
- Arithmetic Intensity: Encourages programmers to specify operations on data which minimize global communication and maximize localized computation.
More about Brook can be found at the Merrimac web site which contains a complete specifications for the language.The BrookGPU compilation and runtime architecture consists of a two components. BRCC is the BrookGPU compiler is a source to source metacompiler which translates Brook source files (.br) into
The BRT is an architecture independent software layer which implements the backend support of the Brook primatives for particular hardware. The BRT is a class library which presents a generic interface for the compiler to use. The implementation of the class methods are customized for each hardware supported by the system. The backend implementation is choosen at runtime based on the hardware available on the system or at request of the user. The backends include: DirectX9, OpenGL ARB, NVIDIA NV3x, and C++ reference.
...but I assume that in any advanced texturing/shading/bump mapping/other GFX function rendering, you apply all the different effects, and when you're done, specifically call that the frame is to be displayed on screen. (E.g. why your FPS != your monitor refresh rate)
I would assume that this program simply never calls the drawing function, but instead gets the results back from the GPU. The normal screen should be able to run in the meanwhile (I assume you can e.g. build a 3D environment while showing a 2D cutscreen), so I would think you can have a plain GUI, as long as it doesn't need to use anything advanced.
Kjella
Live today, because you never know what tomorrow brings
www.gpgpu.org
Very cool. Vector/Graphics processors could one day overtake General processors. They are way more energy efficient too.
1) Each character would have it's own shader program.
2) You would set the shader program, draw a rectange, and the character would appear.
3) The shader programs would be automatically generated by processing TrueType files.
To implement:
1) Break Truetype outline up into a number of convex curve segments.
2) Each of these curve segments would be represented as a set of constants in the shader program
3) For each pixel, test a line from pixel to an edge.
4) If the number of segments crossed is odd the pixel is black else white.
The algorithm can be refined to add antialiasing and hinting.
What you end up with is text that is clear at any resolution. The size of the text is controlled by the rectangle you draw it in. The text can also be clearly rotated and sheared.
An obvious optimization is to get the GPU vendors to add a shader instruction to do the calculation for which side of the bezier curve segment the current point lies.
While not important for games drawing text is critical for desktops. And we all know about the current trends to draw desktops with 3D hardware.
This looks like a straightforward and clean extension that experienced C/C++ programmers won't find difficult to learn, but it isn't entirely clear to me whether just using this language, without any knowledge of GPU architecture, will lead to big improvements in performance. Granted, you don't need to know the details, but you've got to have an idea of what it is that you're trying to do and in a general way how the special constructs of the language allow you to do that. As with other such language extensions, you can nominally write in the language but not really use the extensions (how many "C++" programs have you seen that were really C programs with // comments and a few couts?) or use them in unintended ways that prevent the intended optimization. It seems to me that if the project really is aiming at programmers who are not familiar with GPUs, they need at least to provide a brief introduction to the special properties of GPU architecture and some guidelines as to how to use the features of the language to take advantage of them. At present I don't find this either on the web sites or in the distribution.
I had submitted an AskSlashdot on this subject:
2003-04-20 01:51:36 Using video processing as "attached processor" (askslashdot,hardware) (rejected)
But as you can see it was rejected. I was particularly interested in the use of the GPU for cryptographic functions (e.g., with a loopback encrypted filesystem), to offload the processing from the main CPU. Is anyone aware of any work in this area?
Is this even a viable implementation, or would the overhead of continually dispatching work to the GPU exceed the benefit derived?
Can You Say Linux? I Knew That You Could.
Wasn't there a Slashdot story about the slowness of reading back across the AGP bus? How will that affect the usefullness of GPUs?
I've always wondered why certain research programs (like Folding@home or SETI@home) don't use this type of code. My GPU sees more free time than my CPU plus it would probably get the work done faster. Also, imagine the speed increase of utilizing both the GPU and the CPU to their fullest potential. Now thats some fast folding!
SIGFAULT
But what I'm really looking forward to is a Physics specific processor that sits alongside the graphics processor, and is resposible for collisions detection.
The last few SIGGRAPHS had numerous approaches using GPU's to detect collisions, in real-time, betwen complex volumes using only the GPU. With some minor tweaking, graphics manufacturers can make this 100x more efficent and easier to implement.
With the 'shader' languages being able to create and modify meshesh now, procedurally, this is the best place to detect collisions (beaking back the mesh data to your motherboard so that your local CPU can figure out what collided, is not efficent).
-Malakai
A Dragon Lives in my Garage
We used the AT&T DSP32, a 12.5MFLOPS DSP, 15 years ago at Array Technologies. Programmable in a native C source code, with multiply-accumulate (MAC) instructions optimized in microcode, the DSP32 was lightning fast at y = mx + b equations in its arithmatic logic unit (ALU), and its control logic unit (CLU) was also very fast at branching, including no-overhead looping. Linux runs on one of its many fascinating descendants, the Xilinx Virtex-2 Pro.
--
make install -not war
Here is a Beyond3d link that has some opcode info. Look around their site for a NV30 vs R300 architecture document that has lots of great stuff. If you are looking for the best s/n ratio, Beyond3d is one of the best. All meat, little fanboyism.
Nvidia has this already!
l
"About Cg The Cg Language Specification is a high-level C-like graphics programming language that was developed by NVIDIA in close collaboration with Microsoft Corporation. The Cg environment consists of two components: the Cg Toolkit including the NVIDIA Cg Compiler Beta 1.0 optimized for DirectX(R) and OpenGL(R); and the NVIDIA Cg Browser, a prototyping/visualization environment with a large library of Cg shaders. Developers also have access to user documentation and a range of training classes and online materials being developed for the Cg language."
http://www.nvidia.com/object/IO_20020612_7133.htm
PCI-X can fix this data bus in other ways as well. Motherboards come with one AGP slot, but PCI-X can and will provide many expansion slots.
Picture five high end GPUs on the motherboard eclipsing the single high-end cpu for a fraction of the price. Intel and AMD would be forced to cut the asking price of their products to compete. We could finally see some real four-way competition for "processors".
TW
Even though general purpose CPUs approach the flop rate of GPUs, you cant feed the memory for many data intensive computations fast enough. A GPU may give you 12 or so bytes of data per cycle, where very few commodity CPU buses can do that.
Researchers at Caltech and other institutions have been looking at this for about three years. See "Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid" by Bolz, Farmer, Grinspun and Schroder (SIGGRAPH 2003), for example. The paper, illustrations, and movies are available from Dr. Grinspun's homepage. The primary problems with the approach at the time this work was done was the limited bandwidth of texture-related operations in OpenGL based upon improper assumptions in pipeline optimization.
Joseph R. Kiniry
http://kind.ucd.ie/~kiniry/
Lecturer
UCD School of Computer Science and Informatics
Weren't the Virginia Tech's G5 supercomputer nodes all equipped with standard ATI cards? If used right, there could be 1100 more processors to use...
When will the new client be out for this platform ?
:=)
I know my PC eats 20 Watts more of power when in 3D mode, but still, I want the faster agent
We've talked a decent amount about doing crypto on GPU's. The fundamental issue is that such processors are massively optimized for operating on floating point numbers, and almost all crypto is integer based -- lots of bitshifts, MODs, and XOR's, only the latter of which this gear handles correctly. Even if the problem with getting data back off the card was solved, the card itself couldn't do the job.
Indeed, I only know of one crypto hack that uses floats -- being from DJB, it's predictably brilliant. Basically, it's easy to compute the floating point error from a given operation, but computationally hard to find an operation that yields a given error. So you can effectively sign (or at least MAC) arbitrary content. Nice!
--Dan
This cluster has 70 Playstations (one article said that they'd ordered 100, but only 70 are in the cluster... Obviously the others are being used for "research".)
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
So now I can port my slow as tar software rendering engine to this and finally make my DOOM killer 3D Game a reality!
Oh wait.. never mind
-Jason
I thought the real reason to get a *professional level* card is to get a guarantee of reliability
Well, ISV certification - a CAD vendor will assert "with this card, our software produces no rendering artifacts".
http://portal.acm.org/citation.cfm?doid=566654.566 640l n line.siggraph.org/2002/Papers/13_GraphicsHardware/ purcell.ppt
http://www.theregister.co.uk/content/54/25312.htm
http://online.cs.nps.navy.mil/DistanceEducation/o
A typical application was to use a couple of the processors to do geometry while the rest crunched shading, or alternatively to do lots of FFTs for signal processing - the box was mainly designed for the Navy, and 32-bit floating point was more than enough precision given the A/D converters on sonar input.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Would that speed up my processing ? Will I be able to play Half Life 2 on my Pentium now ?
The Internet's nature is peer to peer - 20050301_cs_profs.pdf
Someone ports a GPU Linux and some asshole loads 8 PCI cards into his machine and maked a beowulf cluster inside of one case?
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano
So I have to wonder how much POVray could be sped up -- if any -- by modifying it so that suitable calculations were run on the GPU, in parallel, while the CPU took care of the rest.
Proud member of the Weirdo-American community.
We got our first boards from the developers of an antiaircraft RADAR signature decoder/sight. We wound up using DSP32Cs, 25MFLOPS as I recall, by late 1990. We had an EISA card (PCI was in the future) with an FPGA for linearly scalable pluggable DSPs. We had experimented with a transputer, but found we could use the DSPs to preprocess the video sensor data during calibration, and load custom logic and buses into the FPGAs for maximum efficency routing the data. When the company folded and reformed, the technology had evolved into a general-purpose FPGA imageprocessor, with scalable utility DSPs embedded in the hyperarray of FPGAs. The lead engineer went to Xilinx, which has consistently produced the most advanced FPGAs since then, including the Virtex-II line with embedded RISCs (PPCs).
One DSP SW engineer I worked with at Array had come from the academic computational music world. He had hooked each DSP32's six parallel ports to the other members of a cube topology, with buses around the surface of the cube. The buses were connected to actual I/O buffers. Some of those were connected to input controls, like sensors on big cans, or tuned monochords, or hard rubber blocks. Outputs from the cube were hooked to output actuators, like solenoids strapped to gongs, motorized clappers against barrels, and rows of mallets aimed at piano keys. Musicians would bang their parts out on the inputs, with computed rhythms and "pitches" spewed out of the actuators. Keyboard/monitor stations allowed musicians on the parallel network to sample parametrized rhythms, sequences, timbres and other values in realtime from other musicians.
The whole thing was totally insane, but then we were a Silicon Valley company in Oakland during the last recession that recruited exclusively musicians, philosophers, exhippie mathematicians, and yours truly (college dropout) as their mascot, for an imageprocessing startup. I've never been the same since, and the industry has yet to catch up with any of us.
--
make install -not war
Interesting, at least as GPU is realy a sort of DSP (Digital Signal Processor). And as i am deeply into both Audio and Brodaband signal processing hardware systems development, i find using those chips on the high performance video cards to be extremely useful in processing waveforms using the base of all of it - matrix calculations. It allows both FFT, iFFT, (of course DFTs and DCT and so on), as well es QMF, PQF filtring and synthesys.
I could dream to do hi-fi vocoder out of video card - crasy but interesting! =)
From the other hand, i think a little bit sceptical about all this, as it will not work even at half or third of its gflops performance, when not used for the "native application". This means it could, after some time and hellish efforts, show that "PAIN vs BENEFIT" ratio falls more and more to the pain side.
I remember times i tryed to use 6510 cpu+8kbyts (dont remember exactly) ram inside c64 disk drive to process graphics in parallel with main pocessor. Efforts fell
And from the thrid point of view -- see how intel processors suck (~flamebait;) - "price>performance" like allways. Any small embedded chip outrivals it.
p.s.
Still hold on for the coming of the FPGA
Your music application sounds like fun. I didn't know anybody was still doing anything quite like that by 1990 - there was a whole range of people around John Cage's time who did lots of prepared piano stuff.
Some of the people who were trying to sell our multi-processor supercomputer flavor came up with a music studio application, doing lots of audio processing and mixing, sort of like your device turned inside out. Don't know if they sold more than one of them before the Lucent spinoff took them away.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks