Slashdot Mirror


BrookGPU: General Purpose Programming on GPUs

An anonymous reader writes " BrookGPU is a compiler and runtime system that provides an easy, C-like programming environment (read: No GPU programming experience needed) for today's GPUs. A shader program running on the NVIDIA GeForce FX 5900 Ultra achieves over 20 GFLOPS, roughly equivalent to a 10 GHz Pentium 4. Combine this with the increased memory bandwidth, 25.3 GB/sec peak compared to the Pentium 4's 5.96 GB/sec peak, and you've got a seriously fast compute engine but programming them has been a real pain. BrookGPU adds simple data parallel language additions to C which allow programmers to specify certain parts of their code to run on the GPU. The compiler and runtime takes care of the rest. Here is the Project Page and Sourceforge page."

275 comments

  1. High Performance for General Purpose? by tempfile · · Score: 3, Interesting

    I suspect that this high performance is only attainable for the field the GPU is specialized for, i.e. graphics-related things. Or isn't it?

    1. Re:High Performance for General Purpose? by fidget42 · · Score: 4, Informative

      Actually, since "graphics-related things" are all matrix operations, this would turn the GPU into a high-end vector (matrix) engine.

      --
      The dogcow says "Moof!"
    2. Re:High Performance for General Purpose? by Anonymous Coward · · Score: 5, Insightful

      "graphics-realted" things include things like floating point mathmatics, linear algebra, and vector operations. If you are doing anything computationally intensive, this might be usefull. You don't have to actually use the hardware to do anything graphical if you are just interested in turning numbers.

    3. Re:High Performance for General Purpose? by Anonymous Coward · · Score: 1, Interesting

      Sweet. Morphing windows that don't hog the CPU.

    4. Re:High Performance for General Purpose? by JonnyRo88 · · Score: 2

      Arent several cryptography operations related to matrix manipulation?

      --
      The Ro Factor - Jeep/Linux Weblog
    5. Re:High Performance for General Purpose? by Total_Wimp · · Score: 3, Interesting

      I can't help but notice the similarity between shader operations and how neurons interact. These processors might be a good platform for some AI tasks.

      I especially like the idea that the GPU and CPU can work together on the task. If the GPU was handling neuron tasks and the CPU was handling other necessary tasks we could get a very big boost to desktop AI

      TW

    6. Re:High Performance for General Purpose? by BrainInAJar · · Score: 4, Interesting

      would the percision be enough though? as far as i know, GPU's do a lot of rounding off

    7. Re:High Performance for General Purpose? by Anonymous Coward · · Score: 0

      It would depend on the application...but I suspect many cards are capable of doing double precision floating point operations in hardware (and certainly you could always program routines in software for as much precision as you want). I agree with you, often the card uses a faster presision when it dosen't need to be as accurate, but I dont think that means the hardware isnt as capable because the graphics card is still has other tasks that might require more accurate calculations

    8. Re:High Performance for General Purpose? by Anonymous Coward · · Score: 3, Informative

      NVIDIA's parts are OK, precision wise. You get IEEE floats, more or less. ATI's parts don't quite get you there are the moment, but their next series is planned to.

    9. Re:High Performance for General Purpose? by larry+bagina · · Score: 1
      and certainly you could always program routines in software for as much precision as you want

      ... which means you'd lose the advantages of having the GPU do the math natively.

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

    10. Re:High Performance for General Purpose? by Anonymous Coward · · Score: 0

      Beware of low floating point precision. Most GPUs are designed to render fast, but not accurately. Small calculation errors are not noticeable in a video image display at 30 frames per second. GPUs may not be suitable for scientific simulations. Check your processor documentation carefully.

    11. Re:High Performance for General Purpose? by sql*kitten · · Score: 2, Insightful

      as far as i know, GPU's do a lot of rounding off

      It depends. If you have a gaming card, it will sacrifice precision for speed to hit its price. If you're rendering 100 fps in a game and in a couple of noncontiguous frames the walls don't quite line up, no big deal. But a professional CAD card, speed is sacrificed for precision - the risk of an engineer making a mistake or failing to spot one in an assembly alignment because of rendering artefact is too high.

      In practice, a CAD card is just as fast as a gaming card, it just costs 5x as much (or more). Still, if your computation was well suited to matrix multiply and add (even a modest GPU will spank a good CPU at this) it might still be worth it.

    12. Re:High Performance for General Purpose? by Directrix1 · · Score: 3, Interesting

      Yes, anything computationally intensive that works over a range of data can usually find a parrallel solution. Such as image/video manipulation/encoding/decoding, encryption, and cracking (and hopefully this will give us a platform for better software RF). I've always wondered why this stuff didn't just become worked into a coprocessor. Because very little new stuff actually happened that was directly related with the video card (as in taking output from the machine and displaying it on a screen). I think the card manufacturers saw this, so they jumped on the 3d acceleration bandwagon toting it as a new video card feature, when it should've just been in the domain of a new math coprocessor.

      --
      Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF
    13. Re:High Performance for General Purpose? by axxackall · · Score: 3, Insightful
      Matrix and vector calculations with floating point makes GPU as a very excelent place to host Neural Network (NN) computation.

      Of course NN can be used for "graphics-related things", such as image recognition, but not only image, for example voice recognition. And not only recognition, for example forecasting on huge sequences with explicit and implicit (hidden) side-factors.

      Stock market trader on GPU, anyone?

      --

      Less is more !
    14. Re:High Performance for General Purpose? by N+Monkey · · Score: 1

      I suspect that this high performance is only attainable for the field the GPU is specialized for, i.e. graphics-related things. Or isn't it?

      It is for "more specialised" tasks but it certainly isn't restricted to graphics. Computation such as FFTs and Fluid Flow is also possible.

      If, like me, you couldn't get through the home page, this sort of thing (including, IIRC, Brook) was discussed at Siggraph and also
      this year's Graphics Hardware conference.
      Scroll down to the "Panel: GPUs as Stream Processors" and "Session 4: Simulation and Computation" sections for slides.

    15. Re:High Performance for General Purpose? by Talez · · Score: 1

      nVidia use mixed 128-bit and 64-bit datatypes while ATi use 96-bit datatypes

      All three are enough for double floats and the top two precisions are good enough for 80-bit reals that x87 uses.

    16. Re:High Performance for General Purpose? by Short+Circuit · · Score: 1

      Vendor-specific patents and such. Be glad the GPUs can be plugged into a high-speed expansion bus, as-is.

    17. Re:High Performance for General Purpose? by Viking+Coder · · Score: 2, Informative

      Nope.

      Those are 4-component (RGBA) types, with 32, 16, and 24 bits per component, respectively.

      None of them are enough for double floats, and none of them are good enough for 80-bit reals that x87 uses.

      --
      Education is the silver bullet.
    18. Re:High Performance for General Purpose? by neural+cooker · · Score: 1

      or possibly AI PCI cards? that could be sweet if somebody could nail down a possible AI API for that. perhaps based on neural-nets and/or fuzzy logic

  2. Cool, but by MooCows · · Score: 3, Interesting

    What kind of instructions does the GPU actually accept?
    I mean, you probably just can't run any kind of algorithm on there can you?

    --
    The path I walk alone is endlessly long.
    30 minutes by bike, 15 by bus.
    1. Re:Cool, but by Anonymous Coward · · Score: 1, Informative

      any algorithm will run (memory allowing), but not any instruction; The card runs on some flavor of assembly, as any microprocessor does, and with this tool you can compile code for the GPU from C and it gets loaded into the GPU when your main program runs on the CPU.

    2. Re:Cool, but by scrytch · · Score: 3, Informative

      > I mean, you probably just can't run any kind of algorithm on there can you?

      Probably. I should imagine it has local storage with the corresponding fetch and store instructions, basic math, and ability to jump to arbitrary points in the shader program, which makes it very much turing complete. Everything else is a matter of a compiler backend. Bus latency would be an issue, so it'd be painful for programs that need a lot of I/O, but that's not an issue for a lot of programs.

      --
      I've finally had it: until slashdot gets article moderation, I am not coming back.
    3. Re:Cool, but by Hast · · Score: 1

      You typically have all standard math related operations, and then some. (Since it's for graphics originally it also supports cross and dot products for instance.) What it typically lacks is flow control. So no branching on a GPU (just yet). OTOH that is pretty much why you can get them so blasingly fast.

  3. Basically like having two processors... by Anonymous Coward · · Score: 4, Interesting

    I wonder how long till we see a (insert worthwhile cause here)-At-Home client that supports this?

    1. Re:Basically like having two processors... by Anonymous Coward · · Score: 0

      I've been asking about this, with those distributed client folks, for years. It's nice to finally see it being done!

      I wonder how many practical things this could aid. Like password cracking (brute force).

    2. Re:Basically like having two processors... by Doc+Ruby · · Score: 1, Offtopic

      Keeping the prehistoric Atari/Commodore flamewar alive, I point out that I used to program multiprocesing on my Atari 400. Syncing its ANTIC vertical blank server routines with 6502 client routines through its wonderfully generic SIO scheduler, I was multiprocessing in 1981! Harnessing fast GPUs to speed general logic is at least as old as the Roman spatial metaphor for value, where "superior" means both "higher" and "better", et cetera.

      --

      --
      make install -not war

    3. Re:Basically like having two processors... by Chordonblue · · Score: 2, Interesting

      Yeah, I remember that! Lucasfilm used it to animate a mothership in 'Rescue on Fractalus' (itself a marvel of tech for the Atari) while the game loaded. The were cool (de)compression routines that harnessed this as well.

      I also seem to recall certain music pieces that could play extra parts by blanking the screen. There was also a really cool 9 second sample of 'You really got me' - the Van Halen version - and it blanked the screen to play it.

      Wow! Them were the salad days!

      --
      "...Well, there's egg and bacon; egg sausage and bacon; egg and spam; egg bacon and spam; egg bacon sausage and spam..."
    4. Re:Basically like having two processors... by CTalkobt · · Score: 1

      Bah humbug. Just to continue the Atari/Commodore flamewar...

      I coded a basic multi-tasker that would allow different threads of a basic program to be run at the same time for the Commodore. It got confusing if you tried to modify a variable in more than 1 spot. It was more fun to play with than really practical.

      --
      There's a gorilla from Manilla whose a fella that stinks of vanilla and has salmonella.
    5. Re:Basically like having two processors... by Doc+Ruby · · Score: 1

      I accept your concession of the Commodore OS as inferior, supporting only an unsafe multithreader, while the Atari OS included simple, safe support for the Atari's true-multiprocessing HW (something like 5 CPUs on the DMA bus). You are most gracious ;).

      --

      --
      make install -not war

    6. Re:Basically like having two processors... by JDWTopGuy · · Score: 0

      Screw worthwhile causes, I want optimized RC5-72 cores for distributed.net! The mere thought of it makes me drool.

      Anybody know if it will work with a Radeon Mobility 9200? Because if somebody could port this to OS X and include it in distributed.net, that alone would be justification for the iBook I plan on getting.

      Naturally, somebody is going to come along and tell me why a GPU would suck at RC5-72, and crush my dreams. Oh well.

      --
      Ron Paul 2012
    7. Re:Basically like having two processors... by cybergibbons · · Score: 3, Interesting

      Ha! The C64 disk drive had it's own processor which you could use to run programs as long as you could deal with the painfully slow serial link. Beat that.

    8. Re:Basically like having two processors... by Doc+Ruby · · Score: 1

      The Atari 6502 machines also had ANTIC and POKEY CPUs running in parallel on a local bus, with their own instruction sets. The OS had CIO, Centralized I/O, which mapped all the data being pumped as devices with a symmetric interface. So the multiprocessing on the mainboard was nicely integrated, with a highly programmable interface. Many programs actually used the architecture, which was more mature, stable and programmable than a hack. If you want hacks, you can look at the programs written to run on the Atari's floppy disk drive, or those written to run on the Z80 coprocessors plugged into the memorymapped ROM slots. This is why the company was named "Atari": Japanese for "Checkmate" :).

      --

      --
      make install -not war

    9. Re:Basically like having two processors... by Anonymous Coward · · Score: 0

      to be fair, rc5 is the only thing demostrably superior about the G4 (or G5), so you'll be in rc5 heaven anyway.

    10. Re:Basically like having two processors... by fitten · · Score: 1

      I thought the MMU of the Atari XE line was pretty novel for the times as well. One instruction page flipping and it could access up to 1M, iirc.

    11. Re:Basically like having two processors... by tengwar · · Score: 1

      Well the original PC had an 8048 8-bit processor to control the keyboard, and presumably there's something like that still in there. Does anyone know if we can get at it?

    12. Re:Basically like having two processors... by Doc+Ruby · · Score: 1

      Modern PC-AT style computers have an x86-compatible CPU, keyboard processor, IDE microcontroller, GPU, sound DSP, ethernet ASIC, and other processors with their own instruction sets that run in parallel. Not to mention PCI, USB, FireWire, SCSI, and PCMCIA bus-accessible processor devices. I've seen Linux apps that program the keyboard microcontroller, and others. What would be fascinating would be Java virtual machines running on each of the different chips, and a distributed multiprocessing scheduler coordinating messages and resources among java objects moving around the system to be "near" their data. That's a way to soak instructions into all the nooks and crannies in the cheap multiprocessors on our desks, minus the JVM overhead.

      --

      --
      make install -not war

  4. Cool ... by torpor · · Score: 5, Interesting

    ... can you say 'software synthesists' wet dream?

    Oh, suddenly, that 'game investment' also gives you a few 100 extra voices of polyphony?

    Sweet ... $5 to the first person to use Brooke to make a synthesizer. :)

    --
    ; -- the corruption of government starts with its secrets. a truly free people keep no secrets. --
    1. Re:Cool ... by usrusr · · Score: 2, Informative

      think fx not synth... just use it as a bad-ass real time convolver, and _then_ get wet.

      isn't it much more interesting to do things that were not possible before, than to just do the some thing, but in increased quantity? Also convolution is the single most universal operation in audio dsp (fir filters, reverb), one well-built plugin would suffice for everything. synth development creativity would certainly suffer from the increased development costs.

      --
      [i have an opinion and i am not afraid to use it]
    2. Re:Cool ... by torpor · · Score: 2, Interesting

      What does 'synth' mean to you?

      To me it doesn't just mean Virtual Analog, or subtractive... it can be anything that makes noise ... so yeah, filters, yeah, effects, yeah, a single monster filter...

      Its all good. Lets see what the GPU's can do ...

      --
      ; -- the corruption of government starts with its secrets. a truly free people keep no secrets. --
    3. Re:Cool ... by Anonymous Coward · · Score: 0

      tttiiiiiiiiiimmmmmmmeeeesssssssssccccccccccaaaaaaa lllllllliiiiiiiinnnnnnnggggggggggg

      eventide's machines cost a fortune and im sure a geforce 3 would beat it!

  5. first link is incorrect by 2.246.1010.78 · · Score: 5, Informative

    but the link to the project page is correct.

  6. Like the good old days by fiskbil · · Score: 5, Funny

    Reminds me of the good old days when you used the processors in the C64 tapedrive to compute stuff. Wouldn't want to waste those precious cycles.

    I'm sure a lot of old farts will tell me how they used some serial controller to compute stuff back in the 60's and that I'm just a little kid. :)

    1. Re:Like the good old days by Trracer · · Score: 3, Informative

      I guess you mean in the C1541 floppydrive.

      --
      English is not my first language, so cut me some slack -: Om du kan lasa det har sa kan du Svenska :-
    2. Re:Like the good old days by tzanger · · Score: 3, Informative

      Reminds me of the good old days when you used the processors in the C64 tapedrive to compute stuff. Wouldn't want to waste those precious cycles.

      Actually it was the old 1540/1541 and later 1571/1581 disk drives. The tape drive did not have a processor in it.

    3. Re:Like the good old days by Anonymous Coward · · Score: 0

      > Reminds me of the good old days when you used the processors in the C64 tapedrive to compute stuff

      I suspect you meant the 1541 disk drive. The Datasette tape drives were extremely *dumb* devices, with no kind of CPU in them.

    4. Re:Like the good old days by hemanman · · Score: 1

      I'm sure a lot of old farts will tell me how they used some serial controller to compute stuff back in the 60's and that I'm just a little kid. :)

      Yeah, a little kid thats on way to much crack like the moderators of your post!

      Else you would have remembered that the tapedrive, was in fact just a tapedrive without any processors whatsoever.

      -H

    5. Re:Like the good old days by Anonymous Coward · · Score: 1, Interesting

      Actually, I made a image codec on the Amiga that programmed the blitter (Amiga graphics co-processor) to do delta decoding, using 3 bitplanes to describe -4 to +3 delta, and summing with previous pixel on a 5 pitplane image. Decoding was done in parallel with the main processor decoding a runlength+huffman stage for next frame. I think that was the first codec I ever made, and certainly the one I had most fun making. Ah, those were the days..

      For the interested, it was used on PMC's Alpha & Omega released on The Gathering 1991.

  7. wait a minute by Janek+Kozicki · · Score: 5, Interesting

    A shader program running on the NVIDIA GeForce FX 5900 Ultra achieves over 20 GFLOPS, roughly equivalent to a 10 GHz Pentium 4.

    wait, if there is a technology that allows construction of GPU that is 3 times faster than the fastest CPUs, why Intel and AMD do not use this technology to build those 3times faster CPUs?

    are you sure that you can compare the speed of GPU and CPU?

    --
    #
    #\ @ ? Colonize Mars
    #
    1. Re:wait a minute by Anonymous Coward · · Score: 1, Informative

      No. GPU's are basically matrix crunchers for vector calculations. You can make them do other stuff, probably, but it'd be about as efficient as emulating an Amiga on a x86.

    2. Re:wait a minute by MooCows · · Score: 2, Informative

      The keywords are:
      A shader program

      The GPU is designed for CG, not for 'general purpose computing'.
      I guess the instruction set is pretty limited too.

      --
      The path I walk alone is endlessly long.
      30 minutes by bike, 15 by bus.
    3. Re:wait a minute by AvitarX · · Score: 2, Informative

      You can compare there ability to run shader programs (see the example given).

      It does not mean you can use the GPU as a general purpose prossessor effectivly, or that it is even turing complete.

      All it means is that certain types of programs could possibly run 3 times faster if ported to this system.

      --
      Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
    4. Re:wait a minute by Anonymous Coward · · Score: 0
      wait, if there is a technology that allows construction of GPU that is 3 times faster than the fastest CPUs, why Intel and AMD do not use this technology to build those 3times faster CPUs?


      It's not the construction but the design. CPUs are more general purpose and so require more logic and are conseuntly slower

      are you sure that you can compare the speed of GPU and CPU?


      For certain types of operations? Sure. There both just implementations of the Turing machine after all.
    5. Re:wait a minute by Anonymous Coward · · Score: 0, Informative

      The CPU is a general purpose computing device. GPU is a specific purpose computing platform.

      Think about the playstation2 were it's 300mhz cpu that outperforms a 700mhz or higher pentium cpu in graphics performance, but also would run a wordproccessor at a speed that would be slow compared to a 486.

      I bet that if you ran a webbrowser on the same GPU it would run just a bit faster then on a 286 computer.

    6. Re:wait a minute by ankit · · Score: 2, Informative

      Its probably because the Pentium 4 needs to be more generic. It needs to support a far greater number of instructions.

      A GPU on the other hand can do only so much. But its strength lies in areas where the CPU lags. Fast memory interfacing, extreme parallelization etc.

      Now there exist cmoputing problems that can be solved very efficiently on the GPU, even with its limited instruction set. This is what this project is all about - to provide a generic programming language that compiles to a vertex/pixel shader that runs on the GPU, but does non-graphics tasks. awesome!

      --
      Don't Panic
    7. Re:wait a minute by the+uNF+cola · · Score: 5, Informative

      You are assuming using the GPU technologies are possible in a CPU. Because something is applicable in one instance doesn't mean it is in all instances. Making some things efficient may take away from the efficiency of others, but in the case of such aa specialized chip, it may not matter.

      It may be ok to compare the speed of a GPU and a CPU if they are infact different. If a GPU was a CPU used with cheaper material, yeah, it would be unfair. But as life goes, they both have their merits.. so why not? A GPU is prolly best at some matrix math transforms.. or not. :)

      --

      --
      "I'm not bright. Big words confuse me. But Wanda loves me and that should be enough for you." - Cosmo

    8. Re:wait a minute by enigma48 · · Score: 5, Insightful

      Definately possible - general purpose CPUs have to do everything where graphics cards can specialize and do what little they can, faster.

      Also, good point about comparing GHz to GHz - AMD CPUs do more per cycle than Intel, but are also clocked much lower. You could look at a subset of instructions (ie: FLoating-point OPerations (FLOPS)) but this only gives you a piece of the overall performance picture.

      Without having read the article, my guess is they extrapolated (educated, math-based guess) how fast a 10GHz P4 would perform and compared the results that way.

      I'd LOVE to see this tech built into a SETI or Folding@Home client (steroids version). (Imagine the kids - "Mom, I need the Radeon 9800XT to find a cure for Grandma's cancer!")

    9. Re:wait a minute by Jah-Wren+Ryel · · Score: 2, Interesting

      All the world is not a FLOP. GPU = Graphics Processing Unit, not General Purpose Unit.

      --
      When information is power, privacy is freedom.
    10. Re:wait a minute by Entropy_ajb · · Score: 5, Informative

      Because CPUs are limited to running instructions (for the most part) in serial. GPUs get to run a large number of instructions in parallel. As some above posts mentioned, a lot of the stuff the GPU can do is vector and matrix multiplication, therefore the GPU is really good at multiplying a lot of numbers times a lot of numbers at once. But in everyday life you aren't multiplying a bunch of number times a bunch of numbers at once, you are multiplying one number time another, then multiplying the result times a number, and so on. GPUs are built to a specific task, and at that task they are very fast, but outside that task they won't be able to compete with a real CPU. And on top of all of that I can buy 3 2.4Ghz P4s for the price of a Geforce FX5950.

    11. Re:wait a minute by mdpye · · Score: 4, Interesting
      And on top of all of that I can buy 3 2.4Ghz P4s for the price of a Geforce FX5950

      But you forget the 256MB (at least) RAM on a steaming fast interface that you get with the GeForce... It makes the P4s' cache look pretty paltry in size by comparison.

      MP
    12. Re:wait a minute by Kjella · · Score: 4, Informative

      wait, if there is a technology that allows construction of GPU that is 3 times faster than the fastest CPUs, why Intel and AMD do not use this technology to build those 3times faster CPUs?

      are you sure that you can compare the speed of GPU and CPU?


      Well, yes and no. In the same way you can take a render farm and say that "this provides the equivalent of a 100GHz Pentium" Which might be true, for that specific task. You see it already between GPUs, compare Pentium, Xeon, Athlon XP and Athlon 64. Do you get one benchmark "X is 3% faster than Y"? No. Faster at some, slower at others. For a specific benchmark, the difference can be pretty big already among "general" processors.

      A specialized processor like a GPU will show much greater variation. It might really shine on some, really suck on others. Which is why it's no good using a GPU as a CPU. Those numbers tell you that it can be much faster than the fastest CPU around. Or better yet, if you can make it run in parallell to the normal CPU, give you a total performance which may theoretically be about 13GHz (10 + 3), where 3 of those can be general-purpose operations. Or it may be a task the GPU runs like a dog, and isn't even worth the overhead.

      Kjella

      --
      Live today, because you never know what tomorrow brings
    13. Re:wait a minute by paulerdos · · Score: 1

      you meant Floating Point Operations Per Second (FLOPS), right?

    14. Re:wait a minute by barik · · Score: 5, Interesting

      Are you sure that you can compare the speed of GPU and CPU?

      Professor Pat Hanrahan, of Stanford University, made a stab at answering this question in his presentation 'Why is Graphics Hardware so Fast?'. The first half of the presentation focuses on this question, while the second half of the presentation covers programming languages that utilitize this hardware. Specifically, the Stanford Real-Time Shading Language (RTSL) and Brook are discussed. Overall, it's a good presentation that should get you up to speed with the basics of what's happening in this area of research.

    15. Re:wait a minute by SenzarLarkin · · Score: 1

      It's simply a result of what you dedicate your transistors towards. GPUs have very little cache, whereas the greater bulk of chip area on a CPU is cache. Since all you have is a bunch of transistors, every block that you make for computation instead of cache increases arithmetic performance.

    16. Re:wait a minute by SirDaShadow · · Score: 2, Funny

      2 words: X86 architecture. Everyone who hated it told you it sucks. Now you see why.

    17. Re:wait a minute by Anonymous Coward · · Score: 0

      Yes you can compare the theoretical throughput of a GPU and CPU in terms of FLOPS or some other OPS's.
      That doesn't mean they are interchangable. For example you can compare the horsepower of the engine in a formula I and an SUV. That doesn't mean you can interchange the vehicles. If you were interested, however in subverting the engine to run a generator the HP comparison is meaningful. That's what BROOK is all about -- letting you hijack the horsepower in the GPU.

    18. Re:wait a minute by larkost · · Score: 2, Funny

      *arrrg*!!

      PowerPoint-like presentation... going dumb... noooooo...

    19. Re:wait a minute by larry+bagina · · Score: 1
      why Intel and AMD do not use this technology to build those 3times faster CPUs?

      They do. MMX, SSE, etc. do parallel math operations. Except for the intel compiler, you need to specifically write code yourself (in asm or gcc's psuedo-high level front end). And most general-purpose computing doesn't deal in math that can be optimized by it.

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

    20. Re:wait a minute by sql*kitten · · Score: 1

      wait, if there is a technology that allows construction of GPU that is 3 times faster than the fastest CPUs, why Intel and AMD do not use this technology to build those 3times faster CPUs?

      MIPS did this in the R8000 chipset used in some of SGI's Impact systems, back in the early 90s. The result was a machine that was deadly for floating point, but the trade-off was merely average general-purpose (i.e. integer) performance and it was very difficult to optimize code for.

    21. Re:wait a minute by TheRoachMan · · Score: 1

      Allright, the interface between GPU and RAM on a GeForce card might be streaming fast, but it isn't on-die like a P4's cache. Increasing the on-die cache on a P4 to a whopping 256MB instead of the standard 256KB or 512KB wouldn't help much speed-wise. The cache is larger, but it will also take more cycles to find something that's stored there, and return it to the register. You can't just go and compare a Graphic card's RAM to a CPU's on-die cache!!

    22. Re:wait a minute by mdpye · · Score: 1

      True, but you did compare a P4 with its cache to a GeForce and its entire memory. All I'm saying is you have to remember that a gfx card isn't just the GPU when you do price comparisons. Granted memory is hardly expensive nowadays, but the interface is still *fast* by FSB standards.

      I'm not suggesting a P4 with 256MB cache, I'm suggesting a P4 with a FSB as fast as the GeForce memory bus...

      MP

    23. Re:wait a minute by Anonymous Coward · · Score: 0

      They do. MMX, SSE, etc. do parallel math operations. Except for the intel compiler, you need to specifically write code yourself (in asm or gcc's psuedo-high level front end).


      Doesn't GCC3.something (3.3?) offer some support for SSE and friends? At least it has some options for them, such as -mfpmath=sse, -msse, -mmmx and -m3dnow.

    24. Re:wait a minute by Anonymous Coward · · Score: 0

      I paid AU$500 (Jan03) for two MP Athlons which run at 1.6GHz yet my AU$900 ATI 9800XT runs at 418MHz. While the AMD's are nothing to scream about the ATI video card is, even at AGP 4x. It simply rocks!

      It should be evident that it's very difficult to get a GPU running at 1GHz, while some Intel CPU's run at 3GHz. GPU's are more efficient at what they do which is in itself is very specialized. I'm certain it would benifit nVidia and ATI greatly if they could double clock speed an current GPUs.

      I'm certain AMD and Intel would have added GPU type 3D instructions beyond SSE and 3DNow if their processors could reach the sort of frequencies they run at today.

    25. Re:wait a minute by larry+bagina · · Score: 1
      Those flags, as far as I can tell, don't automatically cause mmx/sse code to be generated.

      They *do* enable recognition of some builtin mmx/sse functions (which are front ends for the asm instructions of the same name).

      See here for the mmx header.

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

  8. How does this look? by adrianbaugh · · Score: 5, Interesting

    I'm completely new to meddling with graphics card, so apologies if this is a silly question: when programs utilising the GPU for arbitrary calculations are running does the screen go weird, or is there a way of stopping the output being displayed? A screenfull of junk might not matter to a scientist leaving their computer to crunch numbers for a few months but it wouldn't be good for a general-purpose program.

    --
    "'I pass the test,' she said. 'I will diminish, and go into the West, and remain Galadriel.'"
    - JRR Tolkien.
    1. Re:How does this look? by DAldredge · · Score: 1

      I WISH you could get it to display. Having direct access to memory on systems like the C64 lead to some neat effects.

    2. Re:How does this look? by Anonymous Coward · · Score: 5, Informative

      Nope. Nothing appears on your screen until the contents of the area of memory known as the "frame buffer" are rewritten by a program (on either the GPU or CPU). The GPU can execute math code all day and you won't see the results unless it deliberately modifies the frame buffer.

    3. Re:How does this look? by Anonymous Coward · · Score: 0

      No, these cards work better than the humble ZX-80. These GPU calculations are no less arbitrary and no more intrusive than the calculations being performed for your UT or BF1942, or even Duke Nukem: Forever!

    4. Re:How does this look? by tzanger · · Score: 1

      The GPU is the video processor. What's on the display is somewhere in video memory. So long as you're not writing to the particular section of video memory that is being used to show things on screen, you will see nothing. I often use video memory for swap on Linux since I never am in anything but textmode and you can't seem to buy cheap video cards with small amounts of memory.

    5. Re:How does this look? by Anonymous Coward · · Score: 0

      Eh, it would look cool. Matrix-y.

    6. Re:How does this look? by -noefordeg- · · Score: 1

      You usually have double or tripple buffer, where you draw to one buffer, then swap it with the one which was displayed.

    7. Re:How does this look? by Anonymous Coward · · Score: 0

      Something like this:

      1010100100101001000110111000111000101010...

    8. Re:How does this look? by Anonymous Coward · · Score: 0

      Isn't there a huge time penalty for reading the contents of a graphics card's frame buffer from the CPU?

    9. Re:How does this look? by CableModemSniper · · Score: 1

      That sounds really interesting. Could you point me at a how-to or something to pull that off?

      --
      Why not fork?
    10. Re:How does this look? by DrSkwid · · Score: 1

      then you might like plan9

      It has r/w addressable video memory.

      And being plan9 it is even available across the network, like all memory, modulo permissions.

      --
      There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
    11. Re:How does this look? by tzanger · · Score: 1

      That sounds really interesting. Could you point me at a how-to or something to pull that off?

      Sure. Google's a good friend here. :-)

    12. Re:How does this look? by Bazman · · Score: 1

      Reminds me of my ZX Spectrum programming days. We were working on a program that really had to use as little memory as possible, so chunks of the code were copied into bitmap screen RAM. The Speccie had attribute screen RAM as well, so we set that to black ink on black paper so the code was invisible, and designed the rest of our screen graphics around it.

      Of course if the screen ever scrolled we were in trouble :)

    13. Re:How does this look? by JasonAsbahr · · Score: 1

      Seems to me that if you are writing to the frame buffer it will appear on screen via the normal vga signal generation process.

  9. I am not an EE, but... by unfortunateson · · Score: 5, Interesting

    It would seem to me that the GPU is not going to be as general-purpose as the CPU, but could still attain the high mathematical throughput with vector-oriented processing.

    Doing string searches, complex logic analyses, etc. would probably suck, but big data manipulations, such as SETI-style wave transformations, molecular analysis, etc., might be able to take advantage of them.

    --
    Design for Use, not Construction!
    1. Re:I am not an EE, but... by segmond · · Score: 1

      yeah, doing everything will suck using general purpose algorithms, so to make them not to suck, we develop new vector-oriented algorithms...

      --
      ------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
  10. A good example of how an OS should be programmed. by qualico · · Score: 2, Insightful

    "Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language" I'll qualify that this is the first I've heard of Brook, however, the words: "efficient language", ring loud in my ears. If Operating systems were programmed as such, imagine how fast bootup and operation of a computer would be. Instead we have bloated software on all sides of the board, that can barely show us the differences of MHz to GHz machines. What century are we in again?

  11. Good point. by yoshi_mon · · Score: 4, Insightful

    After taking a quick peek at the language part of the project it seems right now that most of it's functions are all about sets of data and how to move them around.

    Makes sence of course as that is what a GPU is all about. (Yes I'm vastly over-simplyifying here.) So I would gather that it might be used for types of data that are streamed alot? Maybe used for video editing, real time video, etc where your trying to deal with a lot of data at once that your trying to move around and not just store or have to perform some more complicated types of functions upon.

    However, I'm no 3d programmer and I should would love a more detailed analysis of the potentals for this.

    --

    Really, I know what I'm doing...Ohhhh, look at the shiny buttons!
    1. Re:Good point. by Anonymous Coward · · Score: 0

      wrong.

      it is not about streaming video or data.

      it is about linear algebra. vector processing as was done in super computers before massive parallell clusters became popular.

      unless you want to perform matrix calculations, this will not help you at all.

    2. Re:Good point. by Anonymous Coward · · Score: 0

      wrong.
      it is not about streaming video or data.
      it is about linear algebra.


      Well, I never said it was about that type of data, I just theorized that maybe those types of data sets would be able to find practical uses for the new functions.

  12. Fast Fourier Transform by HalfFlat · · Score: 3, Interesting

    I'd love to see an FFT implementation (maybe it's not so hard ... will have to download and play with it.)

    A lot of scientific code is constrained by how fast you can do an FFT, perhaps of arbitrary size. And a fast graphics card is a lot cheaper than a high-end processor.

    For embarassingly parallel vector problems, this is just the sort of thing for cheap, powerful clusters based around a cheap PC and a fast GPU.

    1. Re:Fast Fourier Transform by Kazymyr · · Score: 4, Interesting

      Not to mention that you can put several PCI video cards in the same cheap PC. Multiply power by N.

      --
      I hadn't known there were so many idiots in the world until I started using the Internet -Stanislaw Lem
    2. Re:Fast Fourier Transform by Anonymous Coward · · Score: 0

      Why settle for one graphics card?

    3. Re:Fast Fourier Transform by TCM · · Score: 1

      Don't SETI@Home and Prime95 do something with FFT? Or am I mixing something up here?

      --
      Of course it runs NetBSD. BTC: 1NT7QvbetmANwaMzhpVL6
    4. Re:Fast Fourier Transform by jonsmirl · · Score: 5, Informative

      http://www.cs.unm.edu/~kmorel/documents/fftgpu/

      The FFT on a GPU
      This page contains supplemental material for the following paper.

      Moreland, K and Angel, E. "The FFT on a GPU." In SIGGRAPH/Eurographics Workshop on Graphics Hardware 2003 Proceedings, pp. 112-119, July 2003.

    5. Re:Fast Fourier Transform by SharpFang · · Score: 1

      *sigh* I've seen so many more or less sophisticated code to do the bit mirror-reversing in FFT, and why haven't they still made a CPU (ASM) command for that (or did they?) That's SO easy in hardware, just twist the bus 180 degrees.

      --
      45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
    6. Re:Fast Fourier Transform by Anonymous Coward · · Score: 0

      Heh.. I've always wondered if you could do that with a bunch of glcopytexsubimages and creative blending.

    7. Re:Fast Fourier Transform by BiggerIsBetter · · Score: 4, Funny

      Multiply power by N.

      You work for Nvidia, don't you?

      --
      Forget thrust, drag, lift and weight. Airplanes fly because of money.
    8. Re:Fast Fourier Transform by Fembot · · Score: 1

      Sorry to dissapoint but I doubt it scales liniarly... and the PCI bus isnt exactly fast either

    9. Re:Fast Fourier Transform by tius · · Score: 1

      It's called a DSP, otherwise it's a waste of silicon and expense in a general purpose CPU.

    10. Re:Fast Fourier Transform by Anonymous Coward · · Score: 0

      Altivec's permute instructions do this kind of thing iirc.

    11. Re:Fast Fourier Transform by rrkap · · Score: 1

      One application for this might be using a video card to handle mp3/ogg encoding/decoding. Those are basically frequency space calculations, so the math is similar. You might be able to encode data really quickly this way or use a pci video card to handle this leaving the processor free for other things.

      --
      I like my beverages with warning labels!
    12. Re:Fast Fourier Transform by Kazymyr · · Score: 1

      If I did, I'd say "multiply by n" - it's nVidia. :)

      --
      I hadn't known there were so many idiots in the world until I started using the Internet -Stanislaw Lem
    13. Re:Fast Fourier Transform by Kazymyr · · Score: 1

      Maybe not exactly linear, but each card could run different code, for a different task, out of its own memory - which would make it pretty close.

      --
      I hadn't known there were so many idiots in the world until I started using the Internet -Stanislaw Lem
    14. Re:Fast Fourier Transform by SharpFang · · Score: 1

      Can be done on ONE standard latch with two extra inputs (CS and ~WR). 1 CPU command to replace sniplets like:

      n = ((n >> 1) & 0x55555555) | ((n > 2) & 0x33333333) | ((n > 4) & 0x0f0f0f0f) | ((n > 8) & 0x00ff00ff) | ((n > 16) & 0x0000ffff) | ((n 16) & 0xffff0000);

      (how many CPU cycles is the above?)
      And how many cycles to initialise the DSP to perform it?

      --
      45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
    15. Re:Fast Fourier Transform by mysticbob · · Score: 1
      Not to mention that you can put several PCI video cards in the same cheap PC. Multiply power by N.

      not exactly true -- as people have pointed out, the pci bus is shared, and graphics (even of the sort brook can do) are still bandwidth intensive, so this is a bottleneck which will limit scalability.

      second, and perhaps more importantly, almost nobody makes pci gfx anymore, and nvidia, ati, and everyone else is deprecating pci, and moving to pci-express post-haste.

    16. Re:Fast Fourier Transform by Anonymous Coward · · Score: 0

      Even better , use the graphics card to handle MPEG2/MPEG4/VP4 encoding and decoding.

      oh wait...

  13. Re:Wow, thats pretty sad by DAldredge · · Score: 0, Offtopic

    I didn't pay.

    Someone paid for me to have an account. I think it was someones idea of a sick joke.

  14. Site is slow by Anonymous Coward · · Score: 2, Informative
    As the programmability and performance of modern GPUs continues to increase, many researchers are looking to graphics hardware to solve problems previously performed on general purpose CPUs. In many cases, performing general purpose computation on graphics hardware can provide a significant advantage over implementations on traditional CPUs. However, if GPUs are to become a powerful processing resource, it is important to establish the correct abstraction of the hardware; this will encourage efficient application design as well as an optimizable interface for hardware designers.

    Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar and efficient language. The general computational model, referred to as streaming, provides two main benefits over traditional conventional languages:
    • Data Parallelism: Allows the programmer to specify how to perform the same operations in parallel on different data.
    • Arithmetic Intensity: Encourages programmers to specify operations on data which minimize global communication and maximize localized computation.
    More about Brook can be found at the Merrimac web site which contains a complete specifications for the language.

    The BrookGPU compilation and runtime architecture consists of a two components. BRCC is the BrookGPU compiler is a source to source metacompiler which translates Brook source files (.br) into .cpp files. The compiler converts Brook primatives into legal C++ syntax with the help of the BRT, Brook RunTime library.

    The BRT is an architecture independent software layer which implements the backend support of the Brook primatives for particular hardware. The BRT is a class library which presents a generic interface for the compiler to use. The implementation of the class methods are customized for each hardware supported by the system. The backend implementation is choosen at runtime based on the hardware available on the system or at request of the user. The backends include: DirectX9, OpenGL ARB, NVIDIA NV3x, and C++ reference.
  15. The deaf leading the blind... by Kjella · · Score: 4, Informative

    ...but I assume that in any advanced texturing/shading/bump mapping/other GFX function rendering, you apply all the different effects, and when you're done, specifically call that the frame is to be displayed on screen. (E.g. why your FPS != your monitor refresh rate)

    I would assume that this program simply never calls the drawing function, but instead gets the results back from the GPU. The normal screen should be able to run in the meanwhile (I assume you can e.g. build a 3D environment while showing a 2D cutscreen), so I would think you can have a plain GUI, as long as it doesn't need to use anything advanced.

    Kjella

    --
    Live today, because you never know what tomorrow brings
  16. Homepage of GPGU research by zymano · · Score: 4, Informative


    www.gpgpu.org

    Very cool. Vector/Graphics processors could one day overtake General processors. They are way more energy efficient too.

  17. Drawing text with GPU shader units? by jonsmirl · · Score: 4, Interesting
    Has anyone tried drawing text with GPU shader units? It would work something like this:

    1) Each character would have it's own shader program.
    2) You would set the shader program, draw a rectange, and the character would appear.
    3) The shader programs would be automatically generated by processing TrueType files.

    To implement:
    1) Break Truetype outline up into a number of convex curve segments.
    2) Each of these curve segments would be represented as a set of constants in the shader program
    3) For each pixel, test a line from pixel to an edge.
    4) If the number of segments crossed is odd the pixel is black else white.
    The algorithm can be refined to add antialiasing and hinting.

    What you end up with is text that is clear at any resolution. The size of the text is controlled by the rectangle you draw it in. The text can also be clearly rotated and sheared.

    An obvious optimization is to get the GPU vendors to add a shader instruction to do the calculation for which side of the bezier curve segment the current point lies.

    While not important for games drawing text is critical for desktops. And we all know about the current trends to draw desktops with 3D hardware.

    1. Re:Drawing text with GPU shader units? by asparagus · · Score: 1

      It's a nice idea, but far simpler to rasterize the characters one time to a buffer, and then use them as 2d-textures. Then it's easier to code, optimize, and tweak said textures/characters if you don't like how they look. It eats memory, but that's one thing there's plenty of on modern graphics cards.

    2. Re:Drawing text with GPU shader units? by jonsmirl · · Score: 2, Interesting
      Think about a compositing system where the window the app is being drawn into has been transformed into a non-rectangular shape by the compositing engine.

      The app thinks it is drawing into a flat rectangle. But the compositing engine distorts the font bitmap with it's transform. With the shader approach the distortion doesn't happen. Same problem happens when the compositing engine does scaling.

      You only need one shader program per glyph not matter what point size you want to draw. There is a lot of overhead in managing the bitmaps for all of the different point sizes. These bitmaps can get quite big on a 4K by 3K resolution screen.

    3. Re:Drawing text with GPU shader units? by Have+Blue · · Score: 1

      It might be more efficient to convert the fonts beforehand into polygons that can be processed by the GPU's hardware tesselator (I know the Radeons have this, not sure about the Geforces). Then they can be rasterized using a faster process.

    4. Re:Drawing text with GPU shader units? by whovian · · Score: 1

      The bezier stuff you mentioned whaffs of PostScript. Maybe the IT lawyers could chime in and say whether the extant patents that companies (like Adobe, IIRC) have also apply to GPU-rendered curves.

      --
      To-do List: Receive telemarketing call during a tornado warning. Check.
    5. Re:Drawing text with GPU shader units? by jonsmirl · · Score: 1

      With tesselating you can't perform hinting at tiny point sizes.

    6. Re:Drawing text with GPU shader units? by BiggerIsBetter · · Score: 2, Insightful

      Why not? Printers have been doing this for years... There's no reason you couldn't make a graphics card to display postscript in hardware.

      --
      Forget thrust, drag, lift and weight. Airplanes fly because of money.
    7. Re:Drawing text with GPU shader units? by One+Louder · · Score: 1

      I don't think there are any patents on simply rendering the filled bezier outlines - the patents usually have to do with adjustments to improve appearance very small sizes, such as hinting or level-of-detail substitutions, or taking advantage of pixel geometry like in flat panels.

    8. Re:Drawing text with GPU shader units? by Anonymous Coward · · Score: 0

      If Adobe can stretch out it's patents to cover anything that converts TrueType font vectors to another general-purpose vector format, a whole lot of projects more important than this one are screwed too. My guerss is that if they had the patents to do that, they would already be using them to squash freetype, and all the various SVG rendering libraries out there.

    9. Re:Drawing text with GPU shader units? by Allen+Akin · · Score: 1

      Yeah, I have code that does this (and I'm sure lots of other folks do, too).

      The plus is that it can be used to produce fairly nice antialiased text that intermixes well with other primitives, and rendering is very fast.

      The minus is that a single set of geometric primitives for a character won't work for all point sizes if you need to use hinting. (Whether this is important or not depends on your application -- especially whether you need very small text, or have so little graphics memory available that you can't store more than one set of primitives.)

      The interesting challenge is to see whether hinting can be implemented with a vertex program. Some preprocessing of the outline probably would be necessary, as well as some cleverness triggering the vertex program to generate the primitives for the rasterizer. If anyone already has this working, I'd be interested in hearing about it.

      Allen

    10. Re:Drawing text with GPU shader units? by whovian · · Score: 1
      Thanks. Self correction: I guess that should be Apple not Adobe, according to FreeType:

      Apple Computer owns three patents that are related to the processing of glyph outlines within TrueType fonts. This process if also called hinting or grid-fitting and is used to enhance the quality of glyphs at small bitmap sizes.
      ( http://freetype.sourceforge.net/patents.html )

      --
      To-do List: Receive telemarketing call during a tornado warning. Check.
    11. Re:Drawing text with GPU shader units? by EddWo · · Score: 1

      I believe Microsoft is doing this in Longhorn. They are reworking their Cleartype code to use pixels shaders where available.

      --
      "Taligent is still pure vapor. Maybe they'll be the last who jumps up on Openstep... "
    12. Re:Drawing text with GPU shader units? by whovian · · Score: 1

      It appears instead that Adobe is embracing SVG by releasing their own viewer.

      --
      To-do List: Receive telemarketing call during a tornado warning. Check.
    13. Re:Drawing text with GPU shader units? by jonsmirl · · Score: 1

      Do you have any reference supporting this?

    14. Re:Drawing text with GPU shader units? by EddWo · · Score: 1
      --
      "Taligent is still pure vapor. Maybe they'll be the last who jumps up on Openstep... "
    15. Re:Drawing text with GPU shader units? by EddWo · · Score: 1
      Windows Longhorn Graphics Infrastructure and Text Rendering

      A self-extracting zipped Powerpoint presentation.

      --
      "Taligent is still pure vapor. Maybe they'll be the last who jumps up on Openstep... "
    16. Re:Drawing text with GPU shader units? by orasio · · Score: 1

      something like this?
      Display Postscript

      In the engineering school where I study, there are some old Sparc workstations with Postscript displays , and even optical mouses, from the eighties!! (the ones with the grid on the mousepad)

  18. Re:A good example of how an OS should be programme by JonnyRo88 · · Score: 1

    What OS are you using. A 2.6 kernel screams with performance, and 2.4.24 is running beautifully on some of my systems.

    The only thing that interests me about Brook is that it may allow for more efficient cryptography operations, if the processors it runs on are very good at running calculations in parallel. I would love to see something like this become a cheaper alternative to dedicated hardware encryption accelerator cards, because Graphics Cards benifit from an economy of scale that encryption accelerators do not, and thus are much cheaper.

    --
    The Ro Factor - Jeep/Linux Weblog
  19. The future is the past by Anonymous Coward · · Score: 1, Insightful

    Beyond3D had a shader contest and some genius wrote frogger in a pixel shader. That's some great stuff.

    If GPUs are to become general purpose, will the AGP bus problems have to be fixed (fast at delivering data, slow at getting it back from the GPU). Does PCI-X fix this?

    1. Re:The future is the past by Total_Wimp · · Score: 4, Interesting

      PCI-X can fix this data bus in other ways as well. Motherboards come with one AGP slot, but PCI-X can and will provide many expansion slots.

      Picture five high end GPUs on the motherboard eclipsing the single high-end cpu for a fraction of the price. Intel and AMD would be forced to cut the asking price of their products to compete. We could finally see some real four-way competition for "processors".

      TW

    2. Re:The future is the past by Anonymous Coward · · Score: 0

      Do you live in a cave-grave somewhere in Trikith or something?

      The last time I checked, the high end Nvidia or ATI GPU was at least 3 times as expensive as a Pentium 4 2.2 ghz. And don't get me started at how much more expensive it is than a Athlon.

      Once again, how do you write sensible sounding posts without actually basing it on reality?

    3. Re:The future is the past by Total_Wimp · · Score: 1

      "The last time I checked, the high end Nvidia or ATI GPU was at least 3 times as expensive as a Pentium 4 2.2 ghz. And don't get me started at how much more expensive it is than a Athlon."

      Top of the line CPUs are more expensive than the top of the line GPUs, just like mid-range CPUs are more expensive than mid-range GPUs. Why are you comparing a High-end part with a mid range part and calling me unsensible?

      But wait, there's more. If you were able to put several high-end video cards in PCI-X slots and use the CPU to manage them then you'd have to compare 5 high-end GPUs + 1 high-end CPU + 1 single processor motherboard to 6 high-end CPUs capable of six-way SMP(much higher price than the "other" high end CPUs) + 1 motherboard capable of six-way SMP.

      Question: which would have higher processing power and which would cost more? Based on stats I've seen, the 5 GPUs will kick butt for a fraction of the price.

      TW

    4. Re:The future is the past by parnold · · Score: 1

      You are thinking of PCI Express. PCI-X is faster than standard PCI but no where near as fast as PCI Express and is not intended as a replacement for AGP.

      PCI Express, which is set to replace AGP comes in several speeds and motherboards will only have one 16x PCI Express slot suitable for use with graphics cards

      So unfortunatly this will have no effect on CPU prices

      --
      this sig intentionally left blank
    5. Re:The future is the past by Total_Wimp · · Score: 1

      You're right, I had them confused. That said, PCI-Express is still promissing. I found this gem in the PCI-Express article that you linked:

      "One fairly exciting possibility for this upcoming technology is the potential for multiple graphics cards using the same high-speed connection. Currently it is possible to use more than one video card in a given system by combining an AGP card with a 32-bit PCI card. There is no apparent reason why the X16 spec could not be extended to include support for more than one slot, allowing multiple high-speed PCI-Express graphics cards... though at the moment, such a feature does not exist as far as we know."

      The article seemed to believe PCI-Express would tend toward 1x connections for cost reasons and usually that does have a very large effect in the marketplace, but the market for multiple monitors is growing so I'll hold out some hope for the future :-)

      TW

      PS. Thanks for the links! I needed to catch up.

  20. Brook by belmolis · · Score: 5, Insightful

    This looks like a straightforward and clean extension that experienced C/C++ programmers won't find difficult to learn, but it isn't entirely clear to me whether just using this language, without any knowledge of GPU architecture, will lead to big improvements in performance. Granted, you don't need to know the details, but you've got to have an idea of what it is that you're trying to do and in a general way how the special constructs of the language allow you to do that. As with other such language extensions, you can nominally write in the language but not really use the extensions (how many "C++" programs have you seen that were really C programs with // comments and a few couts?) or use them in unintended ways that prevent the intended optimization. It seems to me that if the project really is aiming at programmers who are not familiar with GPUs, they need at least to provide a brief introduction to the special properties of GPU architecture and some guidelines as to how to use the features of the language to take advantage of them. At present I don't find this either on the web sites or in the distribution.

  21. MESA OpenGL Acceleration by JonnyRo88 · · Score: 1

    One thing that would interest me is the development of improvements to the mesaGL drivers for linux that would utilize Cg programs running on an nvidia card to compute some graphics calculations.

    I know this does not take advantage of the cards built in capabilities for 3d, but it would allow for the creation of a better GPL'ed driver for the nvidia geforce, one suitable for distribution with GPL'ed kernels. The user could at a later point to decide to switch to the closed source driver if they want to go even faster, but at least the default driver that comes with a distribution wouldnt be so crappy.

    --
    The Ro Factor - Jeep/Linux Weblog
    1. Re:MESA OpenGL Acceleration by DAldredge · · Score: 1

      I think you have to have the NVIDIA drives installed to be able to use this on NVIDIA cards.

    2. Re:MESA OpenGL Acceleration by JonnyRo88 · · Score: 1

      That sucks, good point though. I wonder how hidden the interface to the Cg hardware is on the nvidia card.

      Perhaps nvidia would be willing to release specs on this to open source developers? I dont know how closely this is tied into their proprietary graphics algorithms they want to keep secret from competitors.

      --
      The Ro Factor - Jeep/Linux Weblog
  22. Obligatory Amiga Comment by Anonymous Coward · · Score: 1, Informative

    Shades of programming the Amiga Blitter. I think Dave Haynie had Life running at 60FPS in about 1986-87 on a 68000.

    1. Re:Obligatory Amiga Comment by Anonymous Coward · · Score: 0

      The problem is that this is a brute force algorithm. You can't go directly to bazillionth iteration.
      Converting hashlife to blitter/gpu is going to be more challenging.

    2. Re:Obligatory Amiga Comment by Anonymous Coward · · Score: 0

      I misread 'Life' as 'Half-Life' and thought it was a (very funny) dig at the Amiga and its users...

      Sorry. :-)

  23. And the editors knew it. by DAldredge · · Score: 1

    I emailed the editors, but they didn't do a damn thing about it.

  24. Excellent! by macemoneta · · Score: 2, Interesting

    I had submitted an AskSlashdot on this subject:

    2003-04-20 01:51:36 Using video processing as "attached processor" (askslashdot,hardware) (rejected)

    But as you can see it was rejected. I was particularly interested in the use of the GPU for cryptographic functions (e.g., with a loopback encrypted filesystem), to offload the processing from the main CPU. Is anyone aware of any work in this area?

    Is this even a viable implementation, or would the overhead of continually dispatching work to the GPU exceed the benefit derived?

    --

    Can You Say Linux? I Knew That You Could.

    1. Re:Excellent! by Quixote · · Score: 1
      But as you can see it was rejected. I was particularly interested in the use of the GPU for cryptographic functions (e.g., with a loopback encrypted filesystem), to offload the processing from the main CPU.

      I'd say it won't work. The AGP bus is slow at pushing data out.

    2. Re:Excellent! by macemoneta · · Score: 1
      I'd say it won't work. The AGP bus is slow at pushing data out.

      AGP 8x can move 2.1 gigabytes per second (GB/s), according to Intel.

      --

      Can You Say Linux? I Knew That You Could.

    3. Re:Excellent! by larkost · · Score: 3, Informative

      2.1 GB/s is very nice, but it only refers to transfers in one direction: to the card. There is a (much) smaller bandwidth back to the motherboard. This is because for their designed purpose, graphics cards do not need to talk back to the system much, they just crunch the numbers and spit out the results to a monitor.

      With encryption you are usually looking at processing streams of data. If your encryption method involves a lot of floating point math (almost never) on every bit of information, then it would be nice. But encryption is almost always integer based (GPUs don't' shine in integer like they do in floating point), and involves just as much data going in as coming back.

      If you are looking for a great (co) processor for integers, look at the Altivec section of the G4 (and the similar one in the G5.. I forget the IBM name).

    4. Re:Excellent! by Fulcrum+of+Evil · · Score: 1

      AGP 8x can move 2.1 gigabytes per second (GB/s)

      And how fast can it read that data? 133MB/s?

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    5. Re:Excellent! by Tarqwak · · Score: 1

      2.1 GB/s is very nice, but it only refers to transfers in one direction: to the card. There is a (much) smaller bandwidth back to the motherboard. This is because for their designed purpose, graphics cards do not need to talk back to the system much, they just crunch the numbers and spit out the results to a monitor.

      How about using that nice 9.9 Gbit/sec DVI link to get the data out?

    6. Re:Excellent! by WoTG · · Score: 1

      Well, PCI-Express isn't too far off. It's a general replacement for PCI at something like AGP8x speeds. This would eliminate the bus as the limiter of the card->memory data transfer. Video card designs, however, may not be set up for high back flow data rates.

  25. HP for GP?-AGP Bottleneck. by Anonymous Coward · · Score: 2, Interesting

    Wasn't there a Slashdot story about the slowness of reading back across the AGP bus? How will that affect the usefullness of GPUs?

    1. Re:HP for GP?-AGP Bottleneck. by Nexx · · Score: 5, Insightful

      WARNING: Lots of conjecture involved.

      That said, if you can fit your data sets and your program on to the video memory (128MB isn't uncommon on high-end), and you're doing lengthy calculations on these sets while being only interested in the results (again, not uncommon in HPC), then the relative slowness of reading these results back becomes a nonissue.

      Does that help? :)

    2. Re:HP for GP?-AGP Bottleneck. by Anonymous Coward · · Score: 2, Insightful

      In 12 months, AGP will be obsolete.

      It will be replaced by PCI-Express, which as a general purpose bus supposely won't have these issues.

    3. Re:HP for GP?-AGP Bottleneck. by skookum · · Score: 2, Informative

      Yes, the AGP -> main memory transfer rate of most video cards is abysmally slow, because it's not something that's needed for gaming. Maybe newer cards have changed, but I don't see why they would. background article

    4. Re:HP for GP?-AGP Bottleneck. by Directrix1 · · Score: 1

      No, its needed for gaming if textures or vertex data are not in the video cards RAM. That was kind of the whole point of AGP. So you can keep vertex and texture data in system RAM, in lieu of sufficient video RAM. But it is slow, but then again access to RAM is slow (but its getting a lot better).

      --
      Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF
    5. Re:HP for GP?-AGP Bottleneck. by skookum · · Score: 2, Insightful

      You're missing the point completely. Main memory -> AGP is blazing fast, for the reasons you just stated. AGP -> main memory is painfully slow, because there's almost no requirement for much data to flow this direction. The result of %99.9999 of the output that the video card computes is displayed on the screen and then discarded with the next refresh.

  26. Research by dfj225 · · Score: 4, Insightful

    I've always wondered why certain research programs (like Folding@home or SETI@home) don't use this type of code. My GPU sees more free time than my CPU plus it would probably get the work done faster. Also, imagine the speed increase of utilizing both the GPU and the CPU to their fullest potential. Now thats some fast folding!

    --
    SIGFAULT
    1. Re:Research by BiggerIsBetter · · Score: 4, Interesting

      I (and presumably others) have asked some project leaders about this, but it seems to come down to testing and support of various cards. Also, remember that this is relatively unknown technology - Amiga blitting aside ;-) - you have to be pretty sure it's going to give accurate and consistent results before using it seriously. Find-A-Drug was my project of interest, and they have a Linux version too.

      --
      Forget thrust, drag, lift and weight. Airplanes fly because of money.
    2. Re:Research by JFMulder · · Score: 1

      The problem is that products like SETI@Home strive for precision, and 32bit vertex shaders just aren't precise enough compared to the 72bits FPU of a x86 CPU.

    3. Re:Research by Kris_J · · Score: 1
      Wake me when I can run a Distributed.Net client on my GeForce4 TI 4600, while it continues to display my 1280x960x32 desktop (and pauses gracefully while I run a 3D app).

      Or possibly when I can run a nice compression package like 7-zip with hardware assist from the GPU...

    4. Re:Research by Vijay+Pande · · Score: 1

      Yes, I definitely agree. We (I'm the head of the Folding@Home project) are actively working on this with the Brook group and other collaborators at Stanford. Stay Tuned! For those that are curious, there as been some discussion on the Folding forum http://forum.folding-community.org/

  27. I've always wondered when this would happen... by malakai · · Score: 2, Interesting

    But what I'm really looking forward to is a Physics specific processor that sits alongside the graphics processor, and is resposible for collisions detection.

    The last few SIGGRAPHS had numerous approaches using GPU's to detect collisions, in real-time, betwen complex volumes using only the GPU. With some minor tweaking, graphics manufacturers can make this 100x more efficent and easier to implement.

    With the 'shader' languages being able to create and modify meshesh now, procedurally, this is the best place to detect collisions (beaking back the mesh data to your motherboard so that your local CPU can figure out what collided, is not efficent).

    1. Re:I've always wondered when this would happen... by Animats · · Score: 3, Interesting
      But what I'm really looking forward to is a Physics specific processor that sits alongside the graphics processor, and is resposible for collisions detection.

      It's been done. The Havok game physics system is available for the Playstation 2, and the physics is running in the vector processors, where most of the PS2's compute power resides.

      Collision detection isn't that CPU-intensive. (This may surprise people not familiar with the field. But it's true. If collision detection is using substantial CPU time, you're doing it wrong.) Correct collision resolution is where the time goes.

      Physics code works better with double-precision FPUs. You need both dynamic range and long mantissas to do it well. Some of the game consoles, and most of the GPUs, only have single-precision FPUs. It's possible to make physics code work in single precision, but fast-moving objects that cover considerable distance may have problems.

    2. Re:I've always wondered when this would happen... by Anonymous Coward · · Score: 0

      I've heard it said that insisting on double precision is a sign of one or both of bad algorithms or bad programming. I was pleasantly surprised to hear this from experts in Physics simulation who had tried out a single precision only product I was involved with (and had expected to see rejected on these grounds)

    3. Re:I've always wondered when this would happen... by Anonymous Coward · · Score: 0

      I've often though a physics coprocessor would be an excellent idea -- not only for collisions, but for the rest of Newtonian and thermodynamic physics as well. It wouldn't be much use for word processing, but can you image a version of GTA5 that figures out the friction between the tire and the road, the aerodynamics and inertia of the car, the trajectory of the bullets, and the explosive power of your gas tank catching on fire, all in hardware? A few dozen physical laws programmed into a massively-parallel chip would reduce the load on the main CPU while at the same time allowing for stunts and eye-candy not dreamed of by the programmers.

  28. DSPs = linear equation processors by Doc+Ruby · · Score: 2, Interesting

    We used the AT&T DSP32, a 12.5MFLOPS DSP, 15 years ago at Array Technologies. Programmable in a native C source code, with multiply-accumulate (MAC) instructions optimized in microcode, the DSP32 was lightning fast at y = mx + b equations in its arithmatic logic unit (ALU), and its control logic unit (CLU) was also very fast at branching, including no-overhead looping. Linux runs on one of its many fascinating descendants, the Xilinx Virtex-2 Pro.

    --

    --
    make install -not war

  29. Windows only by Anonymous Coward · · Score: 0

    i don't have a Windows machine with a graphics card, and it doesn't appear to support Linux. Which is a shame. Might make a nice video encoding accelerator, no worries about integer precision for example.

  30. Re:Who the FUCK uses Debian anymore? by isolenz · · Score: 0, Offtopic

    yes, I am a post debian and not currently a Gentoo nut, but there are reasons why people still use Debian. For it's STABILITY, and the fact that there stable trees are very stable. Gentoo is great for obviously you and me who use linux as a desktop environment and don't care about up times much, but I've had many problems with gentoo after updating packages, going to the extreme of hours on end of trying to fix things up which I shouldn't have to. This is something that shouldn't be, and hence the good reason why debian is still a great distro, just in a different way.

    --
    there are 10 type of people in this world, one's that repeat a stupid joke over and over, and one's that learn that it is freaking retarded and never use it... oh yah, I am funny because I spelt 10 in binary..... /me almost forgot

  31. GPU opcodes by Anonymous Coward · · Score: 4, Informative

    Here is a Beyond3d link that has some opcode info. Look around their site for a NV30 vs R300 architecture document that has lots of great stuff. If you are looking for the best s/n ratio, Beyond3d is one of the best. All meat, little fanboyism.

  32. Nivida CG by Popsikle · · Score: 3, Informative

    Nvidia has this already!
    "About Cg The Cg Language Specification is a high-level C-like graphics programming language that was developed by NVIDIA in close collaboration with Microsoft Corporation. The Cg environment consists of two components: the Cg Toolkit including the NVIDIA Cg Compiler Beta 1.0 optimized for DirectX(R) and OpenGL(R); and the NVIDIA Cg Browser, a prototyping/visualization environment with a large library of Cg shaders. Developers also have access to user documentation and a range of training classes and online materials being developed for the Cg language."

    http://www.nvidia.com/object/IO_20020612_7133.html

    1. Re:Nivida CG by dimator · · Score: 1

      Cg is first thing I thought of when I saw this post... I need to read the project page, but it seems to me, Cg would be the technology to learn given it's strong corporate backing and maturity.

      --
      python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"
    2. Re:Nivida CG by Anonymous Coward · · Score: 0

      Actually, the BrookGPU implementation that this article refers to requires an installation of Cg, because what it actually does is convert the Brook code into Cg code. That's why Cg is on its list of requirements page. Brook is just a slightly easier language to code in than Cg, that's all.

  33. multi gpu? by Anonymous Coward · · Score: 0

    Ok,

    Let's say I'm a glutton for punishment. Does BROOK support MULTIPLE graphics cards? I read the doc's and it doesn't mention it explicitly.

    Let's say you have a case where calculations are well parallelized and you would like to consider factoring a problem across as many GPU's as your system can hold. Let's also say that your calculations are just hard enough to keep you below bandwidth saturation even at a 10GHz P4 equivalent... multiple GPU's might be useful.

    1. Re:multi gpu? by BiggerIsBetter · · Score: 1

      Hmmm. What's the fastest PCI graphics card you can buy these days?

      --
      Forget thrust, drag, lift and weight. Airplanes fly because of money.
    2. Re:multi gpu? by BiggerIsBetter · · Score: 1

      And to answer my own question... maybe GeForceFX 5200 is it? Radeon 9000 seems to be available too.

      --
      Forget thrust, drag, lift and weight. Airplanes fly because of money.
  34. BrookGPU's web page by hab136 · · Score: 1

    What, no screenshots???

  35. Re:A good example of how an OS should be programme by Anonymous Coward · · Score: 0

    No doubt Linux is the cream of our crop at the kernel level. I'm finding my gentoo KDE Mozilla performance still at the speeds of a similar MS setup.

    I'd just like to see more efficient use of the processing power. These GHz machines should be screaming...but alas, all I hear is grinding.

  36. Re:Windows only (not any more) by Anonymous Coward · · Score: 1, Informative

    Same AC as parent - happy to have discovered this

    "Fixed In This Release (12/19/03) * nv30gl backend compiles and runs on Linux. Requires Linux cgc compiler from NVIDIA and the latest drivers.

  37. Re:A good example of how an OS should be programme by Anonymous Coward · · Score: 0

    Now I know what the NSA is using to crack PGP.. a whole farm of Geforce FXs and 2.6 :)

  38. remember those 3dfx tv ad's... by agent2 · · Score: 2, Funny
    Imagine the kids - "Mom, I need the Radeon 9800XT to find a cure for Grandma's cancer!"
    ...that went something like "we have the technology...blah...something....to save lives....but instead....we've used 'em for games!!"
    1. Re:remember those 3dfx tv ad's... by Tim+Browse · · Score: 1

      "Blast his freaking head off!!!"

    2. Re:remember those 3dfx tv ad's... by Mysticalfruit · · Score: 1

      I was wondering when someone would bring that funny commerical up...

      I've got a pile of old pci voodoo2 card lying around. I should get a box with 10 PCI slots and use the libraries to build a box that'll smash 3des in about 5 minutes ;-)

      --
      Yes Francis, the world has gone crazy.
  39. Re:Why subscribe? by DAldredge · · Score: 0, Offtopic

    Yes.

    If they are not working, they they should not post a link at the bottom of the story saying to email in any mistakes.

  40. Kernel driver? by Anonymous Coward · · Score: 0

    What about the possibility to have a kernel module doing this stuff?
    I'd stack my PCI slots full of spiffy videocards and telnet to the machine with my i386.

  41. Re:Windows only (not any more) by Anonymous Coward · · Score: 0

    Just got it to build using version on SourceForge. Makefile has a spurious object "ihash.o" - delete it, then gram.y barfs with an error - stick a semicolon at the end of the offending line. The cg compiler from nvidia builds OK, you need it.

  42. Shading's a specialized task. Try Perl interpreter by Anonymous Coward · · Score: 0

    Of course, a shader isn't general purpose programming. GPU's are optimized for this sort of task. If any of the standard benchmarks were recompiled for a GPU, you would see just how poor they perform certain tasks.
    Also GPUs are designed for one way processsing, much like a DSP. If you're not aware even a 600MHz TI DSP will DESTROY any x86 or x86-64 microprocessor when it comes to FFTs. But they would fail miserably at text parsing.

  43. Interesting by Stonent1 · · Score: 1

    I've wondered about this very thing for a few years now. Good to see that it really was possible.

  44. DivX by iamacat · · Score: 1

    Is there a prize for the first optimized encoder for some flavor of MPEG4? Imaging ripping a DVD in one hour. Hopefully ATI users on OSX are not left behind.

    1. Re:DivX by ColaMan · · Score: 1

      Hmmm - I can rip a DVD to a 1600kbps xVid file in about 2 hours using the ripper in MythTV (MythDVD and transcode as the backend). It takes 15 min or so to rip the VOB file, about 2 hours or so to encode it.
      This is with a XP2000+ and gentoo linux, strangely DVD rips on my XP1800+ under winXP take 8 hours or so and even with some careful scrutiny I can't see the difference.

      Beats me why it does that , but I don't complain, I just rip them all with MythDVD now ;-)

      --

      You are in a twisty maze of processor lines, all alike.
      There is a lot of hype here.
  45. 10 GHz? by Anonymous Coward · · Score: 0

    If you want to compare peak performances, then do it right. P3, P4, K6-2, K7 and K8 can all do 2 single precision adds and multiplies per clock cycle when programmed carefully. This means that you only need 5 GHz in order to achieve 2GFlops.

  46. argh by Anonymous Coward · · Score: 0

    s/2GFlops/20GFlops/

  47. Accuracy by Anonymous Coward · · Score: 0

    The reason these units are fast is because they use floating point tricks, the numbers aren't very accurate and shouldn't be used for computation of real things like physics etc.

  48. memory bandwidth is the key by peter303 · · Score: 2, Insightful

    Even though general purpose CPUs approach the flop rate of GPUs, you cant feed the memory for many data intensive computations fast enough. A GPU may give you 12 or so bytes of data per cycle, where very few commodity CPU buses can do that.

    1. Re:memory bandwidth is the key by renoX · · Score: 1

      You're right, that the memory bandwith and parallelism of GPUs are its strong point.

      But everyone here is forgetting its weak points:
      - limited precision: scientific computing usually use double 64bit for computations, currently GPUs are limited to 32bit..
      And I don't foresee GPU using doubles any time soon! This level of precision is not needed for 3D graphics calculations.
      - when you want to read the data back from the GPU, you get little bandwith, but it may become better with PCI express, we'll see soon.

  49. GPU use for scientific programming. by kiniry · · Score: 4, Interesting

    Researchers at Caltech and other institutions have been looking at this for about three years. See "Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid" by Bolz, Farmer, Grinspun and Schroder (SIGGRAPH 2003), for example. The paper, illustrations, and movies are available from Dr. Grinspun's homepage. The primary problems with the approach at the time this work was done was the limited bandwidth of texture-related operations in OpenGL based upon improper assumptions in pipeline optimization.

    --
    Joseph R. Kiniry
    http://kind.ucd.ie/~kiniry/
    Lecturer
    UCD School of Computer Science and Informatics
    1. Re:GPU use for scientific programming. by echion · · Score: 2, Interesting

      The bandwidth limitations you highlight and the others mentioned in other papers by Grinspun are probably similar to quantum-computing limitations: e.g., in GPUs you can read some read-only registers, multiply/add them (in parallel) tons of times, and then write to some other write-only registers in the GPU; in quantum computing you can take some atoms whose state you knew, applying tons of (parallel) quantum operations, and then observing the results (so they're useless for more quantum computations).

  50. What comes around goes around by /Wegge · · Score: 1

    Back in the late part of the eighties, my company got started by making (almost) real-time control systems based on IBM PS-2 machines running Xenix, with the CPU-intensive stuff on Artic RIC cards.

    Although the RIC card was meant to be an intelligent serial communications gizmo, a lot of the higher level processing was delegated to those as well.

    It seems to me that we are about to get to the other side of the everchanging wheel of "lots of chips" vs "One huge CPU". Again.

    --
    //Wegge
  51. CPU-GPU bandwidth solved? by jkantola · · Score: 1

    Earlier this year, when I was considering the options for using GPUs to do some quantum physics, I found out that there were serious bandwidth problems in the GPU drivers, and even though it was possible to send the data to a GPU for fast processing, the readback of results was awfully inefficient (so much so you lost any processing advantage). This has been solved now?

    1. Re:CPU-GPU bandwidth solved? by Saville · · Score: 1

      PCI Express will help, but it is still slow. It only is faster if you can send a large batch for processing to the VPU and then later read it back.

  52. More speed for the Terascale cluster? by Anonymous Coward · · Score: 2, Interesting

    Weren't the Virginia Tech's G5 supercomputer nodes all equipped with standard ATI cards? If used right, there could be 1100 more processors to use...

  53. distributed.net by terminal.dk · · Score: 2, Interesting

    When will the new client be out for this platform ?

    I know my PC eats 20 Watts more of power when in 3D mode, but still, I want the faster agent :=)

  54. Yeah, but by toganet · · Score: 1

    Is Cg cross-platform? (i.e., can you write programs for Radeon GPUs with it?)

    Then again, is Brook cross-platform?

    If either is general-purpose enough, they could be used to implement routines on less expensive GPUs, wuch as those from SiS et al.

  55. Re:Jihad! by Anonymous Coward · · Score: 0

    WTF?! they seriously have a website designated to promote terrorism against /.
    Grow up.

  56. Crypto by Effugas · · Score: 2, Interesting

    We've talked a decent amount about doing crypto on GPU's. The fundamental issue is that such processors are massively optimized for operating on floating point numbers, and almost all crypto is integer based -- lots of bitshifts, MODs, and XOR's, only the latter of which this gear handles correctly. Even if the problem with getting data back off the card was solved, the card itself couldn't do the job.

    Indeed, I only know of one crypto hack that uses floats -- being from DJB, it's predictably brilliant. Basically, it's easy to compute the floating point error from a given operation, but computationally hard to find an operation that yields a given error. So you can effectively sign (or at least MAC) arbitrary content. Nice!

    --Dan

    1. Re:Crypto by sql*kitten · · Score: 1

      We've talked a decent amount about doing crypto on GPU's

      You can already get hardware accelerated crypto, Sun even bundles it with some servers. I remember back in '96 accidentally melting an nCipher unit that fitted into a drive bay...

    2. Re:Crypto by Panaflex · · Score: 1

      Well, those rainbow and ncipher cards are great general purpose crypto cards but they don't help much when doing number sieving etc..

      pan

      --
      I said no... but I missed and it came out yes.
    3. Re:Crypto by Effugas · · Score: 1

      The nCipher accelerates modular exponentiation -- stuff like 2^3000 mod 15001(or the remainder of 2 to the three thousandth power divided by 15001). This is inherently an integer op, and is only fast because you can break it down like one.

      The computational pathways of GPU's are not appropriate for such calculations.

      --Dan

  57. It has not L2 cache, xDDD by Anonymous Coward · · Score: 0
    The GPU has not 128 KiB L2 cache of a Duron appleberd, xDDD.

    open4free

  58. Re:A good example of how an OS should be programme by mabinogi · · Score: 1

    C is already an efficient language....and that's what most operating systems are written in.

    It's not the fault of the language how people use it, I'm sure people would be able to write big slow things just as well with Brook.

    --
    Advanced users are users too!
  59. Imagine a Beowulf Cluster... no, seriously by billstewart · · Score: 4, Interesting
    There's a cluster of Sony Playstations at UIUC (BBC) that's using the Emotion Engine to do numbercrunching and running Linux on the main processors to do communications and I/O. It's probably not strictly Beowulf, because it's using the Playstation version of Linux.

    This cluster has 70 Playstations (one article said that they'd ordered 100, but only 70 are in the cluster... Obviously the others are being used for "research".)

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  60. GPUs and DSPs often do 32-bit by billstewart · · Score: 1

    Precision can be a problem, because GPUs and DSPs often run single-precision floating point, at least for the widest-parallelism parts like shaders. They may have some double-precision capability as well, but it's usually used for less-parallel activities like geometry crunching.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  61. HP for GP?-Fakeout. by Anonymous Coward · · Score: 0

    The problem with your statement is that both ATI and Nvidia use the same GPU on their CAD cards as used on their Gaming cards. The difference is either drivers(1), firmware, or a pin setting. Internally there's no difference.

    (1) One reason not to open-source.

    1. Re:HP for GP?-Fakeout. by sql*kitten · · Score: 1

      The problem with your statement is that both ATI and Nvidia use the same GPU on their CAD cards as used on their Gaming cards.

      That's probably just to save on die/fab costs - there's no reason why different circuitry couldn't be activated by "gaming" mode than by "CAD" mode.

    2. Re:HP for GP?-Fakeout. by Directrix1 · · Score: 1

      There can be no difference. I thought the real reason to get a *professional level* card is to get a guarantee of reliability (regardless of whether it is software in origin or hardware, since it all boils down to the same thing in the end).

      --
      Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF
    3. Re:HP for GP?-Fakeout. by sql*kitten · · Score: 3, Informative

      I thought the real reason to get a *professional level* card is to get a guarantee of reliability

      Well, ISV certification - a CAD vendor will assert "with this card, our software produces no rendering artifacts".

  62. At Last!! by sryx · · Score: 2, Funny

    So now I can port my slow as tar software rendering engine to this and finally make my DOOM killer 3D Game a reality!

    Oh wait.. never mind

    -Jason

  63. A real-world example - ray tracing by ron_ivi · · Score: 3, Informative

    http://portal.acm.org/citation.cfm?doid=566654.566 640
    http://www.theregister.co.uk/content/54/25312.html
    http://online.cs.nps.navy.mil/DistanceEducation/on line.siggraph.org/2002/Papers/13_GraphicsHardware/ purcell.ppt

  64. Re:Who the FUCK uses Debian anymore? by Anonymous Coward · · Score: 0

    I've never had any issues with Gentoo past the two-day install process. Even that went off without a hitch and now I have a Linux distro which runs faster than yours. Ha ha ha.

  65. Re:Jihad! by V.P. · · Score: 0, Offtopic

    Damn you Paul Muaddib!

  66. AT&T DSP32 Cluster Supercomputer in late 80s by billstewart · · Score: 2, Interesting
    The AT&T DSP32 definitely rocked. In addition to doing 32-bit floating point multiply and accumulate, it could simultaneously do 24-bit integer calculations. The supercomputer cluster was up to 128 of them (I forget if they were 8 or 16 per board), with communications structured as a tree, which could give you 1 GFLOPS sustained and up to 2 GFLOPS if you could keep them busy doing multiply-and-accumulate. Not bad for a desktop in the late 80s, though of course you can get that for $49 today:-)

    A typical application was to use a couple of the processors to do geometry while the rest crunched shading, or alternatively to do lots of FFTs for signal processing - the box was mainly designed for the Navy, and 32-bit floating point was more than enough precision given the A/D converters on sonar input.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  67. Faster X.... by A+Little+Goblin · · Score: 1

    You could wonder if this could be used to improve the performance of X, by having more of it running on the GPU?

  68. I/O and Interrupts and Parallelism, Oh My! by billstewart · · Score: 1
    GPUs get to do lots of parallelism, because most graphics work is inherently parallel (do the same stuff to a whole bunch of pixels) and they're built to exploit it by using multiple sets of multiply/add units.

    But beyond that, CPUs spend a lot of their time doing various kinds of I/O, handling interrupts, and talking to multiple coprocessors, while GPUs normally just get handed the stuff they want. Some of this work gets farmed out to other chipset members - NorthBridge memory controllers, Southbridge I/O controllers - but the CPU still ends up in charge of the process.

    Also, the CPU gets to do operating system jobs like deciding which chunks of memory belong to which applications, while the GPU doesn't worry about that - everything's going to get drawn on the screen. Perhaps Trusted Computing Digital Ridiculousness Management will change this, and the graphics processors will need to start keeping track of who owns each pixel or vector so it can use the right decryption context, to prevent you from watching movies you haven't paid for, but for the moment it's ignorance and bliss.

    Also, the real line is "Sorry, Kids, I need your Playstation 3 to find a cure for Grandma's cancer, go out and play soccer or something."

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  69. Required Slashdot reading list by Anonymous Coward · · Score: 0

    moron

  70. GPU SMP :) by Anonymous Coward · · Score: 0

    6 graphics cards in parallel ? Anyone writing a new BIOS for that. And fix that slow PCI bus while you are at it :).

  71. My old AHA1542 had a Z80 - can I use that ? by anti-NAT · · Score: 2, Funny

    Would that speed up my processing ? Will I be able to play Half Life 2 on my Pentium now ?

    --
    The Internet's nature is peer to peer - 20050301_cs_profs.pdf
  72. How long until by Lord+Kano · · Score: 3, Funny

    Someone ports a GPU Linux and some asshole loads 8 PCI cards into his machine and maked a beowulf cluster inside of one case?

    --
    "Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano
  73. Sounds like Velocity Engine (Altivec) by NeoBeans · · Score: 1

    Granted, accessing main memory from a G4/G5 processor will be slower, but doesn't this sound familiar to Apple users? That said... it's a cool idea. :-)

    1. Re:Sounds like Velocity Engine (Altivec) by Anonymous Coward · · Score: 0

      Altivec is the same as MMX, SSE, 3DNow!, etc.

      It allows your CPU to issue the same instruction to multiple pieces of data. Your CPU still processes them, it just processes them a bit quicker.

      This is using your GPU to do extra processing.

  74. currently close to useless by Saville · · Score: 1

    ATI can execute up to 96 ALU instructions in a fragment progam on their top of the line video card. That is actually 64 native instructions. The GeForceFX can handle 1024. Think about that for a second. Lets say you want to compute 1,000,000 things in parallel so you create a 1000x1000 floating point texture to render to. You have to fit your algorithm into just 64 instructions. And you can't make function calls (Yes, DX9 HLSL has functions, but they are all "inline"). And your algorithm has to have no loops. And your algorithm has to have essentially no branches. Well it can have "branches" that are of the form a = x >= y ? b : c; No real branches. And you can't have static or global variables. Each pixel is executed by the same 64 instruction program. If you want a static variable to save between different programs you have to store it to another 1000x1000 texture. Which is what this system does. If you want an if(whatever){ bunch of instructions } you have to compute all those instructions and then multiply by 0 or 1 at the end in order to emulate branching.. Terribly ineffecient. BTW currently no graphics hardware exists which can emulate the standard T&L pipeline with eight lights! What would be a simple loop like this: for( i = 0; i We need far more flexible video cards before this is useful. ATI's ASHLI is sort of like this. It allows you to compile long Renderman or OpenGL shaders into multiple passes so they can be execute on our current hardware like the crappy ;) 9800XT http://www.ati.com/developer/ashli.html This will be a lot more interesting when we get pixel shaders that can do emulation of the standard T&L pipeline.

    1. Re:currently close to useless by Saville · · Score: 1

      Opps, that was supposed to be in POT, not HTML.
      This is what was missing:

      BTW currently no graphics hardware exists which can emulate the standard T&L pipeline with eight lights! What would be a simple loop like this:
      for( i = 0; i

    2. Re:currently close to useless by Saville · · Score: 1

      Opps, that was supposed to be in POT, not HTML.
      This is what was missing:

      BTW currently no graphics hardware exists which can emulate the standard T&L pipeline with eight lights! What would be a simple loop like this:
      for( i = 0; i 8; ++i)
      {
      if(!light[i].enabled)
      continue;
      switch( light[i].type)
      {
      case directional: ...
      break;
      case point: ...
      break;
      case spotlight: ...
      break;
      }
      }
      is impossible. One program needs to be written for every permutation of different lighting settings which is actually 65536 (4^8) different programs! Only many of those programs are impossible to write because you only have 128 intructions (or 256 on Radeon 9500+ and GeForceFX). By the time you calculate specular and colour for a point light (or worse, spot light) you've already used a ton of instructions. You simply can not have eight point lights on current cards. Of course to make vertex programs useful you really need to use other instructions for things other than calculating lighting such skinning of characters.

  75. Ray tracing with a GPU? by Angst+Badger · · Score: 2, Interesting

    So I have to wonder how much POVray could be sped up -- if any -- by modifying it so that suitable calculations were run on the GPU, in parallel, while the CPU took care of the rest.

    --
    Proud member of the Weirdo-American community.
  76. crushing... by Anonymous+Freak · · Score: 1

    Sorry, dist.net uses integer math exclusively, while GPUs use vector floating point math almost exclusively. (That, and dist.net likes a hardware 'rotate' function, which I don't know if any modern GPUs have.)

    But, I'm sure you could get a few extra cycles out of a GPU. Just not as much as you'd hope.

    --
    Another non-functioning site was "uncertainty.microsoft.com."
    The purpose of that site was not known.
  77. AGP! by itzdandy · · Score: 1

    Currently, the problem with this is AGP. AGP is meant for downstream data( cpu/memory -> agp -> video card ) but is miserable at upstream data.

    On the other hand, PCI-X or PCI-Express will solve this problem with very high up AND downstream data paths. So your output would be easily sent back to the CPU for data handling.

    With this approach, you could optimize your code to run integer math on the main cpu, and export floats to the graphics card(s)

    also, with PCI+? aditional graphics cards could be use and excellerator cards to handle additional math that the primary graphics card cant handle well, and improve performance on the primary card through some sort of Scan line interleaving or something. The fact the the upstream data path is so high make this possible without proprietary tech like voodoo2 cards did.

    1. Re:AGP! by Anonymous Coward · · Score: 0

      PCI-Express != PCI-X
      PCI-Express == PCI-E

  78. Re:AT&T DSP32 Cluster Supercomputer in late 80 by Doc+Ruby · · Score: 2, Funny

    We got our first boards from the developers of an antiaircraft RADAR signature decoder/sight. We wound up using DSP32Cs, 25MFLOPS as I recall, by late 1990. We had an EISA card (PCI was in the future) with an FPGA for linearly scalable pluggable DSPs. We had experimented with a transputer, but found we could use the DSPs to preprocess the video sensor data during calibration, and load custom logic and buses into the FPGAs for maximum efficency routing the data. When the company folded and reformed, the technology had evolved into a general-purpose FPGA imageprocessor, with scalable utility DSPs embedded in the hyperarray of FPGAs. The lead engineer went to Xilinx, which has consistently produced the most advanced FPGAs since then, including the Virtex-II line with embedded RISCs (PPCs).

    One DSP SW engineer I worked with at Array had come from the academic computational music world. He had hooked each DSP32's six parallel ports to the other members of a cube topology, with buses around the surface of the cube. The buses were connected to actual I/O buffers. Some of those were connected to input controls, like sensors on big cans, or tuned monochords, or hard rubber blocks. Outputs from the cube were hooked to output actuators, like solenoids strapped to gongs, motorized clappers against barrels, and rows of mallets aimed at piano keys. Musicians would bang their parts out on the inputs, with computed rhythms and "pitches" spewed out of the actuators. Keyboard/monitor stations allowed musicians on the parallel network to sample parametrized rhythms, sequences, timbres and other values in realtime from other musicians.

    The whole thing was totally insane, but then we were a Silicon Valley company in Oakland during the last recession that recruited exclusively musicians, philosophers, exhippie mathematicians, and yours truly (college dropout) as their mascot, for an imageprocessing startup. I've never been the same since, and the industry has yet to catch up with any of us.

    --

    --
    make install -not war

  79. GPU for computation = wrong architecture by master_p · · Score: 1

    It's not good to use the currently available GPUs for computation tasks. Since the GPU specs change all the time and there is no standard (especially when it comes to shaders), it will create all shorts of nighmarish problems concerning OSes, especially open source ones.

    It is the ideas behind the GPU that must move down to CPU; mainly the vector unit and the high bandwidth. Remember that the PowerPC with the Altivec extensions is a very impressive CPU; also remember that the Playstation 2 can do 48 GB / sec data transfer. The PC needs a really fast bus and lots of custom hardware for most parallelisable tasks.

  80. Re:Fast Fourier Transform - and as if by magic! by N+Monkey · · Score: 1

    I'd love to see an FFT implementation (maybe it's not so hard ... will have to download and play with it.)

    A paper on that very subject was presented at Graphics Hardware 2003. You should be able to find it here

  81. Neural Nets by Anonymous Coward · · Score: 0

    Neural nets on the GPU might be a better application. Lots of data going into the net. Lots of computation within the net. A small amount of data (the answer) leaving the net.

    1. Re:Neural Nets by pclminion · · Score: 1

      Good idea, since the layer computations within the net are merely a series of matrix multiplications. However, you still need to apply the transfer function to the output vectors before flowing into the next layer. How do you implement such a nonlinear transfer function on a GPU?

  82. It IS turing complete! by glyph42 · · Score: 1

    If you actually read anything you would have noticed that modern GPUs are indeed turing complete, and many, many mathematical operations can be performed with great parallel efficiency.

    --
    Music speeds up when you yawn, but does not change pitch.
  83. GPU=DSP by mr.Spike+(edd+sonic) · · Score: 2, Informative


    Interesting, at least as GPU is realy a sort of DSP (Digital Signal Processor). And as i am deeply into both Audio and Brodaband signal processing hardware systems development, i find using those chips on the high performance video cards to be extremely useful in processing waveforms using the base of all of it - matrix calculations. It allows both FFT, iFFT, (of course DFTs and DCT and so on), as well es QMF, PQF filtring and synthesys.

    I could dream to do hi-fi vocoder out of video card - crasy but interesting! =)

    From the other hand, i think a little bit sceptical about all this, as it will not work even at half or third of its gflops performance, when not used for the "native application". This means it could, after some time and hellish efforts, show that "PAIN vs BENEFIT" ratio falls more and more to the pain side.
    I remember times i tryed to use 6510 cpu+8kbyts (dont remember exactly) ram inside c64 disk drive to process graphics in parallel with main pocessor. Efforts fell

    And from the thrid point of view -- see how intel processors suck (~flamebait;) - "price>performance" like allways. Any small embedded chip outrivals it.


    p.s.
    Still hold on for the coming of the FPGA ;) I will do some article asap and post it neaar there, about what we could do about free&open computing on open hardware, not using proprietary chips.

  84. Re:AT&T DSP32 Cluster Supercomputer in late 80 by billstewart · · Score: 3, Interesting
    Yes, they were 25 MFLOPS. The chip had a 12.5 MHz cycle rate (I think that was also the clock speed), and each cycle could do a 32-bit multiply, a 32-bit add, and a 24-bit simple integer operation (some integer ops took multiple clocks, I think?)

    Your music application sounds like fun. I didn't know anybody was still doing anything quite like that by 1990 - there was a whole range of people around John Cage's time who did lots of prepared piano stuff.


    Some of the people who were trying to sell our multi-processor supercomputer flavor came up with a music studio application, doing lots of audio processing and mixing, sort of like your device turned inside out. Don't know if they sold more than one of them before the Lucent spinoff took them away.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  85. renderfarm wet dream.... by The+Lynxpro · · Score: 1

    I'd think that since the obvious applications that would benefit from such techniques would be in renderfarms. Perhaps it would be in the best interest of Weta Digital, Pixar, and ILM to invest some money into this project. After all, it would be much cheaper to acquire a bunch of used PCI-based Voodoo and GeForce cards and load multiple units into their PCs than to just increase the number of PCs sitting next to each other; they'd save money in terms of electricity consumption as well. There is a prescidence afterall; I'm thinking about the pooling of resources that Paramount, Fox, and Disney did a few months ago in getting Adobe Photoshop and other mission-critical Windows programs to run on Linux...

    --
    "Right now, somewhere in this world, Scott Baio is plowing a woman he doesn't love," - Peter Griffin, *Family Guy*
  86. Not x86 by SoopahMan · · Score: 1

    One major difference is that GPUs were built using whatever architecture they cared to use, and the GPU's entire architecture can be turned upside-down in every version if the hardware manufacturer cares to. There's a powerful preprocessor - your CPU - running on the nVidia or ATI drivers to ensure that what goes in is right, and optimal.

    This idea isn't new to CPU manufacturers. This is exactly what Intel tried to accomplish with Itanium by doing the preprocessing at compile-time by the software distributor, and sure enough, Itanium blows the P4's doors off.

    It is telling though that nVidia's entry into the motherboard domain drastically increased system memory bandwidth, something they've needed to focus on in the "machine in an AGP card" videocards they've made.

    It would be interesting to compare the performance of an FX GPU running a shader, and an Itanium on a crossbar memory controller running the same. I bet it would be comparable, and in some cases the FX GPU would lose.

  87. Makes for an interesting screensaver by theonetruekeebler · · Score: 1

    It would be fun to see the SETI@Home screensaver doing its own math. A GPU is basically a DSP, right?

    --
    This is not my sandwich.
  88. on-board DRM by theonetruekeebler · · Score: 1

    I'm willing to bet that before long we will be seeing video drivers that essentially upload a frame-decrypter to the video card, which will unscramble and display streaming video there. Your computer itself will never know what it's displaying.

    --
    This is not my sandwich.
  89. Re:Jihad! by Anonymous Coward · · Score: 0

    if you dont shut up, we'll make a website to promote terrorism against you!

  90. the base hardware is too recent... by The+Lynxpro · · Score: 0

    Setting the base of GPUs at the Nvidia NV30 level is excluding way too many mainstream videocards that aren't currently being used. If you are a gamer, ask yourself how many videocards you have cluttering your home because you are constantly upgrading to the next best card that hit the market. I myself could spare a couple of Voodoo1 cards, two Voodoo3 cards, a TNT2, and a couple of GeForce2's.

    The base of this project should be something like the 3dfx Voodoo1 or Nvidia's Riva128. While you can no longer count on any updated drivers for WindowsXP for these models, they surely have suitable Linux drivers. You could mount multiple PCI versions into a single PC (obviously, it would probably be best if they were all the same cards). Now that's the way to get some extra performance for distributed computing projects...

    Disclaimer: Yes, anything below a Voodoo3 (on the 3dfx front) would have issues with OpenGL because 3dfx used a mini-GL driver since at that time they still favored their own GLide format over both OpenGL and DirectX.

    --
    "Right now, somewhere in this world, Scott Baio is plowing a woman he doesn't love," - Peter Griffin, *Family Guy*
  91. Fortran 90/95 for GPU anyone? by DoctorRad · · Score: 1
    For scientific applications, it would be good to see a Fortran 90/95 compiler which used a GPU to good advantage. Since matrix operations are part of the language spec, this would appear to be an obvious direction to take, especially given the amount of existing code out there.

    Matt...

  92. You're swapping by tepples · · Score: 1

    but alas, all I hear is grinding.

    If you can hear rapid clicking noises whenever your computer is doing something, then your computer is swapping data to and from the hard drive. In that case, you either need more RAM or you shouldn't try to run so many programs at once.

  93. A small contradiction by drpatt · · Score: 1

    ...a compiler and runtime system that provides an easy, C-like programming environment...

    Well, which is it - easy or C-like?

  94. Back in the days of "array processors" by Latent+Heat · · Score: 1
    Back in the days of PDP-11's used as lab data acquisition computers, there were these array processor boards use for FFT's and related calculations. When PC's came along, there was a generation of "array processor" board products based on DSP chips. Some were floating point, some fixed point, some were only accessible through a numeric subroutine library, others were programmable in hex.

    My favorite was the IBM/Tecmar M-ACPA sound card. It sold for $495, at a 10 MHz clock (the TI TMS320C25 DSP), it was pipelined, and if you ordered instructions to take advantage of the pipeline, it could spit out a multiply-accumulate once for every clock cycle. It also came with no software to speak of, even to record and play sound, and the A/D was clocked at a fixed 44.1 kHz and the D/A at a fixed 88.2 kHz sampling rate. Unlike the Motorola DSP in competing products, the C25 could be halted, and the PC could read and write DSP memory through I/O ports whether the thing was halted or running, so you could just dump hex programs into the thing without having to bring up a bootstrap loader on the DSP, and you could start, halt, and inspect for debugging purposes, and it could generate an "A/D buffer full" interrupt for continuous A/D operation.

    IBM supplied sample code where they used some simple macros in the PC macro assembler to generate op codes for the C25 DSP and wrote a simple PC-side loader in 8086 assembler. This was fairly easy to do because the C25 instructions are all fixed 16-bit words, and while the hacked-up assembler was all fixed address with no relocation, it all worked pretty reliably because you had complete control over all aspects of the software and hardware -- TI has some software tools for there TMS320CXX eval boards that were a POS because they were buggy as heck, but this setup was the ultimate in Keep it Simple and Stupid.

    Digging into Crochiere and Rabiner on something called polyphase multirate FIR filter design, I had a bank of filters that upsampled and downsampled to support rates of 10, 11, 16, 20, and 22 kHz to match existing digitized speech files (the TI-MIT-NIST-DARPA speech database was sampled at 16, I had stuff sampled at 20). I also has subroutines for the card for real-valued forwards and inverse FFT, FIR filter, and second-order-section cascade IIR filter, and the whole wad of software fit in DSP memory at absolute addresses, and I had a little Turbo Pascal interface library to the whole thing, and I was king of the world.

    Guess what. The M-ACPA card kind of went by the wayside by the mid 1990's when Windows 95 came along: there were much cheaper Windows sound cards and IBM never could get a Windows driver for the darned thing that didn't leave gaps (sound clicks) with every interrupt cycle. And the Pentium came along which just blew the thing away speed wise, and the DSP library could be written in C or other compiler language, portable to any other computer, and using floating point so one could stop worrying about scaling and overflow and fixed-point roundoff.

    For all my work on the DSP library, I think I got at most about 3 years useful life out of it, and besides, I had the versioning problem of deciding who had one of these M-ACPA's installed and who didn't and reverting to a non-ACPA library for those who didn't (remember when the 8087 was optional and the pain that caused?). After that experience, I don't want to touch another array processor/DSP/GPU whatever: I am going to program whatever CPU is available in whatever compiler is available, and I am not going to mess in assembler with any strange instruction set enhancement promoted by disco dancers in clean room suits. I am just going to sit back and wait for Moore's law and wait for the CPU and compiler to catch up.