AMD Demonstrates "Teraflop In a Box"

← Back to Stories (view on slashdot.org)

AMD Demonstrates "Teraflop In a Box"

Posted by kdawson on Thursday March 1, 2007 @04:10AM from the speedy-silicon dept.

UncleFluffy writes "AMD gave a sneak preview of their upcoming R600 GPU. The demo system was a single PC with two R600 cards running streaming computing tasks at just over 1 Teraflop. Though a prototype, this beats Intel to ubiquitous Teraflop machines by approximately 5 years." Ars has an article exploring why it's hard to program such GPUs for anything other than graphics applications.

16 of 182 comments (clear)

Min score:

Reason:

Sort:

It isn't that they are hard to use for more... by Assmasher · 2007-03-01 04:35 · Score: 3, Informative

...generic purposes, it is that they're (GPUs) suited better for certain types of operations. Image processing, as an example, is very well suited to working on a GPU because the GPU excels at addressing and operating on elements of arrays (textures basically.) I've used it as a proof of concept at work for processing large numbers of video feeds simultaneously for things like photometric normalization, image stabilization, et cetera, and the things are awesome. They work well in this scenario because the problem I'm trying to solve fits the caveats of using the GPU well. Slow upload of data, miraculously fast action upon that data, slow download of the data. Now, slow is relative and getting more and more relative as new chipsets are released.

The actual framework for doing this is relatively simple although it certainly did help that I've a background in OpenGL and DirectXGraphics (so I've done shader work before); however, again, progress is removing those caveats as well. Generic GPU programming toolsets are imminent the only problem being ATI has no interest in their toolsets working with nVidia and nVidia has even less interest in their toolset(s) running ATI hardware. Something we'll just have to learn to deal with.

BTW, DirectX10 will make this a little easier as well with changes to how you have to pipeline data in order to operate on it in a particular fashion.

--
Loading...
Notpick by 91degrees · 2007-03-01 04:36 · Score: 4, Informative

That should be Teraflops. Flops is Floating-point operations per second, so always has an s on the end even if singular.
Re:Never thought of that by Anonymous Coward · 2007-03-01 04:44 · Score: 3, Informative

Check out this web site: http://www.gpgpu.org/

It is up to date and contains a lot of related information.

WP
General Purpose Programmers by Doc+Ruby · 2007-03-01 04:47 · Score: 3, Informative

it's hard to program such GPUs for anything other than graphics applications.

"Anything other" is "general purpose", which they cover at GPGPU.org. But the general community of global developers hasn't gotten hooked on the cheap performance yet. Maybe if someone got an MP3 encoder working on one of these hot new chips, the more general purpose programmers would be delivering supercomputing to the desktop on these chips.

--
--
make install -not war
No, Ars didn't say why. Here's why. by Animats · 2007-03-01 04:59 · Score: 4, Informative

Ars has an article exploring why it's hard to program such GPUs for anything other than graphics applications.
No, Ars has an article blithering that it's hard to program such GPUs for anything other than graphics applications. It doesn't say anything constructive about why.
Here's an reasonably readable tutorial on doing number-crunching in a GPU. The basic concepts are that "Arrays = textures", "Kernels = shaders", and "Computing = drawing". Yes, you do number-crunching by building "textures" and running shaders on them. If your problem can be expressed as parallel multiply-accumulate operations, which covers much classic supercomputer work, there's a good chance it can be done fast on a GPU. There's a broad class of problems that work well on a GPU, but they're generally limited to problems where the outputs from a step have little or no dependency on each other, allowing full parallelism of the computations of a single step. If your problem doesn't map well to that model, don't expect much.
1. Re:No, Ars didn't say why. Here's why. by Chris+Ashton+84 · 2007-03-01 07:05 · Score: 3, Informative
  
  Yes, you used to have to do everything in a graphical environment, but not any more. With nVidia's CUDA you program in C/C++, have a general memory model (you can access texture memory if it's efficient for what you need, but you also have general device memory and several other types of memory to choose from) and run on fully capable stream processors. As far as the programmer is concerned, the gpu is just a stream processor add-in card. You do have to manually transfer to and from device memory, but once you have your data on the gpu you're free to access it however you want (arrays, textures, linear memory, whatever). It's not a difficult system to understand, though tuning your program for performance will be challenging. Check out http://developer.nvidia.com/object/cuda.html for more info.
Re:Never thought of that by theantipop · 2007-03-01 05:00 · Score: 4, Informative

http://folding.stanford.edu/FAQ-ATI.html

It's still in beta AFAIK, but it has been in development for quite some time.
Re:The first rule of teraflop club... by dlapine · 2007-03-01 05:02 · Score: 5, Informative

LOL- you're complaining about wattage for 1 TF when they did it on a pair of friggin' video cards?? That's gotta be what, 500 watts total for whole PC?

We've run several PC clusters and IBM mainframes that didn't have a 1TF of capacity. You don't want know much power went into them. Yes, our modern blade-based clusters are more condensed, but they're still power hogs for dual and quad core systems.
Blue gene is considered to be a power efficient cluster and the fastest, but it still draws 7kw per rack of 1024 cpus. At 4.71 TF per rack, even Blue Gene pulls 1.5kw per teraflop.
Yes, it's a pair of video cards, and not a general purpose cpu, but your average user doesn't have ability to program and use a Blue Gene style solution either. They just might get some real use out of this with a game Physics Engine that taps into this computing power.
This is cool.

--
The Internet has no garbage collection
SuperCell by Doc+Ruby · 2007-03-01 05:14 · Score: 2, Informative

The Playstation 3 is reported to harness 2 TFLOPS. But "only" 204GFLOPS run on the Cell CPU, 10%. The other 1.8TFLOPS runs on the nVidia G70 GPU. But the G70 runs shaders, very limited application to anything but actually rendering graphics.

The Cell itself is notoriously hard to code for. If just some extra effort can target the nVidia, that's TWO TeraFLOPS in a $500 box. A huge leap past both AMD and Intel.

--
--
make install -not war
1. Re:SuperCell by Doc+Ruby · 2007-03-01 09:12 · Score: 2, Informative
  
  PCI-Express offers 64 "lanes" pumping up to 500MBps each (since January, 250MBps in actual shipping HW). In a switched hub, for 256Gbps total. The Cell's EIB is probably its most interesting feature: 200Gbps token ring that transparently connects offchip. So the new IBM Cells, with 4 cores (Power970 + 8-SPEs each) on one die (or SoC) has 32x 25.6GFLOPS + 4x 970 all moving at 200Gbps. Or just a single Cell at 204GFLOPS feeds 200Gbps to a PCIe stuffed with 20x 10Gig ethernet cards (10 double-10GigE PCIe cards).
  
  The single Cell therefore offers 32 (bits) * 8 (SPEs) = 256 FLOPs per (5 picosecond) loop on a full pipeline. The four-way offers 1KFLOPs. There are 1024-SPE Cells in the product line, which are 32KFLOPs; a 4-way would offer 128KFLOPs per loop. Even complex video codecs and mixing need at most 100MIPS, which such a beast would run at 5ns, or 200 million times realtime. 200Gbps is 5 thousand simul Blu-Ray video streams, so we're talking about the beast working at 40 thousand times its max video throughput, while a single Cell works at 8 times video throughput. Audio throughput is much lower FLOPS:bit.
  
  So the Cell data transfer is certainly ample to its high processing speed.
  
  --
  --
  make install -not war
Re:Not misleading at all by ArcherB · 2007-03-01 05:46 · Score: 2, Informative

hat must be why nVidia has decided to enter the x86 chip market and Intel has significantly improved their GPU offerings, as well as indicate they may include vector units in future chips, because these companies plan to work together in the future! It's so obvious! I wish I hadn't paid attention these past 6 months, as it's clearly confused me!

Sarcasm suits you well.

While Intel and nVidia may both be independently reinventing the wheel right now, neither seems to be getting very far very fast. Intel's video offerings have been poor at best and no one has seen an nVidia x86 processor. AMD has already demo'd a prototype, which means they are further along with this Fusion than both Intel and nVidia combined. I don't think it will take long for the decision makers at both of these companies to realize that the other has the missing component.

Of course, you could be right. This is pure speculation on my part and I am pretty much talking from my ass. Still, the idea makes perfect sense to me.

--
There is no "I disagree" mod for a reason. Flamebait, Troll, and Overrated are not substitutes.
Re:Step 1 by Anonymous Coward · 2007-03-01 05:47 · Score: 3, Informative

Step 1: Put your chip in the box. Dude. You have to cut a hole in the box first, otherwise you will pinch your junk...err...your chip under the lid.
Re:Compatibility by UncleFluffy · 2007-03-01 05:55 · Score: 4, Informative

Even if Nvidia's CUDA is as hard as the Ars Technica article suggests, I still hope AMD either makes their chips binary compatible, or makes a compiler that works for CUDA code.

From what I saw at the demo, the AMD stuff was running under Brook. As far as I've been able to make out from nVidia's documentation, CUDA is basically a derivative of Brook that has had a few syntax tweaks and some vendor-specific shiny things added to lock you in to nVidia hardware.

--
What would Lemmy do?
Re:OOOoooo by End+Program · 2007-03-01 06:05 · Score: 5, Informative

Don't forget that you need at least a 60MHz (yes, sixty megahertz) ADC and DSP pair to do what was suggested. The cost of building useful supporting electronics around a DSP capable of implementing a direct sampling receiver at 60MHz would be prohibitive in the range $ridiculous-$ludicrous.

Maybe there aren't any DSP available and low cost, if you aren't a hardware designer:

400 MHz DSP $10.00 http://www.analog.com/en/epProd/0,,ADSP-BF532,00.h tml
14-bit, 65 MSPS ADC $30.00 http://www.analog.com/en/prod/0,,AD6644,00.html
Catching non-designers talking smack ...priceless
Re:The first rule of teraflop club... by Duncan3 · 2007-03-01 06:07 · Score: 2, Informative

Count real, usable FLOPS. GPU's don't win.

But for ~$500, it's what's going to be used.

--
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
Re:Compatibility by Anonymous Coward · 2007-03-01 13:39 · Score: 1, Informative

CUDA isn't a derivative of Brook, it's a more general programming model. Whereas brook is a streaming architecture, meaning that each iteration of the kernel writes one value at the end, the threads in CUDA are able to write many values, as well as perform some communication during the processing.

This new capability will enable CUDA will enable more general algorithms.