AMD Demonstrates "Teraflop In a Box"
UncleFluffy writes "AMD gave a sneak preview of their upcoming R600 GPU. The demo system was a single PC with two R600 cards running streaming computing tasks at just over 1 Teraflop. Though a prototype, this beats Intel to ubiquitous Teraflop machines by approximately 5 years." Ars has an article exploring why it's hard to program such GPUs for anything other than graphics applications.
It shouldn't be a TERAble FLOP at the stores anyway. Nice performance...
OK, yes, bad pun, bad spelling, you can "-1 get a real sense of humor" me now.
34486853790
Connection too slow for X forwarding? Try "ssh -CX user@host"
Even if Nvidia's CUDA is as hard as the Ars Technica article suggests, I still hope AMD either makes their chips binary compatible, or makes a compiler that works for CUDA code.
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Look up 'ubiquitous' before you whine about how far behind Intel might seem to be.
Though having one demonstration will help spur the demand, and the demand will spur production, I still think it'll be five years before everybody's grandmother will have a Tf lying around on their checkbook-balancing credenza, and every PHB will have one under their desk warming their feet during long conference calls.
[
Oh no.
I mean, the PS3 does 2 Teraflops! OMG, they're like 20 years ahead of Intel, who are so RUBBISH.
And what would be the theoretical floppage of, say, a Intel Core 2 Extreme with 2 x nVidia GTXs in a dual SLI arrangement using CUDA? I'm willing to bet it would be somewhat higher than this setup.
Step 1: Put your chip in the box.
How much is that in BogoMIPS?
I might be (read: am mostly) retarded but I never thought of using a graphics processor for anything else, but with the super cards around the corner it makes sense that some normal processing jobs could be farmed out to the GPU when its not being occupied with graphics duties. Does anyone know where I can find some extra info on this, or to what extent this is being implemented? My curiosity is piqued!
imagine a Beow....ah, screw it.
Beowolf cluster.... I think that's all that needs said
It might be hard, but then again, it might be worthwhile. For instance (I'm a ham radio operator) I ran into a sampling shortwave radio receiver the other day. Thing samples from the antenna at 60+ MHz, thereby producing a stream of 14-bit data that can resolve everything happening below 30 MHz, or in other words, the entire shortwave spectrum and longwave and so on basically down to DC.
Now, a radio like this requires that the signal be processed; first you separate it from the rest, then you demodulate it, then you apply things like notch filters (or you can do that prior to demodulation, that's very nice) you build an automatic gain control to handle amplitude swings, provide a way to vary the bandwidth and move the filter skirts (low and high) independently... you might like to produce a "panadapter" display of the spectrum around the signal of interest where the is a graph that lays out signal strengths for a defined distance up and down spectrum... you might want to demodulate more than one signal at once (say, a FAX transmission into a map on the one hand, and a voice transmission of the weather on the other.) And so on - I could really go on for a while.
The thing is, as with all signal processing, the more you try to do with a real-time signal, the more resources you have to dedicate. And this isn't audio, or at least, not at the early stages; a 60+ MHz stream of data requires quite a bit more in terms of how fast you have to do things to it than does an audio stream at, say, 44 KHz.
Bit signal processing typically uses fairly simple math; a lot of it, but you can do a lot without having to resort to real craziness. A teraflop of processing that isn't even happening on the CPU is pretty attractive. You'd have to get the data to it, and I'm thinking that would be pretty resource intensive, but between the main CPU and the GPU you should have enough "ooomph" left over to make a beautiful and functional radio interface.
There is an interesting set of tasks in the signal processing space; forming an image of what is going on under water from sound (not sonar... I'm talking about real imaging) requires lots and lots of signal processing. Be a kick to have it in a relatively standard box, with easily replaceable components. Maybe you could do the same thing above-ground; after all, it's still sound and there are still reflections that can tell you a lot (just observe a bat.)
The cool thing about signal processing is that a lot of it is like graphics, in a way; generally, you set up some horrible sequence of things to do to your data, and then thrash each sample just like you did the last one.
Anyway, it just struck me that no matter how hard it is to program, it could certainly be useful for some of these really resource intensive tasks.
I've fallen off your lawn, and I can't get up.
Shouldn't we be talking about nVidia, since this is a GPU?
120 characters for a sig? That's bloody useless.
Don't mention the wattage...
And the second rule of teraflop club...
Don't mention the wattage...
Back here in the real world where we PAY FOR ELECTRICITY, we're waiting for some nice FLOPS/Watt, keep trying guys.
And they announced this some time ago didn't they?
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
...generic purposes, it is that they're (GPUs) suited better for certain types of operations. Image processing, as an example, is very well suited to working on a GPU because the GPU excels at addressing and operating on elements of arrays (textures basically.) I've used it as a proof of concept at work for processing large numbers of video feeds simultaneously for things like photometric normalization, image stabilization, et cetera, and the things are awesome. They work well in this scenario because the problem I'm trying to solve fits the caveats of using the GPU well. Slow upload of data, miraculously fast action upon that data, slow download of the data. Now, slow is relative and getting more and more relative as new chipsets are released.
The actual framework for doing this is relatively simple although it certainly did help that I've a background in OpenGL and DirectXGraphics (so I've done shader work before); however, again, progress is removing those caveats as well. Generic GPU programming toolsets are imminent the only problem being ATI has no interest in their toolsets working with nVidia and nVidia has even less interest in their toolset(s) running ATI hardware. Something we'll just have to learn to deal with.
BTW, DirectX10 will make this a little easier as well with changes to how you have to pipeline data in order to operate on it in a particular fashion.
Loading...
That should be Teraflops. Flops is Floating-point operations per second, so always has an s on the end even if singular.
So how is an image being formed under water using sound without using sonar? Also, I bet we could do the same thing above ground and maybe above the water we could try to image using radio waves. Since it is using radio waves, lets call it a radar.
So the preview could be boiled down to: Card still in development will be faster than cards currently available for sale.
It also included some pictures of the cooling solution that will completely dominate the card. Not that a picture of a microchip with "R600" written on it would be a lot better I guess. Although the pictures are fuzzy and hard to see, it looks like it might require two separate molex connections just like the 8800s.
I read the internet for the articles.
I thought the dual CPU G5 machines were rated at 1 teraflop. Certainly PowerPC AltiVec processors are super floating-point engines (but I don't know exactly how they rank at flops/mhz....)
/. will correct me :-)
But then maybe the issue depends on the notion of what is "ubiquitous" and Macs don't qualify. I dunno, but I'm sure someone on
dave
How long before they put in the on the HT bus using a HTX slot?
Make her open the box...
which is fully connected to the Internet so that I can put my toast down or pop it up remotely.
Wait...from some of the other comments about electricity usage, I might be able to do away with the heating coils and use the circuits themself to toast. That would really be an environment plus. Wonder how it would affect the taste of the bread?
Help end the use of Sigs. Tomorrow
"Anything other" is "general purpose", which they cover at GPGPU.org. But the general community of global developers hasn't gotten hooked on the cheap performance yet. Maybe if someone got an MP3 encoder working on one of these hot new chips, the more general purpose programmers would be delivering supercomputing to the desktop on these chips.
--
make install -not war
In other news, Martha Stuart explains who screwdrivers don't make good hammers.
Wanna fight ? Bend over, stick your head up your ass, and fight for air.
Specialized hardware units rack up impressive benchmark numbers on specific tasks relative to general-purpose CPUs. News at 11.
There's a real difference between getting something to happen on a quasi-DSP like a GPU and on a real, general purpose processor like a CPU. If GPUs were full out CPU replacements, well then we wouldn't have CPUs any more, would we? The problem is that they are very very fast, but only at some things. Now that's fine, because that's what they were designed for. They are made to push pixels really fast and if they can do anything else, well bonus. However it does mean that they aren't a general purpose computing replacement.
Also, the more specialized you get your DSP, the easier it is to get speed out of it. I'm sure it wouldn't be hard to design (I'm sure they already exist) a very narrow purpose DSP that does over 1 trillion floating point ops per second. However that's real different than having a CPU that will do the same, and do it across many kinds of ops.
So as nifty as shit like this might be, it is real disingenuous to pretend that they've "beat" Intel. Intel isn't talking about a graphics card, they are talking about their CPUs. By the numbers my GPU has always been faster than my CPU, as well it should. There'd be no point in paying for specialized hardware if I had general purpose hardware that was faster.
Ars has an article exploring why it's hard to program such GPUs for anything other than graphics applications.
No, Ars has an article blithering that it's hard to program such GPUs for anything other than graphics applications. It doesn't say anything constructive about why.
Here's an reasonably readable tutorial on doing number-crunching in a GPU. The basic concepts are that "Arrays = textures", "Kernels = shaders", and "Computing = drawing". Yes, you do number-crunching by building "textures" and running shaders on them. If your problem can be expressed as parallel multiply-accumulate operations, which covers much classic supercomputer work, there's a good chance it can be done fast on a GPU. There's a broad class of problems that work well on a GPU, but they're generally limited to problems where the outputs from a step have little or no dependency on each other, allowing full parallelism of the computations of a single step. If your problem doesn't map well to that model, don't expect much.
Step 2: Don't leave your box in Boston.
Imagine turning the flop on a million million hands of Texas Hold'em.
To all the fellas out there with geek friends to impress
It's easy to do, just follow these steps:
One: Cut a hole in a box
Two: Stick your chip in that box
Three: Make her open the box
And that's the way you do it
It's my chip in a box
Could this be the start of some really good opensource drivers for ATI cards?
Just how much of X and OpenGL could they offload on this card?
What Theora, Ogg, Speex, or Divx encoding and decoding?
I know it is a radical idea but since they are optimized for graphics and graphics like operations why not use them for that?
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
1. Cut a hole in a box
2. Put your chips in that box
3. Make AMD open the box
That's the way you do it
It's a teraflop in a box!
You cut a hole in the box.
It is 1 Teraflops since it stands for Tera floating point operations per second. The ending "s" is not plural.
The Playstation 3 is reported to harness 2 TFLOPS. But "only" 204GFLOPS run on the Cell CPU, 10%. The other 1.8TFLOPS runs on the nVidia G70 GPU. But the G70 runs shaders, very limited application to anything but actually rendering graphics.
The Cell itself is notoriously hard to code for. If just some extra effort can target the nVidia, that's TWO TeraFLOPS in a $500 box. A huge leap past both AMD and Intel.
--
make install -not war
You don't need greater than 32-bit precision for any of the MAC ops. Usually that kind of limitation can be overcome by rethinking the algorithm, and doing some accumulation or error analysis outside of the GPU.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
The problem is they have multiple new plataforms and they do not release docs for the bare metal. Compare that behaviour to common architectures like x86, 68K, PPC, ARM, MIPS or the big bunch of DSPs and microcontrollers. You can get books or PDFs with all the instruction, memory ranges, timings... you get all you need to really program them, build compilers, or even design new systems around a chip, and by experience you know they do not change such things at will, some systems are decades old. Thus, investing time is better done in systems that have proven stable and are clearly well documented. They have to choose if they want to be a market standard or keep their "precious" IP. Intel must have lose all of it by now, publishing docs like how to use the SSE instructions, yeah.
Thats not right AMD is cheating. AMD is also using the Video cards. AMD did not beat Intel. If they are going to go against intel with processing stuff they should do it via CPU's only. IF Intel did the same thing with 2 nvidia cards in sli I bet they could get the same results.
1. Cut a hole in the box. 2. Put your flop in that box. 3. Make her open the box. And that's the way you do it!
GPGPU is hard because we're still in the very early days of this particular revolution. As I think about it, and from what we know of AMD's plans in particular, I think this is kind of like the evolution of FPU.
See, in the early days FPU was a seperate chip (anyone remember buying an 80387 to plug into their mobo?). Writing code to use FPU was also a complete pain in the ass, because you had to use assembly, with all the memory management and interrupt handling headaches inherent. FPUs from different vendors weren't guaranteed to have completely compatible instruction sets. Because it was such a pain in the ass, only highly special purpose applications made use of FPU code. (And, it's not that computer scientists hadn't thought up appropriate abstractions to make writing floating point easy. Compilers just weren't spitting out FPU code).
Then, things began to improve. The FPU was brought on die, but as an optional component (think 486SX vs 486DX). Languages evolved to support FPUs, hiding all the difficulty under suitible abstractions so programmer could write code that just worked. More applications began to make use of floating point capabilities, but very few required a FPU to work.
Finally, FPU was brought on die as a bog standard part of the CPU. At that point, FPU capabilities could be taken for granted and an explosion of applications requiring an FPU to achieve decent performance ensued (see, for istance, most games). And writing FPU code is now no longer any more difficult than declaring type float. The compiler handles all the tricky parts.
I think GPGPU will follow a similar trajectory. Right now, we're in phase one. Use a GPU for general purpose computation is such an incredible pain that only the most specialized applications are going to use GPGPU capabilities. High level languages haven't really evolved to take advantage of these capabilities yet. And yes, it's not as though computer scientists don't have appropriate abstractions that would make coding for GPGPU vastly easier. Eventually, GPGPU will become an optional part of the CPU. Eventually high level languages (in addition to the C family, perhaps FORTRAN or Matlab or other languages used in scientific computing) will be extended to use GPGPU capabilities. Standards will emerge, or where hardware manufacturers fail to standardize, high level abstraction will sweep the details under the rug. When this happens, many more applications will begin to take advantage of GPGPU capabilities. Even further down the road, GPGPU capabilities will become bog standard, at which point will see an explosion of applications that need these capabilities for decent performance.
Granted, the curve for GPGPU is steeper because this isn't just a matter of different instructions, but a change in memory management as well. But I think this kind of transition can and will eventually happen.
ehh?? I posted this in this article's discusion.
I have no idea how it ended up here. I didn't have this story open yet when posting this. Ohh well. Shit happens..lol
So I take it that AMD will be ready for Vista's successor?
Well, there's spam egg sausage and spam, that's not got much spam in it.
Cue the "Dick In A Box" jokes...
I do not think it means what you think it means.
There are no trails. There are no trees out here.
You lose at life.
Google:
1-gigaflop (957 results)
1-gigaflops (11,200 results)
If you say "1 gigaflop" you're as much a moron as someone saying "1 Gbp" instead of "1 Gbps". And yes, folks who write "lazer" instead of "laser" are morons too because acronyms are not supposed to be written phonetically, they're written based on what their letters stand for!
"Simple: they aren't available. PC's don't typically come with DSPs. "
Of course they don't.
Right? Right? Right?
There is one aspect of this that I am a bit worried about. In the graphics card market, you have artificial price discrimination based upon application. Namely, if all you want to do is play games you but a Ge-Force card, but if you want to do pro 3d work you buy a Quadro. The only difference between the two is the Quadro has different drivers that disable some features while enabling others. That, and, of course, a *very* steep price difference. If and when GPUs are used for even more applications, might we see even more price discrimination?
It is argued that the video card companies are well within their rights to sell the same product at different prices because the drivers are different, but image if that happened to CPUs. What if the same piece of silicon was sold at different prices depending on if you were a professional writer or someone who just surfs the web? Imagine if Intel and AMD did not give you direct access to the hardware and instead put a an extra layer between the programmer and the chip so they could sell the same chip to different people for different prices. What would something like that do to open-source?
In this merger of the CPU and GPU either the openness of the CPU will extend to the GPU, or the closed-ness of the GPU will extend to the CPU.
unless you use a decent compiler that actually works with almost standard C++:
http://www.rapidmind.net/technology.php
And boy, was she disappointed after being told she'd be getting a hard disk!
Do it yourself, because no one else will do it yourself. [beta blockade 10-17 Feb]
Maybe Intel could just buy a graphics company that already has the technology demo something in 4 months. Like AMD did, again (recall their last their two biggest wins came from acquiring NexGen's intellectual property). AMD has a habit of making money by plagiarizing the work of other smaller companies they acquire. Whereas Intel apparently buys smaller companies and loses money :) :) (er, whatever they bough and sold to Marvell for a huge loss). --ducks for cover--
https://www.accountkiller.com/removal-requested
AMD/ATi have a GPU, Intel will make a CPU.
Carbon based humanoid in training.
"PCI-Express offers 64 "lanes" pumping up to 500MBps each (since January, 250MBps in actual shipping HW). In a switched hub, for 256Gbps total. The Cell's EIB is probably its most interesting feature: 200Gbps token ring that transparently connects offchip. So the new IBM Cells, with 4 cores (Power970 + 8-SPEs each) on one die (or SoC) has 32x 25.6GFLOPS + 4x 970 all moving at 200Gbps. Or just a single Cell at 204GFLOPS feeds 200Gbps to a PCIe stuffed with 20x 10Gig ethernet cards (10 double-10GigE PCIe cards)."
Deja Vu. Welcome to the Transputer/Microway business model.
I laugh every day at the tags people assign to articles, but today I laughed the hardest with the tag "dickinabox" ...
http://phlite.net Lay out on the beach in Rocky Point, Mexico : http://www.granizo.com
this teraFLOP are on 64-bit doubles. Single precision teraFLOPs are close to useless for anything that requires a teraFLOP.
This sounds truly fascinating. How far along is anybody in building an actual working system along these lines? Or is this all still at the drawing board phase, awaiting the required horsepower to really take off? Where should I look to find out more (for a complete layman, at that)?
Cheers,
"What in the name of Fats Waller is that?"
"A four-foot prune."