Boost UltraSPARC T1 Floating Point w/ a Graphics Card?
alxtoth asks: "All over the web, Sun's UltraSPARC T1 is described as 'not fit for floating point calculations'. Somebody has benchmarked it for HPC applications, and got results that weren't that bad. What if one of the threads could do the floating point in the GPU, as suggested here? Even if the factory setup does not expect an video card, could you insert a low profile PCI-E video card, boot Ubuntu and expect decent performance?"
Ever heard of CAD?
The T2 is supposed to have an FPU for each core, so would be a simpler solution tan trying to use a grpahics card. The T2 is also supposed to have double the number of threads per core and even more memory bandwidth.
[...]will probably fall foul of Debian's firmware loading policy
No, it won't. The firmware won't be shipped with debian, it would be run directly from the rom that is on the very card that is to be initialized. Debian has shipped XFree86 for a long time, and it supports a similar method to initialize secondary graphics cards that require their bios to set them up to function properly (probably only works on x86 CPUs).
Those DSPs you mention aren't CPUs, and they're not available on PCI cards - plus the programmability you mention.
The way to think about the use of GPGPU in a host with its own (GP) CPU is client/server computing. I put together such a system in 1990, a 12MHz 80286, with 4 12.5MFLOPS DSPs (AT&T DSP32c) and an FPGA "scheduler" on the ISA card. The 286 ran a loop sending data and commands to a memory mapped page on the card's SRAM, and copying the page when a status register was set. I had realtime 24bit VGA renderings of megapolygons at 30FPS, all processed on the DSPs. The systems have all scaled up, but the price improvement per FLOPS of the GPUs over the CPU is even better now than then.
As you say, the key is keeping the compute servers full, which amortizes the signalling overhead best, and keeping the signaling across the bus high-level enough that the bandwidth doesn't bottleneck. There are lots of demanding apps now which could use that architecture. Audio compression is my favorite - I'm waiting to stuff a $1000 P4 with 6 $400 dual GPUs, and beat the performance of any <$10K server, scalable down to $1500. That's the kind of host that could really transform telephony.
--
make install -not war