NVIDIA's $10K Tesla GPU-Based Personal Supercomputer
gupg writes "NVIDIA announced a new category of supercomputers — the Tesla Personal Supercomputer — a 4 TeraFLOPS desktop for under $10,000. This desktop machine has 4 of the Tesla C1060 computing processors. These GPUs have no graphics out and are used only for computing. Each Tesla GPU has 240 cores and delivers about 1 TeraFLOPS single precision and about 80 GigaFLOPS double-precision floating point performance. The CPU + GPU is programmed using C with added keywords using a parallel programming model called CUDA. The CUDA C compiler/development toolchain is free to download. There are tons of applications ported to CUDA including Mathematica, LabView, ANSYS Mechanical, and tons of scientific codes from molecular dynamics, quantum chemistry, and electromagnetics; they're listed on CUDA Zone."
...to see a company established in a certain market, to branch out so aggressively and boldly into something... well, completely new, really.
Does anyone know if Comsol Multiphysics can be ported to CUDA?
"The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
Port john the ripper/aircrack-ng? Buy a few terabyte drives and start generating hash tables?
At first glance I thought these used actual Tesla coils in the processor, or the devices were at least powered or cooled by some apparatus that used Tesla coils.
Turns out "Tesla" is just the name of the product.
Drat. I demand a refund.
So many scientists use the word "codes" when they mean "program(s)".
Why is this?
according to http://folding.stanford.edu/English/Stats about 250.000 "normal" users are folding proteins at home.
Personally, I would use it as a render farm, but Blender compatibility could take a while if Nvidia keeps the drivers and specification locked up.
What they don't seem to mention is the amount of memory/core (at 960 cores). I'd guess about 32 MB/core, and 240 cores sharing the same memory bus...
> Each Tesla GPU has 240 cores and delivers about 1 Teraflop single precision and about 80 Gigaflops double-precision floating point performance.
The 80GFlops are per card. So you end up with 320GFlops total.
Not much better, but still better than nothing ;)
So how do you get an Erlang system to run on this?
it's not about how many cores you have but how efficiently they can be used. If your CUDA application is any way memory intensive you're going to experience a serious drop in performance. A read from the local cache is 100 times faster than a read from the main ram memory. This cache is only 16kb. I spend most of my time figuring out how to minimise data transfers. That said, CUDA is probably the only platform that offers a realistic means for a single machine to tackle problems requiring gargantuan computing resources.
prepare the survey weasels.
Shameless exploitation of the good name of one of the greatest inventors of all time. :-)
ahh yes the idea of personal supercomputing. Back in '99 I worked for Patmos International. We were at the Linux Expo for that year as well if some of you might remember. Our dream was to have a parallel supercomputer in everyone's home. We used mostly Lisp and Daisy for the programming aspect. The idea was wonderful, but eventually came to a screeching halt when nothing was being sold. It was ahead of it's time for sure. you can find out a little more about it here. I find the whole ideal of symbolic multiprocessing very fascinating though.
*plays the Apogee theme song music*
Neural nets.
This setup sounds ideal for a training bed for fann programs. I can't recall if there's a port of fann for CUDA, but I think there might be.
So being naive to the ways of the world is bad karma now? I thought Buddhism stressed being free from the material things of the world.
The problem is how do you actually define supercomputer. I mean, does only machines released in the past month count? Or do you still count the original bad boys like the Cray? After all, when first built most Crays were multi million dollar number crunching beasts. Does the fact that you can get the same performance in a desktop now mean the Cray no longer counts? The power of computers is still growing at such a pace that the machine that costs millions a decade ago can probably be beaten by a cluster that would cost you less than 25K today, so how exactly would you suggest they define supercomputer?
ACs don't waste your time replying, your posts are never seen by me.
A single Radeon 4870x2 uses two chips
2.4 / 2 = 1.2
Each Tesla GPU has 240 cores and delivers about 1 TeraFLOPS single precision...
Each Radeon HD 4870 produces 1.2 TFLOPS, about 0.2 more than one Tesla GPU.
"NVIDIA announced...the Tesla Personal Supercomputer -- a 4 TeraFLOPS desktop...
Two 4870 X2s equal 4.8 TFLOPS, 0.8 more than four Tesla GPUs.
I think the parent's point was that even when an HD Radeon 4870 X2 is made up of two cards they're still connected and recognized as one. Thus, with "fewer" cards and fewer slots you could achieve more performance. Or you could use the other two vacant slots for yet another two 4870s: Four of them in crossfire would equal 9.6 TFLOPS, 5.6 more than four Tesla GPUs.
Futhermore, I would assume two GPUs that are closely interconnected as a "single" card (4870 X2) would be better than a pair of GPUs connected through a combination of the motherboard (x2 Tesla GPU) and custom interconnects.
I'm not implying that an HD 4870 is a viable alternative to a Tesla GPU but the "performance" is more than just comparable. As it's been mentioned before, the hardware concerned is meant for precision and not speed, otherwise known as performance. Then again, you could compensate for in-accuracy by using all that computing power to make multiple passes rather than making sure your initial calculations were accurate.
Note: Emphasis by me in all quotes provided.
NVIDIA has done a good job of making the processing power accessible to programmers that are not GPU coding experts. In addition, they have made hardware changes to better support the type of scientific computation being done on these devices.
So, while in theory you could put together some Radeon's, work with their API and achieve the same thing, NVIDIA has significantly reduced the level of effort to make it happen.
The 1 byte/flop ratio is more about memory bandwidth than capacity. Each Tesla processing unit may only have 1GB of onboard memory but that doesn't restrict you from transferring data in and out from your system's main memory, which could be as large as you need it to be. The bandwidth on this figure for PCI-Express 2.0 would probably still be a bottleneck though. I don't have the exact specifications on hand, but even a previous generation G80 has about 80GB/s memory bandwidth to on-chip main memory, and there are 4 of these so you're looking at a minimum of 320GB/s or about 1/3 of a byte per flop for a nominal input dataset of 4GB, lower than that for larger data sets. Not ideal, but not useless either.