NVIDIA's $10K Tesla GPU-Based Personal Supercomputer
gupg writes "NVIDIA announced a new category of supercomputers — the Tesla Personal Supercomputer — a 4 TeraFLOPS desktop for under $10,000. This desktop machine has 4 of the Tesla C1060 computing processors. These GPUs have no graphics out and are used only for computing. Each Tesla GPU has 240 cores and delivers about 1 TeraFLOPS single precision and about 80 GigaFLOPS double-precision floating point performance. The CPU + GPU is programmed using C with added keywords using a parallel programming model called CUDA. The CUDA C compiler/development toolchain is free to download. There are tons of applications ported to CUDA including Mathematica, LabView, ANSYS Mechanical, and tons of scientific codes from molecular dynamics, quantum chemistry, and electromagnetics; they're listed on CUDA Zone."
Sweet. I got myself a tesla board... now what the heck to do with it... no kidding... I got one of these beasties... any suggestions?
Perhaps the coolest Personal Super Computer was the one shown by the good folks at Penguin Computing. It was rumored to be over-clocked and featured liquid cooling for silent operation!
http://www.penguincomputing.com/products/linux/workstations
Wow, that's some serious computing power! I wonder if anyone has thought of using these for graphics or rendering? I imagine they could make some killer games, especially with advanced technology like Direct 3D.
...to see a company established in a certain market, to branch out so aggressively and boldly into something... well, completely new, really.
Does anyone know if Comsol Multiphysics can be ported to CUDA?
"The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
A single Radeon 4870x2 is 2.4 TFLOPS. Some supercomputer, that.
Seriously, why is this even news? nVidia makes a product, which is OK, but nothing revolutionary. The devaluation of the "supercomputer" term is appalling.
Also, how much of that 4 TFLOPS you can get on actual applications? How's FFT? Or LINPACK?
..will it run Vista? ;-)
What a rip.
FAQs are evil.
Can't wait for him to come and tell us how Nvidia, ATI, Intel and all are idiots, and this is completely and totally unusable and we're in a parallel crisis!
At first glance I thought these used actual Tesla coils in the processor, or the devices were at least powered or cooled by some apparatus that used Tesla coils.
Turns out "Tesla" is just the name of the product.
Drat. I demand a refund.
The toolchain is binary only and has an EULA that prohibits reverse engineering.
While the inner nerd in me screams to take out a loan against my house to buy one, I can't imagine this being very popular outside academia. Most users don't use the power of their crappy computers, let alone this. And then there is the whole "ECONOMY" thing.
"Chance favors the prepared mind." ~Me
But will it run Crys- ...
Oh.
That'll frighten a lot of the OO fanboys who have to have a friggin inheritance tree and a factory based abstracted class design before they can write Hello World.
Sorry , its early , I'm feeling grouchy.
4 Terraflops should be more than enough for anybody...
Most human behaviour can be explained in terms of identity.
Well cool. I dunno why people have such a tendency to start commenting on slashdot posts with ridicule. This is laudable, not laughable.
If this can be enabled to work for Digital Audio Workstations, to offer massive processing for VST/RTAS away from the CPU, I can still see a thriving market for this in multimedia...or even with real-time video processing.
Who remembers BionicFX? This will certainly make up for their vaporware...
I would save up for this
---
actual CAPTCHA:
costed
No wait, I want one of these and the skills to be able to write something cool in c that would actually use it.
So many scientists use the word "codes" when they mean "program(s)".
Why is this?
Comment removed based on user account deletion
Scientific and rendering outfits don't work to the same M.O. as gamers, ie. "it works, I'll use it, game on".
Open-source has become the name of the game over the last few years, and vendor tie-in has become the arse-end of the computing world, especially for this particular customer base.
While you may like the smell at the arse end, those who need HPC don't. Your blind attachment to the concept of closed-source accelerator solutions is so myopic that it's in danger of becoming an Internet joke meme.
I predict the worst possible outcome for a company with its head in the sand and a chip up its arse.
No, I don't, and thats a piss poor plea (theirs' not your) for mod points. People are generally smart, they just lower themselves for social reasons, in groups or otherwise. You may or may not know that, but I think its more accurate than saying groups are inherently stupid.
Imagine a Beowolf cluster of these!
Most human behaviour can be explained in terms of identity.
And then there is the whole "ECONOMY" thing.
The whole reason the ECONOMY is in the tank is because there are not enough people like you taking loans out against their house to buy random stuff like this.
Basically... IT'S ALL YOUR FAULT!
Deleted
I supercomputing circles (i.e. Top500.org) double precision floating point operations seems to be what is desired. 4 TFLOPS single precision, while impressive, is overshadowed by the equally weak 80 GFLOPS double precision, beaten by a single PowerXCell 8i (successor to the Cell in PS3) or the latest crop of Xeons. I'm sure tesla will find its users but we won't see them on the Top500 list anytime soon.
- Henrik
- when the Shadows descend -
All you need to do is follow the fscking link. Plenty of examples there.
there were a lot of early efforts trying to implement realtime rayracing engines for games (e.g. at Intel recently), let's port that stuff and have some fun.
"I love my job, but I hate talking to people like you" (Freddie Mercury)
I went to the site and tried to configure one. The disk partition options are: "General Purpose, Internet Server, Developer's Workstation, File Server". I wonder, who needs three Tesla cards in a file server or an internet server?
Is it possible to build a smaller version of this configuration? I do not have 10K, but I can come up with something smaller for my PhD research. In that case, is this a package that can be replicated via off the shelf nvidia hardware, or do I need to wait for NVidia to release a smaller version?
The S stands for "seconds". The singular is therefore "FLOPS".
Look, there's Python here. You can do the low-level high-performance core routines in C, and use Python to do all the OO programming. This is how God intended us to program.
So how do you get an Erlang system to run on this?
... AMD has annouced today it new Edison Personal Supercomputer technology.
The game is on.
it's not about how many cores you have but how efficiently they can be used. If your CUDA application is any way memory intensive you're going to experience a serious drop in performance. A read from the local cache is 100 times faster than a read from the main ram memory. This cache is only 16kb. I spend most of my time figuring out how to minimise data transfers. That said, CUDA is probably the only platform that offers a realistic means for a single machine to tackle problems requiring gargantuan computing resources.
prepare the survey weasels.
someone speak python here?
HHHHHSSSSSHSSS
SSSSS
the programming language
http://www.bash.org/?400459
wow, as if I didn't do that before plunking down the money for the darn card...
I was asking for innovative ideas... not their existing boring ones...
So, who's got some cool ideas of what to do with Tesla?
People are always coming out of the wood work to claim supercomputer performance with such and such a solution, go back and look at GRAPE (which is really cool.) http://arstechnica.com/news.ars/post/20061212-8408.html or a lot of other supercomputer clusters. When you want something flexible, you look for "balance" that means a good relationship between memory capacity, latency & bandwidth, as well as computer power. in terms of memory capacity, the number people talk about is: 1 byte/flop... that is 1 Tbyte of memory is about right to keep 1 TFLOP flexibly useful. this thing has 4 G of memory for 4 TF... in other words: 1 byte / 1000 flops. it's going to be hard to use in a general purpose way.
In the paste I was not very impressed by things as http://www-graphics.stanford.edu/projects/brookgpu/ because of the latency that is involved in actually transferring data back and forth from CPU to GPU memory. Thus I observed the same thing. But now it seems to the actual latency for transfer is reduced because of PCI-e, one might wonder if decent compiler technology is able to optimise 'normal' code for GPU instructions.
Support Eachother, Copy Dutch Property!
Shameless exploitation of the good name of one of the greatest inventors of all time. :-)
On that note, it would be a good development platform for realtime raytraced game engines. That way the code would be mature when affordable GPU's come out that can match that level of performance.
ahh yes the idea of personal supercomputing. Back in '99 I worked for Patmos International. We were at the Linux Expo for that year as well if some of you might remember. Our dream was to have a parallel supercomputer in everyone's home. We used mostly Lisp and Daisy for the programming aspect. The idea was wonderful, but eventually came to a screeching halt when nothing was being sold. It was ahead of it's time for sure. you can find out a little more about it here. I find the whole ideal of symbolic multiprocessing very fascinating though.
*plays the Apogee theme song music*
how about go outside and grow up...
. . . that's probably exactly the person who would buy one of these.
Folks who are professionally working on mainstream problems that require supercomputers, well, they probably have access to one already. (Maybe one of the supercomputing folks might want to chime in here; do you have enough access/time? Would a baby-supercomputer be useful to you?)
But there is certainly someone out there who was denied access, because his idea was rejected by peer review. He is considered a loopy nut bag, because he wants to prove that the Higg's boson is made of cottage cheese, or something like that.
Yep, look for rejected supercomputing program proposals, and you have a list of potential customers.
Schroedinger's Brexit: The UK is both in and out of the EU at the same time!
Neural nets.
This setup sounds ideal for a training bed for fann programs. I can't recall if there's a port of fann for CUDA, but I think there might be.
Perhaps there will be a resurgence in mad, unethical experimentation. In 20 years, this computer might acquire a status similar to that of the Altair 8800 home computer kit.
I still say that 640 human embryos should be enough for anybody.
Michael Reed, freelance tech writer.
Can you port Dan Bernstein's DJBFFT to it? And then benchmark a complex double-precision 8192-limb FFT against the CUDA libraries. If you can provide me with the benchmark results, then I'll be able to tell if it's a good platform for big-number number-crunching. (In particular, prime number hunting.)
Also FatPhil on SoylentNews, id 863
So, you bought a product without having any idea on what to do with it..
Not a smart thing to do, and bitching about nobody else having an idea doesn't make you appear any smarter.
You could ask some community to come up with ideas and offer to lend the board to whoever's idea you like best..
This is relevant because the compiler creates device specific binaries that you can't get the assembler code for.
Yes you can. Just give the proper switch to ask NVCC to keep all intermediate files.
You'll both get the high level shaders that got compiled. And the resulting assembler which subsequently code compiled into op-codes.
(Just don't have cuda handy at home to check what the options where).
My main objection is that CUDA is nVidia hardware-specific only, and ties you to a single provider.
The various incarnation of Brook (currently supported by ATI's card) are much more interesting as they are vendor neutral and support several back-ends (BrookGPU has an OpenGL back-end).
OpenCL looks like another interesting thing to follow regarding interoperability.
(My other objection is that CUDA isn't all that high-level as nVidia would like you to believe. Only the code to call kernels has C extensions. Everything else on the CPU side uses an API which is rather low level management of memory, initialisation, etc. Also all the different type of memory aren't properly abstracted - texture memory is still accessed with functions in kernels, not simply as plain C arrays like the other types of memory).
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
As opposed to astroflops?
but I don't know enough about it to be able to give useful information on the subject.
I do write some CUDA code, so I'll try to help.
I believe that each of the chips has a 512 bit wide bus to 4GiB of memory.
Indeed each physical package has entirely access to its own whole chuck of memory, regardless of who many "cores" the package contains (between 2 for the lowest end laptops GPUs and 16 for the highest end 8/9800 cards. Don't know about GT280. But the summary is wrong 240 is probably the amount of ALUs or the width of the SIMD) and regaless of how many "stream processor" there are (each core has 8 ALUs, which are exposed as 32-wide SIMD processing units, which in turn can keep up to 768-threads in flight thanks to some clever hyperthreading-like scheduling).
So in one single GPU card all the memory is accessible.
In a dual-GPU SLI card, each GPU has a full access to its own memory.
So, in our situation, it's 4GiB for each Tesla Card.
Then each core has a special internal memory which is shared by all the 32-to-768 threads running in parallel on the SIMD. (A couple of KiB, don't have the exact number handy).
I'm not sure what the memory allocation per stream processor is but I think the other parts of the chip control what goes where.
There's no actual per-stream-processor control of memory. There is something that looks like a "per-thread memory" but it's actually memory auto-allocated from the global memory.
(It all the same global memory, and the compiler just makes sure that each thread uses a different chunk of it to avoid conflicts).
And you actually do not control the stream-processors themselves.
You write a kernel (a piece of code which will process a mass of data) and throw a number of threads to one GPU (one physical package : i.e.: either 1 normal graphic card, or half of a SLI dual GPU graphic card).
The sceduler will dynamically spread all the concurrent threads among the SIMD processors on the GPU.
There probably are some bottlenecks
Yes, indeed :
- These 4GiB aren't cached at all (that's why it's preferable to use them only in the begin and the end of a calculation and use other types of memory during the calculations), have a big latency (that's why its better to have more threads running together so the scheduler can switch threads to hide latency) and you have to access them in a special fashion to group together the read-writes for faster access.
- Then there's the texture access. Using a special set of functions you can access the memory not directly but as if it was textures. It still has a big latency and it read-only. On the other hand, it has a cache so it has much better bandwidth and the texture units don't require special ordering of the access.
- The last type of memory is an ultra fast on-chip read-write memory which is shared for all the threads executed at the same time on the same core. But its access pattern is weird because everything is accessed in banks (one bank per thread or all threads on the same bank. Never many-to-many).
So, in the end writing good CUDA code requires some voodoo magic to correctly organise your stuff into memory in the most efficient way.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
The real reason this is interesting is because of OpenCL (http://www.khronos.org/opencl/) which just got approved by Khronos:
"OpenCL (Open Computing Language) is the first open, royalty-free standard for general-purpose parallel programming of heterogeneous systems. OpenCL provides a uniform programming environment for software developers to write efficient, portable code for high-performance compute servers, desktop computer systems and handheld devices using a diverse mix of multi-core CPUs, GPUs, Cell-type architectures and other parallel processors such as DSPs."
It's similar to OpenGL / OpenAL except that it's designed for general purpose computing and is already approved by the vast majority of players in the industry. Developing for proprietary CUDA is riddled with problems, but OpenCL should open up the doors for some very interesting applications. In my opinion, support for OpenCL is the single biggest feature in Apple's Snow Leopard.
I just put up a post on this in my blog - http://blog.expensivedna.com/?p=82
If it doesn't shoot arcs of lighting from it, then it shouldn't be named Tesla.
Will it run Duke Nukem For... eh, you all know where this is going...
This sig is false.
Also left out of the calculations are the glial cells. There are 10x more glial cells than neurons. They were previously thought to not be part of brain calculation but have since been shown to modulate the activity of the neurons. We've got a long way to go.
NVIDIA has done a good job of making the processing power accessible to programmers that are not GPU coding experts. In addition, they have made hardware changes to better support the type of scientific computation being done on these devices.
So, while in theory you could put together some Radeon's, work with their API and achieve the same thing, NVIDIA has significantly reduced the level of effort to make it happen.
in terms of memory capacity, the number people talk about is:
1 byte/flop... that is 1 Tbyte of memory is about right to keep 1 TFLOP flexibly useful.
this thing has 4 G of memory for 4 TF... in other words:
1 byte / 1000 flops.
it's going to be hard to use in a general purpose way.
Note that these processors are decidedly not general purpose. As is typical with many scientific computing applications, these processors specialize in matrix and vector operations.
For most signal/image processing uses memory bandwidth is the limiting factor, something that GPU's have had an very large lead over their more general CPU counterparts for some time. The real reason people get excited about this technology is cost. For 10K we replaced an SGI with 128 Itanium 2's that cost well over 2 million. Talk about big iron. That thing required a 6 figure cooling unit just to run it.
10 Gs ? I'd pay that.
Personal supercomputer? Surely it's cool, but how about turning the whole Internet into a supercomputer?
Make Internet fast enough and equip every node with a network operating system to share its resources with all other nodes. Sounds like a security nightmare, but let's focus on the performance part for now. Every one of us has a CPU, a storage device (eg SSD), and some RAM. But not all of us use all of our CPU, SSD, or RAM at the same time. While I play a game effectivelly making my CPU to work at 100% capacity, my neighbour may let their CPU to sleep, but if we had a fast communication link between us and we trusted each other we could just share our work and let my and the neighbour's CPU to work at 50% instead. And two CPUs at 50% deliver faster results than one CPU at 100% if the software is designed to take advantage of multiprocessing and there are no communication overheads.
Similarly for storage: the need to take backup copies would be made obsolete if we could implement a worldwide RAID system of all of our SSDs, HDDs, etc. Our data would be replicated all over the planet's computers in a P2P fashion, and we would never have to worry about backups and lost data. Plus, assuming zero communication costs, such a RAID system would be extremely fast.
The only obstacles to a worldwide supercomputer are communication costs and human trust. Unfortunately with the currently deployed Internet technologies the communication overhead is significant, and we cannot seem to be able to trust our neighbours in this world. The trust problem can potentially be solved (right now whenever you talk on VoIP your data get transmitted through other nodes and yet no one seems to have a problem with this), but I am not so sure about the communication technology and infrastructure. But once we solve the communication problem, the global supercomputer could become reality.
The 1 byte/flop ratio is more about memory bandwidth than capacity. Each Tesla processing unit may only have 1GB of onboard memory but that doesn't restrict you from transferring data in and out from your system's main memory, which could be as large as you need it to be. The bandwidth on this figure for PCI-Express 2.0 would probably still be a bottleneck though. I don't have the exact specifications on hand, but even a previous generation G80 has about 80GB/s memory bandwidth to on-chip main memory, and there are 4 of these so you're looking at a minimum of 320GB/s or about 1/3 of a byte per flop for a nominal input dataset of 4GB, lower than that for larger data sets. Not ideal, but not useless either.
Excellent idea! A growth algorithm for an AI. Sweet.
I'll bite. I manage a cluster as well as what I deem to be a supercomputer. In spare time, I'm running codes on them and try to get the best efficiency out of them as possible. So I can show the guys that write their own codes on these machines.
About 5 months ago I told my boss we need to get one of these. We'll get one but have to wait for end of year budget cleaning. See, I also experiment with the GPU (8800 GTX) in my workstation. I had a client at an institute across town that needed to run 8*10^7 2D ffts with local minimisation and was going to be adding another half million jobs each day with the remote sensing devices he had. Our cluster and SMP machine are totally full and a back-log of work 10x the compute power (no free time for about a month). The GPU did this FFT work wonderfully. I rolled out a machine to the institute and computation caught up the back log and processed any incoming work on the fly.
To get this done on the big machines, we would have had to talk to the security guys to open up some ports (reluctantly = time lost) and then would have to figure out some workflow to push the data across the network to process, compute, and push results back.
.
A supercomputer is the world's fastest computer and computers within one magnitude lower. That that is 100 teraflops and faster these days. Latest list announced last week.
What's the point if it can't play Crysis?
There are 240 shaders on the GT280, which is why there is something like 1.4bn transistors on this chip.
Shaders aren't cores.
GPU tend to use massively wide SIMD architecture (Sinlge Instruction Multiple Data). When you're going to run the exact same shader on huge amount of pixels, its redundant to put a complete pipeline and control for each thread. Instead you group data you have to process into SIMDs.
The SIMD processors runs one piece of code, one shader, but applies this shader on several pixels at the same tame. In terms of GPGPU : you write one kernel function, each SIMD processors executes one instance but applies it to lots of elements of the data array at the same time.
9800GTX was advertised having "128 shaders" where in fact, that is 16 cores each with 8 ALUs.
There aren't 128 discete processors. There are only 16, but each can process 8 pieces of data at the same time.
The "240 shaders" of the GT280 are technically 30 cores with 8 ALUs each.
There are only 30 processors. They just have 8 units each, and thus able to run 32 to 768 threads per processor.
Check the Appendix A of the CUDA programming guide if you don't trust me.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
If you go to purchase it, depending on the configuration you can get it for well under $10000. More like $8000.
"Sure would be cool to build such a beast, do some random connections, and see what happens..."
Just don't give it a modem.
"...and see what happens..." can be pretty fucking scary sometimes, and letting scary out of the building just might be a bad idea.
Never know if your going to end up with Skynet or the mother of all spammers...
It runs Linux and the Windows virus cannot infect it. Sounds like a super computer to me :-)
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
Actually I do have a whole bunch of purposes for it. I'm just interested in what you guys would do with it if you had one...
Play Crysis on max settings
are you on DVORAK?
Nope, not at all. It's fr_CH on buckling springs. Why do you ask ?