Inside Tsubame, Japan's GPU-Based Supercomputer
Startled Hippo writes "Japan's Tsubame supercomputer was ranked 29th-fastest in the world in the latest Top 500 ranking with a speed of 77.48T Flops (floating point operations per second) on the industry-standard Linpack benchmark. Why is it so special? It uses NVIDIA GPUs. Tsubame includes hundreds of graphics processors of the same type used in consumer PCs, working alongside CPUs in a mixed environment that some say is a model for future supercomputers serving disciplines like material chemistry." Unlike the GPU-based Tesla, Tsubame definitely won't be mistaken for a personal computer.
Imagine a beowulf cluster of one of these could do! Oh, wait! ;-)
On reading the article, the box has 30 thousand cores, of much the vast majority are AMD Opterons in Sun boxes. No mention of how/in what you'd program this to actually put the GPUs to good use.
Why is it so special? It uses NVIDIA GPUs
So we can expect binary-only (e.g. non-patchable source) driver issues when running Linux on it? Or will it be frequent nv_disp BSODs from a Windows OS? And I'm sure the kernel(s) will have a fun time managing all of this in addition to SMP across several real CPUs.
Sounds "special" all right...
Ironic name: tsubame means sparrow in japanese, and also has the slang usage of toy-boy (as in a cougar's toy-boy).
Not sure what to read into that ...
Servlet v2.4 container in a single 161KB jar file ? Try Winstone
When it has no graphics out? It is still a GRAPHICS Processing Unit when it doesn't calculate any graphics and doesn't display any graphics. HUH? ;)
They have a whole lot of these boosting a whole lot of quad-cores.
I think it's only a matter of time before many of these clusters will start using all processing power available to them, hell, even desktops and whatever app you build should detect, and use your GPU! If compilers were to get even smarter, they could automatically route pieces of code that include calculations the GPU could do faster to the GPU, and otherwise just use one of the other cores available. This *should* be the future imo.
Quack damn you!
Comment removed based on user account deletion
A system that can run Crysis at full settings.
to pronounce my name, I would have to pull out your tongue...
You may want to read the article again, if not here's a recap:
655 Sun Boxes each with 16 AMD cores=10,480 CPU cores
680 Tesla Cards each with 240 processors=163,2000 GPU processors
As for how to use the GPU's, I use my GTX280 (almost same thing as Tesla) to crunch through lots of numeric calculations in parallel. I'm sure these guys are doing the same thing as that is the strength of the GPU. NVIDIA has made it easier to access the processing power of the GPU with CUDA. You create a program in C that gets loaded on the GPU and when you launch it you can tell it how many copies to run at one time, each one typically operates on a different portion of the data. Because you can launch more threads than there are processors, the GPU can be reading data in from global vid mem while other threads are performing calculations.
I don't think anyone who actually works with CUDA refers to individual thread processors as "GPU processors." Even nVidia refers to the Tesla itself as a GPU (singular). Your terminology is like saying that my desktop PC has "4 Core 2 Processors" because the one Core 2 in my PC has four cores.
just to get a perspective, the GPUs provide about 10 out of 77 TFLOPs benchmarked in LINPACK HPC article
ATI's latest cards give more punch for the cost apiece. and they are designed specifically for being clustered/linked/xfired and whatnot.
Read radical news here
a Tesla comes in two form factors, a pci express card or a rack mount 1U system that contains 4 of the tesla cards and connects to a server or cluster node with two pci e cards. Not sure how you could confuse that with a PC. Also, I was just ad a conference with the gentleman in charge of Tsubame, and if I recall correctly they had some of the 1U tesla systems in the cluster, although they may have used high end graphics cards too - they may have only had a limited number of the rack mount tesla systems for testing
What makes a supercomputer *a* supercomputer, as opposed to a network of not-necessarily-super computers which all happen to be in the same building and connected to the same high-speed network? By the way this is described, it certainly seems to be a network of many computers working together, rather than one single almighty computer.
Well I actually work with CUDA and I just used that term, so that makes at least 1 person.
The term "GPU processor" was merely a shorthand method of stating that the number 163,200 related to circuitry that performs calculations but without as much flexibility as a core on a traditional CPU. They do work, but groups of them share the same instruction. The term "core" would have seemed inaccurate also, maybe I should have said "streaming processor cores"??
I suppose there's a first for everything. I work with CUDA in an HPC research capacity and I've never heard any colleague or anyone from nVidia refer to individual thread processors as a "GPU processor", for what it's worth. nVidia's official terminology is "scalar processor", and I mainly hear that and "thread processor".
Many Internets to whomever gets the reference: Hiken Tsubame Gaeshi!
Well anyways such computers are required for skynet. The rise continues.
Sorry for the double-post, I forgot the most important part of my reply. Moreso than "number of cores", I would consider Tsubame still very much a CPU-based cluster in that approximately one eighth of the LINPACK work was done by GPUs.
i don't know about CUDA, but when Microsoft discusses the number of "processors" a single instance of their OS supports they are generally referring to logical processors, which they define as:
# of physical processors * # of cores * # of threads
that's why Microsoft claims Windows 7 will scale up to 256 processors. in reality that's 64 physical processors * 2 cores * 2 threads, or 32 physical processors * 4 cores * 2 threads, etc.
It's a little frustrating to me that they don't mention Roadrunner, which is an IBM Cell-accelerated Opteron cluster. We're doing plenty of real science, with some applications achieving ~400 TF sustained performance.
Can I run Crysis now?
In soviet Russia, God creates you!
My complaint was primarily due to the fact that "GPU processors" is a rather ambiguous (and unusual) term. As I mentioned in another reply, "scalar processors" and "thread processors" are more common and clear names. "GPU processor" would (in some cases) leave one wondering whether you were referring to:
(a) an entire GPU
(b) a GPU multiprocessor, consisting of (in nVidia's case) 8 scalar processors
(c) a scalar processor
I'll also note that Microsoft counts a Tesla as one GPU :)
This makes plenty of sense. I've personally dealt with several IBM BlueGene supercomputers (more than 200,000 cores) that didn't perform near this well.
The GPUs definitely made a huge difference in this case.
How does something like an nVidia scalar processor stack up against something like a Clearspeed accelerator?
No wonder the Japanese don't play any PC games.
With this much computing power one should be able to take advantage of higher math to determine when the optimal times are to invest in the stock market to take advantage of trends. Unfortunately, since things are headed downward, this technology can be used most efficiently to help you lose money at the optimum rates. Actually I have an idea, but have to work out with CUDA more before I know if it is real. Unfortunately trying to put these cards in Mac Pros is problematic. You would think Apple would have made a deal with NVIDIA to assure these cards could be run in Apple's fastest desktop, but no joy there. In fact it is a long and unhappy story trying to get Apple and CUDA superpower in the same box.
I've never worked with Clearspeed, but on paper they look pretty good. They claim 22TFLOP/s from a 1U unit; nVidia can get about 4TFLOP/s from a 1U unit. They're a bit deficient in terms of memory bandwidth; they can achieve a total of 96GByte/s per second in one of those 1U units whereas nVidia can achieve over 400GBytes/s.
Of course, this is all based on theoretical numbers for the clearspeed units. Also, I have no idea what they cost compared to a Tesla 1U unit.