NEC SX-9 to be World's Fastest Vector Computer
An anonymous reader writes "NEC has announced the NEC SX-9 claiming it to be the fastest vector computer, with single core speeds of up to 102.4 GFLOPS and up to 1.6TFLOPS on a single node incorporating multiple CPUs. The machines can be used in complex large-scale computation, such as climates, aeronautics and space, environmental simulations, fluid dynamics, through the processing of array-handling with a single vector instruction. Yes, it runs a UNIX System V-compatible OS."
A user would pay the extremely high cost of a supercomputer - with it's proprietary memory architecture and interconnects - precisely because it can much more effectively scale up parallel processes then a cluster. If the benefit of that did not outweigh the cost of tailoring software to fit the device then these devices would never be made.
]{
Put simply, the problem set that vector processors are geared towards (those involving large matrix ops) are the type clusters perform horribly at.
There's an interesting paper that analyzes the data accumulated in the top500 list site, which ranks the 500 most powerful supercomputers twice a year: it shows that, over time, the share of vector machines within the list is sharply declining, both in aggregated power and in number: from around 60% in 1993 to around 10% in 2003 (see Figure 3, page 6, in said paper). Still, vector machines refuse to die and always seem to maintain a presence in the top500, as is evident from the above slashdot post. Will vector machines live forever?
Well, distributed is often seen as poor-mans parallel, but in this case they don't compare. Vector units have large arrays of data and perform the same operation on all of them at once. Think array or matrix operations being done in one step rather than needing loops. This is where a SIMD architecture takes off.
The only unit I ever got to play with had a 64x32 grid of processors, you could add a row of numbers in log2(n) steps instead on n. It was cool because you could tell each processor to grab a value from the guy next to him (or n steps in a given direction from him) and so on. You could calculate dot products of matrices very quickly.
The distributed stuff you mentioned is mostly farming. Take a big loop of independent steps, break them up and pass them out to a (possibly) heterogeneous collection of processing nodes. Collect the answers when they finish. Render farms work the same way. It's a good way to break up some problems, but it's not what a vector unit does.
Now, I haven't touched this stuff for eleven years so my facts are possibly wrong. I'm sure someone will be along to correct me.
The cost of the supercomputers is so high, that sometimes several man-month of tailoring the software to run as efficient as possible on the hardware could be recovered during a couple of days of processing.
For the kind of computation the supercomputer market requests, a 5% improvement in running speed on a supercomputer can worth millions
http://www.nec.de/hpc/hardware/sx-series/index.html
There are four PDFs there; the brochure is a four-colour glossy, but there is some real information. Sadly, the interesting-looking white papers are for the SX6, two generations earlier.
SX9 summary: 65nm technology, 3.2GHz clock speed, eight vector elements handled per cycle with two multiply and two add units, which is where the 102.4Gflop/CPU figure comes from. 16 CPUs in a box about the size of a standard 42U rack.
Totally absurdly fast (ten 64-bit words per cycle per CPU) access to a large (options are 512GB or 1TB) shared main memory; absurdly fast (128GB/second) inter-node bandwidth.
There is a video news release and interview with the project manager here: http://movie.diginfo.tv/2007/10/26/07-0502-r.php
That's the current, popular, Blue Gene/L architecture. The Blue Genes are composed of densely packed boards, each of which has a PowerPC chip and many vector processors. The PowerPC chips run a Linux-like OS and do some normal-looking I/O (filesystems, networking, etc), while the vector processors churn lots of data and have simplistic I/O.
That GP who suggests that Xen is used to distribute tasks obviously isn't familiar with the needs of big iron.
So here's what you're missing: Vector processors aren't about doing a lot of math. True, they do that very well, but that's not where they excell. Where vector processors really shine, is in memory bandwidth. Vector operations let you use that 4Terabyte/second of memory bandwidth, and actually use it, not spend it all flushing out cache lines. On this machine, a single load instruction can fetch 2KB of data.
Cell (and many GPUs or future whatever) have the ability to do a LOT of math, but they do it on a very tiny amount of data. These vector CPUs have dozens or hundreds of memory controllers. That's a lot of RAM chips, and a lot of copper wires between memory and the CPU. I'm sure the motherboard is dozens of layers thick for all the traces. In short, you can't get all that capability on a commodity processor, because the commodity market won't pay for all the memory bandwidth, which is expensive to engineer, and expensive per-unit.
Unless/untill there is a major change in the cost of memory, and memory bandwidth, there will still be a need for special-purpose supercomputing processors. This is not to say that Cray and NEC will continue to be the people to make such a thing. I'm sure IBM could come up with a cell-derived processor with a TON of real memory bandwidth, or maybe Nvidia. The question is: will they want to? I figure there's a lot more money to be made selling videogame consoles than there is at the high-end of the supercomputer market.