World's Fastest Supercomputer To Be Built At ORNL
Homey R writes "As I'll be joining the staff there in a few months, I'm very excited to see that Oak Ridge National Lab has won a competition within the DOE's Office of Science to build the world's fastest supercomputer at Oak Ridge National Lab in Oak Ridge, Tennessee. It will be based on the promising Cray X1 vector architecture. Unlike many of the other DOE machines that have at some point occupied #1 on the Top 500 supercomputer list, this machine will be dedicated exclusively to non-classified scientific research (i.e., not bombs)."
Cowards Anonymous adds that the system "will be funded over two years by federal grants totaling $50 million. The project involves private companies like Cray, IBM, and SGI, and when complete it will be capable of sustaining 50 trillion calculations per second."
Some problems are easily partitioned up and distributed to separate nodes. In particular, code where the nodes do not need to talk to each other much are ripe for clusters, as the interconnect speed is less important. Therefore, you can build a commodity cluster fairly cheaply.
:)
For other problems, where interprocess/node communication is high or very high, you need a high speed interconnect (like NUMAflex in SGI's) to get you the scalability you need, as you increase the number of processors/nodes and the size of the data set increases. The big systems like Crays and the bigger SGI's and IBM Power series have those high speed interconnects and will allow you to scale more efficiently than the clusters. They cost a lot more though
A good book to read on the subject of HPC is High Performance Computing by Severance and Dowd (O'Reilly). It's a little old now, but it covers a lot of the concepts you need to know about building a truly HPC system (architecture as well as code).
Yes, DOE is the Federal Government's Department of Energy. Oak Ridge is a large federal govt. lab.
So each node is directly connected to six ajacent nodes. Contrast this with the Thinking Machines Connection Machine CM2 topology, which had 2^N nodes connected in an N dimensional hypercube. So each node in a 16384 node CM2 was directly connected to 16 other nodes. There's a theorem that you can always embed a lower dimensional torus in an N dimensional hypercube, so the CM2 had all the benefits of a torus and more. This topology was criticized because you never needed as much connectivity as you got in the higher node-count machines, to CM2 was in effect selling you too much wiring.
Thinking Machines changed the topology to fat trees in the CM5. One of the cool things about the fat tree is it allows you to buy as much connectivity as you need. I'm really surprised that it seems to have died when Thinking Machines collapsed. On the other hand, any kind of 3D mesh is probably pretty good for simulating physics in 3D. You can have each node model a block of atmosphere for a weather simulation, or a little wedge of hydrogen for an H-bomb simulation. But it might be useful to have one more dimension of connection for distributing global results to the nodes.
--- Often in error; never in doubt!
Two radically different designs, will probably solve very different sorts of problems. Linpack is extremely good at giving a computer an impressive number. It's the sort of problem that fills up execution piplines to their maximum. Blue Gene was origionally designed to do protein-folding calculations. While many other tasks will work well on that machine, others will work very poorly.
It's a mesh of a LOT of microcontroller-class processors. The theory being that these processors give you the best performance per transistor. Thus you can run them at a moderate clock, get decent performance out of them, and cram a whole hell of a lot of them into a cabinet. It's a cool design, I'm interested to see what it will be able to do, once deployed. However, for the problems they have at ORNL, I'm sure the X1 was a better machine. Otherwise they would have bought IBM. They already have a farm of p690s, so they have a working relationship.
The number of processors isn't as important as the memory architecture. Clusters of workstation-class machines have isolated memory spaces connected by I/O channels. Many non-clustered supercomputers have a single unified memory space where all processors have equal access to all of the memory in the system. This can be important for algorithms that heavily use intermediate results from all parts of the problem space.
Even so, for a given number of FLOPS, a vector machine would generally require fewer CPUs than a cluster of general-purpose machines. This reduces the amount of splitting that has to be done to the problem in the first place.