North America's Fastest Linux Cluster Constructed

← Back to Stories (view on slashdot.org)

North America's Fastest Linux Cluster Constructed

Posted by CowboyNeal on Thursday May 13, 2004 @02:53PM from the where-was-the-lightning dept.

SeanAhern writes "LinuxWorld reports that 'A Linux cluster deployed at Lawrence Livermore National Laboratory and codenamed 'Thunder' yesterday delivered 19.94 teraflops of sustained performance, making it the most powerful computer in North America - and the second fastest on Earth.'" Thunder sports 4,096 Itanium 2 processors in 1,024 nodes, some big iron by any standard.

3 of 325 comments (clear)

Min score:

Reason:

Sort:

Re:Very great and all... by tap · 2004-05-13 15:57 · Score: 5, Informative

Do you have any kind of benchmark where the Itanium smokes the Opteron? The Itanium does have a greater memory bandwidth, but not by a lot. If you look at the spec benchmarks, it can be faster on some of them, but not by a lot. However, the Itamium is a lot more expensive!

Compared to a Xeon or AthlonMP cluster, the Itanium faired poorly in price/performance. The only reason to use Itaniums was if you needed 64 bits for more than 4GB of memory, or needed high single CPU performance for a pooly parallized application. (Of course if your application parallizes poorly, a cluster is probably a bad choice to begin with). Then Opterion came out and changed all that. It's 64 bits, it's fast, and it's a fraction of the price of the Itanium2.

I just purchased a new Beowulf cluster. The decision was between Xeons vs Opterons. The Opterons had better price/performance, but the Xeons would fit in better with our existing Pentium3 Beowulf, other ia32 servers, and existing software. In the end, we went with Opterons. Itanium2 was never even in contention. Just one look at the price and performce of a Itanium2 system was all it took to cross it of the list.
Re:"Most" powerful by tap · 2004-05-13 16:13 · Score: 5, Informative

I think you've got that backwards, Quadrics is the performance leading, not the price/performance leader. Myrinet, SCI, and Infiniband all beat it in price/performance. Quadrics is faster, and scales to more nodes than the others.

According to Quadrics latest price list, the cards are $1200 each, $913 per port for a 64 node switch, and $185-$265 for a cable. That's $2300/node.

Myrinet cards are $595, the switch is $400 per port for 64 nodes, and the cables are ~$50. That's $1050/node.

Quadric's price for a 1024 node interconnect is $4,176,094. That's hardly chump change. The bandwith is about 10x higher than gigabit ethernet, and the latency about 100x lower.
Re:before everyone starts shouting at once... by slamb · 2004-05-13 17:23 · Score: 5, Informative
Some of the coolest features of the Itanium are also some of the reasons why a lot of people don't want to use it. The EPIC ISA, for example. It was designed ( along w/ the physical hardware ) to expose a lot of the internal workings of the processor to the user. But rather than recompile and re-optimize their code, people would rather bitch about migration. That's fine for workstations and servers, but in an HPC environment, you want the nifty features, you want to occasionally hand-tune code segments in assembler, etc.
I just coded some IA-64 assembly and from what I've seen, this comment is dead-on. They've got a lot of interesting features:
- Speculation. The idea is to do memory fetches far in advantage to avoid waiting for the (much slower) memory system. You can do a LD.S operation that tells the machine something like "I might want the value from this memory address in a few instructions." It fetches it from memory, if it's in a good mood. If the address is paged out, it doesn't get it. (Instead, it sets a NaT (not a thing) bit to tell you nothing useful is there.) Later, you do a CHK.S. If it turns out that the speculative load fails, it jumps to some "recovery" code which gets it for real.
- Lots of registers. 128 general-purpose 64-bit registers. Floating point registers. Some specialized ones, I think.
- EPIC. (Explicitly Parallel Instruction Computing.) It has different types of instructions, aimed at different execution units. In the current incarnation, there are two sets of these in each processor. You give it bundles of three instructions, more broadly divided into groups. Instructions in a group don't depend on any earlier results calculated by the group, so they can be executed in parallel.
- Rotating registers. This lets you make different iterations of the same loop work with different registers, to take advantage of EPIC more fully.
- Predicated instructions. There are a bunch (16? 64? don't remember) of predicate bits, set by the CMP instruction and the like. Every instruction has an associated predicate. (p0 is hardcoded to true, so you normally don't notice.) So you can do conditional execution without jumping. More efficient, especially if it's just a few instructions that differ.
If you just have a simple sequence of operations, each dependant on the one before, you can't really take advantage of these capabilities. (My code was like this. Even though performance wasn't my reason for writing assembly, it was a little disappointing that I couldn't play with the new toys.) If you're expecting these features to make Word start faster, you'll probably be disappointed.
But if you're doing intensive computations in a tight loop, you can do amazing things. If you can get all the execution units working simultaneously, it will fly. And the features like rotating registers are designed to make that possible. You need a very good compiler or a very smart person to hand-tune it. You may need to recompile to tune if your memory latency changes (affecting how many iterations to run at once) or they come out with a new chip with more sets of execution units. But in a situation like this, none of that is a problem. They'll have applications designed to run as fast as possible on this machine. They may never be run anywhere else.