JPL Clusters XServes
burgburgburg writes "MacSlash has a brief note how NASA's JPL has put together a cluster of 33 XServes that was able to achieve 1/5 teraflop. The original article notes that the Applied Cluster Computing Group, using Pooch (Parallel OperatiOn and Control Heuristic Application) ran the AltiVec Fractal Carbon demo and achieved over 217 billion floating-point operations per second on this XServe cluster. More importantly, their research indicates that no evidence of an intrinsic limit to the size of a Macintosh-based cluster could be found."
http://www.spymac.com/gallery/showphoto.php?photo= 4665
-- I was raised on the command line, bitch
OTOH, if you can take advantage of it, that would put this cluster at #250 in the Top 500 list of supercomputers. In fact, it is just a tick behind an IBM NetFinity cluster with 512x733MHz Pentium IIIs. Not bad for 66x1GHz G4s.
--
The internet is the greatest source of biased information in the history of mankind.
Maybe the computations were not communications bound... Fractal calculations can be done with a Monte Carlo method, which is highly parallelizable, and requires very little inter-node communication.
f you can take advantage of it, that would put this cluster at #250 in the Top 500 [top500.org] list of supercomputers. In fact, it is just a tick behind an IBM NetFinity cluster with 512x733MHz Pentium IIIs. Not bad for 66x1GHz G4s.
No, it is not. The Top500 ranking is based on *actual* parallel performance in *DOUBLE PRECISION* linpack.
The _theoretical_ peak performance of 66*1 GHz G4 boxes in double precision floating point is 66 Gflop. In practice the G4 has large scheduling problems with the normal floating-point unit, so I would be surprised if it could even achieve 30 Gflops. And ethernet is not going to scale very well for LINPACK. The real performance of parallel LINPACK on this machine would probably be in the order of 10 Gflops.
The Xserve is a nice box, and Altivec is cool for some applications, but real scientific applications are VERY different from a single precision fractal demo.
The G4 has 3 parallel floating point execution units, so the theoretical peak is actually 198 GFLOPS. Also, The LINPACK performance number from beowulf-style clusters is derived from the aggregate total performance of each node. This puts true single-image systems at a disadvantage, since real applications use quite a bit more latency sensitive and interconnect bandwidth intensive.
<Amanda`> I just went out to the parking lot in my bathrobe to exchange warez CDs.
Actually...
33 Machines x ($3999 + $200 ##1.5Gigs Aftermarket DDR##) = $138,567
$138,567 for 217 gigaflops = $638.55/gigaflop
I hate Grammar Nazi's
No, I don't know, but my uninformed opinion would be that there's no measurable difference in latency between gigabit and 100BASE-T. The vast majority of the latency happens in the TCP stack inside the computer and in the NIC, as packets are generated and whatnot. Actual transmission latency will be so tiny as to make no meaningful difference.
I write in my journal
You don't know what you're talking about. The G4 does NOT have three floating-point units. (hint: an integer unit doesn't do floating-point)
If you don't believe me, you might at least believe Motorola
Or, check out a summary.
LLNL Linux Network/Quadrix supercluster if build out of Penguin Computing 1U Relion 140's:
$4,747,392 offering 11.2 Teraflops...
$423.87/Gigaflop...
I hate Grammar Nazi's
The theoretical peak performance for 33 XServes in the test done here was actually 495 GFLOPS, BTW. I don't know what the theoretical performance of double precision on Altivec is, though. LINPACK is all linear algebra (IIRC), so it would see some benefit.
I will admit that there are plenty of applications where the G4 is not the best processor available. I for one will certainly be happy to see the IBM PPC 970, but you shouldn't discount the XServe until the test is actually run.
--
The internet is the greatest source of biased information in the history of mankind.
Actually I'm using the Apple libraries a lot, they include double precision, but the double precision routines do NOT use the Altivec unit.
It is simply not possible - the Altivec unit doesn't have any instructions that can handle double precision, and emulating it with single precision would be an order of magnitude slower than doing it in the normal FPU. This is exactly why Intel introduced SSE2 that does double precision.
The latency question is a good one. I'd say the answer to this lies in the driver for the NIC. I've written an IOKit ethernet driver and experienced pretty decent performance at 100 Mb. The system is processing packets as incoming data causes interrupts in the system.
However, I think the interrupt overhead for a 1000Mb link would be so high as to bring the machine to a screeching halt (okay, slow it down perceptibly). What a lot of driver writers do for gigabit links is to move their driver into polling mode. They essentially set a timer to go off every X milliseconds and process all the packets that have been copied into memory during that timeframe.
This gives a lower bound on the latency. A packet will always take X milliseconds to be noticed and processed by the system. Interrupt overhead stays low, but packet latency goes up a smidge.
It's a good trade off. I would bet that on a saturated link, packet latency at gigabit speeds is equivalent or WORSE than 100Mb. I might have to test that out...
cr
As pointed out on As the Apple Turns, the difference is, while the PowerMacs reach higher performance levels (233 vs 217 Gflops) they take up a whole heap more space than a single rack of 42 IU servers...
Pessimism of the intellect, optimism of the will! - Antonio Gramsci.
I thought a lot of gig-e cards implemented a good deal of packet processing in hardware to deal with this very problem. Am I mistaken? I remember that the first PCI gig-e card I ever saw was installed in an SGI Origin, and when running full-out it pegged an entire CPU with interrupt handlers. Later versions of the card-- or perhaps an entirely different card, but sold by SGI and used with the same Origin servers-- had hardly any interrupt activity at all, even when moving data at rates exceeding 50 MB/s.
I write in my journal
Indeed I know of an Xserve cluster siginificantly larger than this one at a certain nearby university, they just aren't up to full capacity yet. But no doubt that news will come too...once they benchmark and spank the JPL cluster :)