tmattox · Slashdot Mirror

**Supercomputer** price/performance on Big Mac Benchmark Drops to 7.4 TFlops · 2003-10-23 03:18 · Score: 1

No, I never said anything about a linear scale. I've been doing research in this field for longer than I care to admit, and scaling parallel computers to larger and larger numbers of processors is always a hard thing to do. That is the focus of my research in Sparse FNN technology, to reduce the extra cost of increasing the size of the parallel machine, by making the network cost increase more slowly.

Although KASY0 may be considered "small" by some standards, it defintly merits the definition of "supercomputer". It's theoretical double precision (64-bit) peak is over half a TFLOPS, and over a full TFLOPS for single precision (32-bit), and it uses about 13 kilowatts of electrical power.

Saying that various single processor commodity electronic gadgets get better price/performance is meaningless. The slashdot subject line was too short to add the word supercomputer earlier... it was implied by context.

But again, not to detract from the VT achievment, Big Mac is very impressive. I anxiously await more details on the fault tolerance software they are using, as well as the network toplogy with the 96-port Infiniband switches.

The high percentage of peak Linpack performance of Big Mac on just a subset of it's nodes tells me more about their network topology than anything else. The nature of the Linpack benchmark is that it scales very very well if each node has enough memory (it's operation count scales as the cube, while it's memory references and communications scale as the square). The fact that its efficiency dropped so much on the full machine to me indicates they have some networking bottleneck between switches. It also means that they have a VERY nicely tuned matrix multiply core for within a single CPU or node. Looking at the typical percentage of peak performance numbers on single CPUs from Automatically Tuned Linear Algebra Software (ATLAS), it's difficult to get over 80% peak on just a single CPU.

So, again, the VT machine is VERY impressive.

Re:Important items of note on Big Mac Benchmark Drops to 7.4 TFlops · 2003-10-22 15:55 · Score: 2, Informative

I have yet to find a satisfactory description of the network topology they are using. The specs on the Infiniband switches they are using are quite impressive for latency and bandwidth numbers, but without knowing how they are interconnected, its' hard to say if it's latency or maybe bisection-bandwidth issues limiting their efficiency. From the early report of 80% efficiency on 128 CPUs (or was it 128 nodes?) would seem to indicate the problem is with the switch fabric in some way. With ~1100 nodes, communications are having to cross through mutliple switches in any traditional network topology, resulting in higher latency, and possibly bandwidth bottlenecks.

I saw some indication that they were using a Fat-Tree topology, which would eliminate any bandwidth bottlenecks between switches, but the number of switches used didn't seem large enough for a fat-tree. But again, VT just hasn't, as of the last time I looked, released enough information about the cluster to tell.

BTW - My thesis work on Flat Neighborhood Networks (FNNs) used in the KLAT2 and KASY0 supercomputers is finding better ways to interconnect the nodes, given a particular set of network components.

It's a good price/performance, but not best. on Big Mac Benchmark Drops to 7.4 TFlops · 2003-10-22 15:31 · Score: 3, Interesting

I guess the original submission didn't see the slashdot article from August 23 about our KASY0 supercomputer breaking the $100 per GFLOPS barrier.

KASY0 achieved 187.3 GFLOPS on the 64-bit floating point version of HPL, the same benchmark used on "Big Mac". While "Big Mac" is about 40 times faster on that benchmark, it is about 130 times the cost of KASY0 (~$40K vs ~$5200K). Considering the size difference, "Big Mac" is VERY impressive, but it can't claim to be the best price/performance supercomputer on the HPL benchmark.

Note: KASY0 gets 482.6 GFLOPS (0.48 TFLOPS) on a 32-bit precision version of Linpack, satisfying our under $100 per GFLOPS claim.

Regardless, Virginia Tech's "Big Mac" is a very impressive machine. My congratulations to them!

Re:Macs ? on Virginia Tech to Build Top 5 Supercomputer? · 2003-08-31 07:35 · Score: 2, Interesting

I am one of the designers of KLAT2 and KASY0, and the guy who ran the Linpack benchmarks on both. Over 3 years ago when we submitted our results for KLAT2 to the top500 list, there was no public indication that 64-bit floating point was required. It took them awhile, but the top500 website now has a FAQ that indicates "full precision" is required, and they interpret that as 64-bit for most machines. FYI, 32-bit FLOPs are useful in many situations, and machines had been on the top500 list that had used 32-bit FLOPs. You might take a look at our KASY0 FAQ on GFLOPS. As a means to rank the top500, I think it is quite legitimate to require 64-bit FLOPS, but that doesn't make it "illegal" to use 32-bit Linpack FLOPS for other comparisons.

As for the G5, it won't need AltiVec to get good Linpack numbers due to its fused multiply-add capability in its dual floating point pipes. That's 4 FLOPs per clock peak! I hope VT was able to get Apple to leave out, and not charge for, the components not needed in a cluster node. The PCI-X slots in the G5 should allow VT to better use a high-speed cluster network technology. Commodity x86 boxes tend to only have 32-bit 33MHz PCI, limiting the usable link bandwidth between nodes to under a gigabit per second. For 64-bit Linpack GFLOPS per dollar, a cluster of G5's could be competative. I look forward to seeing their results, and any similar work using the upcoming Athlon 64.

Standard Engineering Practice from the 1980's on Cleartype In Depth · 2000-06-13 02:45 · Score: 2

When Microsoft mentioned this around two years ago, my advisor posted his application of standard engineering practice (i.e., non-patentable) technique for making use of LCD subpixels. He posted it on the WWW simply to ensure that no company can patent these obvious, but perhaps useful, methods.

See Color LCD Panel Subpixel Rendering by Prof. Hank Dietz, December 15, 1998.
--

Re:KLAT2, a more powerful "cheap" cluster on FreeBSD Cluster At Purdue · 2000-06-06 08:21 · Score: 1

We are releasing the technology into the Public Domain as soon as we can (i.e. when the hacked code has at least some clarity/documentation). So, yes, you can apply the concepts from KLAT2 to most any size cluster. See my my other comment for a little more info.
--

Re:KLAT2, any one? on FreeBSD Cluster At Purdue · 2000-06-06 08:12 · Score: 1

The technology used in KLAT2 scales up and down in size. The Flat Neighborhood Network architecture can be scaled down to use several eight port 100 Mb/s etherenet switches (about $80 each) to make a very formidable network for a small cluster on the cheap. Check out our new CGI for designing your own FNN. However, for their full-up cluster with 27 nodes, it is more practical to use 16/24 port switches...

As for using the 3DNow! stuff, their K6-2's can have some real punch if they are willing to code for it... Check out our SWAR - SIMD Within A Register compiler technology for doing just that. Actually, the Ph.D. student doing most of the work on SWAR is AT Purdue.
--

KLAT2, a more powerful "cheap" cluster on FreeBSD Cluster At Purdue · 2000-06-06 07:24 · Score: 1

Our Athlon based KLAT2 Beowulf cluster at the University of Kentucky achieved over 64 GFLOPS on LINPACK for only $41K using 3DNow! instructions. The FreeBSD Cluster at Purdue doesn't even mention ANY benchmarks for performance. I'm a Purdue Alum, so I think this is great that they are getting slashdot coverage for an inexpensive cluster. However, when we submitted KLAT2's (Kentucky Linux Athlon Testbed 2) story to slashdot last month, which in many respects is much more "news for nerds", it got passed over. Ah well, thats the way of slashdot.
--
--

Slashdot Mirror

User: tmattox

Comments · 8