Slashdot Mirror


Supercomputer Breaks the $100/GFLOPS Barrier

Hank Dietz writes "At the University of Kentucky, KASY0, a Linux cluster of 128+4 AMD Athlon XP 2600+ nodes, achieved 471 GFLOPS on 32-bit HPL. At a cost of less than $39,500, that makes it the first supercomputer to break $100/GFLOPS. It also is the new record holder for POV-Ray 3.5 render speed. The reason this 'Beowulf' is so cost-effective is a new network architecture that achieves high performance using standard hardware: the asymmetric Sparse Flat Neighborhood Network (SFNN)." Because this was a university project, KASY0 was assembled entirely by unversity students, which while being a source of cheap labor, is also a good way to get a lot of students of involved in a great project.

6 of 281 comments (clear)

  1. Re:Asymmetric Sparse Flat Neighborhood Network by flymolo · · Score: 5, Informative

    Due to "creative" (computed) wiring, if all switchs are functioning, no node is more than one hop from each other node. This requires a routing table written for each pc. It could be used for redunancy, but it is being used to minimize latency, and collisions, which are both killers in clusters.

    --
    "Sometimes it's hard to tell the dancer from the dance." --Corwin Of Amber in CoC
  2. Cooling by bengoerz · · Score: 4, Informative

    I toured the previous cluster these guys did (KLAT2) and was very impressed. However, using AMD Athlon Thunderbirds last time, it did get quite hot. I remember standing by the cluster looking at all the wiring and being bombarded by an overhead cooling vent. I'm also assuming that these cooling issues is the reason that each case has two blow-holes. I'd also like to see these guys post in-depth specs of each machine. Being a hardware nut, I'd like to see how they got so many machines so cheap, and maybe even what vender they used. As I remember, they worked REALLY hard on their last cluster to keep costs to an absolute minimum.

  3. Re:Asymmetric Sparse Flat Neighborhood Network by Rich+Dougherty · · Score: 3, Informative

    Here's a quote from the site:

    Does The World Need Yet Another Network Topology?

    One would think (well, we did ;-) that the latest round of Gb/s network hardware would have made the design of a high-bandwidth cluster network a trivial exercise. However, that isn't the case when the prices are considered:

    • When we invented FNNs in 2000, the cheapest of the Gb/s NICs available were PCI Ethernet cards priced under $300 each; now they are $50-$100. Prices have continued to drop. Prices on custom high-performance NICs (e.g., Myrinet) start at close to $1000 and have not been going down.
    • In late 2002, 48-port 100Mb/s Fast Ethernet switches have dropped to less than $25/port. Gigabit Ethernet switches are starting to follow the same trend, with $100/port pricing in sight for switches up to about 48 ports. Wider switches with the needed performance are unlikely to become cheap in the near future. Thus, it would be necessary to build a heirarchical switch fabric using multiple layers of switches, yielding higher cost, higher latency, and significantly lower bisection bandwidth (unless you use a "fat tree" or other scheme, which adds still more expense -- especially because cheap layer 2 Ethernet switches don't support those topologies).

    In summary, the cost of the "obvious" Gb/s network for KLAT2's 66 single-processor nodes was OVER 30 TIMES the cost of the network we built for KLAT2. In fact, to match KLAT2's bisection bandwidth, a network built using Gb/s hardware would have cost even more. Gigabit Ethernet is getting cheaper, but obvious topologies just are not competitive with FNN performance. So, if you've got tons of money that you have to spend immediately, you can impress your friends by buying expensive custom network hardware that can use an obvious topology and still be competitive with FNN performance. Otherwise, read on.... ;-)

  4. Wrong by imsabbel · · Score: 3, Informative

    In reality, beowolf clusters are good for only a subset of supercomputing tasks and the "real" supercomputers are still best at general purpose supercomputing.

    If you can paralize your application well enough, beowoulf rules, but if you need a lot of node2node communication, the network cost quickly surpasses the cpu cost of the system

    --
    HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
  5. Re:University students by panda · · Score: 4, Informative

    Having worked there, and knowing what Hank Dietz and his students are doing, I can tell you that it is different from just slapping PCs together, stringing wire between them and installing clustering software.

    Dietz specializes in networking and all the wiring that you see in the photos is charted out by custom software that he's written just for this purpose.

    He works in the realm of optimizing communications among the nodes to avoid network latency and so on. If you read the POVRay benchmarks, you'll notice that the author comments that several clusters' CPUs spend most of their time idle due to network latency. Dietz is researching the best ways to eliminate much of that latency so that the CPUs in the cluster can spend more of their time crunching data rather than just throwing off heat. To my knowledge, he is succeeding at this and better than most other researchers in the field.

    As for what his students learned from this, I don't know exactly which students helped him on this. For KLAT2, there were several undergrad volunteers who helped with wiring and assembly, mostly from the campus Linux Users' Group. I know his grad students and research assistants are learning a lot about how clustering and network tech works, and a couple are doing their Ph.D. disserts in this very subfield of E.E.

    --
    Just be sure to wear the gold uniform when you beam down -- you know what happens when you wear the red one.
  6. Re:why not DSP? by SmackCrackandPot · · Score: 3, Informative

    Actually, they do, but they are referred to as vector processors rather than DSP's. Probably the most famous and the first was the Cray supercomputer. And there was also the INMOS "Transputer"

    DSP's are optimised to handle streamed data of a particular maximum size (Eg. 4-element float point variables). Useful for image processing (red,green,blue,alpha) and 3D graphics(XYZW), but if you're modelling something like ocean currents, global weather, every data element is more than likely going to have more than four variables (eg. temperature, humidity, velocity, pressure, salinity, ground temperature), you may not get full optimisation.

    Plus, you also need a means of getting all these processors to talk to each other. DSP's are nearly always optimised to operate in single pipelines, so don't need much communication support (eg. Sony Playstation 2). However, if you're designing a supercomputer system, the major bottleneck is the communication between processors (network topology). Some applications might only need adjacent processors to talk to each other (global weather simulation usually represents the atmosphere as a single large block of air, with sub-blocks assigned to seperate processors. Other applications might assign individual processors to different tasks, which complete at different rates (eg. the Mandelbrot set). A configurable network architecture allows the system to be used for many more different applications.