Slashdot Mirror


Supercomputer Breaks the $100/GFLOPS Barrier

Hank Dietz writes "At the University of Kentucky, KASY0, a Linux cluster of 128+4 AMD Athlon XP 2600+ nodes, achieved 471 GFLOPS on 32-bit HPL. At a cost of less than $39,500, that makes it the first supercomputer to break $100/GFLOPS. It also is the new record holder for POV-Ray 3.5 render speed. The reason this 'Beowulf' is so cost-effective is a new network architecture that achieves high performance using standard hardware: the asymmetric Sparse Flat Neighborhood Network (SFNN)." Because this was a university project, KASY0 was assembled entirely by unversity students, which while being a source of cheap labor, is also a good way to get a lot of students of involved in a great project.

15 of 281 comments (clear)

  1. To those who might not know... by qrash · · Score: 2, Informative

    gigaflop

    As a measure of computer speed, a gigaflop is a billion floating-point operations per second (FLOPS).

    --
    you may find the Higgs in this signature.
    1. Re:To those who might not know... by ant_slayer · · Score: 2, Informative

      If you're going to try to be informative, at least be accurate. There's no such thing as a "gigaflop". That would mean "Billions of Floating point Operations Per..." without the unit of time.

      It's a gigaflops (singular). The 's' is very important. It's how we know how long it takes to perform a billion floating point operations.

      It's like when people say "I had my engine up to 6000 rpms". What's an rpms? Is it a plural rpm? If so, what is pluralized? The acronym expansion yields "revolutions per minute", so would it be "revolutionses per minute", "revolutions per minutes", or "revolutions per minutes"s? None of 'em make sense. Technically, anything that revolves revolves at 6000 revolutions per some number of minutes... oi.

      The Earth rotates at 6000rpms... if the unit of time is in blocks of 8640000 minutes... This type of confusion is why the time unit in our units of speed is usually unity.

      -Josh O-

  2. Re:Asymmetric Sparse Flat Neighborhood Network by flymolo · · Score: 5, Informative

    Due to "creative" (computed) wiring, if all switchs are functioning, no node is more than one hop from each other node. This requires a routing table written for each pc. It could be used for redunancy, but it is being used to minimize latency, and collisions, which are both killers in clusters.

    --
    "Sometimes it's hard to tell the dancer from the dance." --Corwin Of Amber in CoC
  3. Cooling by bengoerz · · Score: 4, Informative

    I toured the previous cluster these guys did (KLAT2) and was very impressed. However, using AMD Athlon Thunderbirds last time, it did get quite hot. I remember standing by the cluster looking at all the wiring and being bombarded by an overhead cooling vent. I'm also assuming that these cooling issues is the reason that each case has two blow-holes. I'd also like to see these guys post in-depth specs of each machine. Being a hardware nut, I'd like to see how they got so many machines so cheap, and maybe even what vender they used. As I remember, they worked REALLY hard on their last cluster to keep costs to an absolute minimum.

  4. Re:Please mod parent down by gorim · · Score: 2, Informative

    A playstation2 costs $199. That information is in your local newspaper. Actually, sales peg it at $179 lately, my mistake. The playstation2, with 2 vector processing units, each with 4 floats wide registers (128bit), capable of doing a multiply-add operation per clock cycle on whole registers, at 300mhz independant of the main CPU which still has its own scalar floating point coproc, handily does 5.5GFLOPS, and is well documented as such if you google around. Check out http://playstation2-linux.com/

  5. In cache maybe by msgmonkey · · Score: 2, Informative

    These numbers for microprocessors etc mean nothing because they are usually referring to operations on data in cache.. you'ill find that real life performance is 10-20x slower because thats how much slower accessing main memory is.

  6. Re:Asymmetric Sparse Flat Neighborhood Network by Rich+Dougherty · · Score: 3, Informative

    Here's a quote from the site:

    Does The World Need Yet Another Network Topology?

    One would think (well, we did ;-) that the latest round of Gb/s network hardware would have made the design of a high-bandwidth cluster network a trivial exercise. However, that isn't the case when the prices are considered:

    • When we invented FNNs in 2000, the cheapest of the Gb/s NICs available were PCI Ethernet cards priced under $300 each; now they are $50-$100. Prices have continued to drop. Prices on custom high-performance NICs (e.g., Myrinet) start at close to $1000 and have not been going down.
    • In late 2002, 48-port 100Mb/s Fast Ethernet switches have dropped to less than $25/port. Gigabit Ethernet switches are starting to follow the same trend, with $100/port pricing in sight for switches up to about 48 ports. Wider switches with the needed performance are unlikely to become cheap in the near future. Thus, it would be necessary to build a heirarchical switch fabric using multiple layers of switches, yielding higher cost, higher latency, and significantly lower bisection bandwidth (unless you use a "fat tree" or other scheme, which adds still more expense -- especially because cheap layer 2 Ethernet switches don't support those topologies).

    In summary, the cost of the "obvious" Gb/s network for KLAT2's 66 single-processor nodes was OVER 30 TIMES the cost of the network we built for KLAT2. In fact, to match KLAT2's bisection bandwidth, a network built using Gb/s hardware would have cost even more. Gigabit Ethernet is getting cheaper, but obvious topologies just are not competitive with FNN performance. So, if you've got tons of money that you have to spend immediately, you can impress your friends by buying expensive custom network hardware that can use an obvious topology and still be competitive with FNN performance. Otherwise, read on.... ;-)

  7. Wrong by imsabbel · · Score: 3, Informative

    In reality, beowolf clusters are good for only a subset of supercomputing tasks and the "real" supercomputers are still best at general purpose supercomputing.

    If you can paralize your application well enough, beowoulf rules, but if you need a lot of node2node communication, the network cost quickly surpasses the cpu cost of the system

    --
    HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
  8. Re:Hmm Math? by r00zky · · Score: 2, Informative

    128+4...
    That's like 132 isn't it?


    From the FAQ:

    KASY0's configuration is:
    128 + 4 "cold spare" PC nodes, each containing:
    One AMD Athlon XP 2600+ (the 2.075GHz version)
    One 512MB PC2700 DDR SDRAM
    BioStar M7VIT Pro motherboard
    Two Linksys LNE100TX NICs
    Codegen 6042L case with 400W power supply
    18 BenQ SE0024 24-port Fast Ethernet switches
    405 Cat5 Fast Ethernet cables
    RedHat Linux 9.0, modified Warewulf 1.11


    So it's 128, the other 4 are spares!

    --
    I'm a chainsmokin' alcoholic sociopath, so-ci-o-path
  9. Re:Also I wonder by rusty0101 · · Score: 2, Informative

    Per the FAQ on the site, the supercomputer draws 210A. Power requirements provide an yearly cost equivalent to the cost of the network equipment connecting the nodes.

    210A at 120Vac via the power law comes to 25.2kw/hr. Tripple that to allow for cooling (It takes approx 2 watts of power to remove the heat generated by 1 watt of power usage) and you come to almost 76kw/hr. Take a look at your utility bill to come up with the hourly cost for electricity while this thing is on.

    The equipment does not have cooling isolated from the rest of the building. As a result the cooling costs will ultimatly be absorbed by the operational cost of the building, and probably will not show up as a line item for the cost of this cluster.

    -Rusty

    p.s. more wire does not meed more juice, just more pathes for signals to follow.

    --
    You never know...
  10. Re:University students by panda · · Score: 4, Informative

    Having worked there, and knowing what Hank Dietz and his students are doing, I can tell you that it is different from just slapping PCs together, stringing wire between them and installing clustering software.

    Dietz specializes in networking and all the wiring that you see in the photos is charted out by custom software that he's written just for this purpose.

    He works in the realm of optimizing communications among the nodes to avoid network latency and so on. If you read the POVRay benchmarks, you'll notice that the author comments that several clusters' CPUs spend most of their time idle due to network latency. Dietz is researching the best ways to eliminate much of that latency so that the CPUs in the cluster can spend more of their time crunching data rather than just throwing off heat. To my knowledge, he is succeeding at this and better than most other researchers in the field.

    As for what his students learned from this, I don't know exactly which students helped him on this. For KLAT2, there were several undergrad volunteers who helped with wiring and assembly, mostly from the campus Linux Users' Group. I know his grad students and research assistants are learning a lot about how clustering and network tech works, and a couple are doing their Ph.D. disserts in this very subfield of E.E.

    --
    Just be sure to wear the gold uniform when you beam down -- you know what happens when you wear the red one.
  11. Re:Playstation2 at 5.5GFLOPS costs only $199 $40/G by Arker · · Score: 2, Informative

    Gah feel free to mod the previous version of this comment into oblivion, I hit submit accidentally.

    The numbers you're looking at are marketing numbers first off, and overly generous. Second you don't scale for free - you never get anything like 100 times the performance of a single box when you wire 100 together, for the same reason that you don't get twice the horsepower out of an engine twice the size.

    The previous price/performance champ was in fact a PS/2 cluster, mentioned here, but this AMD cluster is roughly three times the performance for the dollar. You can check the stats with different assumptions on their FAQ page, particularly the section labeled 'Is KASY0 really the first supercomputer under $100/GFLOPS?'

    --
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-
    Friends don't let friends enable ecmascript.
  12. Re:cable management by HBI · · Score: 2, Informative

    I would consider something like this, bent into a circular pattern.

    We use this in one of my closets to provide a feed from the ceiling to a rack with a patch panel. We had moved the rack 4' to the left and we needed something to bring the ceiling stuff down to the new rack. We used fittings with tek screws and wire ties to hold it firmly in place and linked it into the drop ceiling framework as well for additional stability.

    I could see doing something like that here between the stainless shelving units. Other adequate solutions could be arrived at. Also, the snake tray probably cost us $150 tops.

    --
    HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
  13. Re:why not DSP? by SmackCrackandPot · · Score: 3, Informative

    Actually, they do, but they are referred to as vector processors rather than DSP's. Probably the most famous and the first was the Cray supercomputer. And there was also the INMOS "Transputer"

    DSP's are optimised to handle streamed data of a particular maximum size (Eg. 4-element float point variables). Useful for image processing (red,green,blue,alpha) and 3D graphics(XYZW), but if you're modelling something like ocean currents, global weather, every data element is more than likely going to have more than four variables (eg. temperature, humidity, velocity, pressure, salinity, ground temperature), you may not get full optimisation.

    Plus, you also need a means of getting all these processors to talk to each other. DSP's are nearly always optimised to operate in single pipelines, so don't need much communication support (eg. Sony Playstation 2). However, if you're designing a supercomputer system, the major bottleneck is the communication between processors (network topology). Some applications might only need adjacent processors to talk to each other (global weather simulation usually represents the atmosphere as a single large block of air, with sub-blocks assigned to seperate processors. Other applications might assign individual processors to different tasks, which complete at different rates (eg. the Mandelbrot set). A configurable network architecture allows the system to be used for many more different applications.

  14. Re:Mckenzie Cluster, faster, cheaper per TFlop by afidel · · Score: 2, Informative

    $600,000 dollars / 1,200 Gflops = $500/Gflop. I think you misplaced a decimal in the $Cad->$US conversion =)

    --
    There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.