Supercomputer Breaks the $100/GFLOPS Barrier
Hank Dietz writes "At the University of Kentucky, KASY0,
a Linux cluster of 128+4 AMD Athlon XP 2600+ nodes, achieved 471 GFLOPS on 32-bit HPL. At a cost of less than $39,500, that makes it the first supercomputer to break $100/GFLOPS. It also is the new record holder for POV-Ray 3.5 render speed.
The reason this 'Beowulf' is so cost-effective is a new network architecture that achieves high performance using standard hardware: the asymmetric Sparse Flat Neighborhood Network (SFNN)." Because this was a university project, KASY0 was assembled entirely by unversity students, which while being a source of cheap labor, is also a good way to get a lot of students of involved in a great project.
gigaflop
As a measure of computer speed, a gigaflop is a billion floating-point operations per second (FLOPS).
you may find the Higgs in this signature.
Due to "creative" (computed) wiring, if all switchs are functioning, no node is more than one hop from each other node. This requires a routing table written for each pc. It could be used for redunancy, but it is being used to minimize latency, and collisions, which are both killers in clusters.
"Sometimes it's hard to tell the dancer from the dance." --Corwin Of Amber in CoC
I toured the previous cluster these guys did (KLAT2) and was very impressed. However, using AMD Athlon Thunderbirds last time, it did get quite hot. I remember standing by the cluster looking at all the wiring and being bombarded by an overhead cooling vent. I'm also assuming that these cooling issues is the reason that each case has two blow-holes. I'd also like to see these guys post in-depth specs of each machine. Being a hardware nut, I'd like to see how they got so many machines so cheap, and maybe even what vender they used. As I remember, they worked REALLY hard on their last cluster to keep costs to an absolute minimum.
A playstation2 costs $199. That information is in your local newspaper. Actually, sales peg it at $179 lately, my mistake. The playstation2, with 2 vector processing units, each with 4 floats wide registers (128bit), capable of doing a multiply-add operation per clock cycle on whole registers, at 300mhz independant of the main CPU which still has its own scalar floating point coproc, handily does 5.5GFLOPS, and is well documented as such if you google around. Check out http://playstation2-linux.com/
These numbers for microprocessors etc mean nothing because they are usually referring to operations on data in cache.. you'ill find that real life performance is 10-20x slower because thats how much slower accessing main memory is.
Here's a quote from the site:
In reality, beowolf clusters are good for only a subset of supercomputing tasks and the "real" supercomputers are still best at general purpose supercomputing.
If you can paralize your application well enough, beowoulf rules, but if you need a lot of node2node communication, the network cost quickly surpasses the cpu cost of the system
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
128+4...
That's like 132 isn't it?
From the FAQ:
KASY0's configuration is:
128 + 4 "cold spare" PC nodes, each containing:
One AMD Athlon XP 2600+ (the 2.075GHz version)
One 512MB PC2700 DDR SDRAM
BioStar M7VIT Pro motherboard
Two Linksys LNE100TX NICs
Codegen 6042L case with 400W power supply
18 BenQ SE0024 24-port Fast Ethernet switches
405 Cat5 Fast Ethernet cables
RedHat Linux 9.0, modified Warewulf 1.11
So it's 128, the other 4 are spares!
I'm a chainsmokin' alcoholic sociopath, so-ci-o-path
Per the FAQ on the site, the supercomputer draws 210A. Power requirements provide an yearly cost equivalent to the cost of the network equipment connecting the nodes.
210A at 120Vac via the power law comes to 25.2kw/hr. Tripple that to allow for cooling (It takes approx 2 watts of power to remove the heat generated by 1 watt of power usage) and you come to almost 76kw/hr. Take a look at your utility bill to come up with the hourly cost for electricity while this thing is on.
The equipment does not have cooling isolated from the rest of the building. As a result the cooling costs will ultimatly be absorbed by the operational cost of the building, and probably will not show up as a line item for the cost of this cluster.
-Rusty
p.s. more wire does not meed more juice, just more pathes for signals to follow.
You never know...
Having worked there, and knowing what Hank Dietz and his students are doing, I can tell you that it is different from just slapping PCs together, stringing wire between them and installing clustering software.
Dietz specializes in networking and all the wiring that you see in the photos is charted out by custom software that he's written just for this purpose.
He works in the realm of optimizing communications among the nodes to avoid network latency and so on. If you read the POVRay benchmarks, you'll notice that the author comments that several clusters' CPUs spend most of their time idle due to network latency. Dietz is researching the best ways to eliminate much of that latency so that the CPUs in the cluster can spend more of their time crunching data rather than just throwing off heat. To my knowledge, he is succeeding at this and better than most other researchers in the field.
As for what his students learned from this, I don't know exactly which students helped him on this. For KLAT2, there were several undergrad volunteers who helped with wiring and assembly, mostly from the campus Linux Users' Group. I know his grad students and research assistants are learning a lot about how clustering and network tech works, and a couple are doing their Ph.D. disserts in this very subfield of E.E.
Just be sure to wear the gold uniform when you beam down -- you know what happens when you wear the red one.
Gah feel free to mod the previous version of this comment into oblivion, I hit submit accidentally.
The numbers you're looking at are marketing numbers first off, and overly generous. Second you don't scale for free - you never get anything like 100 times the performance of a single box when you wire 100 together, for the same reason that you don't get twice the horsepower out of an engine twice the size.
The previous price/performance champ was in fact a PS/2 cluster, mentioned here, but this AMD cluster is roughly three times the performance for the dollar. You can check the stats with different assumptions on their FAQ page, particularly the section labeled 'Is KASY0 really the first supercomputer under $100/GFLOPS?'
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
I would consider something like this, bent into a circular pattern.
We use this in one of my closets to provide a feed from the ceiling to a rack with a patch panel. We had moved the rack 4' to the left and we needed something to bring the ceiling stuff down to the new rack. We used fittings with tek screws and wire ties to hold it firmly in place and linked it into the drop ceiling framework as well for additional stability.
I could see doing something like that here between the stainless shelving units. Other adequate solutions could be arrived at. Also, the snake tray probably cost us $150 tops.
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
Actually, they do, but they are referred to as vector processors rather than DSP's. Probably the most famous and the first was the Cray supercomputer. And there was also the INMOS "Transputer"
DSP's are optimised to handle streamed data of a particular maximum size (Eg. 4-element float point variables). Useful for image processing (red,green,blue,alpha) and 3D graphics(XYZW), but if you're modelling something like ocean currents, global weather, every data element is more than likely going to have more than four variables (eg. temperature, humidity, velocity, pressure, salinity, ground temperature), you may not get full optimisation.
Plus, you also need a means of getting all these processors to talk to each other. DSP's are nearly always optimised to operate in single pipelines, so don't need much communication support (eg. Sony Playstation 2). However, if you're designing a supercomputer system, the major bottleneck is the communication between processors (network topology). Some applications might only need adjacent processors to talk to each other (global weather simulation usually represents the atmosphere as a single large block of air, with sub-blocks assigned to seperate processors. Other applications might assign individual processors to different tasks, which complete at different rates (eg. the Mandelbrot set). A configurable network architecture allows the system to be used for many more different applications.
$600,000 dollars / 1,200 Gflops = $500/Gflop. I think you misplaced a decimal in the $Cad->$US conversion =)
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.