Domain: aggregate.org
Stories and comments across the archive that link to aggregate.org.
Comments · 77
-
Re:Try These
-
Cluster software & GPU experence
I assume this is an epic troll, but am going to give an honest answer anyway, because there are some legitimate questions buried in there.
I work with a aggregate.org a university research group which has a decent claim to having built the very first Linux PC Cluster, set some records with them (KLAT2 and KASY0 were both ours), and still operates a number of Linux clusters, including some containing GPUs, so I feel like I have some idea of the lay of cluster technology. It is *way* overdue for an update (and one is in progress, we swear!), but we also maintain TLDP's widely circulated Parallel Processing HOWTO, which was the goto resource for this kind of question for some time.
In a cluster of any size, you do _not_ want to be handling nodes individually. There are several popular provisioning and administration systems for avoiding doing so, because every organization with a large number of machines needs such a tool. The clusters I deal with are mostly provisioned with Perceus with a few ROCKS holdovers, and I'm aware of a number of other solutions (xCat is the most popular that I've never tinkered with). Perceus can pass out pretty much any correctly-configured Linux image to the machines, although It is specifically tailored to work with Caos NSA (Redhat-like), or GravityOS (a Debian derivative) payloads. Infiscale, the company that supports Perceus, releases the basic tools and some sample modifiable OS images for free, and makes their money off support and custom images, so it is pretty flexible option in terms of required financial and/or personnel commitment. The various provisioning and administration tools are generally designed to interact with various monitoring tools (ex. Warewulf or Ganglia) and job management systems (see next paragraph).
Accounting and billing users is largely about your job management system. Our clusters aren't billed this way, so I can't claim to have be closely familiar with the tools, but most of the established job management systems like Slurm, and GridEngine (to name two of many) have accounting systems built in.
The "standard" images or image-building tools provided with the provisioning systems generally provide for a few nicely integrated combinations of tools, which make it remarkably easy to throw a functioning cluster stack together.
As for GPUs... be aware that the claimed performance for GPUs, especially in clusters, is virtually unattainable. You have to write code in their nasty domain-specific languages (CUDA or OpenCL for Nvidia, just OpenCL for AMD) and there isn't really any concept of IPC baked in to the tools to allow for distributed operations. Furthermore, GPUs are also generally extroridnarly memory and memory bandwidth starved (remember, the speed comes from there being hundreds of processing elements on the card, all sharing the same memory and interface), so simply keeping them fed with data is challenging. GPGPU is also an unstable area in both relevant senses: the GPGPU software itself has a nasty tendency to hang the host when something goes wrong (which is extra fun in clusters without BMCs), and the platforms are changing at an alarming clip. AMD is somewhat worse in the "moving target" regard - they recently deprecated all 4000 series cards from being supported by GPGPU tools, and have abandoned their CTM, CAL, and Brook+ environments before settling on OpenCL, and only OpenCL. Nvidia still supports both their C -
Cluster software & GPU experence
I assume this is an epic troll, but am going to give an honest answer anyway, because there are some legitimate questions buried in there.
I work with a aggregate.org a university research group which has a decent claim to having built the very first Linux PC Cluster, set some records with them (KLAT2 and KASY0 were both ours), and still operates a number of Linux clusters, including some containing GPUs, so I feel like I have some idea of the lay of cluster technology. It is *way* overdue for an update (and one is in progress, we swear!), but we also maintain TLDP's widely circulated Parallel Processing HOWTO, which was the goto resource for this kind of question for some time.
In a cluster of any size, you do _not_ want to be handling nodes individually. There are several popular provisioning and administration systems for avoiding doing so, because every organization with a large number of machines needs such a tool. The clusters I deal with are mostly provisioned with Perceus with a few ROCKS holdovers, and I'm aware of a number of other solutions (xCat is the most popular that I've never tinkered with). Perceus can pass out pretty much any correctly-configured Linux image to the machines, although It is specifically tailored to work with Caos NSA (Redhat-like), or GravityOS (a Debian derivative) payloads. Infiscale, the company that supports Perceus, releases the basic tools and some sample modifiable OS images for free, and makes their money off support and custom images, so it is pretty flexible option in terms of required financial and/or personnel commitment. The various provisioning and administration tools are generally designed to interact with various monitoring tools (ex. Warewulf or Ganglia) and job management systems (see next paragraph).
Accounting and billing users is largely about your job management system. Our clusters aren't billed this way, so I can't claim to have be closely familiar with the tools, but most of the established job management systems like Slurm, and GridEngine (to name two of many) have accounting systems built in.
The "standard" images or image-building tools provided with the provisioning systems generally provide for a few nicely integrated combinations of tools, which make it remarkably easy to throw a functioning cluster stack together.
As for GPUs... be aware that the claimed performance for GPUs, especially in clusters, is virtually unattainable. You have to write code in their nasty domain-specific languages (CUDA or OpenCL for Nvidia, just OpenCL for AMD) and there isn't really any concept of IPC baked in to the tools to allow for distributed operations. Furthermore, GPUs are also generally extroridnarly memory and memory bandwidth starved (remember, the speed comes from there being hundreds of processing elements on the card, all sharing the same memory and interface), so simply keeping them fed with data is challenging. GPGPU is also an unstable area in both relevant senses: the GPGPU software itself has a nasty tendency to hang the host when something goes wrong (which is extra fun in clusters without BMCs), and the platforms are changing at an alarming clip. AMD is somewhat worse in the "moving target" regard - they recently deprecated all 4000 series cards from being supported by GPGPU tools, and have abandoned their CTM, CAL, and Brook+ environments before settling on OpenCL, and only OpenCL. Nvidia still supports both their C -
Empowering...
Way too any intro electronic "experiments" are either underwhelming (I lit an LED!) or black box magic (Build a radio transmitter by following these 37 simple steps! The following paragraph explains how the circuit works...). So I sympathize with the original request. Personally, I think the trick is to use a low end microcontroller and some cool I/O, and get the kids doing some simple, minimal programming so that they feel ownership. The best answer to this used to be a Basic Stamp, but the cost is prohibitive. Doing this on a budget today, I'd probably get some low-end 8-pin PIC, some switches and lights, and a cheap servo motor (one of the sub $4 HXT ones from HobbyCity). Then I'd have the kids share a few PC's loaded with MPLAB (free) and maybe a cheap Basic or C compiler (there are free ones). Finally, you'll need a cheap programmer (a PICKIT 2 or a third party one). It's a bit more work, but that's enough for the kids to do some really cool stuff. The goal here should be to give the kids tools so that they can be confident enough to go off and make their own cool stuff. To get the flavor of some of these ideas, check out: http://aggregate.org/hankd/piaee12.pdf
-
Re:Kentucky
Here in Lexington we've had them for at least 8 years... http://aggregate.org/KLAT2/press.html
-
A Pragmatic Introduction to the Art of EE
I'm surprised no one mentioned this one: http://aggregate.org/hankd/piaee12.pdf It's a bit dated now (from the mid 90's), but still has lots of good info. It was originally designed as a college text for non-EE majors, but is much more project-oriented than a classic text. By the way, the title was meant to be an homage to the Art of EE - a great book, but a bit intimidating in scope for a newbie...
-
FTFA: the first to cost less than $100/Gflop?
What about KASY0, which had $84 per GFLOP in 2003?
-
Price/Performance not new...
The University of Kentucky (where he is coincidently going to grad school) beat his price point years ago on a "real" supercomputer. This super computer was built for about $84 per GFLOP in 2003 and it made the Top500 list when it was built. The Aggregate team at UK is one of the tops in the field when it comes to supercomputers on the cheap.
-
This has been done before...
by a cluster computing group at the University of Kentucky called the Aggregate http://aggregate.org/. They built a nine laptop display panel that is basically what you are trying to do. It is much more difficult than I thought it would be to do. Here is a video of the panel in action http://aggregate.org/IMG/mvi_5158.avi. And here is the software they created to do it http://aggregate.org/VWLib/.
-
This has been done before...
by a cluster computing group at the University of Kentucky called the Aggregate http://aggregate.org/. They built a nine laptop display panel that is basically what you are trying to do. It is much more difficult than I thought it would be to do. Here is a video of the panel in action http://aggregate.org/IMG/mvi_5158.avi. And here is the software they created to do it http://aggregate.org/VWLib/.
-
This has been done before...
by a cluster computing group at the University of Kentucky called the Aggregate http://aggregate.org/. They built a nine laptop display panel that is basically what you are trying to do. It is much more difficult than I thought it would be to do. Here is a video of the panel in action http://aggregate.org/IMG/mvi_5158.avi. And here is the software they created to do it http://aggregate.org/VWLib/.
-
My Experence
I have a stack of five origional Pentium boxes with 32mb of RAM and 2gb harddrives (except for one, with a larger drive for a software repository). Origionally built it to experiment with AFAPI based clustering, but since AFAPI is a reasonably non-invasive setup, it works well for trying other techniques too, everthing from simply running distcc on the nodes to speed up i586 software builds to briefly fiddling about with some of the other clustering options mentioned. Fiddling around with options on a real cluster (running cluster software on a single node really isn't a good impression) that could be reinstalled from scratch in a few hours, and the machines aren't worth enough to matter if it is physically damanged is a great way to learn.
-
I was expecting something more detailed than this
I was expecting something a little better than this, like maybe some fast code to study and use.
-
On on U of K...
Seriously, we have some really good programs. Hank Dietz, Bill Dieter, and Tim Mattox have some exceptional results in parallel computing. Until recently, their $40,000 home-made cluster beat UK's million dollar HP Superdome cluster in Linpack ratings. The $40k even factors in the cost of student labor (in the form of pizza) to wire the cluster.
I just wish I could say the same for our CS department... It's been getting steadily better since the College of Engineering adopted it, but they switched to M$ Visual Studio
.NET this year, and that really worries me... program internals shouldn't be hidden from the student at lower levels of computer science.I hope I didn't
/. aggregate.org too badly... -
Aggregate.org
For some very good information on F/OSS based clustering, check out aggregate.org. They have really neat ideas, that are reasonably well doccumented and freely implementable/usable. I built a little cluster (AFAPI on a WAPERS switch) with them for my highschool senior project, and it was a great experence.
-
just ask Hank Dietz!
-
just ask Hank Dietz!
-
just ask Hank Dietz!
-
Just ask Hank Dietz!
-
FNN
Flat neighborhood networks, basically you get to use "cheap" cards and switches in a web configuration to provide a fast interconnect between nodes.
Other than that, a Cisco 6513 with 11 10/100/1000 48 port switch cards would fit the bill to provide a single chasis switch for all 500 nodes. Hope you've got a decent budget, because it will cost you. -
Re:http://aggregate.org/
I've been keeping up with Dr. Dietz's work since Purdue. I really admire his work, and I even ran a small 2-node PAPERS cluster at home using his AFAPI library.
PeTS may be applicable here, especially his research into Flat Neighborhood Networks (FNNs). However, I think that AMD/Intel sytems use too much power (70 watts or so each). A computationally-equivalent cluster of VIA EPIA motherboards (maybe 10 watts each) would be both physically smaller and much easier on the electric bill. At $100 each for a VIA EPIA V10000A or $163 for the newer VIA EPIA M10000 Nehemiah I could afford to both buy a cluster and run it. Running an AMD cluster would use more electricity than I could afford.
The picture in the middle of the PeTS page, KAOSlab.jpg, is my background desktop at work, and I often get comments. I wish I were so lucky as to work with that sort of thing every day. :) -
http://aggregate.org/
I would have the research group that I work with at the University of Kentucky build it. Maybe you should contact my professor, Dr. Hank Dietz.
KAYS0
University Of Kentucky Supercomputer Breaks The $100 Per GFLOPS BarrierThey built the supercomputer for under $40,000 with 128 nodes + 4 spare nodes, just think how many nodes and how powerful it could be with $700,000!
-
http://aggregate.org/
I would have the research group that I work with at the University of Kentucky build it. Maybe you should contact my professor, Dr. Hank Dietz.
KAYS0
University Of Kentucky Supercomputer Breaks The $100 Per GFLOPS BarrierThey built the supercomputer for under $40,000 with 128 nodes + 4 spare nodes, just think how many nodes and how powerful it could be with $700,000!
-
Low Cost Cluster Computing
I would very much recommend this research site from one of my professors at the University of Kentucky. He has been doing work with cluster super computing for quite some time now and has managed to build some very impressive systems at low costs. Much lower costs than what your current grant is for. With a grant of that size using this professor's techniques you could build a whole bunch of clusters. I would suggest you taking a look at his group's research site aggregate.org.
You can also see one of the specific examples of a very low cost efficient cluster computer. KASY0 -
Low Cost Cluster Computing
I would very much recommend this research site from one of my professors at the University of Kentucky. He has been doing work with cluster super computing for quite some time now and has managed to build some very impressive systems at low costs. Much lower costs than what your current grant is for. With a grant of that size using this professor's techniques you could build a whole bunch of clusters. I would suggest you taking a look at his group's research site aggregate.org.
You can also see one of the specific examples of a very low cost efficient cluster computer. KASY0 -
disagree
I'm sure Hank Dietz would disagree : http://www.aggregate.org
-
Re:Important items of noteI have yet to find a satisfactory description of the network topology they are using. The specs on the Infiniband switches they are using are quite impressive for latency and bandwidth numbers, but without knowing how they are interconnected, its' hard to say if it's latency or maybe bisection-bandwidth issues limiting their efficiency. From the early report of 80% efficiency on 128 CPUs (or was it 128 nodes?) would seem to indicate the problem is with the switch fabric in some way. With ~1100 nodes, communications are having to cross through mutliple switches in any traditional network topology, resulting in higher latency, and possibly bandwidth bottlenecks.
I saw some indication that they were using a Fat-Tree topology, which would eliminate any bandwidth bottlenecks between switches, but the number of switches used didn't seem large enough for a fat-tree. But again, VT just hasn't, as of the last time I looked, released enough information about the cluster to tell.
BTW - My thesis work on Flat Neighborhood Networks (FNNs) used in the KLAT2 and KASY0 supercomputers is finding better ways to interconnect the nodes, given a particular set of network components.
-
Re:Important items of noteI have yet to find a satisfactory description of the network topology they are using. The specs on the Infiniband switches they are using are quite impressive for latency and bandwidth numbers, but without knowing how they are interconnected, its' hard to say if it's latency or maybe bisection-bandwidth issues limiting their efficiency. From the early report of 80% efficiency on 128 CPUs (or was it 128 nodes?) would seem to indicate the problem is with the switch fabric in some way. With ~1100 nodes, communications are having to cross through mutliple switches in any traditional network topology, resulting in higher latency, and possibly bandwidth bottlenecks.
I saw some indication that they were using a Fat-Tree topology, which would eliminate any bandwidth bottlenecks between switches, but the number of switches used didn't seem large enough for a fat-tree. But again, VT just hasn't, as of the last time I looked, released enough information about the cluster to tell.
BTW - My thesis work on Flat Neighborhood Networks (FNNs) used in the KLAT2 and KASY0 supercomputers is finding better ways to interconnect the nodes, given a particular set of network components.
-
Re:Important items of noteI have yet to find a satisfactory description of the network topology they are using. The specs on the Infiniband switches they are using are quite impressive for latency and bandwidth numbers, but without knowing how they are interconnected, its' hard to say if it's latency or maybe bisection-bandwidth issues limiting their efficiency. From the early report of 80% efficiency on 128 CPUs (or was it 128 nodes?) would seem to indicate the problem is with the switch fabric in some way. With ~1100 nodes, communications are having to cross through mutliple switches in any traditional network topology, resulting in higher latency, and possibly bandwidth bottlenecks.
I saw some indication that they were using a Fat-Tree topology, which would eliminate any bandwidth bottlenecks between switches, but the number of switches used didn't seem large enough for a fat-tree. But again, VT just hasn't, as of the last time I looked, released enough information about the cluster to tell.
BTW - My thesis work on Flat Neighborhood Networks (FNNs) used in the KLAT2 and KASY0 supercomputers is finding better ways to interconnect the nodes, given a particular set of network components.
-
It's a good price/performance, but not best.I guess the original submission didn't see the slashdot article from August 23 about our KASY0 supercomputer breaking the $100 per GFLOPS barrier.
KASY0 achieved 187.3 GFLOPS on the 64-bit floating point version of HPL, the same benchmark used on "Big Mac". While "Big Mac" is about 40 times faster on that benchmark, it is about 130 times the cost of KASY0 (~$40K vs ~$5200K). Considering the size difference, "Big Mac" is VERY impressive, but it can't claim to be the best price/performance supercomputer on the HPL benchmark.
Note: KASY0 gets 482.6 GFLOPS (0.48 TFLOPS) on a 32-bit precision version of Linpack, satisfying our under $100 per GFLOPS claim.
Regardless, Virginia Tech's "Big Mac" is a very impressive machine. My congratulations to them!
-
It's a good price/performance, but not best.I guess the original submission didn't see the slashdot article from August 23 about our KASY0 supercomputer breaking the $100 per GFLOPS barrier.
KASY0 achieved 187.3 GFLOPS on the 64-bit floating point version of HPL, the same benchmark used on "Big Mac". While "Big Mac" is about 40 times faster on that benchmark, it is about 130 times the cost of KASY0 (~$40K vs ~$5200K). Considering the size difference, "Big Mac" is VERY impressive, but it can't claim to be the best price/performance supercomputer on the HPL benchmark.
Note: KASY0 gets 482.6 GFLOPS (0.48 TFLOPS) on a 32-bit precision version of Linpack, satisfying our under $100 per GFLOPS claim.
Regardless, Virginia Tech's "Big Mac" is a very impressive machine. My congratulations to them!
-
Re:Price vs Preformance
the cost of cooling hundreds of machines running full bore 24/7 would increase nonlinearly. double the machines and you will more than double the power consumption to keep them from melting. 1 g5 by under normal conditions can cool itself without extra air conditioning, you stick a bunch in the room each one is pulling in the heated air put out by the others and basking in their radiant warmth. with a few of them you can stick a fan on them and have a normal AC system and you'll be fine. cheaper beowulfs often include a few big fans in and amongst their costs along with the pizza used to bribe students into putting the thing together and incidentals like the nodes themselves. moving up from a few dozen nodes to 1100 nodes is a big jump though. I don't know if the numbers given are accurate or not but I certainly would expect it to be more than 300 midrange homes.
-
Re:What happened to the federal controls?
Hell, if you are wanting Myrinet on the cheap, check out Flat Network Neighborhoods. Lots of 100Mb/s cards and lots of switches, wiring them is a bitch but it works. PCI bus saturation is an issue in some applications, but PCI-X and/or Hypertransport should solve them.
-
Re:Macs ?
I am one of the designers of KLAT2 and KASY0, and the guy who ran the Linpack benchmarks on both. Over 3 years ago when we submitted our results for KLAT2 to the top500 list, there was no public indication that 64-bit floating point was required. It took them awhile, but the top500 website now has a FAQ that indicates "full precision" is required, and they interpret that as 64-bit for most machines. FYI, 32-bit FLOPs are useful in many situations, and machines had been on the top500 list that had used 32-bit FLOPs. You might take a look at our KASY0 FAQ on GFLOPS. As a means to rank the top500, I think it is quite legitimate to require 64-bit FLOPS, but that doesn't make it "illegal" to use 32-bit Linpack FLOPS for other comparisons.
As for the G5, it won't need AltiVec to get good Linpack numbers due to its fused multiply-add capability in its dual floating point pipes. That's 4 FLOPs per clock peak! I hope VT was able to get Apple to leave out, and not charge for, the components not needed in a cluster node. The PCI-X slots in the G5 should allow VT to better use a high-speed cluster network technology. Commodity x86 boxes tend to only have 32-bit 33MHz PCI, limiting the usable link bandwidth between nodes to under a gigabit per second. For 64-bit Linpack GFLOPS per dollar, a cluster of G5's could be competative. I look forward to seeing their results, and any similar work using the upcoming Athlon 64.
-
Re:Macs ?
I am one of the designers of KLAT2 and KASY0, and the guy who ran the Linpack benchmarks on both. Over 3 years ago when we submitted our results for KLAT2 to the top500 list, there was no public indication that 64-bit floating point was required. It took them awhile, but the top500 website now has a FAQ that indicates "full precision" is required, and they interpret that as 64-bit for most machines. FYI, 32-bit FLOPs are useful in many situations, and machines had been on the top500 list that had used 32-bit FLOPs. You might take a look at our KASY0 FAQ on GFLOPS. As a means to rank the top500, I think it is quite legitimate to require 64-bit FLOPS, but that doesn't make it "illegal" to use 32-bit Linpack FLOPS for other comparisons.
As for the G5, it won't need AltiVec to get good Linpack numbers due to its fused multiply-add capability in its dual floating point pipes. That's 4 FLOPs per clock peak! I hope VT was able to get Apple to leave out, and not charge for, the components not needed in a cluster node. The PCI-X slots in the G5 should allow VT to better use a high-speed cluster network technology. Commodity x86 boxes tend to only have 32-bit 33MHz PCI, limiting the usable link bandwidth between nodes to under a gigabit per second. For 64-bit Linpack GFLOPS per dollar, a cluster of G5's could be competative. I look forward to seeing their results, and any similar work using the upcoming Athlon 64.
-
Re:Macs ?
I am one of the designers of KLAT2 and KASY0, and the guy who ran the Linpack benchmarks on both. Over 3 years ago when we submitted our results for KLAT2 to the top500 list, there was no public indication that 64-bit floating point was required. It took them awhile, but the top500 website now has a FAQ that indicates "full precision" is required, and they interpret that as 64-bit for most machines. FYI, 32-bit FLOPs are useful in many situations, and machines had been on the top500 list that had used 32-bit FLOPs. You might take a look at our KASY0 FAQ on GFLOPS. As a means to rank the top500, I think it is quite legitimate to require 64-bit FLOPS, but that doesn't make it "illegal" to use 32-bit Linpack FLOPS for other comparisons.
As for the G5, it won't need AltiVec to get good Linpack numbers due to its fused multiply-add capability in its dual floating point pipes. That's 4 FLOPs per clock peak! I hope VT was able to get Apple to leave out, and not charge for, the components not needed in a cluster node. The PCI-X slots in the G5 should allow VT to better use a high-speed cluster network technology. Commodity x86 boxes tend to only have 32-bit 33MHz PCI, limiting the usable link bandwidth between nodes to under a gigabit per second. For 64-bit Linpack GFLOPS per dollar, a cluster of G5's could be competative. I look forward to seeing their results, and any similar work using the upcoming Athlon 64.
-
Re:What about latency?
Latency is paramount for some tasks, less important for those that *can* make a good distributed project over the Internet of today.
Now, since today's supercomputers are *all* massively parallel constructions, the difference between a commercial design and an off-the-shelf cluster is in the quality and speed of the interconnects. NEC's Earth Simulator, the prime example of 'custom' supercomputer architecture, puts many processor units on *ridiculously* fast 'local' buses, and its racks are all interconnected with still_pretty_insanely_fast (and rather expensive) custom links.
Meanwhile, more 'commercial' designs use various interconnects. IIRC, NEC's 'regular' supercomputers, which formed the design basis for the Earth Simulator architecture, use Fibre Channel 'mesh' networks between racks. The Opteron - sure to be an up-and-coming player in this market - offers HyperTransport, which it looks like Cray will be stretching to its limits on Red Storm; I'm not sure *how* long an HT bus can be, but one gets the impression they'll be stretching it as far as possible, and it's certainly high throughput/low-latency versus the technologies you'd usually find in use for 'networking.'
Anyhow, point is, those designs pack a lot of CPUs together with *very* fast interconnects (equivalent to 16, 32, 64+-way SMP), and have lots and lots of racks of those. (The Opteron/Red Storm approach sounds sexy to me, because I think Hypertransport should let them pack 'lots and lots' of CPUs together versus existing designs. I've yet to read anything about what they're actually doing with it, though.)
Now.. In contrast, an 'off the shelf' cluster is usually going to stick with Ethernet, and will only have 1 to perhaps 4 processors per [node-unit-where-the-CPUs-are-connected-on-a-fast- local-bus], depending how affordable 'cheap' multiprocessor systems are at the time. But *everyone* building supercomputers bumps up against the latency/routing problem; it's just a question of whether it's a problem for, say, 50 Earth Simulator racks (aren't there quite a few more?) vs. 1100 PowerMacs. Experimenting with 'lots of little nodes' has led us to better understand the problem, and learn how to produce tuned topologies that can compete favorably with 'purpose-built' hardware. See: http://aggregate.org/KASY0/
Now, the question *is* one of cost-benefit. Large supercomputers tend to be built with maintenance features and power efficiency in mind. In turn, a totally 'off the shelf' cluster like KASY0 has some advantages because each machine is a cheap, practically disposable 'module' unto itself, and can doubtless be downed off the cluster, pulled out and replaced with another while being easily bench-repaired (since, after all, it's a self-contained PC, rather than a CPU blade or some other random card that would require an expensive test rack to troubleshoot). Meanwhile, if you absolutely demand low-latency, you want one sort of design (Red Storm seems to be acheiving it 'on the cheap,' by combining off-the-shelf - and thus cheap - chips and buses with smart 'custom-design' engineering) while if you can sacrifice some for throughput (jobs with few conditionals), you want another... (like 1100 G5 Macs on a shelf, wired with 'boring' gigabit ethernet, especially if Apple is giving you a bulk discount on the hardware).
So what I'm trying to say is... this is a *combination* of PR stunt and intelligent planning, and there's certainly a lot of 'good science' they could do with the beast - both in number-crunching and 'computer science' a-la cluster topologies. Whether they'll actually *use* it for such, or if it'll be solely a topology toy is anyone's guess.
I think there's some hope that it'll be the "Real Thing," though, since this would explain some of the weird rumors about FC-on-the-mainboard Macs. So they get a Real Monster, made of what will be revealed as "the new G5 Xserves" at the unveiling. The best of COTS *and* fresh d -
Re:And how much HEAT?Did you guys notice from the pics [aggregate.org] that there doesn't seem to be any fans in the holes on the sides?
See here:
-
Re:hot damn, they're case modders!No, the case came that way. But, you will notice that they are stacked next to each other, blocking the side ports for all except the ones on the left end.
That's probably why they did this:
-
Re:Playstation2 at 5.5GFLOPS costs only $199 $40/G
Gah feel free to mod the previous version of this comment into oblivion, I hit submit accidentally.
The numbers you're looking at are marketing numbers first off, and overly generous. Second you don't scale for free - you never get anything like 100 times the performance of a single box when you wire 100 together, for the same reason that you don't get twice the horsepower out of an engine twice the size.
The previous price/performance champ was in fact a PS/2 cluster, mentioned here, but this AMD cluster is roughly three times the performance for the dollar. You can check the stats with different assumptions on their FAQ page, particularly the section labeled 'Is KASY0 really the first supercomputer under $100/GFLOPS?'
-
This beat the PS/2
The previous price/performance champ was in fact a PS/2 cluster, mentioned here, but this AMD cluster is roughly three times the performance for the dollar. You can check the stats with different assumptions on their FAQ page, particularly the section labeled 'Is KASY0 really the first supercomputer under $100/GFLOPS?'
-
How are these booted?Am I missing something? They say:
KASY0 nodes are completely diskless; there isn't even a floppy. (from the FAQ)
So how are the nodes booted? Are there bioses out there that can netboot?
-c
-
And how much HEAT?
I mean these things are Athlons! Heck, they're saving money just from the fact that they'll never have to turn on the furnace again!
Did you guys notice from the pics that there doesn't seem to be any fans in the holes on the sides? Are they crazy? These are Athlons. I hope they put enough fans in those things. -
Nice wiring!
Looks like most of the wiring jobs I've seen done by students: kasy0core.jpg.
God forbid they use cable gutters ;-)
Other than that, kick ass job guys!
-nate -
cable management
What a mess of cables! I understand they were hitting a price point, but would it have killed them to spring $500 or so for a cable management system?
There's something professional looking about having the cables look neat. On the other hand, maybe i'm just anal about things. -
Also interesting
The University of Kentucky is still doing interesting things with Athlons & Linux. Just about two weeks ago, a group there built KASY0, which they expect to set a new price/performance record at better than 1GFLOPS/$100. More about KASY0 here.
-
Network Infrastructure
The University of Kentucky's KLAT2 project used a FNN to get insane bandwidth without worrying about gigabit cards and switches. I'd suggest you take a look at it.
-
Network Infrastructure
The University of Kentucky's KLAT2 project used a FNN to get insane bandwidth without worrying about gigabit cards and switches. I'd suggest you take a look at it.
-
Re:Architecture matching AlgorithmIndeed - and it depends on what your cluster is supposed to be beneficial for. Ideal for "number crunch" clustering are tasks that require low bandwidth and high CPU performance - like movie rendering or testing alternative simulation parameters. For the latter projects like
SETI@home,
distributed.net or
Folding@home have become famous. Most CPU work, neglectible network load. For SETI@home I have an average network throughput of ~50 bit/second. To saturate a 100Mbit/s network (not even switched) with SETI@home you'll need approx. one million (1.000.000) PCs.
As for network - do you need throughput or low latency? Depeding on your problem small changes in algorithm can do wonders. E.g. for film rendering you might choose a few NAS and a hoard of dumb/diskless rendering slaves. If you copy the model libraries (for the included figures, textures, etc.) onto a local disk at the beginning of a scene render run, you will decrease net load a big deal (I've done that with Provray rendering myself).
If you don't have the rerssources to buy e.g. Myrinet, try alternative architectures if they might fit your problem, e.g. hypercubes (see other posts) or models like Flat Neighbourhood. -
Klatz 2Here's the link.
Remember kids, preview your posts.