Apple Wins VT in Cost. vs. Performance
danigiri writes "Detailed notes about a presentation at Virginia Tech are posted by by an attending student. copied most of the slides of the facts presentation and wrote down their comments. He wrote some insightful notes and info snippets, like the fact that Apple gave the cheapest deal of machines with chassis, beating Dell, IBM, HP. They are definitely going to use some in-house fault-tolerance software to prevent the odd memory-bit error on such a bunch of non-error-tolerant RAM and any other hard or soft glitches. The G5 cluster will be accepting first apps around-November."
mfago adds, "Apple beat Dell, IBM and others based on Cost vs. Performance alone, and it will run Mac OS X because 'there is not enough support for Linux.'"
One of the primary concerns for a multi-node cluster is insured latency among all components within the cluster. It doesn't have to be the fastest, it just needs to insured exacting timing for latency across all nodes. IBM can do this with their "wormhole" switch routing on SP and has done this with Myranet on their Intel X-series clusters.
m l
From most of my reading with Infiniband, it was designed from the ground up as a NAS style solution, than for large multi-node cluster computing. I'm curious as to if they have any issues with cluster latency.
http://www.nwfusion.com/news/2002/1211sandia.ht
The primary timings and white papers I've seen published for Infiniband have been for small clustered filesystem access. Although it's burst rate is much higher than Myranet, it's hard to find any raw retails for their multiple node latency normalization.
I hope it scales, since Intel's solution appears to be less cost prohibitive than some of the other solutions offered on the market, and would really open up the market even for smaller clusters (16-36 node) for business use.
# 3 MW power, double redundant with backups - UPS and diesel * 1.5 MW reserved for the TCF
# 2+ million BTUs of cooling capacity using Liebert's extreme density cooling (rack mounted cooling via liquid refrigerant) * traditional methods [fans] would have produced windspeeds of 60+ MPH
Seems that they did talk about both.
Okay before we get going with the same discussion about ECC vs. Non ECC, and all the flames start from people perusing slashdot who think they are more in the know than the PhD's at VT who have been working on this for months I want to point a few things out.
1. The majority if not all of the bit errors that ECC corrects are caused by thermal noise. Thermal noise is an issue in a cluster of rack mounted 1U units due to the difficulty of cooling such tightly spaced units generating so much heat in so small a space. It is not an issue in a cluster of DESKTOP machines utilizing a Liebert system with way more cooling capacity than is needed.
2. Even if somehow a none-thermal bit error occurs, each node has 4GB RAM. The probability of it being in an OS or application critical (especially given the converging nature of many long running calculations) piece of RAM as opposed to an empty piece of RAM is small.
How many of you are reading this from a desktop without ECC RAM that has an obnoxiously huge uptime? ECC is a non-issue in a well-cooled cluster of desktop cased machines.
Certain things are easy to imagine in large quantities, but dude.
Just....dude....
Kick in the Head