A Look Inside the NCSA
Peter Kern writes "The National Center for Supercomputing Applications (NCSA) is one of the great supercomputing facilities in the world and is home to 'Abe', one of the top 10 supercomputers on the current Top 500 list. TG Daily recently toured the facility and published a stunning report about their computing capabilities (more than 140 teraflops), power requirements (a sustained 1.7 megawatts), enormous 20-ft chillers in four cooling systems and other installations that keep the NCSA online."
I have actually been in the newer facility dozens of times when I worked as an intern for the Architect on the building. I actually drafted the final drawings for this project. It is a VERY nice facility, with some pretty cool under-floor cooling systems and things like that. I am pretty sure I have 3D digital models of the facility somewhere in my work records.
:)
The lecture auditorium bites the big one though, purple seats? Nasty. The Seibel Center accross the mini-quad is a much more interesting building though, at least to an architect.
Much of the software which is run at the NCSA is home-grown software written by computational scientists, not computer scientists.
I've seen code written by computational guys before. While not really terrible, it's not terribly re-usable or maintainable. Obviously these guys don't study computer science, but I truly think there's gains to be made if they understood the tool they were using better.
For many of these massively parallel codes, written on top of MPI, fault tolerance really isn't all that easy. For a commercial production code on the order of Gaussian, this may be doable, but for bleeding-edge research codes, it may be a better use of the (human) time to push the algorithms rather than worry about fault-tolerance. From the user's perspective, jobs that are killed due to a hardware failure have their service units refunded, so there isn't a huge incentive to worry about it.
As long as your job runs under 6 hours, sure. But if it takes over 6 hours, you're already doing some kind of saving and re-starting. That's probably about 80% of what I'm talking about, just on a larger scale. I bet ou're right though, it's all going to come to a head as there's more and more components that could fail, so it has to be fixed at a higher level, or the programmer level. Maybe you can fix the problem with virtualization, but how much of a performance hit do you take, or how much costlier is the machine?
AccountKiller