$208 Million Petascale Computer Gets Green Light

← Back to Stories (view on slashdot.org)

$208 Million Petascale Computer Gets Green Light

Posted by samzenpus on Wednesday September 3, 2008 @10:46AM from the that's-a-lot-of-solitaire dept.

coondoggie writes "The 200,000 processor core system known as Blue Waters got the green light recently as the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications (NCSA) said it has finalized the contract with IBM to build the world's first sustained petascale computational system. Blue Waters is expected to deliver sustained performance of more than one petaflop on many real-world scientific and engineering applications. A petaflop equals about 1 quadrillion calculations per second. They will be coupled to more than a petabyte of memory and more than 10 petabytes of disk storage. All of that memory and storage will be globally addressable, meaning that processors will be able to share data from a single pool exceptionally quickly, researchers said. Blue Waters, is supported by a $208 million grant from the National Science Foundation and will come online in 2011."

2 of 174 comments (clear)

Re:Yes, but the article doesn't address a few ques by PsychoElf · 2008-09-03 14:50 · Score: 0, Troll

But only on minimal settings.
I've modeled large supercomputers, this is bogus! by woolio · 2008-09-03 15:00 · Score: 0, Troll

I was once employed in a position where I created detailed performance/reliability models for large supercomputers BlueGene/L, etc.
Say you have an application that is infinitely parallelizable [over idealistic assumption]. Adding processors (and ignoring the communications overhead, etc) speeds up the application -- only up to a point.
At some point, adding processors starts to slowdown the entire application. Why? The probability diminishies that all processors will be up for long enough for the application to finish. Even if spare processors are available and the distributed application uses checkpoints, this effect still occurs.
Say a single node/processor has a mean-time-to-failure (MTTF) of 5 years (157680000 seconds). Two hundred thousand nodes have a MTTF of *approximately* 788.4 seconds (it's actually worse). In other words, there is probability of (1/e) [roughly a third] that 788.4 seconds will elapse without any failures. Wouldn't it just be cheaper & easier to have a 20k node computer and an application that runs for 1hr instead of 10 minutes on 200k nodes?
Yes, you could use 3/4 of the processors for active computation and have the other 1/4 as hot spares/etc... But wouldn't it just be simpler to use fewer processors in the first place? I'm not even convinced there are applications that can be efficiently parallelized over 50k nodes, much less than 200k nodes. When communication overhead and redundancies are taken into account, the utility of much more than a few thousand nodes starts to drop radically.
I've also noticed that those in the "supercomputer" field tend to have Computer Science or Physics backgrounds. These developers are more focused on obtaining exact results, which leads to very slow applications. I suspect there are very accurate (and fast) approximations for many the calculations in their applications. They use distributed application frameworks (MPI) that are fairly low-level and rigid. This means complex applications that run (slowly but well) on 1k nodes may not even be scalable to 100k nodes.
In short, 200k nodes cannot be used efficiently for any meaningful amount of time. For long running applications (a few hours), there is little need to use more than a few thousand nodes.
Aside: Don't intelligent people have anything else better to do than to blow each other up?