TACC "Stampede" Supercomputer To Go Live In January

← Back to Stories (view on slashdot.org)

TACC "Stampede" Supercomputer To Go Live In January

Posted by samzenpus on Wednesday September 12, 2012 @05:58PM from the coming-soon dept.

Nerval's Lobster writes "The Texas Advanced Computing Center plans to go live on January 7 with "Stampede," a ten-petaflop supercomputer predicted to be the most powerful Intel supercomputer in the world once it launches. Stampede should also be among the top five supercomputers in the TOP500 list when it goes live, Jay Boisseau, TACC's director, said at the Intel Developer Forum Sept. 11. Stampede was announced a bit more than two years ago. Specs include 272 terabytes of total memory and 14 petabytes of disk storage. TACC said the compute nodes would include "several thousand" Dell Stallion servers, with each server boasting dual 8-core Intel E5-2680 processors and 32 gigabytes of memory. In addition, TACC will include a special pre-release version of the Intel MIC, or "Knights Bridge" architecture, which has been formally branded as Xeon Phi. Interestingly, the thousands of Xeon compute nodes should generate just 2 teraflops worth of performance, with the remaining 8 generated by the Xeon Phi chips, which provide highly parallelized computational power for specialized workloads."

14 of 67 comments (clear)

Min score:

Reason:

Sort:

Why so little memory? by afidel · 2012-09-12 18:11 · Score: 4, Interesting

I wonder why it's got such little memory? You can easily run 64GB per socket at full speed with the E5-2600 (16GB x 4 channels) without spending that much money. Heck for maybe 10% more you can run 128GB per socket (You need RDIMM's to run two 16GB modules per bank). They're apparently only running one 16GB DIMM per socket (any other configuration would be slower on the E5) which IMHO is crazy as you're going to have a hard time keeping 8 cores busy with such a small amount.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
1. Re:Why so little memory? by Taco+Cowboy · 2012-09-12 18:24 · Score: 2
  
  You can easily run 64GB per socket at full speed with the E5-2600 (16GB x 4 channels) without spending that much money. Heck for maybe 10% more you can run 128GB per socket (You need RDIMM's to run two 16GB modules per bank).
  As TFA has put it:
  
  " ... the compute nodes would include "several thousand" Dell Stallion servers, with each server boasting dual 8-core Intel E5-2680 processors and 32 gigabytes of memory"
  I am guessing it might have something to do with budget
  
  From the way I look at it, they are populating each memory slot with 4GB of el-cheapo DDR3 DRAM and that way they may be saving quite a bit of $$$ to buy more Dell servers
  
  --
  Muchas Gracias, Señor Edward Snowden !
2. Re:Why so little memory? by Anonymous Coward · 2012-09-12 20:08 · Score: 2, Insightful
  
  That's 2GB per core, a fine amount for supercomputer problems requiring compute density and bandwidth. No virtualization there and the compilers, middleware and programmers are probably sufficiently educated to know how to split the problem.
3. Re:Why so little memory? by Shinobi · 2012-09-12 21:00 · Score: 2, Informative
  
  Esoteric? Nearly impossible to program for? Methinks you haven't read through the actual docs for it. You can use all the standard Intel tools to program for it, which are also MIC-aware, just like you program for a standard multi-core CPU. That includes the threading and math kernel libraries, as well as OpenCL if you want to go that route.
4. Re:Why so little memory? by loufoque · 2012-09-12 22:43 · Score: 3, Interesting
  
  You will be parallelizing, and each thread will only ever be able to use max_mem/N for its own processing.
  When you parallelize, you avoid sharing memory between threads. Your data set is split over the threads and synchronization is minimized. In a SMP/NUMA model, this is done transparently by simply avoiding to access memory that other threads are working on. In other models, you have to explicitly send the chunk of memory that each thread will be working on (through DMA, the network, an in-memory FIFO or whatever), but it doesn't change anything from a conceptual point of view.
  If your parallel decomposition is much more efficient if your data per thread is larger than 1GB, then you cannot possibly run 64 threads set up like this on the MIC platform. There is often a minimum size required for a parallel primitive to be efficient, and if that minimum size is greater than max_mem/N then you have a problem. This is the limiting factor I'm talking about.
  128 MB, however, is IMO quite large enough.
  
  In fact this is a major advantage of MIC versus GPUs.
  The advantage of MIC lies in ease of programming thanks to compatibility with existing tools and the more flexible programming model.
  Memory on GPUs is global as well, so I have no idea what you're talking about. There is also so-called "shared" memory (CUDA terminology, OpenCL is different) which is per block, but that's just some local scratch memory shared by a group of threads.
  
  There is nothing nighmarish of the above
  Please stop deforming what I'm saying. What is nightmarish is finding the optimal work distribution and scheduling of a heterogeneous or irregular system.
  Platforms like GPUs are only fit for regular problems. Most HPC applications written using OpenMP or MPI are regular as well. Whether the MIC will be able to enable good scalability of irregular problems remains to be seen, but the first applications will definitely be regular ones.
5. Re:Why so little memory? by Nite_Hawk · 2012-09-12 23:24 · Score: 2
  
  The Cynic in me says that you don't get to into the Top 5 by spending all of your budget on memory. :)
  Practically speaking there are a lot of research codes out there that are using 1GB or less of memory per core. Our systems at MSI typically had somewhere between 2-3GB of memory per core and often were only using half of their memory or less. There's a good chance that TACC has looked at the kinds of computations that would happen on the machine and determined that they don't need more.
  We had another much smaller cluster that had significantly more memory per node where we tried to push big memory people to use. They of course don't like it because they want to run on the big fancy glorious machine that gets mentioned in all of the press articles even though they aren't well suited to use it. Such is the way Academia works though.
6. Re:Why so little memory? by gentryx · 2012-09-12 23:42 · Score: 3, Informative
  
  Agreed. 2 GB/core seems to be the current agreement on almost all machines except for IBM BlueGene which has just 1 GB per core.
  
  --
  Computer simulation made easy -- LibGeoDecomp
Summary: s/tera/peta/ by gentryx · 2012-09-12 18:16 · Score: 3, Informative

The summary mentions that 2 teraflops are generated by the CPUs while 8 are generated by the Knights Bridge chips. It should say petaflops.

--
Computer simulation made easy -- LibGeoDecomp
Time For A New Supercomputer Metric by Jane+Q.+Public · 2012-09-12 18:20 · Score: 3, Insightful

"Petaflops" is not representative of the power of modern supercomputers, many of which use massively parallel integer processing to perform their duties. Sure, you can say that simulating floating point operations with the integer units amounts to the same thing, but it actually doesn't. We have discovered that there are a great many real-world problems for which parallel integer math works just fine, or even better (more efficient) than floating point. And for those, flops is a completely meaningless metric.

We need a standard that actually makes sense.
1. Re:Time For A New Supercomputer Metric by afidel · 2012-09-12 19:01 · Score: 2
  
  Well, as far as achievable computation, that's why Linkpack reports Rmax and Rpeak, however the one big area where Linpack is lacking as a measurement stick for many real workloads is its small communications overhead, it's much easier to achieve high utilization on Linpack then it is for many other workloads.
  
  --
  There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
GDDR5 by PCK · 2012-09-12 18:21 · Score: 2

The Knights Corner chips use GDDR5 memory, bandwidth is a big problem when you have 50+ cores to feed.
Ah, memories! by hyades1 · 2012-09-12 18:27 · Score: 4, Funny

This reminds me of an old science fiction story. The designers, builders and programmers assemble. The Switch is flipped. The computer boots. The first question they ask is, "Is there a God?" The machine hums away for a few seconds, then arc welds the power switch open and responds, "There is now!"

--
I've calculated my velocity with such exquisite precision that I have no idea where I am.
Ooops. by PCK · 2012-09-12 18:31 · Score: 2

Ooops, scratch that miss-read the summary. There probably is n't a need for that much memory because the kind of problems they are most likely to be dealing with will have massive datasets that don't fit in memory anyway. The limiting factory will be CPU and node interconnect bandwidth so adding extra memory wont make much if any difference to performance.
Umm, No. by slew · 2012-09-12 18:52 · Score: 2

I'm pretty sure you are mistaken on this point.
Most modern supercomputers get their "flop" count from SSE3/4 and/or GPUs which are not integer, but Floating point processing machines(at least 32-bit single precision fp, but also double precision albeit at a slower rate). These machines most certainly do NOT simulate floating point with their integer units (nor cheat by calling an integer op as an approximate fp op), and they have massive amounts of dedicated hardware SIMD FP processing units to do their heavy lifting.
Of course there are many real world problems that could use parallel integer math and CPUs and GPUs are also capable of lots of SIMD integer ops as well, but that's not how supercomputers are rated these days, they are rated by the number of IEEE FP operations (mostly FMA or fused multipy-add counting as 2-ops) with at least 32-bits of precision.
The integer OPs currently don't count in the current ratings and I don't see that changing any time soon. Important scientific operations like matrix inversion, finite-element analysis, FFTs, and linear programming don't work the same with integer ops, so it is unfair to compare supercomputers by their integer ops.