HPE Announces World's Largest ARM-based Supercomputer (zdnet.com)
The race to exascale speed is getting a little more interesting with the introduction of HPE's Astra -- what will be the world's largest ARM-based supercomputer. From a report: HPE is building Astra for Sandia National Laboratories and the US Department of Energy's National Nuclear Security Administration (NNSA). The NNSA will use the supercomputer to run advanced modeling and simulation workloads for things like national security, energy, science and health care.
HPE is involved in building other ARM-based supercomputing installations, but when Astra is delivered later this year, "it will hands down be the world's largest ARM-based supercomputer ever built," Mike Vildibill, VP of Advanced Technologies Group at HPE, told ZDNet. The HPC system is comprised of 5,184 ARM-based processors -- the Thunder X2 processor, built by Cavium. Each processor has 28 cores and runs at 2 GHz. Astra will deliver over 2.3 theoretical peak petaflops of performance, which should put it well within the top 100 supercomputers ever built -- a milestone for an ARM-based machine, Vildibill said.
HPE is involved in building other ARM-based supercomputing installations, but when Astra is delivered later this year, "it will hands down be the world's largest ARM-based supercomputer ever built," Mike Vildibill, VP of Advanced Technologies Group at HPE, told ZDNet. The HPC system is comprised of 5,184 ARM-based processors -- the Thunder X2 processor, built by Cavium. Each processor has 28 cores and runs at 2 GHz. Astra will deliver over 2.3 theoretical peak petaflops of performance, which should put it well within the top 100 supercomputers ever built -- a milestone for an ARM-based machine, Vildibill said.
You are naive. That's how you make a really crappy supercomputer.
This machine will have more than 100,000 cores. At that scale there are many things that must be carefully thought out. Even just _launching_ a job at 100k procs presents challenges (enough so that people who do it well put out press releases about it: http://mvapich.cse.ohio-state.... ). Beyond choosing the processor (obvious) here are some of the things that must be thought about / balanced:
1. Power - for machines this large you often have to make special deals with local utility companies to power it efficiently.
2. Cooling - The heat load will be immense, deciding how to cool it is incredibly important
3. Interconnect - There are many options here (although fewer than in the past). Choosing e.g. Infiniband vs Ethernet, etc. comes with different tradeoffs and can depend on what your average application will be doing (many short messages vs large messages, etc.)
4. Switching - How many switches are needed? What topology will you use (fat-tree, hypercube, etc.). It depends somewhat on how much you want to spend on switches and somewhat on what your typical application workload looks like.
5. RAM - RAM is currently incredibly expensive (thanks to cell-phones using so much of it!). How much RAM, what type, how fast can greatly tip the scales in price / performance
6. OS - Most of these machines these days run Linux - but there are many different flavors. Things get optimized all the way down to exactly which Kernel version to use - and everything is hand-tuned
7. Job Scheduler - Several options here from PBS to Slurm and proprietary vendor specific options. How good your job scheduler is can have a HUGE impact in the usability of the machine.
8. Filesystem - Most of these machines have at least two types of filesystems: "home" and "scratch"... where "home" can be something reliable - maybe even using NFS and "scratch" is typically some highspeed filesystem (Lustre, Panasas, etc.). Choosing the balance between the two is critical. Note that 100,000 processes reading/writing simultaneously can take even carefully crafted filesystems to their knees.
9. Local disk - for a long time it was in voguge to run a "diskless" system - but now "disks" are making a come back (in the way of NV-RAM). Depending on what your applications look like this can provide huge speedups.
(I'm sure I missed something - but these are the big ones)
Anyway: It's not simple. Purchasing for these machines typically takes at least a year just in the phase where you're defining the requirements and then another 6 months or so to put out bids and go through the selection process.
In case you're wondering - I do work in the national lab system, I use these machines daily and am part of procurement decisions for them...