HPE Announces World's Largest ARM-based Supercomputer (zdnet.com)
The race to exascale speed is getting a little more interesting with the introduction of HPE's Astra -- what will be the world's largest ARM-based supercomputer. From a report: HPE is building Astra for Sandia National Laboratories and the US Department of Energy's National Nuclear Security Administration (NNSA). The NNSA will use the supercomputer to run advanced modeling and simulation workloads for things like national security, energy, science and health care.
HPE is involved in building other ARM-based supercomputing installations, but when Astra is delivered later this year, "it will hands down be the world's largest ARM-based supercomputer ever built," Mike Vildibill, VP of Advanced Technologies Group at HPE, told ZDNet. The HPC system is comprised of 5,184 ARM-based processors -- the Thunder X2 processor, built by Cavium. Each processor has 28 cores and runs at 2 GHz. Astra will deliver over 2.3 theoretical peak petaflops of performance, which should put it well within the top 100 supercomputers ever built -- a milestone for an ARM-based machine, Vildibill said.
HPE is involved in building other ARM-based supercomputing installations, but when Astra is delivered later this year, "it will hands down be the world's largest ARM-based supercomputer ever built," Mike Vildibill, VP of Advanced Technologies Group at HPE, told ZDNet. The HPC system is comprised of 5,184 ARM-based processors -- the Thunder X2 processor, built by Cavium. Each processor has 28 cores and runs at 2 GHz. Astra will deliver over 2.3 theoretical peak petaflops of performance, which should put it well within the top 100 supercomputers ever built -- a milestone for an ARM-based machine, Vildibill said.
There's lots of naivete in the "connect up bunches" part.
The supercomputer has far higher interconnect bandwidth and better latency than typically networked commercial servers.
There needs to be high-performance (meaning assembly level drivers in cases) support for the API's used by the heavily multiprocessed workloads. Think about massive partial differential equation solvers with one gridpoint talking to others and updating at every timestep.
Conventional networked servers and their bad latency: http://www.scs.stanford.edu/~rumble/papers/latency_hotos11.pdf
You are naive. That's how you make a really crappy supercomputer.
This machine will have more than 100,000 cores. At that scale there are many things that must be carefully thought out. Even just _launching_ a job at 100k procs presents challenges (enough so that people who do it well put out press releases about it: http://mvapich.cse.ohio-state.... ). Beyond choosing the processor (obvious) here are some of the things that must be thought about / balanced:
1. Power - for machines this large you often have to make special deals with local utility companies to power it efficiently.
2. Cooling - The heat load will be immense, deciding how to cool it is incredibly important
3. Interconnect - There are many options here (although fewer than in the past). Choosing e.g. Infiniband vs Ethernet, etc. comes with different tradeoffs and can depend on what your average application will be doing (many short messages vs large messages, etc.)
4. Switching - How many switches are needed? What topology will you use (fat-tree, hypercube, etc.). It depends somewhat on how much you want to spend on switches and somewhat on what your typical application workload looks like.
5. RAM - RAM is currently incredibly expensive (thanks to cell-phones using so much of it!). How much RAM, what type, how fast can greatly tip the scales in price / performance
6. OS - Most of these machines these days run Linux - but there are many different flavors. Things get optimized all the way down to exactly which Kernel version to use - and everything is hand-tuned
7. Job Scheduler - Several options here from PBS to Slurm and proprietary vendor specific options. How good your job scheduler is can have a HUGE impact in the usability of the machine.
8. Filesystem - Most of these machines have at least two types of filesystems: "home" and "scratch"... where "home" can be something reliable - maybe even using NFS and "scratch" is typically some highspeed filesystem (Lustre, Panasas, etc.). Choosing the balance between the two is critical. Note that 100,000 processes reading/writing simultaneously can take even carefully crafted filesystems to their knees.
9. Local disk - for a long time it was in voguge to run a "diskless" system - but now "disks" are making a come back (in the way of NV-RAM). Depending on what your applications look like this can provide huge speedups.
(I'm sure I missed something - but these are the big ones)
Anyway: It's not simple. Purchasing for these machines typically takes at least a year just in the phase where you're defining the requirements and then another 6 months or so to put out bids and go through the selection process.
In case you're wondering - I do work in the national lab system, I use these machines daily and am part of procurement decisions for them...
The biggest difference between a simple compute cluster and a supercomputer is the speed of the interconnect. A compute cluster might have individually fast nodes, potentially decked out with RAM, but it's not going to be able to access the contents of any other node's memory effectively. So a big problem needs to be partitioned into slices that fit on a node.
Supercomputers have fast enough interconnects that multiple nodes can act as a single machine image. Nodes can read and write to the shared memory so they can access the global state of the computation. So you can model a trillion particles in a system rather than millions or billions.
I'm a loner Dottie, a Rebel.
Well - ease of programming for one thing.
With the death of Intel Phi... the HPC community really only has GPUs to offer good flops/watt. The problem with that? Not all workloads map well to GPUs and you often can't rewrite millions of dollars of software that doesn't use GPUs.
ARM offers another alternative: it can run anything an x86 processor can at better flops/watt.
The rise of ARM in HPC is _definitely_ an interesting development!
The flops/socket is still better than BlueGene procs do - and I suspect the flops/watt will be a LOT better than the Xeon system you pointed out.
An exascale computer can't simply use 10M Xeons... you would need to build a small nuclear reactor next to it to power it. And while GPUs are useful for generating flops... not all workloads map well to them. These cores are general purpose: they can run anything a Xeon can run... but should use a lot less power.
These machines are still "distributed memory" supercomputers. It's rare to see a true "shared memory" cluster in HPC these days.
Infiniband works off of a RDMA process (Remote Direct Memory Access) - but you wouldn't consider it to be "shared" memory (and programmers don't typically interact with the RDMA calls directly - most often still using MPI... but MPI then uses RDMA to achieve the transfer).
That said: you are correct that interconnect is one of the things that makes a supercomputer "super". The price of the interconnect can be a significan percentage of the purchase price of the machine. The number of network cards, number of switches (and hence topology) and length of cabling all makes a difference in the price... and the performance.
Game over for ARM designs from China. Super computers don't need need kind of risk. Adjust your investment portfolio as needed and lol at the crash that's due.
The cool part for me is learning there's 28-cores ARM processors out there, which gives me hope that Apple's delays with the MacBook Air and Mac mini means they're close to releasing their first-ever ARM-powered Macs running macOS.
#DeleteFacebook
It does beg the question: what exactly is a "super-computer"? The boundaries seem fuzzy to me, based on being tuned for an intent rather than physical characteristic.
Table-ized A.I.