SGI to Scale Linux Across 1024 CPUs
im333mfg writes "ComputerWorld has an article up about an upcoming SGI Machine, being built for the National Center for Supercomputing Applications, "that will run a single Linux operating system image across 1,024 Intel Corp. Itanium 2 processors and 3TB of shared memory.""
The limit for Windows Server 2003, Datacenter edition for 64 bit Itaniums is actually 64 processors and 512 GB RAM. http://www.microsoft.com/windowsserver2003/64bit/i pf/datacenter.mspx
well, sgi uses/hacks NUMA, spinlocks, etc to make this happen in a more efficient manner. We recently had a SGI rep come and explain their 512CPU architechture at our LUG meeting... and he basically said that SGI has their own implementation of all of the clustering/cpu stacking techs... which they will eventually feed back into the community.. all good stuff.. understandably they will wait for a year or so so they can get their money's worth before they release their changes.
SGI has had 512 and 1024-cpu MIPS-based systems in operation for more than 5 years. Much work was done on the Irix systems to initialize large parallel computations and provide libraries and compiler support for these configurations. One technique is to provide message-passing libraries that use shared memory. A better technique is to morph (slightly) parallel mesh apps so that each computational mesh node exposes the array elements to be shared with neighbors. No message-passing needed - you push data after a big iteration and then use the (really fast) sync primitives to launch into the next iteration. With shared-nothing clusters (i.e. Beowulf) a computation (and its memory) must be partitioned among the compute nodes. The improvement over a "classical" cluster can be startling especially with computations that are more communications-bound than compute-bound. This means there is no value for replacing a render farm with a big system. But there are big compute problems, e.g. finite element, for which the shared-nothing cluster is often inadequate.
With a single memory image system the computation can easily repartition dynamically as the computation proceeds. Its very costly (never say impossible!) to do this on a cluster because you have to physically move memory segments from one machine to another. On the NUMA system you just change a pointer. The hardware is good enough that you don't really have to worry about memory latency.
And let's not forget io. Folks seem to forget that you can dump any interesting section of the computation to/from the file system with a single io command. On these systems the io bandwidth is limited only by the number of parallel disk channels - a system like the one mentioned in the article can probably sustain a large number of GBytes/sec to the file system.
Let's not forget page size. The only way you can traverse a few TB of memory without TLB-faulting to death is to have multi-MByte-size pages (because TLB size is limited). SGI allowed a process to map regions of main memory with different page sizes (upto 64 MB I think) at least 10 years ago in order to support large image data base and compute apps.
When I used to work at SGI (5 years ago) the memory bandwidth at one cpu node was about 800 MBytes/s. My understanding is that the Altix compute nodes now deliver 12 GBytes/s at each memory controller. Although I haven't had a chance to test drive one of these new systems, it sounds like they have gradually been porting well-seasoned Irix algorithms to Linux. It is unlikely that a commodity computer really needs all of this stuff, but I'm looking at a 4-cpu Opteron that could really use many of the memory management improvements.
g