SGI to Scale Linux Across 1024 CPUs
im333mfg writes "ComputerWorld has an article up about an upcoming SGI Machine, being built for the National Center for Supercomputing Applications, "that will run a single Linux operating system image across 1,024 Intel Corp. Itanium 2 processors and 3TB of shared memory.""
It seems that if they pull this off one of the dtrongholds of solaris (namely massivly parralell computing) will have been conqurered by linux. I wonder how sun are feeling at the moment?
Solaris scales to hundreds of processors out-of-the-box. Until the vanilla Linux kernel accepts these changes and scale, Solaris still has a big edge in this area.
Lame analogy: many people have demonstrated that they can hack their Honda Civic to outperform a Corvette, however I can walk into a dealership and purchase the latter which performs quite well without mods.
With the exception of the NUMA stuff, is there software available to re-create this? I'm not even sure what to search for; would this still be considered a "cluster"?
RISC and CISC offer no final advantage over the other, so the one that dominated is the one that was here first.
Quick examples: RISC use less power because it has less logic? No, it needs to run at a higher frequency to maintain the same speed as a slower CISC.
RISC is easier to program? Depends on the person. A compiler can take advantage of large instructions very well which are hardware optimized.
RISC easier to develop/manage? I'll say yes for RISC on this one. There's simply less logic on the chip so less logical errors possible. There's plenty more cache which can break but broken parts can be fused off.
RISC is physically smaller? No. RISC needs a higher clock frequency because many more instructions need to be executed. The result of this is that a much larger instruction cache is needed on chip.
I don't remember every comparison but it pretty much comes out that neither is better than the other. That being said RISC is better than x86. Everything is better than x86. However CISC vs RISC is much harder to judge. Having done x86, 68k, and MIPS I must say that RISC is a pleasure.
Until the vanilla Linux kernel accepts these changes and scale, Solaris still has a big edge in this area.
I wouldn't be surprised to see these changes in the 2.8 kernel. And what will people do until then I hear some people ask. I can tell you that right now it is very few people that actually have the need to scale to 1024 CPUs. And that will probably also be true by the time Linux 2.8.0 is released. AFAIK Linux 2.6 does scale well to 128 CPUs, but I don't have hardware to test it, neither does any of my friends. So I'd say there is no need for a rush to get this in mainstream, the few people that need this can patch their kernels. My guess is that in the time from now until 2.8.0 is released, we will see less than 1000 such machines worldwide.
Do you care about the security of your wireless mouse?
If someone buys one of these clusters from SGI, then it does scale "out of the box" as far as they're concerned.
A better retort would be "There's a world market for maybe 5 computers" by the IBM dude.
Claims are very difficult to make, and impossible to proove. However putting a time limit on a claim is easy. 2.8.0 will be released in 05 or 06, maybe we'll all have 1024CPU boxes in 20 years, but in 20 months?
Hot swapping components sounds great, but what if the screwdriver slips out of the finger of the engineer and causes a short?
The systems I've seen that have hot-swap PCI cards have plastic partitions between the slots to prevent the cards from touching each other when hot swapping them.
I'm not sure why the hypothetical screwdriver in such a tech's hands. Many systems have non-screw means of retaining memory, PCI cards, CPUs and such.
Yes... I was an engineer for SGI for over 16 years (laid off about a year ago)....I have hot-swapped modules on a running system many many times without problems. With the Numaflex archetechure, you have "modules" that house a seperate set of CPUs, memmory, power supply, etc. You shut the offending module down (after the OS has migrated all process's off of it-on the fly). After parts are replacesd, you run a diagnostic off of your laptop/terminal, and bring the module back into the system (OS "sees" the change on the fly and re-integrates the module). It works extreamly well.
What happens if both pipelines make the same mistake because the L1-cache feeds them both the same corrupted data?
www.vanheusden.com - home of Multitail, HTTPing, CoffeeSaint, EntropyBroker, rsstail, bsod, listener, nagcon, nagi
The point here is that if performance continues to grow like it is today, they will be selling these machines for $1,000 at Walmart in just 14 years. It will be about the same size as the computer you own now.
The problem with 1024CPU is much more then just the operating system. It is a mess of communication hardware needed to wire everything together. It is about special power feeds and air conditioning, and sometimes floor loading requirements.
Take a quick look at the end of this PDF. It talks about heat output and the need for 3 phase 240V power coming into this computer. It is not unusual to hire both an electricial and a cooling expert when you talk about installing one of these babies. Not for the Home user, and never will be, however, idential compute power comming in just 14 years, so get ready...
- SGI (at the time still called Silicon Graphics Inc) purchased Cray Research.
- Well before the purchase, Cray had a hand in developing and marketing Suns larger machine, the "Super Dragon", sold by Sun as the 64SC, and referred to within by Cray as the 64CS -- I'm probably messing up the number, but I do recall the difference in the letter ordering.
:-)
- Prior to the purchase, Cray had completed the design for a
new shared memory system based on a high speed switch and single image OS.
- Prior to the purchase SGI had already completed the design of the first NUMAflex systems, the Origin2000 and Onyx2.
- So after the purchase, the new merged Cray/SGI had two large SMP/NUMA systems, the Origin line and the Cray developed line. Since they didn't need two, they sold the Cray design to Sun, where it was marketed as the E10000. They also called they NUMA fabric on the Origin2K "CrayLink" even though Cray had little or nothing to do with its design.
- For a few years afterwards, there were a few within Cray CF (Chippewa Falls) that were somewhat bitter about SGI's decision to pawn off the E10000 design, pointing out repeatedly that Sun was selling plenty of E10000s...
If it matters, the HPC procurement I was involved in opted for the SGI, which was probably the correct decision. As unstable as the SGI hardware was, the sites I knew running E10000's for general HPC loads had far worse stability problems (though the E10K's undoubtedly better at running Oracle).As far as the E10000 being NUMA or SMP -- depends on how you look at it. The Origin line used a bristled hypercube interconnect topology, so memory on the same node as a CPU was one hop thru the fabric, memory on another node connected to the same router was three hops, on a distant node might be multiple routre hops. The E10K (and I think the E15K) used a star topology where memory was ether on the same bus as the CPU or was on another bus that had to go through the switch. So the Sun has basically two levels of memory latency, whereas the SGI could have many levels. The SGI is definitely NUMA, the Sun is either SMP or "slightly NUMA", or however you want to parse it.
If you've never seen it, the tech papers on how the SGI NUMA systems work are worth reading. Build a fast 8-port crossbar chip (the "spyder chip"), then use it to glue CPUs, memory, and peripherals together. Keep a couple ports open, and you can glue the crossbars together in a fabric. Presto, you can now build a system with 200 CPUs or 100 PCI busses. Pretty cool, even if it was expensive, proprietary, and all the rest.