SGI to Scale Linux Across 1024 CPUs
im333mfg writes "ComputerWorld has an article up about an upcoming SGI Machine, being built for the National Center for Supercomputing Applications, "that will run a single Linux operating system image across 1,024 Intel Corp. Itanium 2 processors and 3TB of shared memory.""
It seems that if they pull this off one of the dtrongholds of solaris (namely massivly parralell computing) will have been conqurered by linux. I wonder how sun are feeling at the moment?
Sun hardware has additional, wonderful resiliency features like - allowing cpu's to "fail-over" to other cpus in case of failure. The same holds true for memory, network interfaces, etc. Solaris is aware of these hardware features and can "map out" the bad memory and cpus on the fly (or allow swap-in replacements). The engineers can then replace the broken cpus/memory/interfaces WITHOUT BRINGING THE MACHINE DOWN. This lends itself to an environment than can enjoy nearly 100% uptime. Finally, since Sun has been doing the "lots of cpus" thing for many years, their process management and scalability tends to be much better.
I don't work for Sun, I'm just an SA that deals with both Solaris and Linux boxes. You don't pick sun for just "lots of cpus", you pick it for a very scalable OS and amazing hardware that allows for a very, very solid datacenter. If downtime costs a lot (ie. you lose a lot of money for being down), you should have Sun and/or IBM zseries hardware. Unfortunately those features cost a lot and most times you can use Linux clustering instead for a fraction of the cost and a high percentage of the availability.
I have replaced Sun Hardware/Software combo's in the core datacenter for many of our customers, and I can tell you that yes - Sun brings some amazing features to the table - most of which are there to serve old technology. Linux on simple CPU's delivers such an amazing price performance (depending on the job, we see an average of 3x to 4x performance increase for 25% of the cost. That means that if I were to spend the same, lifecycle-wise, on a Linux cluster as I would on a big Sun box like the 10k or 15k, I'd end up with 12x to 16x the performance of the Sun solution.
The same functionality in terms of cpu and ram (and other hardware) failure is available on the Linux cluster, albeit in less graceful form - the magic spell to invoke goes like this: if I have 300 machines crunching my data, I can afford to lose a couple, and can afford to have a few hot-standby's.
Of course, the massively parrallel architecture does not work for all applications, and in those cases you would look to use either OpenMOSIX or of course the (relatively expensive) SGI box mentioned in this article.
People who think they know everything are a great annoyance to those of us who do.