SGI to Scale Linux Across 1024 CPUs
im333mfg writes "ComputerWorld has an article up about an upcoming SGI Machine, being built for the National Center for Supercomputing Applications, "that will run a single Linux operating system image across 1,024 Intel Corp. Itanium 2 processors and 3TB of shared memory.""
Sweet, now we'll be able to run Doom3 at highest detail in *SOFTWARE*-rendering mode!
But does it run--crap. I mean what about a Beowulf--doh!
Damn you SGI!
Why not fork?
Yeah, but can it run Longhorn?
Intel's sales figures for Itanic^Hum CPUs more than doubled as a result.
--
"Open source is good." - Steve Jobs
"Open source is evil." - Microsoft
It seems that if they pull this off one of the dtrongholds of solaris (namely massivly parralell computing) will have been conqurered by linux. I wonder how sun are feeling at the moment?
Hmm, quite possibly.
"Proudly Posting Without Reading The Article"
Obviously this would be overkill for doom3(altho I'd still like to have it in my apartment as a space heater/server)! Ok, so it would be more than a space heater; I'd have to run my a/c 24/7/365.25, with all my windows open in the winter. But rendering would be sooooo sweet.
The link to the press release as of July 14.
CC.
TaijiQuan (Huang, 5 loosenings)
so does this mean KDE and Openoffice will finally run at decent speed?
No, you're going to need quantum computing for that.
...how easy it is to install printer and sound drivers?
Microsoft made a statement today reminding everyone that Windows Server 2003 can handle as many as 32 processors, at the same time even.
When shown the report about Linux running on 1024 processors, Gates purportedly responded, "32 processors ought to be enough for anybody."
Unknown host pong.
They decided it was too RISCy maybe?
Why not fork?
yes, according to the project leader "on this supercomputer, OpenOffice will finally *run* at decent speed, but waiting for the JVM to start up will still be a bitch" As for KDE, he stated "we're still waiting for the qt toolkit to initialize, but we're confident we can be fully logged in before August"
you should see the specs for longhorn's minimum install...
Join Team Mozilla #38050 Folding@home
AMD and Intel happened. What do you think is running your computer right now (assuming it's an x86)? It a RISC chip that has x86 translater attached, the core of the chip is RISC.
They said Itanium cluster, not VAX cluster!
Well, this system is neither RISC nor CISC. Itaniums are VLIW. IIRC, it too does have an x86 translator somewhere, but they work far better with native code.
Scientific computing means data crunching (floating point). Complex, powerful processors are needed. The "stupider, but more" tradeoff doesn't work anymore. Sun processors have fallen behind in this respect.
The Raven
RISC stands for "reduced instruction set computer". It made sense in the 1980's when the "CISC", complex instruction set computers, took tens or hundreds of clock cycles to execute some instructions. With RISC one had less instructions, but each instruction executed in less clock cycles, resulting in a faster computer. Today, CPU's with full-size instruction sets execute most of them as fast as a RISC CPU does, so there is no need to limit the instruction set anymore. Even such complex instructions as multyplying double-precision floating point numbers are executed in a single clock cycle in a Pentium 4.
Sun hardware has additional, wonderful resiliency features like - allowing cpu's to "fail-over" to other cpus in case of failure. The same holds true for memory, network interfaces, etc. Solaris is aware of these hardware features and can "map out" the bad memory and cpus on the fly (or allow swap-in replacements). The engineers can then replace the broken cpus/memory/interfaces WITHOUT BRINGING THE MACHINE DOWN. This lends itself to an environment than can enjoy nearly 100% uptime. Finally, since Sun has been doing the "lots of cpus" thing for many years, their process management and scalability tends to be much better.
I don't work for Sun, I'm just an SA that deals with both Solaris and Linux boxes. You don't pick sun for just "lots of cpus", you pick it for a very scalable OS and amazing hardware that allows for a very, very solid datacenter. If downtime costs a lot (ie. you lose a lot of money for being down), you should have Sun and/or IBM zseries hardware. Unfortunately those features cost a lot and most times you can use Linux clustering instead for a fraction of the cost and a high percentage of the availability.
With the exception of the NUMA stuff, is there software available to re-create this? I'm not even sure what to search for; would this still be considered a "cluster"?
They bought HP's overstock of them for pennies on the dollar.
I wish I had that much disk space...
True. There are at least two different x86 emulators available. There is the HW one that is built in and the newer and faster IA-32 Execution Layer (currently only available for windows).
RISC and CISC offer no final advantage over the other, so the one that dominated is the one that was here first.
Quick examples: RISC use less power because it has less logic? No, it needs to run at a higher frequency to maintain the same speed as a slower CISC.
RISC is easier to program? Depends on the person. A compiler can take advantage of large instructions very well which are hardware optimized.
RISC easier to develop/manage? I'll say yes for RISC on this one. There's simply less logic on the chip so less logical errors possible. There's plenty more cache which can break but broken parts can be fused off.
RISC is physically smaller? No. RISC needs a higher clock frequency because many more instructions need to be executed. The result of this is that a much larger instruction cache is needed on chip.
I don't remember every comparison but it pretty much comes out that neither is better than the other. That being said RISC is better than x86. Everything is better than x86. However CISC vs RISC is much harder to judge. Having done x86, 68k, and MIPS I must say that RISC is a pleasure.
The Sun hardware is more difficult to deal with, since there isn't a virtual machine abstraction. You can't do everything below the OS. Still, Linux 2.6 has hot-plug CPU support that will do the job without help from a virtual machine. Hot-plug memory patches were posted a day or two ago. Again, this is NOT required for hot-plug on the zSeries. IBM whips Sun.
I'd trust the zSeries hardware far more than Sun's junk. A zSeries CPU has two pipelines running the exact same operations. Results get compared at the end, before committing them to memory. If the results differ, the CPU is taken down without corrupting memory as it dies. This lets the OS continue that app on another CPU without having the app crash.
"will run a single Linux operating system image across 1,024 Intel Corp. Itanium 2 processors..."
"The National Center for Supercomputing Applications will use it for research"
1. Make a system that generates more heat than a supernova.
2.Research a solution to global warming.
3. Profit!
SCO gained $715,776
I have replaced Sun Hardware/Software combo's in the core datacenter for many of our customers, and I can tell you that yes - Sun brings some amazing features to the table - most of which are there to serve old technology. Linux on simple CPU's delivers such an amazing price performance (depending on the job, we see an average of 3x to 4x performance increase for 25% of the cost. That means that if I were to spend the same, lifecycle-wise, on a Linux cluster as I would on a big Sun box like the 10k or 15k, I'd end up with 12x to 16x the performance of the Sun solution.
The same functionality in terms of cpu and ram (and other hardware) failure is available on the Linux cluster, albeit in less graceful form - the magic spell to invoke goes like this: if I have 300 machines crunching my data, I can afford to lose a couple, and can afford to have a few hot-standby's.
Of course, the massively parrallel architecture does not work for all applications, and in those cases you would look to use either OpenMOSIX or of course the (relatively expensive) SGI box mentioned in this article.
People who think they know everything are a great annoyance to those of us who do.
Hot damn, this is one server that could survive a slashdotting.
...Right on the heels of this too.
The purpose of that computer is to solve complex scientific problems such as weather simulations, high-energy particle simulations, protine folding, etc. Many of these simulations involve iterated systems of equations that can take decades to solve on the fastest CPU's we have today.
The only way to get meaningful results in a meaningful amount of time is to break the problem apart into smaller problems and solve them in parallel.
Some projects, such as Folding@Home and Find-A-Drug go the distributed computing route -- use many disconnected systems to solve the problem.
The downside to that approach is that not all problems can be easily broken apart -- and some classes of problems can exist without tight coupling but they loose efficiency. The impressive thing about this particular super computer is that it has a single, unified memory image.
This is very useful for some classes of simulation problems when the entire simulation must be present for each iteration.
It's ok for embedded and other areas (slower CPUs) but with desktop/server CPUs being much faster than memory speeds and remaining so for the forseeable future, having common and popular instructions being shorter than other instructions is actually an advantage despite the complexity that involves.
It's like having on-the-fly instruction decompression. e.g. CISC programs tend to be smaller in main memory+cache, and they travel in CISC/"compressed" form taking up less memory bandwidth over the memory/cache buses to the CPU instruction decoder where they are "decompressed" to RISC micro-ops to be executed.
Look at the mainstream desktop/workstation/server CPUs. Only the SPARC is RISC. IBM POWER/PowerPC is barely RISC[1], some people think it's more CISC than RISC. Itanium isn't RISC. x86 isn't. The rest (Alpha, MIPS, PA-RISC) are either out of the market or on their way out.
As long as CPUs are fast and much faster than RAM (and cache remaining expensive), it's often worth doing the compression/decompression thing.
[1] I believe IBM's POWER chips actually decode their "RISC" instructions to simpler instructions, some of their "RISC" instructions are pretty complex- kinda oxymoronic... But as I mentioned, that may not be such a bad thing.
Pentium 4 reduces the CISC instructions to a series of RISC-like "microops" that, for the most complex of the bunch, can take hundreds of cycles to complete.
Well, we know that the kernel can be made to scale but what about the applications? The same issues the kernel had to face, the applications have to face also. For parallel computing you naturally try to avoid too much sharing by "parallelizing" the programs. For applications like databases, you are talking about a lot of sharing of a lot of data. Not all the techniques the Linux kernel used are available to the applications yet.
Fire up apache and then post a link to it here on slashdot. We love a challenge.
The UNIX made by SGI (the company making the machine referenced in the article) is more scalable than Solaris. Remember, IRIX was the first OS to scale a single Unix OS image across 512 CPUs. And now they've eclipsed that, with Linux.
None of that is unique to Sun.
Better than what? And says who? They've never decisively convinced the market that they're beter at this than HP, SGI, IBM or Compaq.
In addition to ignoring the other good Unix architectures out there in a dumb way with this comparison, you're also totally missing the point of the article. Linux supercomputing isn't just about cheap clusters anymore. Expensive UNIX machines on one side and cheap Linux clusters on the other is a false dichotomy.
Now before I get modded down, I be to remind whoever might read this that what I am saying is FACT. - bogaboga
That's almost enough to run Emacs!
Generally, bash is superior to python in those environments where python is not installed.
It's already out...in Japan
Join Team Mozilla #38050 Folding@home
That much hard drive space rivals my porn collection! :O
35 TFLOPS is the peak performance number sitewide. Cobalt itself should be able to clear between 6 and 7, making it a much more modest 25ish place. There are rumours that a bigger cluster-style machine is in the works, once the issues with Tungten (NCSA's biggest and #5 in the world) are ironed out.
"Quick examples: RISC use less power because it has less logic? No, it needs to run at a higher frequency to maintain the same speed as a slower CISC."
No. This is exactly wrong. G5s are a good example of this. They easily outperform P4s at the same clock speed, and it's the P4 which must run at the higher speed to compensate.
The overhead of supporting all the various instructions and adressing modes, as well as being able to fit the whole CPU in one die were what made RISC a good choice in the past. Now, that overhead is dwarfed by other parts of the chip, and they're all running weird u-ops internally, so it makes little difference.
"RISC is easier to program? Depends on the person. A compiler can take advantage of large instructions very well which are hardware optimized."
Compilers are notorious for not utilizing esoteric opcodes. And when they do, there's almost never a significant performance advantage in doing so.
For example, none of the code I've ever tested with icc (one of the only compilers that can use weird opcodes on i386) has been more than about 5% faster than "gcc -Os -msse2", and a lot of it has been slower.
"RISC is physically smaller? No. RISC needs a higher clock frequency because many more instructions need to be executed. The result of this is that a much larger instruction cache is needed on chip.
RISC does generally need a larger cache, but it does not need a higher frequency.
"I don't remember every comparison but it pretty much comes out that neither is better than the other. That being said RISC is better than x86. Everything is better than x86. However CISC vs RISC is much harder to judge. Having done x86, 68k, and MIPS I must say that RISC is a pleasure."
Just use a compiler. Anything with a proper MMU will be good enough.
I rarely criticize things I don't care about.
The point here is that if performance continues to grow like it is today, they will be selling these machines for $1,000 at Walmart in just 14 years. It will be about the same size as the computer you own now.
The problem with 1024CPU is much more then just the operating system. It is a mess of communication hardware needed to wire everything together. It is about special power feeds and air conditioning, and sometimes floor loading requirements.
Take a quick look at the end of this PDF. It talks about heat output and the need for 3 phase 240V power coming into this computer. It is not unusual to hire both an electricial and a cooling expert when you talk about installing one of these babies. Not for the Home user, and never will be, however, idential compute power comming in just 14 years, so get ready...
"I just wanted to point out you mentioned GCC. Sadly GCC is about the worst compiler in existence for performance."
That was my point. A shitty compiler with moderate optimization settings is very close in performance to one of the top compilers out there.
"The top compiler is infact the Intel compiler in part because it knows about unpublished instructions. Have fun reading the code it generates."
Yes, this was the example I used. The vectorized loops are a bitch to read.
"On the subject of G5s being faster, there are a whole host of differences between G5's and P4's. You can't just pick one difference and claim that's the reason."
That's true. However, I never gave a reason for the performance difference, so I'm not sure why you're saying this.
You said that RISC CPUs needed to run at a higher frequency to get the same performance as a CISC CPU. Since you're wrong, I gave an example to prove you wrong.
There is basically only one RISC CPU architechture that has the benefit of a really large R&D effort these days, and that's POWER/PowerPC. Itanium is not strictly RISC, and nothing else has the benefit of such a huge R&D effort.
Thus, the only RISC CPUs that can be fairly compared to x86 are the POWER/PowerPC chips from IBM. The only two x86 CPUs that have a really huge R&D effort behind them are the Athlons from AMD and the Pentiums from Intel.
They all have relatively similar performance (with advantages going to one or the other in a few niches). PowerPC chips are shipped at similar clock speeds to the Athlons and much lower clock speeds than the Pentiums.
Therefore, your statement that RISC CPUs need higher clock speeds to get the same performance has been demonstrated to be false in a comparison between the only 3 large chip makers in operation.
Further comparisons, such as those between Sparc and the VIA C3, which are smaller but significant efforts, show the RISC CPU getting more done per clock cycle, again demonstrating your statement to be false.
I rarely criticize things I don't care about.
I will avoid the tech terms (partly because they would confuse you, partly because I don't know them all but mostly because they ain't needed.
A single CPU computer can execute ONE instruction at the time. Meaning one program thread running at the time. But wait you say, my OS can run multiple programs at the same time. WRONG. It can't. It is a trick. It is running one program at the time but it is switching the program it is running really fast. There is however a problem with this. When it has switched to a program all the other programs are effectevily at the the mercy of the program now running INCLUDING the OS. Wich is why DOS and Windows and Linux and Mac OS and all the others had "hangups". With an extremely well written OS these hangups (when a program doesn't switch back to the OS) can be avoided but it still remains a case that all the programs and the OS are fighting for time on 1 single cpu.
So what happens when you add a cpu? Well a lot less switching PLUS if a program for whatever reason does not switch properly the OS can still be run on the other processor. Just making a windows box a dual CPU instantly makes it far more robuust. I encountered this myself with an old dell P3 that had a dual board but no dual CPU installed. Before I added a second CPU it was the usual windows crap of hangs and reboots and BSoD. Afterwards it ran as stable as a unix machine. Simple things like openeing a complex folder in exploder no longer "froze" the desktop as it could simple run exploder on one CPU and say word or my mp3 player on the other.
Don't forget too that there think like ATA harddrives and CD-ROM need the cpu to drive them. This takes a lot of long cycles and a lot of waiting, not so much CPU power as just time on the CPU. With a second one to do all the other tasks this makes everything run far smoother.
So what is better? Running 1 2ghz cpu or 2 1ghz cpu's? Depends. If you are running 1 program thread go with the 1 cpu. It will take all the cpu time but will not need to share it. If however you are running countless small threads go with the 2 or more solution. Threads will have access faster and you will loose less cpu time on the time needed to execute switches.
Oh yeah that is another problem. Switching between programs takes cpu time as well. It is not unknown for single CPU systems to spend so much time on switching they don't have time to run anything anymore. The old to many running programs problem known from windows but wich affects every OS.
Lastly there is a simple problem. Say you want real power do you go for a quad 2ghz or a single 8ghz. Answer? It is a trick, no such thing as a 8ghz cpu.
If you get the chance buy a second hand dual P3 and install windows 2000+ or Linux on it and be amazed. That old system will respond a lot faster underload then your 3ghz monster.
MMO Quests are like orgasms:
You may solo them, I prefer them in a group.
I've been working all weekend to cluster 4 Honda Civics. When I'm done, I expect it to go 280MPH, get 12MPG and 0-60 in under 3 seconds.
The UNIX made by SGI (the company making the machine referenced in the article) is more scalable than Solaris. Remember, IRIX was the first OS to scale a single Unix OS image across 512 CPUs. And now they've eclipsed that, with Linux.
Scalability is a complex issue. SGI has put a whole lot of processors together and put a single Linux image on it (so that a single program can use all memory), but this says nothing about how that setup will actually perform for general purpose use. Just because the hardware allows threads on hundreds of processors to make calls into a single Linux kernel, does not mean that there will not be major performance issues if this actually happens.
There are performance issues with memory even on single processor systems with nominally a single large address space, and a developer may need to put a lot of work into ensuring that data is arranged to make best use of the various levels of cache.
Many of the multi-processor architectures require even greater care to ensure that the processors are actually used effectively.
The fact that a single Linux image has been attached to hundreds of processors is no indication of scalability. A certain program may scale well, or not.
Being an administrator of some 24-way boxes, I have to ask a more detailed question about the error handling. Is the L2 cache in the CPUs just ECC'd, Parity, or fully mirrored? You'll find that on a large installation of CPUs, not being fully mirrored on your L2 will cause quite a bit of downtime over the course of a year with that many CPUs. I don't have those Itanium 2 specs. Anyone?
UPDATE: I looked. Itanium 2's L2 cache is ECC. It'll correct a 1 bit failure, detect and die on a 2 bit failure. Believe it or not, on a large number of CPUs running over a long period of time, it happens more often than you think. It also says it has an L3. No idea on the L3 cache protection method used. Because they don't say, I'd also guess ECC. Wheee! Lots of high speed RAM around the CPU with ECC protection. Well, nobody called this an enterprise solution, so I guess its okay.
Also, you're going to have regular issues with soft ECC errors on that many TB of RAM. And then your eventual outright failures that'll bring down the whole image of the OS. (An OS could potentially handle it 'gracefully' by seeing if there is a userspace process on that page and killing/segfaulting it, but that's more of an advanced OS feature.)
Boy, I'd really hate to be the guy in charge of hardware maintenance on THAT platform.
I am happy to say that I have worked, and continue to work on the current state of the art:
/dev/shm /etc/redhat-release:
http://www.ccs.ornl.gov/Ram/Ram.html
A few notes:
Linux kernel: 2.4.21-sgi240rp04051808_10074
From df, a 1 TB ram disk:
none 1023700704 0 1023700704 0%
From
Red Hat Linux Advanced Server release 2.1AS (Derry)
The machine is actually not nice to work on. It is prone to frequent short freezes (2-15 seconds long; about one every 2-3 minutes, although not evenly spaced out).