SGI to Scale Linux Across 1024 CPUs

Press Release by foobsr · 2004-07-18 03:44 · Score: 3, Informative

The link to the press release as of July 14.

CC.

--
TaijiQuan (Huang, 5 loosenings)

Re:What happened to RISC? by DAldredge · 2004-07-18 03:53 · Score: 3, Informative

AMD and Intel happened. What do you think is running your computer right now (assuming it's an x86)? It a RISC chip that has x86 translater attached, the core of the chip is RISC.

Re:Solaris by justins · 2004-07-18 03:59 · Score: 4, Informative

Solaris is not a leader in supercomputing, never has been.

http://top500.org/list/2004/06/

There's no "stronghold" for Sun to lose.

--
Now before I get modded down, I be to remind whoever might read this that what I am saying is FACT. - bogaboga

Sun != scientific computing by vlad_petric · 2004-07-18 04:03 · Score: 4, Informative

Sun processors execute server workloads (database, app server) very well, but that's pretty much it. The emphasis with such workloads is on the memory system. Boatloads of caches do the job. It's also more effective to have tons of processors that are very simple, than just a couple of them that are complex and powerful.

Scientific computing means data crunching (floating point). Complex, powerful processors are needed. The "stupider, but more" tradeoff doesn't work anymore. Sun processors have fallen behind in this respect.

--

The Raven

It became obsolete by mangu · 2004-07-18 04:05 · Score: 3, Informative

RISC stands for "reduced instruction set computer". It made sense in the 1980's when the "CISC", complex instruction set computers, took tens or hundreds of clock cycles to execute some instructions. With RISC one had less instructions, but each instruction executed in less clock cycles, resulting in a faster computer. Today, CPU's with full-size instruction sets execute most of them as fast as a RISC CPU does, so there is no need to limit the instruction set anymore. Even such complex instructions as multyplying double-precision floating point numbers are executed in a single clock cycle in a Pentium 4.

Re:It became obsolete by Johan+Veenstra · 2004-07-18 04:55 · Score: 4, Informative

Actually RISC is a bad name for what it stand for, it should have been SISC (Simplified Instruction Set Computer), since the key difference between the two are the complexity of the instructions and not the quantity.

A CISC instruction could do things like: take the value in register BP, add 4, get the value from the memory at the address you just computed, add the value in the register AX, and put the result back at the same memory location. Execution would take several clock-ticks.

To do the same in RISC, you would need several instructions (add 4, get from memory, add ax, store to memory). The execution of the individual instructions would take one tick each, so the sequence would take several. But on average RISC was a bit faster.

CISC was invented in a time that the memory was small, in the CISC way you could store larger programs in the same amount of memory.

RISC was invented when memory-size was not limited anymore, and looked to displace CISC in the long run.

CISC was still around when the memory bandwidth became a limiting factor. And since fewer instructions needed to be fetched from memory, more bandwidth was left for other data traffic. RISC lost some of it's speed advantage.

Modern CISC processors, get CISC instructions from memory, chop them up in smaller instructions, and executes those smaller instructions really fast. So in fact they can be seen as RISC processors, posing as CISC processors, ie the best of both worlds.

So CISC is a way of compressing RISC instructions, so they take up less memory/bandwidth.
Re:It became obsolete by AKAImBatman · 2004-07-18 04:56 · Score: 3, Informative

With RISC one had less instructions, but each instruction executed in less clock cycles, resulting in a faster computer.

Technically, RISC chips were supposed to execute all instructions in ONE cycle. This simplified the chip architecture, allowing it to scale up much farther. The downside was that it put the onus on the compiler writer to produce efficient code. (MIPS is a perfect example of this architecture.) All he had to do was make sure that fewer instructions were executed per task, and the code would run faster.

That is, until the chip designers started introducing SuperScaler and Out of Order execution. You see, simplifying the chip design provided chip designers with a way to add new optimizations in how instructions were loaded and executed. Unfortunately, this again meant more work for the compiler writer. Now he not only had to optimize the number of instructions, but he also had to optimize the ordering so that multiple instructions could be executed simultaneously or out of order.

--
Javascript + Nintendo DSi = DSiCade

Re:In other news... by levram2 · 2004-07-18 04:08 · Score: 5, Informative

The limit for Windows Server 2003, Datacenter edition for 64 bit Itaniums is actually 64 processors and 512 GB RAM. http://www.microsoft.com/windowsserver2003/64bit/i pf/datacenter.mspx

Sun and/or IBM zseries hardware by r00t · 2004-07-18 04:20 · Score: 3, Informative

Linux runs on both of these, with official IBM support on the zSeries. On the IBM hardware, go ahead and swap out CPUs and memory. It's supported, today, with Linux.

The Sun hardware is more difficult to deal with, since there isn't a virtual machine abstraction. You can't do everything below the OS. Still, Linux 2.6 has hot-plug CPU support that will do the job without help from a virtual machine. Hot-plug memory patches were posted a day or two ago. Again, this is NOT required for hot-plug on the zSeries. IBM whips Sun.

I'd trust the zSeries hardware far more than Sun's junk. A zSeries CPU has two pipelines running the exact same operations. Results get compared at the end, before committing them to memory. If the results differ, the CPU is taken down without corrupting memory as it dies. This lets the OS continue that app on another CPU without having the app crash.

Re:Similar software available? by dwgranth · 2004-07-18 04:39 · Score: 5, Informative

well, sgi uses/hacks NUMA, spinlocks, etc to make this happen in a more efficient manner. We recently had a SGI rep come and explain their 512CPU architechture at our LUG meeting... and he basically said that SGI has their own implementation of all of the clustering/cpu stacking techs... which they will eventually feed back into the community.. all good stuff.. understandably they will wait for a year or so so they can get their money's worth before they release their changes.

Re:Advantages...? by myg · 2004-07-18 04:47 · Score: 4, Informative

Because a machine like that isn't about running Apahce or serving files.

The purpose of that computer is to solve complex scientific problems such as weather simulations, high-energy particle simulations, protine folding, etc. Many of these simulations involve iterated systems of equations that can take decades to solve on the fastest CPU's we have today.

The only way to get meaningful results in a meaningful amount of time is to break the problem apart into smaller problems and solve them in parallel.

Some projects, such as Folding@Home and Find-A-Drug go the distributed computing route -- use many disconnected systems to solve the problem.

The downside to that approach is that not all problems can be easily broken apart -- and some classes of problems can exist without tight coupling but they loose efficiency. The impressive thing about this particular super computer is that it has a single, unified memory image.

This is very useful for some classes of simulation problems when the entire simulation must be present for each iteration.

RISC overrated by TheLink · 2004-07-18 04:47 · Score: 3, Informative

It's ok for embedded and other areas (slower CPUs) but with desktop/server CPUs being much faster than memory speeds and remaining so for the forseeable future, having common and popular instructions being shorter than other instructions is actually an advantage despite the complexity that involves.

It's like having on-the-fly instruction decompression. e.g. CISC programs tend to be smaller in main memory+cache, and they travel in CISC/"compressed" form taking up less memory bandwidth over the memory/cache buses to the CPU instruction decoder where they are "decompressed" to RISC micro-ops to be executed.

Look at the mainstream desktop/workstation/server CPUs. Only the SPARC is RISC. IBM POWER/PowerPC is barely RISC[1], some people think it's more CISC than RISC. Itanium isn't RISC. x86 isn't. The rest (Alpha, MIPS, PA-RISC) are either out of the market or on their way out.

As long as CPUs are fast and much faster than RAM (and cache remaining expensive), it's often worth doing the compression/decompression thing.

[1] I believe IBM's POWER chips actually decode their "RISC" instructions to simpler instructions, some of their "RISC" instructions are pretty complex- kinda oxymoronic... But as I mentioned, that may not be such a bad thing.

--

Too many replies beneath your current threshold

Let me clue you in on a few things by justins · 2004-07-18 04:53 · Score: 4, Informative

You don't pick sun for just "lots of cpus", you pick it for a very scalable OS and amazing hardware that allows for a very, very solid datacenter.

The UNIX made by SGI (the company making the machine referenced in the article) is more scalable than Solaris. Remember, IRIX was the first OS to scale a single Unix OS image across 512 CPUs. And now they've eclipsed that, with Linux.

Sun hardware has additional, wonderful resiliency features like - allowing cpu's to "fail-over" to other cpus in case of failure.

None of that is unique to Sun.

Finally, since Sun has been doing the "lots of cpus" thing for many years, their process management and scalability tends to be much better.

Better than what? And says who? They've never decisively convinced the market that they're beter at this than HP, SGI, IBM or Compaq.

If downtime costs a lot (ie. you lose a lot of money for being down), you should have Sun and/or IBM zseries hardware. Unfortunately those features cost a lot and most times you can use Linux clustering instead for a fraction of the cost and a high percentage of the availability.

In addition to ignoring the other good Unix architectures out there in a dumb way with this comparison, you're also totally missing the point of the article. Linux supercomputing isn't just about cheap clusters anymore. Expensive UNIX machines on one side and cheap Linux clusters on the other is a false dichotomy.

--
Now before I get modded down, I be to remind whoever might read this that what I am saying is FACT. - bogaboga

Re:from MPI to multithreaded ? by Sangui5 · 2004-07-18 05:42 · Score: 4, Informative

Does this mean that the applications running on the "old" clusters, presumably using some flavor of MPI to communicate between nodes, will have to be ported somehow to become multithreaded applications ?

NCSA still has plenty of "old" style clusters around. Two of the more aging clusters, Platinum and Titan are being retired, to make room for newer systems like Cobalt. Indeed, the official notice was made just recently--they're going down tommorrow. However, as the retirement notice points out, we still have Tungsten, Copper, and Mercury (Terragrid). Indeed, Tungsten is number 5 on the Top 500, so it should provide more than enough cycles for any message-passing jobs people require.

So, anyone has any insights as to why/how this matters for the programmers ?

What it means is that programming big jobs is easier. You no longer need to learn MPI, or figure out how to structure your job so that individual nodes are relatively loosely-coupled. Also, jobs that have more tightly-coupled parallelism are now possible. The older clusters used high-speed interconnects like Myrinet or Infiniband (NCSA doesn't own any Infiniband AFAIK, but we're looking at it for the next cluster supercomputer). Although they provided really good latency and bandwidth, they aren't as high-performing as shared memory. Also, Myrinet's ability to scale to huge numbers of nodes isn't all that great--Tugsten may have 1280 compute nodes, but a job that uses all 1280 nodes isn't practical. Indeed, untill recently the Myrinet didn't work at all, even after partitioning the cluster into smaller subclusters.

This new shared-memory machine will be more powerful, more convienient, and easier to maintain than the cluster-style supercomputers. Hopefully it will allow better scheduling algorithms than on the clusters too--an appaling number of cycles get thrown away because cluster scheduling is non-preemptive.

I'd also like to point out some errors in the Computerworld article. NCSA is *currently* storing 940 TB in near-line storage (Legato DiskXtender running on an obscenely big tape library), and growing at 2TB a week. The DiskXtender is licenced for up to 2 petabytes--we're coming close to half of that now. The article therefore vastly understates our storage capacity. On the other hand, I'd like to know where we're hiding all those teraflops of compute--35 TFLOPS after getting 6 TFLOPS from Cobalt sounds more than just a little high. That number smells of the most optimistic peak performance values of all currently connected compute nodes. I.e. - how many single-precision operations could the nodes do if they didn't have to communicate, everything was in L1 cache, we managed to schedule something on all of them, and they were all actually functioning. Realistically, I'd guess that we can clear maybe a quarter of that figure, given machines being down, jobs being non-ideal, etc. etc. etc.

As a disclaimer, I do work at NCSA, but in Security Research, not High-Performance Computing.

Re:Scalability of applications by xtp · 2004-07-18 05:48 · Score: 5, Informative

SGI has had 512 and 1024-cpu MIPS-based systems in operation for more than 5 years. Much work was done on the Irix systems to initialize large parallel computations and provide libraries and compiler support for these configurations. One technique is to provide message-passing libraries that use shared memory. A better technique is to morph (slightly) parallel mesh apps so that each computational mesh node exposes the array elements to be shared with neighbors. No message-passing needed - you push data after a big iteration and then use the (really fast) sync primitives to launch into the next iteration. With shared-nothing clusters (i.e. Beowulf) a computation (and its memory) must be partitioned among the compute nodes. The improvement over a "classical" cluster can be startling especially with computations that are more communications-bound than compute-bound. This means there is no value for replacing a render farm with a big system. But there are big compute problems, e.g. finite element, for which the shared-nothing cluster is often inadequate.

With a single memory image system the computation can easily repartition dynamically as the computation proceeds. Its very costly (never say impossible!) to do this on a cluster because you have to physically move memory segments from one machine to another. On the NUMA system you just change a pointer. The hardware is good enough that you don't really have to worry about memory latency.

And let's not forget io. Folks seem to forget that you can dump any interesting section of the computation to/from the file system with a single io command. On these systems the io bandwidth is limited only by the number of parallel disk channels - a system like the one mentioned in the article can probably sustain a large number of GBytes/sec to the file system.

Let's not forget page size. The only way you can traverse a few TB of memory without TLB-faulting to death is to have multi-MByte-size pages (because TLB size is limited). SGI allowed a process to map regions of main memory with different page sizes (upto 64 MB I think) at least 10 years ago in order to support large image data base and compute apps.

When I used to work at SGI (5 years ago) the memory bandwidth at one cpu node was about 800 MBytes/s. My understanding is that the Altix compute nodes now deliver 12 GBytes/s at each memory controller. Although I haven't had a chance to test drive one of these new systems, it sounds like they have gradually been porting well-seasoned Irix algorithms to Linux. It is unlikely that a commodity computer really needs all of this stuff, but I'm looking at a 4-cpu Opteron that could really use many of the memory management improvements.

g

Re:from MPI to multithreaded ? by kscguru · 2004-07-18 06:33 · Score: 3, Informative

Caveat: I think MPI itself is very recent (standardized only w/in the past few years), before that everyone used custom message-passing libraries.

It's a tradeoff. MPI is "preferred" because a properly written MPI program will run on both clusters and shared-memory equally fast, because all communication is explicit. It's also much harder to program, because all communication must be made explicit.

Shared-memory (e.g. pthreads) is easier to program in the first place (since you don't have to think about as many sharing issues) and more portable. However, it is very error-prone - get a little bit off on the cache alignment or contend too much for a lock, and you've lost much of the performance gain. And it can't run it on a cluster without horrible performance loss.

If it's the difference between spending two months writing the shared-memory sim and four months writing the message-passing sim that runs two times faster on cheaper hardware, well, which would you choose? Is the savings in CPU time worth the investment in programmer time?

Alas, the latencies on a 1024-way machine are pretty bad anyway. If they use the same interconnect as the SGI Origin, it's 300-3000 cycles for each interconnect transaction (depending on distance and number of hops in the transaction). Technically that's low-latency... but drop below 32 processors or so, and the interconnect is a bus with 100 cycle latencies, so those extra processors cause a lot of lost cycles.

--

A witty [sig] proves nothing. --Voltaire

Not sure if your serious but lets explain. by SmallFurryCreature · 2004-07-18 07:24 · Score: 3, Informative

I will avoid the tech terms (partly because they would confuse you, partly because I don't know them all but mostly because they ain't needed.

A single CPU computer can execute ONE instruction at the time. Meaning one program thread running at the time. But wait you say, my OS can run multiple programs at the same time. WRONG. It can't. It is a trick. It is running one program at the time but it is switching the program it is running really fast. There is however a problem with this. When it has switched to a program all the other programs are effectevily at the the mercy of the program now running INCLUDING the OS. Wich is why DOS and Windows and Linux and Mac OS and all the others had "hangups". With an extremely well written OS these hangups (when a program doesn't switch back to the OS) can be avoided but it still remains a case that all the programs and the OS are fighting for time on 1 single cpu.

So what happens when you add a cpu? Well a lot less switching PLUS if a program for whatever reason does not switch properly the OS can still be run on the other processor. Just making a windows box a dual CPU instantly makes it far more robuust. I encountered this myself with an old dell P3 that had a dual board but no dual CPU installed. Before I added a second CPU it was the usual windows crap of hangs and reboots and BSoD. Afterwards it ran as stable as a unix machine. Simple things like openeing a complex folder in exploder no longer "froze" the desktop as it could simple run exploder on one CPU and say word or my mp3 player on the other.

Don't forget too that there think like ATA harddrives and CD-ROM need the cpu to drive them. This takes a lot of long cycles and a lot of waiting, not so much CPU power as just time on the CPU. With a second one to do all the other tasks this makes everything run far smoother.

So what is better? Running 1 2ghz cpu or 2 1ghz cpu's? Depends. If you are running 1 program thread go with the 1 cpu. It will take all the cpu time but will not need to share it. If however you are running countless small threads go with the 2 or more solution. Threads will have access faster and you will loose less cpu time on the time needed to execute switches.

Oh yeah that is another problem. Switching between programs takes cpu time as well. It is not unknown for single CPU systems to spend so much time on switching they don't have time to run anything anymore. The old to many running programs problem known from windows but wich affects every OS.

Lastly there is a simple problem. Say you want real power do you go for a quad 2ghz or a single 8ghz. Answer? It is a trick, no such thing as a 8ghz cpu.

If you get the chance buy a second hand dual P3 and install windows 2000+ or Linux on it and be amazed. That old system will respond a lot faster underload then your 3ghz monster.

--

MMO Quests are like orgasms:

You may solo them, I prefer them in a group.

Scalability of sorts by Decaff · 2004-07-18 08:58 · Score: 3, Informative

The UNIX made by SGI (the company making the machine referenced in the article) is more scalable than Solaris. Remember, IRIX was the first OS to scale a single Unix OS image across 512 CPUs. And now they've eclipsed that, with Linux.

Scalability is a complex issue. SGI has put a whole lot of processors together and put a single Linux image on it (so that a single program can use all memory), but this says nothing about how that setup will actually perform for general purpose use. Just because the hardware allows threads on hundreds of processors to make calls into a single Linux kernel, does not mean that there will not be major performance issues if this actually happens.

There are performance issues with memory even on single processor systems with nominally a single large address space, and a developer may need to put a lot of work into ensuring that data is arranged to make best use of the various levels of cache.

Many of the multi-processor architectures require even greater care to ensure that the processors are actually used effectively.

The fact that a single Linux image has been attached to hundreds of processors is no indication of scalability. A certain program may scale well, or not.

Slashdot Mirror

SGI to Scale Linux Across 1024 CPUs

18 of 360 comments (clear)