IBM unveils 64-way NUMA server; Promises Linux support
I just found this article at Info World which talks about IBM releasing a 64-way NUMA-Q server. The interesting part is that IBM promise to release a version of Linux optimized for
NUMA servers. What do you think about it?
Well, this isn't really an "IBM" system. It's a Sequent system which was far along in the design process when IBM bought them a year or so ago. There seems to have been a tremendous brain drain since the purchase (*), so this machine may be born as a legacy system.
(*) According to one of those drained brains, IBM didn't seem to have a clue what to do with them. Lacking any top-down direction, they tried to launch some bottom-up initiatives, which IBM management squashed.
- Old Man of the Mountain ---- "I want to disturb my neighbor"
So the cool thing about this announcement is it means Linux will be getting good, efficient NUMA support even sooner than expected! Which should help it compete favorably with NT and perhaps even Solaris on high-end servers.
"Freedom means freedom for everybody" -- Dick Cheney
Going for the geek familes market, Rowenta unwrapped two new toasters -- a 4 toast QUAD-BURNER and a entry level 2-way energy saving version. "We hope to give all the toaster lovers out there more satisfying experience, from bachelors to large grandma and grandpa and lot of grandchildren type families" said John Williams, who manages the R&D at Rowenta. Company officials also announced their intent to deliver a version of Linux optimized for QUAD-BURNER. "It sure makes sense to use open source OS to make open faced sandwiches," John remarked. /.!)
Rowenta is based in Standfort Offenbach in Germany.
J.
(mention Linux and get posted on
User numbers?
/. putting cookies on my machine at the time and it took Rob quite a while for there to be any advantage to user accounts other than having your name automatically attached to the post. I held out til the mid-3000s. My neighbor got one just over 1000, and both of us had started reading at the same time.
Bah. I remember before we had user accounts. I didn't like having
What are user numbers up to these days, anyway?
-- This and all my posts are in the public domain. I am a lawyer. I am not your lawyer, and this is not legal advice.
An actual NUMA-Q server is a (up to) 4-way Xeon box in a 4X rack cabinet.
From what I remember during the last time I visited sequent (3-4 years ago) They had some NUMA-Q systems with multiple proc boards, each holding 4 PPros. I think the one they had running had 16 PPros in it. It was a refridgerator sized box, too.
What really intrigues me though is the SMP boxes with mixed x86 architectures. And for that, I shall be ever impressed with Dynix/ptx
Cursed are those who use sequent (IBM,whatever) products in buildings with no elevators.
Their hardware interconnect was a SCI (Scalable Coherent Interconnect) ring with a bandwidth of around 1 Gb/sec/link. This is not, IMHO, the best link to use today. In fairness to Sequent, SCI may have been the best thing available at the time. They started shipping in '96 or '97, so the R&D must have happened a few years earlier.
I also thought the multi-path I/O was pretty cool. It is a FibreChannel SAN (Storage Area Network) with multiple controllers, F-C switches, and EMC RAID boxes, with fail-overs that work a little like the Internet, rerouting around dead hardware.
--
"You've crossed my Line of Death!" "What? No! Where is it?" "Here in the fine print...."
NUMA is not really a clustering method. It's a way of addressing some of the drawbacks of large-scale SMP. You have quads (4 CPUs), and each quad has memory, and cache. Memory access is non-uniform, because special techniques are used when memory is accessed between quads. Think of it as a distributed shared memory computer, all in one computer.
Each quad has connections to other quads and IO buses.
Because the quads are actually separate, you can subdivide your machine and run different operating systems on different quads, yet they can share (at bus speeds) data between them, such as a fibre-channel array.
And Dynix is a good UNIX, too. It has it's problems (like it's low on the port list of just about everything) but it runs all the GNU software I've tried on it and is very reliable.
No.
> /var has to be on the root partition
No.
> What is normally in var (logs and such) is in /usr/adm
vi /etc/syslogd.conf
> Most sysadmin tasks (such as adding a user) should not be done on the command line
Oh really? Seems to work here.
> You should use their menu system because it twiddels with bits on some internal-not-text database
Nonsense.
Dynix is a great OS. Sure, you need to spend a few hours installing bash, a decent sendmail, etc etc but that's no different from Solaris or AIX.
The new code has completely rewritted the locking so that there is no longer a single global lock, but seperate discrete locks for each thing that needs them.
The impact of this, plus a few other IO related changes, should be that Linux will scale better than Solaris to large numbers of CPUs. Well, that's the theory anyway...
"NUMA" stands for "Non-Uniform Memory Architecture". It's one approach to dealing with system memory on a machine with a large number of processors.
The idea is that each processor module has its own dedicated RAM, which can be accessed both locally and remotely by other machines across the network. System memory as a whole is the aggregate of the local memory banks on all of the processor modules. While this is all in one physical address space, access time will vary depending on whether you're accessing a local or non-local bank. Hence, "Non-Uniform".
There are undoubtedly extensions to NUMA that do more complicated things with system memory; this is just the version that I was told about at university.
Actually, there's nothing special about the non-uniformity (NU) of access. You have that feature whenever you throw in caches, for example: the time to access a memory location depends on whether it's in cache or not. A Beowulf cluster is also a non-uniform access architecture: accessing your local memory is cheaper than accessing the memory on a remote machine (via some sort of message passing layer, such as MPI or PVM or RPC, for example). Network file systems also exhibit non-uniform access cost behavior: the speed at which you can access /net/machine/filename depends on whether "machine" is local or remote, and how far away it is.
So what makes a "NUMA machine" so special? It's the hardware cache coherence. That is, a cc-NUMA turns a (cheap) Beowulf cluster into an (relatively expensive) IBM machine.
Whenever access costs are non-uniform, algorithms must be tuned to be latency sensitive. If algorithms are not aware of this nonuniformity, then they'll run really slow!!
Why is hardware cache coherence silly? eheh (my opinion). If we need to make algorithms latency sensitive to take advantage of non-uniform access architectures, then we need explicit control of data movement throughout the system. But a cc-NUMA takes this control away from you.
Notice that your L2 cache effectively has hardware cache coherence. You can't really control what's in L2 cache. And neither do your algorithms need to be conscious of the existence of an L2 cache to perform correctly. To perform well, however, they do.
So that's what they call it. I wrote the numerical analysis code for my thesis research on a Kendall Square Research KSR-1; 32 nodes 20MHz each (yeah!), 32 MB each, UNIX derivative OS with Posix threads (Mutex's, barriers, thread-private/public data). You treated it like an SMP with the additional knowledge that memory access had an affinity for the processor.
KSR went under, as I recall, but I always wondered what happened with the technique. Now I know that it is NUMA!
Well, before we get too excited, keep in mind that both companies support (or flip-flop between if you want to be less charitable) a wide variety of operating systems. This is SGI's third "bet the company" OS strategy in 5 years. IBM's wide support for Linux is far more interesting, but to a certain extent (the 390 comes to mind) it has the feel of a stunt. This is all great news for Linux, but it's still early in the game.
Using a standard platform like Linux that has developed an independent following will give both companies a difinitive advantage over Sun and their Solaris platform.
Until there actually is anything resembling a "standard platform" for Linux, I don't think this is a serious point. There are already plenty of differences between (for example) Debian and Red Hat on the x86 platform alone, so it seems like a huge stretch to suggest that SGI and IBM/Sequent machines will provide a "standard platform" simply because they both have Linux-based kernels available. Again, this is all good for Linux, but don't set your expectations too high.
- Old Man of the Mountain ---- "I want to disturb my neighbor"
NUMA-Q is the Sequent technology. It is also cache coherent according to this paper, but the details are lacking. It does not appear to scale as well as SGI's NUMA, though.
A well-crafted lie appears unquestionable - Dama Mahaleo
Pah! 4GB RAM? Call that maxed out? From one of the DG AViiONs I'm using at work:
Fully laden, it'll take 32 CPUs and significantly more than 100GB of storage. The newer AV25000 takes up to 64 CPUs and 64GB RAM. I'm hoping that if IBM add NUMA support to Linux for the Sequent box, it'll help with getting it running on the DG NUMA boxen too...
"The invisible and the non-existent look very much alike." -- Delos B. McKown
Hi!
We already have 7, count'em SEVEN FIRST POSTS! I wonder if IBM's including a 64-way First Post server with their NUMA boxes...
--Joe--
Program Intellivision!
It's always a good thing when a company undertakes a major port of Linux to a new architecture. Remember, more eyes find more bugs, and these are VERY talented eyes that are going to be adapting and scrutinizing the kernel for the sorts of multiprocessing bugs that only show up in configurations with large numbers of CPUs.
Everyone wins.
I wonder if IBM will release patches for Linux + other patches/drivers needed to operate and let Redhat/SuSE/Caldera/Turbo-Linux to port their distributions - the same way they did with the S/390 and the RS/6000 - or will they create a whole distribution?
Hetz (Heunique)
I may be misunderstanding but, from what I understand NUMA is not strictly for multiple boxen. I was more under the impression it was the middle step between SMP and MPP clusters...
An actual NUMA-Q server is a (up to) 4-way Xeon box in a 4X rack cabinet. NUMA is software that lets a bunch of those share RAM and processor time. Sequent (IBM) recently overcame the old 64 processor limit on their NUMA implementation.
Maxed out, with the enterprise cabinet, 4GB of RAM and 100GB of storage, you're looking at hundreds of thousands of dollars.
Sequent's web site seems to be down right now.. (cough). Heh.
And, as to why you might want one, we average over 400 processes at any given time with a load avg of around, uhm, zero, on our production box. These things can take a LOT of abuse.
--
blue
i browse at -1 because they're funnier than you are.
Assuming that they've got this even close to right, managing a 64-CPU NUMA-Q system should be no more complicated than managing a 1 CPU NUMA-Q system. Until the sysadmin tools for Beowulfs get a hell of a lot more sophisticated, managing a 64-CPU Beowulf is going to be much more complicated than a 1-CPU Beowulf.
Again, assuming that they've gotten things right, programmers should be able to continue working with a model they know and have experience with. Applications have to be re-written (and frequently re-designed) from scratch to run on a Beowulf, and programmers need to use a totally different mindset. In theory, any application that works on an SMP should "just work" on a NUMA machine - possibly with a recompile. To really get peak performance, applications may well need some tuning, but that's certainly easier than rewriting.
I'll bet that when this machine ships, Oracle (just to pick a big example) will already be running on it. When will Oracle ship a Beowulf-aware version?
- Old Man of the Mountain ---- "I want to disturb my neighbor"
Other posters have done a pretty good job of discussing NUMA (ccNUMA is what SGI uses in the Origin machines).
Each Origin consits of node board that contain 2 CPUs and some RAM. The system can scale up to 512 node boards (1024 CPUs), but you obviously can't fit all of those CPUs in one Origin case (the little purple mini-fridge in the case of the Origin 2000). So, the CrayLINK is used to expand the CPU network topology beyond one box - it is basically a extremely large bandwidth short-range cable that connects Origin machines together to form one big cluster that is the equivalent of one box with 1024 CPUs.
Do you even know anything about perl? -- AC Replying to Tom Christiansen post.
The zoned memory management in 2.3 is a starting point for NUMA. It allows you to break the address space into segments which are treated differently (ie. Non-Uniformly).
SGI and IBM will have to cooperate, although maybe not officially. They'll each have teams adding features to the kernel and they'll talk to each other the same way all the other developers do. There's just no other way to do it.
It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. - Abraham Maslow
I'm not sure how IBM/Sequent does it, but we (SGI) do it that way - the Origin 2000 and the followon to it have a board with processors, memory, I/O channels, and a "network link" to the intramachine memory network. These are linked together with a big ASIC, which we call the Hub. The Hub is linked to routers and the routers are connected in something approximating a hypercube. The latency to memory gets longer the further you are away and the more router hops you have to go. Our architecture scales to 512 (and perhaps 1024 in the future) processors. A future version will be based on IA64 and will run Linux also. I'm not sure if IBM needs routers as they only go to 64p.
So what do you have to do to make the OS run? The big thing is getting the OS to recognize holes in memory which may exist between nodes. Once you have done that, you can probably boot. To actually run well, you have to modify the memory allocation and scheduler (among other things) to try to keep jobs physically close to their memory. This gets *really* hard, especially when the job takes more than one node's worth of memory.
One last thing I should point out is that there are two flavors of NUMA. Regular NUMA and ccNUMA. In regular NUMA machines (the Cray T3D/T3E is the only one that comes to mind) you can get access to memory that is not on your node, but you will only get a snapshot of what is there at a given moment. In ccNUMA, the caches of all the processors are coherent so not only do you get a snapshot, but if you modify your copy, everyone will perceive that the modification happened at the same time. The T3E runs a seperate copy of the OS on every processor partly because of the lack of the "cc" part. The Origin stuff can run a single kernel on the entire system and my group is doing software which breaks the machine up and runs seperate kernel images in different pieces for reliability reasons. The images can then talk to each other over the memory interconnect at very high speeds.
Finally, 4GB of memory is nothing :) I regularly play with a machine that has 196 gigs of RAM (and 512 processors :) and it's nowhere near maxed out :)
Go Badgers! -- #include "std/disclaimer.h"
Ideally, yes, although to meet the strict definition of NUMA, it only needs dedicated RAM. That RAM doesn't have to be shared with other (remote) CPU blocks. All it needs is for the memory to not be equally available to all CPUs. See the following example from a DG AViiON:
As you can see, the memory is shared between each block of 4 CPUs, but it's not accessable by remote blocks. NUMA AViiONs have 3 basic memory types -- shared UMA, local NUMA and remote NUMA.
"The invisible and the non-existent look very much alike." -- Delos B. McKown
Does Linux currently have NUMA-aware memory management? The article states that IBM is developing their own patches for this, but it would be interesting to know what else is out there.
Optimizing for local vs. nonlocal memory accesses can have a big performance impact, as (if memory serves) Sun found a while back.
"it" being porting linux to a NUMA-Q.
(NUMA is a method of sharing CPU and RAM access across mutliple boxen)
NUMA-Q's are x86 boxen. They have some really, really, really cool features. I do wonder if IBM plans to write drivers for the FibreChannel SCSI adaptors and etc that come standard with most NUMAs.
OTOH, there is noooo reason not to use dynix on a NUMA. It's included with the (MASSIVE) cost of the box, it's based on BSD, it's a nice OS with tons of kick-ass features, and it's symbiotically enmeshed into these servers.
Hey, I wonder if IBM is actually gonna write a NUMA layer for linux? I mean, if they don't then all you end up with a buncha 4-way rack mounted linux boxes.. for $365,000 apiece.
One other thing, Sequent (now IBM) has the absolute best support I have ever seen. I have sent email to their web site about completely esoteric crap and had them call me back and get a dialogue open with the developer of whatever I was having trouble with. If you've got the cash and don't wanna deal with Solaris, DYNIX is the way to go.
--
blue, who is wearing his Sequent hat today.
i browse at -1 because they're funnier than you are.
Why would someone bother to use a NUMA-Q server for any application? Beowulf clustering (which IBM is also pushing) provides a significantly greater price/performance ratio for most applications, and gives you a better interconnect fabric for interprocess communications. With prices starting at around $73,000 for a two processor NUMA-Q setup, you can buy a LOT of celeron or athlon systems, or even alphas, using myrinet for an interconnect. That will give you a much more robust solution, at a lower price point, and give you the flexibility to optimize the system for YOUR apps, not how the scheduler wants to distribute your data across a non uniform access time memory pool. And don't get me started about SANS, compared to a true cluster IO system...
Okay, I remember from one of my classes the difference between SMP and NUMA. I remember a very brief discussion on SGI's CC-NUMA, and how it was basically as switched network for processors to be talking to different segemented areas of memory so that not all processors are trying to access the same address space of memory and blocking each other's access to the memory bus.
Now, does anyone know what the differences are between SGI's CC-NUMA and IBM's NUMA-Q?
If it's for-profit but free, you're not the customer -- you're the product (e.g., the Slashdot Beta's "audience").
Linux does do multiprocessing, but not as well as Solaris and other commerical Unixes. Still, as the article says IBM will be releasing a beta version that will work on it. I wonder how much of that code will make it into 2.4? (Or rather 2.5 probably...) 2.4 is supposed to have better SMP support than 2.2, if the fella's over at IBM make a version of Linux that supports 64 way processing I would guess that there are direct modifications to the kernel, and many people have been posting here about how Linux needs better support for multiple (more than 4) processors.
(I mention the below because I KNOW it will come up in this thread...)
Kernel fragmentation? Possible, yes but unlikely I think because the code that benifits most Linux platforms from this version will eventually make it into the kernel anyways. Besides, even if it is a totally different version of Linux, how many people can afford the $73,000 price tag on one of these things!
Try to hack my 31337 firewall!
OS/2 Warp Server theoretically already supports 64 CPU's, and since it scales better than NT/2000, it should work great on these machines. Unfortuantely, I can't get to www.sequent.com, so I can't tell if OS/2 is supported. Does anyone know?
And the men who hold high places must be the ones who start
To mold a new reality... closer to the heart
We are definetly working on a IA64/Linux version of this hardware that will scale to the same number of processors as the MIPS version, though it is unclear how far it will go as a single system image. However, we can split the thing up and run multiple kernel images that talk over shared memory in the same box.
Go Badgers! -- #include "std/disclaimer.h"
You don't have to imagine - check out ASCI Blue - it's not a Beowulf, but it is a cluster of 48 boxes where each box is an SGI Origin 2000 with 128 processors. It is pretty high up the Top 500 list of the world's fastest supercomputers :)
Go Badgers! -- #include "std/disclaimer.h"
Who would spend that kind of money on a system and then run Linux on it? If you've got the money to afford a system like that, you might as well shell out the extra few dollars and get an OS that can handle that many processors. I'm sure the people who buy these things aren't thinking, "Duh, it'd be real keen-o if I ran Linux on this thing. Golly gee, I wonder if I could play Quake3 on it."
Somehow I don't see a lot of cooperation between Big Blue and Small Purple on this one.
A well-crafted lie appears unquestionable - Dama Mahaleo