BigTux Shows Linux Scales To 64-Way

← Back to Stories (view on slashdot.org)

BigTux Shows Linux Scales To 64-Way

Posted by timothy on Tuesday January 18, 2005 @03:28PM from the can't-let-you-do-that-dave dept.

An anonymous reader writes "HP has been demonstrating a Superdome server running the Stream and HPL benchmarks, which shows that the standard 2.6 Linux kernel scales to 64 processors. Compiling the kernel didn't scale quite so well, but that was because it involves intermittent serial processing by a single processor. The article also notes that HP's customers are increasingly using Linux for enterprise applications, and getting more interested in using it on the desktop..."

17 of 247 comments (clear)

Min score:

Reason:

Sort:

excuse my ignorance by g0dsp33d · 2005-01-18 15:48 · Score: 2, Informative

I know linux is pretty good from a security sence (compared to windows, at least), and I'm not surprised to find it operates on exotic setups, but is there that many programs out there that support such a setup? or ones that will actually benefit from this many processors? Or is the point of this system to develop custom business for their use? Or is it for a data server of some sort that can benefit from multiple cores answering requests?

--
lol: You see no door there!
1. Re:excuse my ignorance by AstroDrabb · 2005-01-18 15:58 · Score: 4, Informative
  
  There are still many uses for this many processors. Think of a monster DB. It is much easier to have more processors on you DB than to have many small systems and have to worry about syncing the data.
  Think about virtualization. I would love to have a 64-way system and break that up into 32 2-way systems or 16 4-ways systems. It would make system management much easier. And with software, you can instantly assign more processors in a virtualized system to a server that was being hit hard. So your 4-way DB can turn into a 8-way or 16-way DB in an instant. Once the load is gone, you set it back to a 4-way DB.
  I personally still prefer to load balance many smaller servers to save costs. However, this could be an excellent option for some enterprises. I know where I work we have some big Sun boxes and we just add processors as we need. However, that has proven to be rather expensive and virtualizing could help save some big costs.
  
  --
  If Tyranny and Oppression come to this land,
  it will be in the guise of fighting a foreign enemy. -James Madison
2. Re:excuse my ignorance by PornMaster · 2005-01-18 17:02 · Score: 3, Informative
  
  There are plenty of them on the market, and as the price comes down, there will be even more.
  
  To whom do you think HP has been selling the SuperDome line? And to whom has Sun been selling the E10/12/15K?
  
  One of the benefits of using a huge multiprocessor Sun box, though, besides the massive numbers of CPUs you can have in a single frame running under a single system image is the ability to dynamically reconfigure resources, like a few other posters have touched on.
  
  Imagine this... you have a box with 64 CPUs and 128GB of RAM. During the day, you have developers who are working with 16 CPUs and 32GB of RAM, working on the next generation of the database you'll be running for your business. A development domain.
  
  You have another domain of 16 CPUs and 32GB as a test domain. Like when stuff goes out to beta, you run tests on the stuff you've pushed out from your development copy to see if it's ready for prime-time.
  
  You have a third domain of 32 CPUs and 64GB in which you run production. It's a bit oversized for your needs for the work throughout the day, but it's capable of handling peak loads without slowing down.
  
  Then, you have a nightly database job that runs recalculating numbers for all the accounts, dumping data out to be sent to a reporting server somewhere, batch data loads coming in that need to be integrated into your database. Plus you have to keep servicing minimal amounts of requests from users throughout the night, but hey, nobody's really on between 10PM and 4AM.
  
  Wouldn't it be nice to drop the dev and test databases down to maybe 4CPUs if they're still running minimal tasks, and throw 56CPUs and 112GB of RAM at your nightly batch jobs? They get what's almost the run of the machine... until you're done with the batch jobs. Then you shrink production back to half the machine, and boost up the test and dev to a quarter each... so everyone's happy when the day starts.
  
  --
  500GB of disk, 5TB of transfer, $5.95/mo
3. Re:excuse my ignorance by jd · 2005-01-18 17:12 · Score: 2, Informative
  
  A SSI cluster that supported roles for defining the distribution of tasks would probably be more cost-effective. You'd also need Distributed Shared Memory, though, and distribution of threads as well as processes.
  
  Having the entire engine on one multi-way motherboard wouldn't really gain you much, because none of the work you described needs tight interconnects.
  
  --
  It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
4. Re:excuse my ignorance by ppanon · 2005-01-18 17:54 · Score: 2, Informative
  
  You'd also need Distributed Shared Memory, though, and distribution of threads as well as processes.
  
  Right, and that's exactly the situation where a single honking box is going to kick on any kind of cluster that's connected more loosely than what you get with a high-cpu count multiprocessor box.
  
  It all depends on how much interdependence on memory access between threads/processes (i.e. how well you can partition your data set to match your cluster topology). Often, it's a lot cheaper for a company to buy a $200,000 box than to throw three top-level programmers at rewriting the problem for a cluster (assuming they have the source code and it's a problem that can be tackled that way). Of course, companies like Microsoft or Oracle or IBM can sell enough copies to make that development worthwhile, but because the market is fairly small, they don't get their usual 90% profit margin, even when they charge many more times what they charge for the non-clustered versions. The only reason some of them do it is competition for bragging rights on benchmarks like TPC.
  
  Having the entire engine on one multi-way motherboard wouldn't really gain you much, because none of the work you described needs tight interconnects.
  Say what? Databases don't need tight interconnects in a dynamic scaling environment? Are you planning on repartitioning that 180GB data set each day yourself or were you planning on hiring a handful of university interns to do it?
  
  --
  Laissez lire, et laissez danser; ces deux amusements ne feront jamais de mal au monde. - Voltaire
5. Re:excuse my ignorance by jd · 2005-01-18 19:36 · Score: 2, Informative
  
  No, databases don't need tight interconnects, unless the data is changing rapidly, relative to the number of queries.
  
  I'd personally expect to see a system where common views of the data were cached locally, where the "authoritative" database was accessed via a SAN rather than the processor network, and where interprocessor communication was practically nil. There's not a whole lot that different threads would need to sent to each other.
  
  The whole point of SANs, "local busses" and other such technologies is to take all the heavy work off the lines that need to be highly responsive. It's generally better to have several specialized networks than one network that over-generalizes and is therefore not as good at any specific thing.
  
  --
  It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
6. Re:excuse my ignorance by Cajal · 2005-01-19 02:18 · Score: 2, Informative
  
  Just remember that almost no open-source databases use parallelized algorithms. PostgreSQL, Firebird and MySQL certainly don't. OpenIngres is the only one I know of with a parallel query engine. By this I mean the ability of a single query to use multiple processors (say, for handling a complex join and a large sort). The only way PG, FB and MySQL can use multiple CPUs is if you have multiple queries running. But for OLAP-style workloads, you won't see much benefit from SMP.
Re:A little factoid for you by AstroDrabb · 2005-01-18 15:49 · Score: 5, Informative

A little factoid for you
Where did you get your facts from? You are way off champ or should I say troll?
While FreeBSD is a great OS/kernel, it doesn't scale as well as Linux, end of story.
Until you start talking about double that amount of procs, which is what Windows Server does these days
Huh? What smoke are you craking? Here is the comparison of MS's latest and greatest Windows 2003 server editions
Web Edition supports up to 2-way.
Standard Edition supports up to 4-way.
Enterprise Edition supports up to 8-way.
Datacenter Edition supports up to 64-way.

So, umm where is this double of what Linux supports? Plain vanilla Linux 2.6 can do 64-way no problem. Actually, SGI has had single image 128-way Linux system out for a while. They should have 256-way, single image Linux system out soon. That is more then MS can even touch. Maybe do some research before you just shoot off FUD.

--
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
Re:A little factoid for you by tetromino · 2005-01-18 15:51 · Score: 3, Informative

If it can scale to 16 procs well, it will scale to 64 procs well.
Until you start talking about double that amount of procs, which is what Windows Server does these days

Wrong. Windows Server 2003 supports a maximum of only 64 processors, and I believe it was significantly tested only on 32-way and smaller machines.
Re:A little factoid for you by AstroDrabb · 2005-01-18 16:02 · Score: 4, Informative

This is where reading TFM whould help.
In the STREAM benchmark, memory bandwidth rose from 5GB/s with one 'cell' of four processors, to 10GB/s with two cells, and continued to double until all 64 processors -- or all 16 cells -- were switched on to provide 80GB/s of bandwidth.
The HPL benchmark, which is used to measure performance when solving large linear equations, produced similar results, rising from 18 gigaflops with one cell of four processors to 277 gigaflops with all 16 cells, or 64 processors, running.

--
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
Re:A little factoid for you by pantherace · 2005-01-18 16:03 · Score: 2, Informative

NASA's Columbia cluster ^ 512-way SGI machines running Linux (actually 20 of them...) Not to mention "Columbia's record results were achieved running the LINPACK benchmark on 8,192 of the NASA supercomputer's 10,240 processors. Columbia also achieved an 88 percent efficiency rating on the LINPACK benchmark, the highest efficiency rating ever attained in a LINPACK test on large systems." from http://www.sgi.com/company_info/newsroom/press_rel eases/2004/october/worlds_fastest.html
Re:Threads vs. Processes by Anonymous Coward · 2005-01-18 16:05 · Score: 2, Informative

Nice thing about processes is that they do not share memory. As such, the processes will be localized as would all the memory access. OTH, if you had just ONE big process loaded with nothing but threads, you would likely find the memory backplane going into highgear as data would be moved around abit.
Re:Interesting. by Anonymous Coward · 2005-01-18 16:36 · Score: 5, Informative

The problem is that most resources (memory, the bus, disks, etc) can only be used by one CPU at a time. So, for problems which are resource-intensive, you're generally better to cluster than to use SMP, so that each processor has its own bus, memory, etc.

No, you have a misconception. On these REAL big iron systems, each CPU (or each few CPUs) does have its own busses, memory, and io busses.

So in that regard it is as good as a cluster, but then add the fact that they have a global, cache coherent shared memory and interconnets that shame any cluster.

The only advantage of a cluster is cost. Actually redundancy plays a role too, although less so with proper servers, as they have redundancy built in, and you can partition off the system to run multiple operating systems too.

To be efficient, the processors would need gigantic caches, to keep the load on the rest of the system down. Either that, or you COULD run the CPUs out of step over a bus that is 64 times faster than normal. I'd hate to be the person designing such a system, though.

Now, this system could be of extreme interest in the supercomputer world. One of the biggest complaints about clustering is the poor interconnects. This would seem to get round that problem. A Blue Gene-style cluster where each node is a 64-way SMP board, and you're running a few thousand nodes, would likely be an order of magnitude faster than anything currently on the supercomputer charts.

Not really. Check the world's second fastest supercomputer. It is a cluster of 20 512-way IA64 systems running Linux.
Re:The SGI Altix is scaling to 256 cpus... by stratjakt · 2005-01-18 17:14 · Score: 5, Informative

This is an unmodified stock 2.6 kernel (well it's patched with stuff that's in distros, and will be in the next kernel). Out of the box, it detected the NUMA set up, memory partitions, the whole bit.

The SGI boxes are nothing like the stock kernel.

--
I don't need no instructions to know how to rock!!!!
Re:Threads vs. Processes by AstroDrabb · 2005-01-18 17:32 · Score: 3, Informative

First of all, Windows does not have very efficient threads. OK, compared to Linux they might be good

Linux is no where close to scaling its threads up to 64 processors.
Dude, what crack are you smoking? Have you used any _recent_ Linux thread? LinuxThreads is an implementation of the Posix 1003.1c thread package.
Unlike other implementations of Posix threads for Linux, LinuxThreads provides kernel-level threads: threads are created with the new clone() system call and all scheduling is done in the kernel.
The main strength of this approach is that it can take full advantage of multiprocessors. It also results in a simpler, more robust thread library, especially w.r.t. blocking system calls.

Oh, and if you think the latest implementation of Linux thread are slower, especially slower then MS Windows, you are an idiot. Here is are some test from IBM. Current Linux threads were spawning at more then 10,000 PER SECOND while MS Windows was spawning barely 6,000. Linux Thread performance, scroll down to the "pretty" graphs. Oh, and these numbers are higher then Solaris. Linux threads and Linux processes spawn _MUCH_ faster then the best MS has to offer and faster then Solaris.

--
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
Re:Threads vs. Processes by tetromino · 2005-01-18 18:10 · Score: 5, Informative

Have you used any _recent_ Linux thread? LinuxThreads is an implementation of the Posix 1003.1c thread package.

Dude, get with the times, LinuxThreads are obsolete. Kernel 2.6 / glibc 2.3 use NPTL, which launches new threads four times faster than LinuxThreads, allows you to have more than 8192 threads per process, doesn't require you to have lots of manager threads that don't do anything useful, delivers signals to threads as opposed to processes, and is actually more-or-less POSIX compliant.

I've been using NPTL on my workstation for 12 months, and I haven't looked back (except when early versions of Mono were incompatible with NPTL). You talk about "any _recent_ Linux thread" - but it looks like you are using a Debian Woody...
Re:Interesting. by Anonymous Coward · 2005-01-18 18:58 · Score: 2, Informative

Global shared-memory can be done on OpenMOSIX, using the Migshm extension, which provides you with Distributed Shared Memory.

There is a world of difference between emulating it with the operating system / programming environment, and having hardware cache coherent global shared memory.

The Altix uses 4-way CPU "bricks", along with networking and memory bricks, which you can then use to assemble a system. Yes, resources are visible globally, and it is a LOT faster than a PoP (pile-of-pcs) cluster using ethernet, but it is still a cluster of 4-way nodes.

No it is not. The big difference is that it isn't just "networking" them anymore than 2 CPUs on a SMP motherboard are networked. It is a specialty interconnect with higher bandwidth and lower latency than you'll find in anything you think of as a network. It also directly carries the cache directory protocol on the wire rather than TCP packets.

It is not a cluster. If you think it is then you either don't know what a cluster is or you don't know what an Altix is.

It also doesn't avoid the main point, which is that any given resource can only be used by one CPU at a time. If processor A on brick B is passing data along wire C, then wire C cannot be handling traffic for any other processor at the same time. That resource is claimed, for that time.

I'll repeat it for you for the 100th time. This does not get any better in a cluster. In fact, it gets *much* worse because the latency and bandwidth on the interconnect is so much worse.

Why do you think people pay so much money for one when they could get 1000 cheap P4's and cluster them? Do you seriously think you know more about the subject than the people making and buying these things? (Hint: you don't)