BigTux Shows Linux Scales To 64-Way

← Back to Stories (view on slashdot.org)

BigTux Shows Linux Scales To 64-Way

Posted by timothy on Tuesday January 18, 2005 @03:28PM from the can't-let-you-do-that-dave dept.

An anonymous reader writes "HP has been demonstrating a Superdome server running the Stream and HPL benchmarks, which shows that the standard 2.6 Linux kernel scales to 64 processors. Compiling the kernel didn't scale quite so well, but that was because it involves intermittent serial processing by a single processor. The article also notes that HP's customers are increasingly using Linux for enterprise applications, and getting more interested in using it on the desktop..."

12 of 247 comments (clear)

Min score:

Reason:

Sort:

Re:A little factoid for you by Cheeze · 2005-01-18 15:50 · Score: 2, Insightful

Hey, at least they tried. How many news articles have you read that compares linux kernel compiles on a 64 processor machine? probably only one.

it took 19 minutes to compile with a single cpu, and 26x faster for the 64 processor machine. Does that equate to about 43 seconds for a kernel compile? It'd probably take longer than that just to untar/unbzip2 the source, since that would be running on only 2 cpus (one process for tar, one for bzip2).

--
Why read the article when I can just make up a snap judgement?
Threads vs. Processes by Dancin_Santa · 2005-01-18 15:53 · Score: 5, Insightful

Looking at the literature, Linux and Unix in general seems to be designed to keep processes as lightweight as possible. OTOH, Windows processes are a little heavier and take longer to start up.

Then, OTOH, Windows threads are very lightweight compared to the equivalent thread model in Linux. Benchmarks have shown that in multi-process setups, Unix is heavily favored, but in multi-threaded setups Windows comes out on top.

When it comes to multi-processors, is there a theoretical advantage to using processes vs threads? Leaving out the Windows vs Linux debate for a second, how would an OS that implemented very efficient threads compare to one that implemented very efficient processes?

Would there be a difference?
Re:excuse my ignorance by Anonymous Coward · 2005-01-18 15:56 · Score: 3, Insightful

is there that many programs out there that support such a setup?

As they say, if you have to ask, you don't need it.

The point for stuff like this isn't the number of programs that will support it, it's that you already have *one* program that not only supports it, but requires it.

Think weather modeling. It's a specialized application that requires massive CPU horsepower - and it's written specifically for the task at hand. This isn't something you'd pick up at Best Buy, or download from Freshmeat - it's a custom app that requires massive amounts of horsepower to do a specific task.
Re:A little factoid for you by MBCook · 2005-01-18 16:03 · Score: 4, Insightful

This is almost troll, as far as I'm concerned.
First of all, a 26x speedup is GOOD. That said, if you are trying to use a cluster of 64 Itanium 2 processors to compile things, you're an idiot. IIRC, the long pipeline and VLIW, highly scheduled, architecture of the Itanium 2 make it bad at compiling. You could get that performance with cheapter Athlon 64s or Xeons. Not only that, but compiling one thing will ALWAYS be partly serial. Now if they were to compile multiple things (say 3 kernels, or the kernel, X, and KDE) at the same time, they should see closer to that 64x speedup. It's all about how much you can make parallel.
Which is something else. If you were to give that same thing a better application, it WOULD give you near 64x performance. If you used it to batch convert WAVs to MP3s, or RAW images to JPEGs, or MPEG4 to DiVx, or even just raytrace images (all things where no part is dependant on another part so they are highly parallizable), things will go great. In the article, they give the example of some bandwidth benchmark where the bandwidth scales almost perfectly with the number of processors they throw at it.
PS: Interesting fact I saw the other day. The human brain can only do about 200 operations per second, which is why computers are much faster at math. But the brain can do MILLIONS of things at once. So while it may only be able to process the image from our eyes at 200 "operations" per second, it do that for the millions of little bits of information all at once, which is why people are so good at visual things, pattern matching, chess, etc. Just FYI.

--
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
Re:Hrmm by Anonymous Coward · 2005-01-18 16:25 · Score: 1, Insightful

*yawn* Why stay up that late when it booted on a 512p Altix this morning?
Interesting. by jd · 2005-01-18 16:27 · Score: 2, Insightful

The problem is that most resources (memory, the bus, disks, etc) can only be used by one CPU at a time. So, for problems which are resource-intensive, you're generally better to cluster than to use SMP, so that each processor has its own bus, memory, etc.

To be efficient, the processors would need gigantic caches, to keep the load on the rest of the system down. Either that, or you COULD run the CPUs out of step over a bus that is 64 times faster than normal. I'd hate to be the person designing such a system, though.

Now, this system could be of extreme interest in the supercomputer world. One of the biggest complaints about clustering is the poor interconnects. This would seem to get round that problem. A Blue Gene-style cluster where each node is a 64-way SMP board, and you're running a few thousand nodes, would likely be an order of magnitude faster than anything currently on the supercomputer charts.

On the other hand, do we need to know what the weather is not going to be, ten times as often?

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
1. Re:Interesting. by jd · 2005-01-18 17:21 · Score: 4, Insightful
  
  Global shared-memory can be done on OpenMOSIX, using the Migshm extension, which provides you with Distributed Shared Memory.
  
  The Altix uses 4-way CPU "bricks", along with networking and memory bricks, which you can then use to assemble a system. Yes, resources are visible globally, and it is a LOT faster than a PoP (pile-of-pcs) cluster using ethernet, but it is still a cluster of 4-way nodes.
  
  It also doesn't avoid the main point, which is that any given resource can only be used by one CPU at a time. If processor A on brick B is passing data along wire C, then wire C cannot be handling traffic for any other processor at the same time. That resource is claimed, for that time.
  
  When you are talking a massive cluster of hundreds or thousands of CPU bricks, it becomes very hard to efficiently schedule the use of resources. That's one reason such systems often have an implementation of SCP, the Scheduled Communications Protocol, where you reserve networking resources in advance. That way, it becomes possible to improve the efficiency. Otherwise, you run the risk of gridlock, which is just as real a problem in computing as it is on the streets.
  
  --
  It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:Wow... by Bingo+Foo · 2005-01-18 16:31 · Score: 1, Insightful

It says something about how provincial the IT world has become when someone says "the best thing about multi-processor systems past two or four is really the ability to run virtualized servers with two or four dedicated CPUs each inside an uber-CPU'd system."
There's a reason you pay so much more per CPU for an SMP or NUMA system, and it ain't for network services.

--
taken! (by Davidleeroth) Thanks Bingo Foo!
Re:Pardon my ignorance, but... by rgmoore · 2005-01-18 16:36 · Score: 3, Insightful

That type of processing is frequently called "embarrasingly parallel", and it's far more common than you seem to think. I think that 3D rendering and web serving that doesn't require writing to a database can all be handled this way. There are also many categories of scientific data processing- think SETI@home- that work this way. The real reason that this kind of SMP isn't interesting is because it's so easy that you don't need fancy hardware like 64-way servers to take advantage of it. It can be farmed out to clusters of cheap PCs, or even distributed over the network to volunteers.

--
There's no point in questioning authority if you aren't going to listen to the answers.
Re:excuse my ignorance by AstroDrabb · 2005-01-18 16:51 · Score: 4, Insightful

Why would I want a 16-way processor in place of 8 dual processor boxes with a gigabit backbone network to them?
It all depends on what you are doing. Where I work we replaced a few bigger boxes with a bunch of smaller/cheaper boxes behind a load balancer for web apps. However, when it came to DB performance, the bigger boxes were much better. Well, at least to a point. Our 8-way DB was much better then are 4 2-way DB's. The cost wasn't much more, so an 8-way worked well.
I do agree, that "big iron" is losing the power it once had. Especially when one can cluster a bunch of much cheaper 2-way boxes.

--
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
Re:A little factoid for you by jd · 2005-01-18 17:05 · Score: 5, Insightful

If it can scale to 16 procs well, it will scale to 64 procs well.

Someone wasn't awake when their Comp Sci class covered Ahmdal's Law. Or the Dining Philosopher's Problem. Or vector processing. Or networking. Or the parallelization problem. Or...

Actually, the troll can be made to serve a useful purpose, because there are probably a lot of people who read Slashdot who didn't do Comp Sci.

Part of the problem with parallelization is that not all problems can be divided up that way. If one man takes 60 seconds to dig a posthole, how long would it take 60 men to dig a single posthole? Answer - 60 seconds. Exactly the same amount of time is spent, because only one person can be digging the posthole at a time. Having more people doesn't help.

Another part of the problem is sharing resources. Let's say you have some computer memory that can respond to a read operation in one clock cycle. Let's also say that the computer program never reads from memory. (Very unlikely.) The first processor fetches an instruction (which is a read operation) and then executes it. The second processor can't do anything while the first one is reading, so has to wait until it has finished with that part, before it can do a read of its own.

If the instruction takes 1 clock cycle to execute, then the first processor will be ready after the second one has performed its fetch. In which case, you will be running the memory flat-out with just 2 processors. Any more than that, and the system will actually slow down, because the processors will have to wait.

Likewise, if the average time to run an instruction is N clock cycles, you will (on average) be able to have N+1 processors, before the memory is maxed out.

In practice, processors run about an order of magnitude faster than RAM, which is why modern systems have lots of L1 and L2 cache (and sometimes L3), pipelining, etc. These are all tricks to try and access the somewhat slower main memory as little as possible.

Also in practice, programmers try to avoid "expensive" (in terms of clock cycles) operations because you can generally get the same results faster by other means. (That's why RISC technology became popular - make the fast operations faster, rather than adding stuff that people will try to avoid.)

In consequence, sharing resources is a very difficult problem. It is not the only problem that many-way systems face, though. If you have N processors, there are !N possible ways for those processors to communicate. In this case, it would be !64 (64x63x62x...x2x1), which is a horribly large number. You couldn't have one link per pathway, for example, which means you've got to share links, which means you've got to have some damn good scheduling and routing mechanisms. Even then, with limited resources, you can only have so many processors talking at a time, before you are overwhelmed. Which means that "chatty" problems will involve a lot of processors spending a lot of time simply waiting for their turn to chat.

(This goes back to why people generally build clusters, rather than many-way SMP systems, and why high-end clusters use the fastest networking technology on the planet. Clustering is easy. Getting the communication speeds up is the problem. Getting communication speeds to the point of being useful for scientific applications is a very complex, expensive problem. Which is the main reason Mr. Cray charged more than Mr. Dell for his computers - and why people would pay it.)

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:excuse my ignorance by afabbro · 2005-01-18 17:56 · Score: 4, Insightful
It would make system management much easier.
I prefer to say "might" make systems management much easier. The problem with the One Big Box is the same whether it's Sun, HP, Linux, etc.:
- Something bad happens to the One Or Two Critical Components. If you know of any open systems box has no one single point of failure, I'd sure like to see it. If you want one big box without a single point of failure, you buy a mainframe. Every open systems big box I'm aware has at least one or ten SPOFs...and I've had the backplane go out on more than one Sun E10K. At that point, you don't lose just one system, you lose everything if you've consolidated to one 64-way box.
- It's time to do some hardware maintenance. Good luck coordinating that with 32 different user groups. "Ah, but we can do everything hot with this big box." Always sounds good on paper. I've always run into things for each of them that required a power-off maintenance.
- Or perhaps it's not even maintenance...it's just something weird. I had a Big Box once where a power supply made a popping noise and emitted a small puff of smoke. It burned out. Not a big deal in the end - it could be replaced hot - but it was a nervous couple of hours. Versus a cluster where you'd fail over to the spare (yes, I know you could cluster your Two Big Boxes, but we start getting into financial justifications).
- ISVs say things like "You want to run XYZ 1.0 on your 64-way box? That's a tier 9 platform and that will be $100,000, thank you." "But I'm only using it on one 2-way partition!" "You might dynamically reconfigure it after we sell you the license and our software isn't that smart, so it's $100K or no deal. And then you can use it on all your partitions!" "But I don't need it on all of them!" You'd be amazed how many prominent software companies tier based on the overall box and don't support virtual partitions, etc. from a licensing perspective. And you're guaranteed to have a user who needs one of their products.
- Department B bought SAN gizmo X and your big box is exotic enough that there is no driver for it. They really want SAN gizmo X, so they go off and buy a new 4-way box for themselves. Or they want to run SuSE and SuSE doesn't support your box. Or everyone wants his own gig-E or two and you don't have 128 ports out the back. Etcetera - there are lots of scenarios where you can't get the technical architecture brainiacs to think ahead or you can't get the vendors' stars to line up and you wind up with people who don't want to be on the big box...and pretty soon the data center is proliferating again.
Etcetera...of course, there are just as many if not more problems with the "we'll just build a giant cluster of 64 boxes and scale across it!" approach...I'll rant on that some other day.
It's all trade-offs. And no matter which way you go, you'll discover some truly ugly hidden costs that never seem to show up in those vendor white papers. And none of it works exactly the way it should or you'd like it to. But I'm not jaded or anything ;)
--
Advice: on VPS providers