BigTux Shows Linux Scales To 64-Way
An anonymous reader writes "HP has been demonstrating a Superdome server running the Stream and HPL benchmarks, which shows that the standard 2.6 Linux kernel scales to 64 processors. Compiling the kernel didn't scale quite so well, but that was because it involves intermittent serial processing by a single processor. The article also notes that HP's customers are increasingly using Linux for enterprise applications, and getting more interested in using it on the desktop..."
Does it run Linux well?
What parallel-computing activity doesn't involve intermittent activity by a single processor? You have to spawn the parallel job somehow, and typically that starts as a single process. Is the implication here that compiling is pipelined, but linking is a single-CPU job?
If you mod me down, I shall become more powerful than you can possibly imagine.
I haven't had a 64-way since college.
And you?
"Look, Smithers! I'm Davy Crockett!"
SGI
Unisys
Fujitsu
HP
It looks like there might actually be a competitive marketplace for scalable multiprocessor Linux systems real soon now (if not already).
I know linux is pretty good from a security sence (compared to windows, at least), and I'm not surprised to find it operates on exotic setups, but is there that many programs out there that support such a setup? or ones that will actually benefit from this many processors? Or is the point of this system to develop custom business for their use? Or is it for a data server of some sort that can benefit from multiple cores answering requests?
lol: You see no door there!
While FreeBSD is a great OS/kernel, it doesn't scale as well as Linux, end of story.
Huh? What smoke are you craking? Here is the comparison of MS's latest and greatest Windows 2003 server editions So, umm where is this double of what Linux supports? Plain vanilla Linux 2.6 can do 64-way no problem. Actually, SGI has had single image 128-way Linux system out for a while. They should have 256-way, single image Linux system out soon. That is more then MS can even touch. Maybe do some research before you just shoot off FUD.If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
Hey, at least they tried. How many news articles have you read that compares linux kernel compiles on a 64 processor machine? probably only one.
it took 19 minutes to compile with a single cpu, and 26x faster for the 64 processor machine. Does that equate to about 43 seconds for a kernel compile? It'd probably take longer than that just to untar/unbzip2 the source, since that would be running on only 2 cpus (one process for tar, one for bzip2).
Why read the article when I can just make up a snap judgement?
That should be enough for anybody :)
Living better through chemicals
If it can scale to 16 procs well, it will scale to 64 procs well.
Until you start talking about double that amount of procs, which is what Windows Server does these days
Wrong. Windows Server 2003 supports a maximum of only 64 processors, and I believe it was significantly tested only on 32-way and smaller machines.
Looking at the literature, Linux and Unix in general seems to be designed to keep processes as lightweight as possible. OTOH, Windows processes are a little heavier and take longer to start up.
Then, OTOH, Windows threads are very lightweight compared to the equivalent thread model in Linux. Benchmarks have shown that in multi-process setups, Unix is heavily favored, but in multi-threaded setups Windows comes out on top.
When it comes to multi-processors, is there a theoretical advantage to using processes vs threads? Leaving out the Windows vs Linux debate for a second, how would an OS that implemented very efficient threads compare to one that implemented very efficient processes?
Would there be a difference?
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
First of all, a 26x speedup is GOOD. That said, if you are trying to use a cluster of 64 Itanium 2 processors to compile things, you're an idiot. IIRC, the long pipeline and VLIW, highly scheduled, architecture of the Itanium 2 make it bad at compiling. You could get that performance with cheapter Athlon 64s or Xeons. Not only that, but compiling one thing will ALWAYS be partly serial. Now if they were to compile multiple things (say 3 kernels, or the kernel, X, and KDE) at the same time, they should see closer to that 64x speedup. It's all about how much you can make parallel.
Which is something else. If you were to give that same thing a better application, it WOULD give you near 64x performance. If you used it to batch convert WAVs to MP3s, or RAW images to JPEGs, or MPEG4 to DiVx, or even just raytrace images (all things where no part is dependant on another part so they are highly parallizable), things will go great. In the article, they give the example of some bandwidth benchmark where the bandwidth scales almost perfectly with the number of processors they throw at it.
PS: Interesting fact I saw the other day. The human brain can only do about 200 operations per second, which is why computers are much faster at math. But the brain can do MILLIONS of things at once. So while it may only be able to process the image from our eyes at 200 "operations" per second, it do that for the millions of little bits of information all at once, which is why people are so good at visual things, pattern matching, chess, etc. Just FYI.
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
NASA's Columbia cluster ^ 512-way SGI machines running Linux (actually 20 of them...) Not to mention "Columbia's record results were achieved running the LINPACK benchmark on 8,192 of the NASA supercomputer's 10,240 processors. Columbia also achieved an 88 percent efficiency rating on the LINPACK benchmark, the highest efficiency rating ever attained in a LINPACK test on large systems." from http://www.sgi.com/company_info/newsroom/press_rel eases/2004/october/worlds_fastest.html
Correct, AFAIK the biggest windows 2003 datacenter installs are on Unisys ES7000's and those only support 32-way windows partitions. The box can hold 64 Xeon's so I would say that Unisys isn't comfortable with the scalability of windows to the full system size, otherwise they'd be shouting it from the rooftops.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Never mind Linux for a moment, I'm just amazed that 64 Itanium 2's have actually been sold...
To be efficient, the processors would need gigantic caches, to keep the load on the rest of the system down. Either that, or you COULD run the CPUs out of step over a bus that is 64 times faster than normal. I'd hate to be the person designing such a system, though.
Now, this system could be of extreme interest in the supercomputer world. One of the biggest complaints about clustering is the poor interconnects. This would seem to get round that problem. A Blue Gene-style cluster where each node is a 64-way SMP board, and you're running a few thousand nodes, would likely be an order of magnitude faster than anything currently on the supercomputer charts.
On the other hand, do we need to know what the weather is not going to be, ten times as often?
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Looks like someone was up to those challenges, eh? 64-processor support *and* 64-bit support. Awesome news.
I have no special gift, I am only passionately curious. --Albert Einstein
Smaller, say 4 or 8 way NUMA boards, that are within the means of the average geek?
I'm not talking about mere mortal SMP systems, I wan't all the crazy memory partitioning and whatnot.
I don't need no instructions to know how to rock!!!!
Someone wasn't awake when their Comp Sci class covered Ahmdal's Law. Or the Dining Philosopher's Problem. Or vector processing. Or networking. Or the parallelization problem. Or...
Actually, the troll can be made to serve a useful purpose, because there are probably a lot of people who read Slashdot who didn't do Comp Sci.
Part of the problem with parallelization is that not all problems can be divided up that way. If one man takes 60 seconds to dig a posthole, how long would it take 60 men to dig a single posthole? Answer - 60 seconds. Exactly the same amount of time is spent, because only one person can be digging the posthole at a time. Having more people doesn't help.
Another part of the problem is sharing resources. Let's say you have some computer memory that can respond to a read operation in one clock cycle. Let's also say that the computer program never reads from memory. (Very unlikely.) The first processor fetches an instruction (which is a read operation) and then executes it. The second processor can't do anything while the first one is reading, so has to wait until it has finished with that part, before it can do a read of its own.
If the instruction takes 1 clock cycle to execute, then the first processor will be ready after the second one has performed its fetch. In which case, you will be running the memory flat-out with just 2 processors. Any more than that, and the system will actually slow down, because the processors will have to wait.
Likewise, if the average time to run an instruction is N clock cycles, you will (on average) be able to have N+1 processors, before the memory is maxed out.
In practice, processors run about an order of magnitude faster than RAM, which is why modern systems have lots of L1 and L2 cache (and sometimes L3), pipelining, etc. These are all tricks to try and access the somewhat slower main memory as little as possible.
Also in practice, programmers try to avoid "expensive" (in terms of clock cycles) operations because you can generally get the same results faster by other means. (That's why RISC technology became popular - make the fast operations faster, rather than adding stuff that people will try to avoid.)
In consequence, sharing resources is a very difficult problem. It is not the only problem that many-way systems face, though. If you have N processors, there are !N possible ways for those processors to communicate. In this case, it would be !64 (64x63x62x...x2x1), which is a horribly large number. You couldn't have one link per pathway, for example, which means you've got to share links, which means you've got to have some damn good scheduling and routing mechanisms. Even then, with limited resources, you can only have so many processors talking at a time, before you are overwhelmed. Which means that "chatty" problems will involve a lot of processors spending a lot of time simply waiting for their turn to chat.
(This goes back to why people generally build clusters, rather than many-way SMP systems, and why high-end clusters use the fastest networking technology on the planet. Clustering is easy. Getting the communication speeds up is the problem. Getting communication speeds to the point of being useful for scientific applications is a very complex, expensive problem. Which is the main reason Mr. Cray charged more than Mr. Dell for his computers - and why people would pay it.)
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
This is an unmodified stock 2.6 kernel (well it's patched with stuff that's in distros, and will be in the next kernel). Out of the box, it detected the NUMA set up, memory partitions, the whole bit.
The SGI boxes are nothing like the stock kernel.
I don't need no instructions to know how to rock!!!!
They did try Windows Server 2003 on a 64-way machine, but the kernel got scared and hid under the disk controller.
-- Microsoft is the most expensive commodity operating system and office suite vendor in the marketplace.
Mandatory... Imagine a Beowulf cluster of those!
Two mod points if you can work a good goatse or overlord joke into this topic. Although, the thought of a 64-way goatse overlord gives me the jeebies.
Table-ized A.I.
You must have high ceilings in your office.
-- Microsoft is the most expensive commodity operating system and office suite vendor in the marketplace.
Linux scaling to 512 processors:/ columbia/
http://www.sgi.com/features/2004/oct
The story should be HP has finally caught up to where SGI were 2 years ago.\
There is folly and foolishness on the one side, and daring and calculation on the other. - Admiral Pellew, Hornblower
My kernel only goes up to 11.
How'd you get a three processor system? Is it a quad board, discounted heavily because one socket was broken? That'd be neat, where'd you get it?
Infuriate left and right