BigTux Shows Linux Scales To 64-Way
An anonymous reader writes "HP has been demonstrating a Superdome server running the Stream and HPL benchmarks, which shows that the standard 2.6 Linux kernel scales to 64 processors. Compiling the kernel didn't scale quite so well, but that was because it involves intermittent serial processing by a single processor. The article also notes that HP's customers are increasingly using Linux for enterprise applications, and getting more interested in using it on the desktop..."
Looking at the literature, Linux and Unix in general seems to be designed to keep processes as lightweight as possible. OTOH, Windows processes are a little heavier and take longer to start up.
Then, OTOH, Windows threads are very lightweight compared to the equivalent thread model in Linux. Benchmarks have shown that in multi-process setups, Unix is heavily favored, but in multi-threaded setups Windows comes out on top.
When it comes to multi-processors, is there a theoretical advantage to using processes vs threads? Leaving out the Windows vs Linux debate for a second, how would an OS that implemented very efficient threads compare to one that implemented very efficient processes?
Would there be a difference?
First of all, a 26x speedup is GOOD. That said, if you are trying to use a cluster of 64 Itanium 2 processors to compile things, you're an idiot. IIRC, the long pipeline and VLIW, highly scheduled, architecture of the Itanium 2 make it bad at compiling. You could get that performance with cheapter Athlon 64s or Xeons. Not only that, but compiling one thing will ALWAYS be partly serial. Now if they were to compile multiple things (say 3 kernels, or the kernel, X, and KDE) at the same time, they should see closer to that 64x speedup. It's all about how much you can make parallel.
Which is something else. If you were to give that same thing a better application, it WOULD give you near 64x performance. If you used it to batch convert WAVs to MP3s, or RAW images to JPEGs, or MPEG4 to DiVx, or even just raytrace images (all things where no part is dependant on another part so they are highly parallizable), things will go great. In the article, they give the example of some bandwidth benchmark where the bandwidth scales almost perfectly with the number of processors they throw at it.
PS: Interesting fact I saw the other day. The human brain can only do about 200 operations per second, which is why computers are much faster at math. But the brain can do MILLIONS of things at once. So while it may only be able to process the image from our eyes at 200 "operations" per second, it do that for the millions of little bits of information all at once, which is why people are so good at visual things, pattern matching, chess, etc. Just FYI.
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
I do agree, that "big iron" is losing the power it once had. Especially when one can cluster a bunch of much cheaper 2-way boxes.
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
Someone wasn't awake when their Comp Sci class covered Ahmdal's Law. Or the Dining Philosopher's Problem. Or vector processing. Or networking. Or the parallelization problem. Or...
Actually, the troll can be made to serve a useful purpose, because there are probably a lot of people who read Slashdot who didn't do Comp Sci.
Part of the problem with parallelization is that not all problems can be divided up that way. If one man takes 60 seconds to dig a posthole, how long would it take 60 men to dig a single posthole? Answer - 60 seconds. Exactly the same amount of time is spent, because only one person can be digging the posthole at a time. Having more people doesn't help.
Another part of the problem is sharing resources. Let's say you have some computer memory that can respond to a read operation in one clock cycle. Let's also say that the computer program never reads from memory. (Very unlikely.) The first processor fetches an instruction (which is a read operation) and then executes it. The second processor can't do anything while the first one is reading, so has to wait until it has finished with that part, before it can do a read of its own.
If the instruction takes 1 clock cycle to execute, then the first processor will be ready after the second one has performed its fetch. In which case, you will be running the memory flat-out with just 2 processors. Any more than that, and the system will actually slow down, because the processors will have to wait.
Likewise, if the average time to run an instruction is N clock cycles, you will (on average) be able to have N+1 processors, before the memory is maxed out.
In practice, processors run about an order of magnitude faster than RAM, which is why modern systems have lots of L1 and L2 cache (and sometimes L3), pipelining, etc. These are all tricks to try and access the somewhat slower main memory as little as possible.
Also in practice, programmers try to avoid "expensive" (in terms of clock cycles) operations because you can generally get the same results faster by other means. (That's why RISC technology became popular - make the fast operations faster, rather than adding stuff that people will try to avoid.)
In consequence, sharing resources is a very difficult problem. It is not the only problem that many-way systems face, though. If you have N processors, there are !N possible ways for those processors to communicate. In this case, it would be !64 (64x63x62x...x2x1), which is a horribly large number. You couldn't have one link per pathway, for example, which means you've got to share links, which means you've got to have some damn good scheduling and routing mechanisms. Even then, with limited resources, you can only have so many processors talking at a time, before you are overwhelmed. Which means that "chatty" problems will involve a lot of processors spending a lot of time simply waiting for their turn to chat.
(This goes back to why people generally build clusters, rather than many-way SMP systems, and why high-end clusters use the fastest networking technology on the planet. Clustering is easy. Getting the communication speeds up is the problem. Getting communication speeds to the point of being useful for scientific applications is a very complex, expensive problem. Which is the main reason Mr. Cray charged more than Mr. Dell for his computers - and why people would pay it.)
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
The Altix uses 4-way CPU "bricks", along with networking and memory bricks, which you can then use to assemble a system. Yes, resources are visible globally, and it is a LOT faster than a PoP (pile-of-pcs) cluster using ethernet, but it is still a cluster of 4-way nodes.
It also doesn't avoid the main point, which is that any given resource can only be used by one CPU at a time. If processor A on brick B is passing data along wire C, then wire C cannot be handling traffic for any other processor at the same time. That resource is claimed, for that time.
When you are talking a massive cluster of hundreds or thousands of CPU bricks, it becomes very hard to efficiently schedule the use of resources. That's one reason such systems often have an implementation of SCP, the Scheduled Communications Protocol, where you reserve networking resources in advance. That way, it becomes possible to improve the efficiency. Otherwise, you run the risk of gridlock, which is just as real a problem in computing as it is on the streets.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I prefer to say "might" make systems management much easier. The problem with the One Big Box is the same whether it's Sun, HP, Linux, etc.:
Etcetera...of course, there are just as many if not more problems with the "we'll just build a giant cluster of 64 boxes and scale across it!" approach...I'll rant on that some other day.
It's all trade-offs. And no matter which way you go, you'll discover some truly ugly hidden costs that never seem to show up in those vendor white papers. And none of it works exactly the way it should or you'd like it to. But I'm not jaded or anything ;)
Advice: on VPS providers