Slashdot Mirror


BigTux Shows Linux Scales To 64-Way

An anonymous reader writes "HP has been demonstrating a Superdome server running the Stream and HPL benchmarks, which shows that the standard 2.6 Linux kernel scales to 64 processors. Compiling the kernel didn't scale quite so well, but that was because it involves intermittent serial processing by a single processor. The article also notes that HP's customers are increasingly using Linux for enterprise applications, and getting more interested in using it on the desktop..."

22 of 247 comments (clear)

  1. So this time.. by puntloos · · Score: 4, Funny
    The age-old Slashdot question should read:

    Does it run Linux well?

    1. Re:So this time.. by ikewillis · · Score: 5, Interesting
      This is the real question which is oft ignored. There is far too great an emphasis of being able to manage n CPUs rather than how effectively kernel services operate on n CPUs.

      The answers have to do with fine grained locking of kernel services, so that the number of resource contentions between processors can be mitigated through a diverse number of locks with the hope that diversifying locks will ensure that fewer will be likely to be held at a given time, or designing interfaces that don't require locking of kernel structures at all.

      At any rate, Amazon successfully powers their backend database with Linux/IA64 running on HP servers. YMMV, but if it's good for what most would consider the preminent online merchant, it's probably good enough for you too.

  2. Geez... by byronne · · Score: 4, Funny

    I haven't had a 64-way since college.

    And you?

    --
    "Look, Smithers! I'm Davy Crockett!"
    1. Re:Geez... by nacturation · · Score: 4, Funny

      Most slashdotters are still working on upgrading to a 2-way.

      --
      Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
  3. Hrmm by Nailer · · Score: 4, Interesting

    SGI
    Unisys
    Fujitsu
    HP

    It looks like there might actually be a competitive marketplace for scalable multiprocessor Linux systems real soon now (if not already).

  4. Re:A little factoid for you by AstroDrabb · · Score: 5, Informative
    A little factoid for you
    Where did you get your facts from? You are way off champ or should I say troll?

    While FreeBSD is a great OS/kernel, it doesn't scale as well as Linux, end of story.

    Until you start talking about double that amount of procs, which is what Windows Server does these days
    Huh? What smoke are you craking? Here is the comparison of MS's latest and greatest Windows 2003 server editions
    Web Edition supports up to 2-way.
    Standard Edition supports up to 4-way.
    Enterprise Edition supports up to 8-way.
    Datacenter Edition supports up to 64-way.
    So, umm where is this double of what Linux supports? Plain vanilla Linux 2.6 can do 64-way no problem. Actually, SGI has had single image 128-way Linux system out for a while. They should have 256-way, single image Linux system out soon. That is more then MS can even touch. Maybe do some research before you just shoot off FUD.
    --
    If Tyranny and Oppression come to this land,
    it will be in the guise of fighting a foreign enemy. -James Madison
  5. Re:64 Itanics! by cicadia · · Score: 4, Funny
    That's what, 640.000?

    That should be enough for anybody :)

    --
    Living better through chemicals
  6. Threads vs. Processes by Dancin_Santa · · Score: 5, Insightful

    Looking at the literature, Linux and Unix in general seems to be designed to keep processes as lightweight as possible. OTOH, Windows processes are a little heavier and take longer to start up.

    Then, OTOH, Windows threads are very lightweight compared to the equivalent thread model in Linux. Benchmarks have shown that in multi-process setups, Unix is heavily favored, but in multi-threaded setups Windows comes out on top.

    When it comes to multi-processors, is there a theoretical advantage to using processes vs threads? Leaving out the Windows vs Linux debate for a second, how would an OS that implemented very efficient threads compare to one that implemented very efficient processes?

    Would there be a difference?

    1. Re:Threads vs. Processes by tetromino · · Score: 5, Informative

      Have you used any _recent_ Linux thread? LinuxThreads is an implementation of the Posix 1003.1c thread package.

      Dude, get with the times, LinuxThreads are obsolete. Kernel 2.6 / glibc 2.3 use NPTL, which launches new threads four times faster than LinuxThreads, allows you to have more than 8192 threads per process, doesn't require you to have lots of manager threads that don't do anything useful, delivers signals to threads as opposed to processes, and is actually more-or-less POSIX compliant.

      I've been using NPTL on my workstation for 12 months, and I haven't looked back (except when early versions of Mono were incompatible with NPTL). You talk about "any _recent_ Linux thread" - but it looks like you are using a Debian Woody...

  7. Re:excuse my ignorance by AstroDrabb · · Score: 4, Informative
    There are still many uses for this many processors. Think of a monster DB. It is much easier to have more processors on you DB than to have many small systems and have to worry about syncing the data.

    Think about virtualization. I would love to have a 64-way system and break that up into 32 2-way systems or 16 4-ways systems. It would make system management much easier. And with software, you can instantly assign more processors in a virtualized system to a server that was being hit hard. So your 4-way DB can turn into a 8-way or 16-way DB in an instant. Once the load is gone, you set it back to a 4-way DB.

    I personally still prefer to load balance many smaller servers to save costs. However, this could be an excellent option for some enterprises. I know where I work we have some big Sun boxes and we just add processors as we need. However, that has proven to be rather expensive and virtualizing could help save some big costs.

    --
    If Tyranny and Oppression come to this land,
    it will be in the guise of fighting a foreign enemy. -James Madison
  8. Re:A little factoid for you by AstroDrabb · · Score: 4, Informative
    This is where reading TFM whould help.
    In the STREAM benchmark, memory bandwidth rose from 5GB/s with one 'cell' of four processors, to 10GB/s with two cells, and continued to double until all 64 processors -- or all 16 cells -- were switched on to provide 80GB/s of bandwidth.

    The HPL benchmark, which is used to measure performance when solving large linear equations, produced similar results, rising from 18 gigaflops with one cell of four processors to 277 gigaflops with all 16 cells, or 64 processors, running.

    --
    If Tyranny and Oppression come to this land,
    it will be in the guise of fighting a foreign enemy. -James Madison
  9. Re:A little factoid for you by MBCook · · Score: 4, Insightful
    This is almost troll, as far as I'm concerned.

    First of all, a 26x speedup is GOOD. That said, if you are trying to use a cluster of 64 Itanium 2 processors to compile things, you're an idiot. IIRC, the long pipeline and VLIW, highly scheduled, architecture of the Itanium 2 make it bad at compiling. You could get that performance with cheapter Athlon 64s or Xeons. Not only that, but compiling one thing will ALWAYS be partly serial. Now if they were to compile multiple things (say 3 kernels, or the kernel, X, and KDE) at the same time, they should see closer to that 64x speedup. It's all about how much you can make parallel.

    Which is something else. If you were to give that same thing a better application, it WOULD give you near 64x performance. If you used it to batch convert WAVs to MP3s, or RAW images to JPEGs, or MPEG4 to DiVx, or even just raytrace images (all things where no part is dependant on another part so they are highly parallizable), things will go great. In the article, they give the example of some bandwidth benchmark where the bandwidth scales almost perfectly with the number of processors they throw at it.

    PS: Interesting fact I saw the other day. The human brain can only do about 200 operations per second, which is why computers are much faster at math. But the brain can do MILLIONS of things at once. So while it may only be able to process the image from our eyes at 200 "operations" per second, it do that for the millions of little bits of information all at once, which is why people are so good at visual things, pattern matching, chess, etc. Just FYI.

    --
    Comment forecast: Bits of genius surrounded by a sea of mediocrity.
  10. Re:excuse my ignorance by jd · · Score: 4, Interesting
    A 64-way system may or may not be useful. It depends on the speed of the interconnects, and the way it handles bus locking. (On a 64-way system, any given CPU can only have control of a given resource 1/64th of the time. Unless this is handled extremely well, this is Bad News.)


    In general, people use clusters of single or dual-processor systems, because many problems demand lots of hauling of data but relatively little communication between processors. For example, ray-tracing involves a lot of processor churning, but the only I/O is getting the information in at the start, and the image out at the end.


    Databases are OK for this, so long as the data is relatively static (so you can do a lot of caching on the separate nodes and don't have to access a central disk much).


    A 64-way superscaler system, though, is another thing altogether. Here, we're talking about some complex synchronization issues, but also the ability to handle much faster inter-processor I/O. Two processors can "talk" to each other much more efficiently than two ethernet devices. Far fewer layers to go through, for a start.


    Not a lot of problems need that kind of performance. The ability to throw small amounts of data around extremely fast would most likely be used by a company looking at fluid dynamics (say, a car or aircraft manufacturer) because of the sheer number of calculations needed, or by someone who needed the answer NOW (fly-by-wire systems, for example, where any delay could result in a nice crater in the ground).


    The problem is, most manufacturers out there already have plenty of computing power, and the only fly-by-wire systems that would need this much computing power would need military-grade or space-grade electronics, and there simply aren't any superscaler electronics at that kind of level. At least, not that the NSA is admitting to.


    So, sure, there are people who could use such a system, but I cannot imagine many of them are in the market.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  11. Wow by Anonymous Coward · · Score: 5, Funny

    Never mind Linux for a moment, I'm just amazed that 64 Itanium 2's have actually been sold...

  12. Re:Interesting. by Anonymous Coward · · Score: 5, Informative

    The problem is that most resources (memory, the bus, disks, etc) can only be used by one CPU at a time. So, for problems which are resource-intensive, you're generally better to cluster than to use SMP, so that each processor has its own bus, memory, etc.

    No, you have a misconception. On these REAL big iron systems, each CPU (or each few CPUs) does have its own busses, memory, and io busses.

    So in that regard it is as good as a cluster, but then add the fact that they have a global, cache coherent shared memory and interconnets that shame any cluster.

    The only advantage of a cluster is cost. Actually redundancy plays a role too, although less so with proper servers, as they have redundancy built in, and you can partition off the system to run multiple operating systems too.

    To be efficient, the processors would need gigantic caches, to keep the load on the rest of the system down. Either that, or you COULD run the CPUs out of step over a bus that is 64 times faster than normal. I'd hate to be the person designing such a system, though.

    Now, this system could be of extreme interest in the supercomputer world. One of the biggest complaints about clustering is the poor interconnects. This would seem to get round that problem. A Blue Gene-style cluster where each node is a 64-way SMP board, and you're running a few thousand nodes, would likely be an order of magnitude faster than anything currently on the supercomputer charts.


    Not really. Check the world's second fastest supercomputer. It is a cluster of 20 512-way IA64 systems running Linux.

  13. Re:excuse my ignorance by AstroDrabb · · Score: 4, Insightful
    Why would I want a 16-way processor in place of 8 dual processor boxes with a gigabit backbone network to them?
    It all depends on what you are doing. Where I work we replaced a few bigger boxes with a bunch of smaller/cheaper boxes behind a load balancer for web apps. However, when it came to DB performance, the bigger boxes were much better. Well, at least to a point. Our 8-way DB was much better then are 4 2-way DB's. The cost wasn't much more, so an 8-way worked well.

    I do agree, that "big iron" is losing the power it once had. Especially when one can cluster a bunch of much cheaper 2-way boxes.

    --
    If Tyranny and Oppression come to this land,
    it will be in the guise of fighting a foreign enemy. -James Madison
  14. Re:excuse my ignorance by Sir+Nimrod · · Score: 4, Interesting

    Take this with a grain of salt, because I was part of the group that developed the chipset for the first Superdome systems (PA-RISC). I'm probably a little biased.

    A 64-way Superdome system is spread across sixteen plug-in system boards. (Imagine two refrigerators next to each other; it really is that big.) A partition is made up of one or more system boards. Within a partition, each processor has all of the installed memory in its address space. The chipset handled the details of getting cache blocks back and forth among the system boards.

    That's a huge amount of memory to have by direct access. Access is pretty fast, too.

    Still, they were doubtless pretty expensive. HP-UX didn't allow for on-the-fly changes to partitions, but the chipset supports it. (The OS always lagged a bit behind. We built a chip to allow going above 64-way, but the OS just couldn't support it. A moral victory.) Perhaps Linux could get that support in place a little more quickly....

    --
    The United States of America: We mean well.
  15. Re:A little factoid for you by jd · · Score: 5, Insightful
    If it can scale to 16 procs well, it will scale to 64 procs well.


    Someone wasn't awake when their Comp Sci class covered Ahmdal's Law. Or the Dining Philosopher's Problem. Or vector processing. Or networking. Or the parallelization problem. Or...


    Actually, the troll can be made to serve a useful purpose, because there are probably a lot of people who read Slashdot who didn't do Comp Sci.


    Part of the problem with parallelization is that not all problems can be divided up that way. If one man takes 60 seconds to dig a posthole, how long would it take 60 men to dig a single posthole? Answer - 60 seconds. Exactly the same amount of time is spent, because only one person can be digging the posthole at a time. Having more people doesn't help.


    Another part of the problem is sharing resources. Let's say you have some computer memory that can respond to a read operation in one clock cycle. Let's also say that the computer program never reads from memory. (Very unlikely.) The first processor fetches an instruction (which is a read operation) and then executes it. The second processor can't do anything while the first one is reading, so has to wait until it has finished with that part, before it can do a read of its own.


    If the instruction takes 1 clock cycle to execute, then the first processor will be ready after the second one has performed its fetch. In which case, you will be running the memory flat-out with just 2 processors. Any more than that, and the system will actually slow down, because the processors will have to wait.


    Likewise, if the average time to run an instruction is N clock cycles, you will (on average) be able to have N+1 processors, before the memory is maxed out.


    In practice, processors run about an order of magnitude faster than RAM, which is why modern systems have lots of L1 and L2 cache (and sometimes L3), pipelining, etc. These are all tricks to try and access the somewhat slower main memory as little as possible.


    Also in practice, programmers try to avoid "expensive" (in terms of clock cycles) operations because you can generally get the same results faster by other means. (That's why RISC technology became popular - make the fast operations faster, rather than adding stuff that people will try to avoid.)


    In consequence, sharing resources is a very difficult problem. It is not the only problem that many-way systems face, though. If you have N processors, there are !N possible ways for those processors to communicate. In this case, it would be !64 (64x63x62x...x2x1), which is a horribly large number. You couldn't have one link per pathway, for example, which means you've got to share links, which means you've got to have some damn good scheduling and routing mechanisms. Even then, with limited resources, you can only have so many processors talking at a time, before you are overwhelmed. Which means that "chatty" problems will involve a lot of processors spending a lot of time simply waiting for their turn to chat.


    (This goes back to why people generally build clusters, rather than many-way SMP systems, and why high-end clusters use the fastest networking technology on the planet. Clustering is easy. Getting the communication speeds up is the problem. Getting communication speeds to the point of being useful for scientific applications is a very complex, expensive problem. Which is the main reason Mr. Cray charged more than Mr. Dell for his computers - and why people would pay it.)

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  16. Re:The SGI Altix is scaling to 256 cpus... by stratjakt · · Score: 5, Informative

    This is an unmodified stock 2.6 kernel (well it's patched with stuff that's in distros, and will be in the next kernel). Out of the box, it detected the NUMA set up, memory partitions, the whole bit.

    The SGI boxes are nothing like the stock kernel.

    --
    I don't need no instructions to know how to rock!!!!
  17. Re:Interesting. by jd · · Score: 4, Insightful
    Global shared-memory can be done on OpenMOSIX, using the Migshm extension, which provides you with Distributed Shared Memory.


    The Altix uses 4-way CPU "bricks", along with networking and memory bricks, which you can then use to assemble a system. Yes, resources are visible globally, and it is a LOT faster than a PoP (pile-of-pcs) cluster using ethernet, but it is still a cluster of 4-way nodes.


    It also doesn't avoid the main point, which is that any given resource can only be used by one CPU at a time. If processor A on brick B is passing data along wire C, then wire C cannot be handling traffic for any other processor at the same time. That resource is claimed, for that time.


    When you are talking a massive cluster of hundreds or thousands of CPU bricks, it becomes very hard to efficiently schedule the use of resources. That's one reason such systems often have an implementation of SCP, the Scheduled Communications Protocol, where you reserve networking resources in advance. That way, it becomes possible to improve the efficiency. Otherwise, you run the risk of gridlock, which is just as real a problem in computing as it is on the streets.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  18. Read my lips by Chatz · · Score: 5, Interesting

    Linux scaling to 512 processors:
    http://www.sgi.com/features/2004/oct/ columbia/

    The story should be HP has finally caught up to where SGI were 2 years ago.\

    --
    There is folly and foolishness on the one side, and daring and calculation on the other. - Admiral Pellew, Hornblower
  19. Re:excuse my ignorance by afabbro · · Score: 4, Insightful
    It would make system management much easier.

    I prefer to say "might" make systems management much easier. The problem with the One Big Box is the same whether it's Sun, HP, Linux, etc.:

    • Something bad happens to the One Or Two Critical Components. If you know of any open systems box has no one single point of failure, I'd sure like to see it. If you want one big box without a single point of failure, you buy a mainframe. Every open systems big box I'm aware has at least one or ten SPOFs...and I've had the backplane go out on more than one Sun E10K. At that point, you don't lose just one system, you lose everything if you've consolidated to one 64-way box.
    • It's time to do some hardware maintenance. Good luck coordinating that with 32 different user groups. "Ah, but we can do everything hot with this big box." Always sounds good on paper. I've always run into things for each of them that required a power-off maintenance.
    • Or perhaps it's not even maintenance...it's just something weird. I had a Big Box once where a power supply made a popping noise and emitted a small puff of smoke. It burned out. Not a big deal in the end - it could be replaced hot - but it was a nervous couple of hours. Versus a cluster where you'd fail over to the spare (yes, I know you could cluster your Two Big Boxes, but we start getting into financial justifications).
    • ISVs say things like "You want to run XYZ 1.0 on your 64-way box? That's a tier 9 platform and that will be $100,000, thank you." "But I'm only using it on one 2-way partition!" "You might dynamically reconfigure it after we sell you the license and our software isn't that smart, so it's $100K or no deal. And then you can use it on all your partitions!" "But I don't need it on all of them!" You'd be amazed how many prominent software companies tier based on the overall box and don't support virtual partitions, etc. from a licensing perspective. And you're guaranteed to have a user who needs one of their products.
    • Department B bought SAN gizmo X and your big box is exotic enough that there is no driver for it. They really want SAN gizmo X, so they go off and buy a new 4-way box for themselves. Or they want to run SuSE and SuSE doesn't support your box. Or everyone wants his own gig-E or two and you don't have 128 ports out the back. Etcetera - there are lots of scenarios where you can't get the technical architecture brainiacs to think ahead or you can't get the vendors' stars to line up and you wind up with people who don't want to be on the big box...and pretty soon the data center is proliferating again.

    Etcetera...of course, there are just as many if not more problems with the "we'll just build a giant cluster of 64 boxes and scale across it!" approach...I'll rant on that some other day.

    It's all trade-offs. And no matter which way you go, you'll discover some truly ugly hidden costs that never seem to show up in those vendor white papers. And none of it works exactly the way it should or you'd like it to. But I'm not jaded or anything ;)

    --
    Advice: on VPS providers