Slashdot Mirror


Running 100,000 Parallel Threads

An anonymous reader writes "This story explains how the latest Linux development kernel is now able to start and stop over 100,000 threads in parallel in only 2 seconds (about 14 minutes 58 seconds faster than with earlier Linux kernels)! Much of this impressive work is thanks to Ingo Molnar, author of the O(1) scheduler recently merged with the 2.5 Linux development kernel."

21 of 387 comments (clear)

  1. Re:Posix thread... by Wolfier · · Score: 5, Informative

    Your answer:

    http://www.cs.wustl.edu/~schmidt/ACE.html

    This is so far the best library I have used for pthread programming. Powerful, easy to use, and encapsulates message passing really well...

  2. Not 100,000 threads in parallel, just 50. by Tronster · · Score: 1, Informative
    The title and description is misleading. From the comments further down in the article, Linus points out that only 50 threads at a time were running in parallel:

    From: Linus Torvalds
    Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1
    Date: Fri, 20 Sep 2002 06:01:47 +0000 (UTC)

    Rik van Riel wrote:

    >I agree, it's pretty silly. But still, I was curious how they
    >managed to achieve it ;)

    You didn't read the post carefully.

    They started and waited for 100,000 threads.

    They did not have them all running at the same time. I think the
    original post said something like "up to 50 at a time".

    Basically, the benchmark was how _fast_ thread creation is, not now many
    you can run at the same time. 100k threads at once is crazy, but you can
    do it now on 64-bit architectures if you really want to.

    Linus
    1. Re:Not 100,000 threads in parallel, just 50. by vvikram · · Score: 5, Informative


      Yeah right. And modded to "Informative"? Slashdot moderators are the _pits_.

      Read ingo's reply to Linus. They _did_ start
      one test serially and also _parallelly_ . In short he says that its possible.

      vv

    2. Re:Not 100,000 threads in parallel, just 50. by kinnunen · · Score: 5, Informative
      Read Ingo's posts too:
      actually, that was Ulrich's other test, which tests the serial starting of 100,000 threads. the test i did started up 100,000 concurrent threads which shot up the load-average to a couple of thousands. [the default timeslice the parent has is enough to start more than 50,000 parallel threads a pop or so.]
      And another one:
      Anton tested 1 million concurrent threads on one of his bigger PowerPC boxes, which started up in around 30 seconds. I think he saw a load average of around 200 thousand. [ie. the runqueue was probably a few hundred thousand entries long at times.]
    3. Re:Not 100,000 threads in parallel, just 50. by the_quark · · Score: 5, Informative

      No, seriously. Process creation under Linux was time-similar to thread creation on other OSs. That's because Linux was as fast at creating *a process* as other OSs are at creating *a thread*. IIRC, threading was initially implemented in Linux from the process-creation methods, so it was similar in speed (the main advantage in Linux from threads was the shared memory space if your application wanted that sort of thing). That's why Apache 2.0 is bringing NT performance more in line with Linux 1.3 performance: NT's threading speed is a lot closer to Linux's forking speed. Again, I'd like to underscore I'm not an expert on this, and it's possible I'm mistaken about relative benchmarks (is NT w/Apache 2.0 a little faster than Linux w/Apache 1.3? Could be...) but I'm very confident of the basic underlying point, that Linux process creation is essentially comparable to other OSs' thread creation, perhaps even faster.

      See, for example, http://www.linux.cu/pipermail/linux-prog/2001-Febr uary/000027.html, just one of the first Google links that popped up when I went looking for proof that I'm not on crack: "Linux newcomers often are unaware of the substantial differences between Linux and other operating systems. To implement concurrency, they use multithreading exclusively, mistakenly assuming as high an overhead associated with Linux multiprocessing as on other platforms." In fact, knowing how fast Linux's process creation is relative to other systems' thread creation makes this even more impressive in my mind. This isn't just a bug fix; much like with process creation before it, Linux is doing something fundamentally better than its counterparts.

      Don't forget: Just because this is /. doesn't mean I'm just a Windows-hating troll. I try to make sure all my Windows-hating-troll-posts are at least backed up by facts. ;)

    4. Re:Not 100,000 threads in parallel, just 50. by brianpane · · Score: 4, Informative
      Apache 2.0 doesn't actually do thread creation very frequently. The thread creation cost occurs mostly at startup. So the limiting factors for threaded Apache performance on Linux are mainly:
      • The speed with which the kernel can schedule and context-switch among threads
        For some recent data on this, see http://marc.theaimsgroup.com/?l=apache-httpd-dev&m =103228014211983. The O(1) scheduler patch for 2.4 seems to help here.
      • Memory usage per thread
      • Concurrency limitations of the Apache code itself
        This has been improving gradually with successive 2.0 releases, as the remaining global locks are removed or optimized.
      • General robustness of the thread implementation
        The current (2.4) Linux threading implementation doesn't work well with debuggers.
      At first glance, it looks like the NPTL could be a win for threaded Apache on Linux, as offers some solutions first the first and last of these issues.
    5. Re:Not 100,000 threads in parallel, just 50. by Karellen · · Score: 5, Informative

      It's not process/thread _creation_ times that make the difference, it's the process/thread _context_switch_ times that really mount up, which is where Linux shines.

      And yes, Linux's process context switches are on a par (possibly faster - can't be bothered to look up benchmarks) with NT's thread context switches.

      K.

      --
      Why doesn't the gene pool have a life guard?
  3. Re:I'm only a humble C programmer, but.... by bm_luethke · · Score: 5, Informative

    probably none. On the other hand the field I work in (high performance computing) this will be a great help. Currently we are running a 500,000 processor simulation on a four node cluster, startup and running both is a pain. Remeber, on of the great things about linux is some of the neat/usefull applications being ran on it (human genome, nuclear simulations, fluid simulations). Windows is a toy and geared toward "normal" users (read very few threads not processor intensive). Linux is more of a workhorse (many threads, computationally expensive, and high uptimes). While there are exceptions to this look at advances such as this in that light. And finally, just because you won't use it compiling a kernel doesn't mean it's not needed.

    --
    ------- Sorry about the spelling, I suffer from two problems. Dyslexia makes it difficult to spell well, lazy makes it
  4. Re:Possible use by vsync64 · · Score: 2, Informative

    Except that threads, as far as I am aware, share the same address space. Multiple processes need to arrange to share memory, and therefore are less likely to trample on one another or careen out of control.

    --
    TO BUY A NEW CAR WOULD MAKE YOU SEXUALLY ATTRACTIVE.
  5. Re:Alternative headline by Dahan · · Score: 4, Informative
    Gigantic performance problem in Linux code fixed after several years of "many eyes" scanning over it.

    Uh, why did that get moderated as a troll? Oh, right, Linux is absolutely perfect, and anyone who says otherwise must be a troll.

    Come on, Linux's scheduler has long been known to have performance problems once you have a lot of processes/threads... for example, read this paper [text version] (appropriately subtitled "How I Learned to Love the Alpha and Hate the Scheduler"):

    0.8.1 Create a fixed priority scheduler.
    Currently, the Linux scheduler is very different than the traditional Unix schedulers. Although the Linux scheduler is very efficient when only several processes are running, it is not scalable. In order to match the performance of *BSD and other Unices, another scheduling algorithm must be used.
    Moderators, don't be Slashbots, moderating according to the groupthink. Educate yourselves, and you'll be better moderators, and better people.
  6. Re:Windows comparison by Courageous · · Score: 4, Informative

    Very thread uses a minimum of *1 PAGE* of reserve memory for its statck, which is 64K. However, you have to go out of your way to use less than 1 megabyte of reserve memory. Since only 2GB of reserve memory (addressable memory) is available to user applications, this would fit your 2000 thread figure like a glove.

    C//

  7. Re:nice, but... by Magnus+Reftel · · Score: 4, Informative

    According to a mail from Ingo Molnar halfway down the linked article, M:N threading doesn't really solve the real problem - it's good at switching back and forth between running threads, but the real reason for having very large amounts of threads (be they kernel or user space threads) to begin with, is to do IO, and for that, there is no real advantage of user space threads.

    More info on the 1:1 vs M:N issue can be read in the white paper

    --
    print "Yet another p{erl,ython} hacker\n",
  8. POSIX compliance ahead? by rkit · · Score: 2, Informative

    Scalability is a good thing, no doubt about that. However, there is another aspect that should be pointed out: the current thread API in linux is quite different from the POSIX specification and somewhat crufty. Just to mention the biggest problems:
    missing cancellation points: testing whether a thread has been cancelled should be done in lots of system calls, but linux pthreads do not support this. Instead, you have to call pthread_testcancel() before and after every such call. A real drag.
    signal handling: linux pthread signal handling is very different from the POSIX specification. However, proper signal handling is crucial for any real world application.
    fork() will not work as expected. This is a real nuissance if you want proper daemon behaviour for your application.
    documentation of linux-specific behaviour is poor. As a result, most of the existing literature on thread programming is pretty useless for linux.
    All these points can be worked around, for sure. Nevertheless, it makes writing portable software a nightmare. Porting threaded software to linux, well ... All in all, linux threads really need much better integration with the standard system API. A lot of applications could profit from multithreading. Just think of GUI responsiveness. Also, using threads makes some programming tasks much easier. No need for asynchronous hostname lookup, for example.
    A solid, well documented, standard conforming threads implementation will make linux a much nicer environment for serious programming than it already is. I am really looking forward to this.

    --
    sig intentionally left blank
  9. Re:Real World Example by khuber · · Score: 2, Informative
    I was talking about a hybrid design, not pure select. Of course you are right about the limitations of pure select. Thread per client starts to bog as the number of simultaneous clients increases.

    It's not practical to serve hundreds/thousands of clients with a thread per client model. A typical machine can't handle the load well because it has limited resources. It will thrash. By having a thread pool you place a limit (throttle if you will) on resource utilization. Most high performance, highly scalable web and app servers use this model or a variant.

    There is another architecture based on event driven state machines aka SPED (single process event driven) that is high performance and single process/single thread in its pure form. The Zeus web server does this.

    -Kevin

  10. Re:Threads? by Anonymous Coward · · Score: 1, Informative

    A thread is a "light weight process". Threads have many definitions and types.
    In java's green threads, the JVM has 1 process and many user-space threads.
    In Solaris, there are actually 2 types of threads: a kernel and a user-space thread.

    Linux has had a fast process creation and context-switch so Linus chose to implement kernel threads in terms of a process (clone is the creation function). Pthread is simply a posix API wrapper for a linux thread.

    While there are many thread libraries, pthread is the main one now. NGPT (Next Gen Posix Thread from IBM) was being positioned to replace pthread with a M:N implementation. It is heavier than current pthread, but with ability to create both kernel and user space threads.

    Redhat's library may actually be the replacement just due to it's simplicity (read this as fast and less buggy).

  11. Re:Windows comparison by Anonymous Coward · · Score: 1, Informative

    The 1 MB is the default stack size for every Posix thread. It takes some effort to determine what the smallest valid stack size since PTHREAD_STACK_MIN doesn't specify enough space for the start_func stack frame. The parent post merely stated that the default is 1MB and you have to work to lower that, and he is correct. If the grand-parent poster was just creating threads without specify a stack size, he would run out of RAM pretty quickly.

    I wouldn't want to guess where the 64k figure came from.

  12. Re:I'm only a humble C programmer, but.... by cduffy · · Score: 5, Informative

    KDE actively discourages threads. Perhaps that will change now. Likewise servers, such as apache, will speed up.

    I'm not so sure about that.

    A threaded model doesn't necessarily offer advantages -- Apache's multiprocess model is really just as good on platforms without serious performance penalties on fork(), and Boa (which neither forks nor threads) is much, much faster than either Apache mode (though of course on SMP systems multiple instances must be run to use all the available CPUs).

    Indeed, unless SMP is being taken advantage of, a well-written single-threaded application will always be faster than an equivalent multithreaded application. Such an application has less overhead and is able to jump between its "subprocesses" only when needed -- and without the latencies involved by letting the OS handle said scheduling. Back in the Real World, I still write threaded code -- but because writing unthreaded code (in the problem spaces where threads are useful) is harder, not because it's faster.

  13. Re:Ur browser??? by Anonymous Coward · · Score: 2, Informative

    I can only suppose you don't know what Ur is, maybe because you come from a very different culture...

    Anyway, and I'm really not well qualified to answer this, Ur was an ancient city-state from which a prominent ancestral of the Jewish-Christian-Islamic heritage (Abraham, if I'm not wrong).

    This city, IIRC already found, was sumerian (I'm not sure about this), the folks who are said to be the inventors of the wheel, among other neat things.

    So an Ur browser would be the primeval browser, in other words.

    Upon writing a note, one must be sure it will be understood; nonetheless, the "Ur" mention boosted the note level way up. All in all, I think it was great and I'm all for it.

    But explanations as these sometimes become necessary.

  14. Re:Windows comparison by sagei · · Score: 3, Informative

    Each Linux thread has two things of its own: its own stack, which can be as small as 1 or 2 pages if the code to run is simple enough, and also its own task_struct, which is 1 page including kernel stack for the thread.

    This is not true; the kernel stack is two pages in size, i.e. 8KB on i386.

    Also, in 2.5 (where these tests were done), the task_struct is no longer allocated on the stack. It is allocated off the slab cache, while the thread_info struct is on the stack. The task_struct slab object is another ~1.7KB per task.

    Finally, I do not know what the pthreads default stack size is (user-space? what is that?) but it is certainly larger than one page.

    --

    Robert Love

  15. Re:Windows comparison by Saint+Stephen · · Score: 2, Informative

    I've created over 200,000 process on a PIII 550 laptop with 256 mb of ram running Windows XP. Of course, it took a while (swapping).

    The process is called nothing.exe. Source Code: int WinMain(...) {Sleep(INFINITE);}

    I work at a lab, so I also ran it on a Compaq 8-way with 4-GB of ram. It worked but I don't remember how fast it went.

    However, there is a big gnarley limit in Windows that will limit the # of processes: the amount of memory allocated to virtual desktops or something. We researched it -- Look it up. This is why you get limited to a few thousand processes or threads if they all do GUI stuff. The bad thing is basically any function you call in user32 will register the thread as a GUI thread. It explains it all in the book Inside Windows 2000.

    Not meaning to troll, I'm just going to share basic fact: It sucks that Windows threads are so expensive, but tens of thousands of threads *DOES* suck (read: thread per client) on Windows. However, this is not the same thing as saying Windows doesn't scale -- you just have to code it differently. (Check out how many SQL Server uses when it's processing thousands of clients.) Stuff like IO Completion ports, AWE memory, and Scatter/Gather IO is the way that you have to go.

    Just because you *can* create hundreds of thousands of threads, doesn't mean it's a good idea or that your app won't run like shit on a 32-CPU machine!

  16. It is not broken. by 1101z · · Score: 2, Informative

    No you will see a pid per thread because, that is how the scheduler knows to schedule things. The getpid() c library call from within the program. When they said it is a 1-to-1 mapping that means that there is a process per thread. Just look when you see all those proccesses with the same name, and see if they have the exact same memory usage. If they do it means they are using the same memory and are threads. No matter how you implement threads there has to be more than one proccess other wise when the program blocks for I/O all threads would be blocked.

    --
    One day people will learn the folly of Winbloze, Linux Rules!