Running 100,000 Parallel Threads
An anonymous reader writes "This story explains how the latest Linux development kernel is now able to start and stop over 100,000 threads in parallel in only 2 seconds (about 14 minutes 58 seconds faster than with earlier Linux kernels)! Much of this impressive work is thanks to Ingo Molnar, author of the O(1) scheduler recently merged with the 2.5 Linux development kernel."
The linux song
Arbitrary sig
Your answer:
http://www.cs.wustl.edu/~schmidt/ACE.html
This is so far the best library I have used for pthread programming. Powerful, easy to use, and encapsulates message passing really well...
this image springs to mind
It takes two seconds to start 100,000 threads???? Piff! With my ME computer, It doesn't matter how many parallel threads I am running... I can stop them all instantly by simply attempting to use my computer :P.
And this is great news, and, indeed, impressive. But my question is, what (if any) change is this going to make to my daily use of linux (for gcc, reading slashdot, and that's about it...) Am I going to notice any performance differences?
Launch 100,000 threads while I walk away. . .
OK I'll shut up now.
This is very cool; but does it scale to multiple CPU systems? More and more, SMP, split-bus and multi-core architectures are going to be taking over. If this holds up in those environments, Linux may actually have a leg up on some of the dedicated task heavyweights.
Says the RIAA: When you EQ, you're stealing bass!
So now I'm able to open up 100.000 pr0n pictures in just 2 sec. Ubercool ;-)
Thomas S. Iversen
Yeah right. And modded to "Informative"? Slashdot moderators are the _pits_.
Read ingo's reply to Linus. They _did_ start
one test serially and also _parallelly_ . In short he says that its possible.
vv
This could be huge for things like webservers, though, which spend a lot of their time kicking off new (logical) processes. As I understand it, on Linux, a big part of the reason Apache 2.0 hasn't taken off (aside from lack of availability of major packages) is that Apache 2.0's main win is in threading support. Under Linux, thread creation hasn't been much faster than process creation, because process creation was so dang fast.
So, am I right in thinking this means threading (and hence Apache 2.0) will be a big win for Linux web servers, now?
A later post pointed out that Linus was wrong. They actually did both tests: one test created and destroyed threads as fast as possible; the other created 100K threads first and then killed them all.
True, however the feat is still quite impressive. By making the creation and destruction of threads cheaper, it frees developers from having to worry so much about the overall system impact when spawning threads.
For instance, because of the expense many applications use thread pools, which is simply a bunch of idle threads that sit around doing nothing, waiting for work to do. These idle threads still take up system resources even though there not actually using CPU. Not to mention the extra work the developers have do to make the thread pools work for there applications.
Arbitrary sig
- Have a picture
"Hello, my name is Ingo Molnar. You killed -9 my process: prepare to die."
:P
Sorry, had to
At school (before I graduated so long ago) we would "fork bomb" the compute servers [ while(1) do { fork(); } ] in an attempt to extend deadlines or simply be assholes :)
Religion is a gateway psychosis. -- Dave Foley
Just out of curiousity, how does the benchmark in windows compare?
- Jeff Brubaker
Under Linux, thread creation hasn't been much faster than process creation, because process creation was so dang fast.
That's called "making lemonade out of lemons". Clearly this test has shown that thread creation in Linux was horribly broken, not the flip side that process creation was so wonderfully good.
I'm building a project where there will be one huge database with up to 200 different companies connected to it pretty much nonstop. 1-10 users from every company depending on the time of the year. 2 threads for every connection.
200*10*2=4000 threads.
Could you please refrain from using "boxen". It makes my head hurt
Im not here now... Im out KILLING pepperoni
I have no idea what the hell you're talking about but it certainly sounds impressive. :)
-
Now we finally have the power to run 99,999 pop up ads when we visit that pr0n site
Very interestingly enough, either windows has a quota, or some sort of memory leak or something...
Max I can create in a process is 2031 threads... That being done in 700ms.
It's odd cause I can create more if I run several processes. It doesn't look like the kernel is choking on thread creation...
will investigate more.
Normally I am of the "use only as many threads as CPUs" school of thought, but I can think of a reason to use 100,000 threads - imagine a large FTP server, or a multi-homed HTTP server, where you need to provide each connected user with his own set of access privileges or filesystem context. A one-thread-per-connection server may be the easiest way to build security into the system.
No, seriously. Process creation under Linux was time-similar to thread creation on other OSs. That's because Linux was as fast at creating *a process* as other OSs are at creating *a thread*. IIRC, threading was initially implemented in Linux from the process-creation methods, so it was similar in speed (the main advantage in Linux from threads was the shared memory space if your application wanted that sort of thing). That's why Apache 2.0 is bringing NT performance more in line with Linux 1.3 performance: NT's threading speed is a lot closer to Linux's forking speed. Again, I'd like to underscore I'm not an expert on this, and it's possible I'm mistaken about relative benchmarks (is NT w/Apache 2.0 a little faster than Linux w/Apache 1.3? Could be...) but I'm very confident of the basic underlying point, that Linux process creation is essentially comparable to other OSs' thread creation, perhaps even faster.
r uary/000027.html, just one of the first Google links that popped up when I went looking for proof that I'm not on crack: "Linux newcomers often are unaware of the substantial differences between Linux and other operating systems. To implement concurrency, they use multithreading exclusively, mistakenly assuming as high an overhead associated with Linux multiprocessing as on other platforms." In fact, knowing how fast Linux's process creation is relative to other systems' thread creation makes this even more impressive in my mind. This isn't just a bug fix; much like with process creation before it, Linux is doing something fundamentally better than its counterparts.
/. doesn't mean I'm just a Windows-hating troll. I try to make sure all my Windows-hating-troll-posts are at least backed up by facts. ;)
See, for example, http://www.linux.cu/pipermail/linux-prog/2001-Feb
Don't forget: Just because this is
Uh, why did that get moderated as a troll? Oh, right, Linux is absolutely perfect, and anyone who says otherwise must be a troll.
Come on, Linux's scheduler has long been known to have performance problems once you have a lot of processes/threads... for example, read this paper [text version] (appropriately subtitled "How I Learned to Love the Alpha and Hate the Scheduler"):
Moderators, don't be Slashbots, moderating according to the groupthink. Educate yourselves, and you'll be better moderators, and better people.Very thread uses a minimum of *1 PAGE* of reserve memory for its statck, which is 64K. However, you have to go out of your way to use less than 1 megabyte of reserve memory. Since only 2GB of reserve memory (addressable memory) is available to user applications, this would fit your 2000 thread figure like a glove.
C//
It's nice that the Linux kernel can handle that many threads. But user level threads generally are even more lightweight, and high performance implementations like those on Solaris provide both user level and kernel level threads and map the former onto the latter. Is Linux going to get something similar? Is Sun perhaps donating their implementation? Or are these new kernel threads so lightweight and quick that they are competitive with Solaris on their own, without the mess and complication of adding user level threads?
How will this change affect Mozilla, the Sun JVM and OpenOffice, for instance.
While it probably is generally true that it will take some time for most applications to start using the new threading model some larger applications could support it fairly soon.
Can we expect these applications to be adapted to the new threading model some time soon, and how will it affect performance?
The Internet is full. Go Away!!!
Be careful who you call a dumb fuck. Netscape had a functional browser long before IE3, aguably the first usable version of IE. And it would not surprise me if Netscape 1 predated IE 1, though I can't say I know that for sure.
Speeding The Net is an excellent book about Netscape vs Microsoft, in case anybody cares (it's been a long while since I read it, thus why my date memory is rusty).
Anarchy$ dd if=/dev/random of=~/.signature bs=120 count=1
...will start writing horrible monsters running hundreds and thousands of threads, and their creations will suffer from all other shortcomings of that decision.
Contrary to the popular belief, there indeed is no God.
- The speed with which the kernel can
schedule and context-switch among threads
m =103228014211983.
The O(1) scheduler patch for 2.4 seems to help
here.
- Memory usage per thread
- Concurrency limitations of the Apache code
itself
- General robustness of the thread
implementation
At first glance, it looks like the NPTL could be a win for threaded Apache on Linux, as offers some solutions first the first and last of these issues.For some recent data on this, see http://marc.theaimsgroup.com/?l=apache-httpd-dev&
This has been improving gradually with successive 2.0 releases, as the remaining global locks are removed or optimized.
The current (2.4) Linux threading implementation doesn't work well with debuggers.
I ran this in DOS:
prompt "Enter Password:"
No one could figure out that all i did was change the prompt from "$P$G" to that, and everyone was asking what the password was. haha, good old teacher was infinitely frustrated as well! IT WAS BEAUTIFUL.
I got kicked out for a year (not beautiful).
100.000 threads? What nonsense; everybody knows that no computer would ever use more than 640.
Wenn ist das Nunstueck git und Slotermeyer? Ja! Beiherhund das Oder die Flipperwaldt gersput.
" - - libpthread should now be much more resistant to linking problems: even if the application doesn't list libpthread as a direct dependency functions which are extended by libpthread should work correctly."
This ought to be a big help for those of us who write plug-in modules for servers like Apache 1.x and PHP. The existing thread library doesn't work properly unless the program executable explicitly links to it, which means that my shared libraries can't take advantage of standard thread management such as pthread_atfork().Given that Apache 2.x can utilise threads as well as processes, does this mean that you can configure a large web server with, say "MaxSpareThreads 1000000" so that you can cope when you're slashdotted ;-)?
640 should be enough for anybody!
LEXX
"Gold still represents the ultimate form of payment in the world." - Alan Greenspan, 1999
Combine this with Apache2's Multi-threaded or Hybrid MPM and you'll have a heck of a web-server!
It's not process/thread _creation_ times that make the difference, it's the process/thread _context_switch_ times that really mount up, which is where Linux shines.
And yes, Linux's process context switches are on a par (possibly faster - can't be bothered to look up benchmarks) with NT's thread context switches.
K.
Why doesn't the gene pool have a life guard?
Alternatively, you might want to consider that Linux's scheduler was very nicely tuned for far and away the most common case - where you have only a small number of running processes.
/isn't/ insane, and hence these new developments have come along.
/have/ to realise that the kernel developers care about how people actually use the system, rather than crappy benchmarketing numbers. These developments have come about because people needed them, and they didn't happen earlier because no one had needed them before. Go back and read the last few years of the lkml archives, and /then/ come back and talk about this kind of thing, when you understand /why/.
Likewise, threading support under Linux has been oriented towards what the developers considered sane: a fairly small number of threads. They had good reasons for considering that the right way to do it - for a start, it worked nicely for what they wanted, and it was sufficiently simple that they didn't have to put in lots of complex code. Further, it's almost never a good idea to have a program architecture that requires very large numbers of threads - it generally only shows up in naive code where people simply don't understand the problems it brings. So, as far as the kernel developers were concerned, stupid people hurting themselves wasn't something to put any effort into amelioriating. This has changed recently, as people have started using Linux in areas where this kind of thing
You need to understand the reasoning behind a lot of these decisions before you can start complaining about them. First and foremost, you simply
himi
My very own DeCSS mirror.
Scalability is a good thing, no doubt about that. However, there is another aspect that should be pointed out: the current thread API in linux is quite different from the POSIX specification and somewhat crufty. Just to mention the biggest problems: ... All in all, linux threads really need much better integration with the standard system API. A lot of applications could profit from multithreading. Just think of GUI responsiveness. Also, using threads makes some programming tasks much easier. No need for asynchronous hostname lookup, for example.
missing cancellation points: testing whether a thread has been cancelled should be done in lots of system calls, but linux pthreads do not support this. Instead, you have to call pthread_testcancel() before and after every such call. A real drag.
signal handling: linux pthread signal handling is very different from the POSIX specification. However, proper signal handling is crucial for any real world application.
fork() will not work as expected. This is a real nuissance if you want proper daemon behaviour for your application.
documentation of linux-specific behaviour is poor. As a result, most of the existing literature on thread programming is pretty useless for linux.
All these points can be worked around, for sure. Nevertheless, it makes writing portable software a nightmare. Porting threaded software to linux, well
A solid, well documented, standard conforming threads implementation will make linux a much nicer environment for serious programming than it already is. I am really looking forward to this.
sig intentionally left blank
Then, with NGPT (Next-Generation Posix Threads), those 100,000 threads would be in user space and may be even cheaper.
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
So maybe there is a heavyweight library for some applications, and a lighter weight one for common use.
Probably you do the light one, and include it in the heavy when required.
Ah, the one-size-fits-all thought process...
Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
I think we need to pull some old stats out of our ass. This paper is about athe 2.2.x kernel. Correct me if I'm wrong, but hasn't there been massive overhauling of the 2.4.x and 2.5.x kernels in the scheduling area?
I think I'll just slam XP performance based off of NT benchmarks and aricles. What the hell, thier both from MS the argument must be a valid.
Get a grip!
-- Many men would appreciate a woman's mind more if they could fondle it
..actually.
g p00033.html
;)
Your answer:
http://www.linux.ncsu.edu/lug/lectures/rpm-pres/m
This is so true to all of us
Wrong in every respect.
First Mosaic was not the 'Ur browser'. Tim's NextStep browser was. Mosaic was browser number 15 or so. The significant things about Mosaic were that 1) it actually compiled without having to hack the code yourself or mess with 6 different support packages like tkwww and 2) it was the first X-Windows browser that did not look really amateur.
Second, Netscape does not contain any code from Mosaic, although it was written by the same main author - Eric Bina. NCSA sold the commercial rights to Mosaic to Spyglass.
Third IE was originally based on the Spyglass code, so if any browser is 'the direct descendant' it would be IE. Go look at the 'about' box on IE, although the original Mosaic actually had more lines of CERN code than NCSA code which were never acknowledged.
Looking for an Information Security student project suggestion?
Try http://dotcrimeManifesto.com/
Ingo:...Anton tested 1 million concurrent threads on one of his bigger PowerPC boxes, which started up in around 30 seconds. I think he saw a load average of around 200 thousand. [ie. the runqueue was probably a few hundred thousand entries long at times.]
Wow.. this is pretty good.The ability to spawn & run 1 million concurrent threads should keep even the most demanding users happy for a few years...
OTOH, I hope this post doesn't become the butt of jokes a few months from now ("and you thought 1 million was a lot! Ha! My Palm 5000XL does more than that!")...
> The title and description is misleading. From the
> comments further down in the article, Linus
> points out that only 50 threads at a time were
> running in parallel:
And the next comment down is from Ingo:
actually, that was Ulrich's other test, which
tests the serial starting of 100,000 threads.
the test i did started up 100,000 concurrent
threads which shot up the load-average to a
couple of thousands. [the default timeslice the
parent has is enough to start more than 50,000
parallel threads a pop or so.]
So, yes, they did manage 100,000 threads running in parallel.
Matt
Your egregious use of the word "egregious".
No one ever had to evacuate a city because the solar panels broke!
I can only suppose you don't know what Ur is, maybe because you come from a very different culture...
Anyway, and I'm really not well qualified to answer this, Ur was an ancient city-state from which a prominent ancestral of the Jewish-Christian-Islamic heritage (Abraham, if I'm not wrong).
This city, IIRC already found, was sumerian (I'm not sure about this), the folks who are said to be the inventors of the wheel, among other neat things.
So an Ur browser would be the primeval browser, in other words.
Upon writing a note, one must be sure it will be understood; nonetheless, the "Ur" mention boosted the note level way up. All in all, I think it was great and I'm all for it.
But explanations as these sometimes become necessary.
See subject. A useful 'heads up' post for folks like myself who tend to assume that Linux will follow the general Un*x-family behaviours we're familiar with from the commercially-sold variants.
;) check this assumption if I were to do some significant implementation for the Linux platform.
And yes, I would of course
Don't you mean
My name is ingo Molnar.
You kill -15 my parent process - Prepare to die.
I'm a loner Dottie, a Rebel.
User-level threads cannot take advantage of multiple CPUs. True, they are somewhat faster on a single CPU system due to lower overhead, but that's all they are good for.
___
If you think big enough, you'll never have to do it.
ACE is nice for big systems.
But it's also way overkill for small stuff. It's a whole distributed framework, not a wrapper around pthreads.
May we never see th
It's a Windows limit, and it's in the documentation.
C//
The 64K page size is Windows' page size. I can only assume that the poster stating that the intel hardware page size is 4K. I would suppose this means that a Window's (2K,NT) page of 64K is assembled from 16 hardware pages, then. The Windows' page size of 64K is in their documentation. I never paused to think about how this interfaces with hardware pages...
C//
Currently in Linux every thread is assigned a distinct process ID, and as such, a process has as many entries in `top' and `ps' as it has threads. This makes it difficult to monitor processes externally, or even see the other processes' information. Has this issue been addressed? (I realize this is a user-space program issue, not a kernel issue).
Linus was... Wrong?!
Whoa, that's going to completely shatter the world view of many Slashdotters.
Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
I can't seem to find any info on whether Linux core files still produce one core file per thread or just one core file per process (as does Solaris). Has `gdb' been enhanced to handle multithreaded programs (or multithreaded core file) on Linux? If I have a thousand threads - I sure don't want 1000 core files in the event of a crash. Is there a way around this?
Okay, you're wrong. This O(1) scheduler in 2.5.x is the "massive overhauling." (Yes, the patch has been around for a while... but as the article says, it's only recently been merged into 2.5)
Err, Windows NT does use the native 4KB page size on Intel, but is designed to be expandable to systems with up to a 64KB page size. As a result, certain operations (like the reserve mapping that goes on for the thread stack) aligns data in 64KB increments. IIRC, there is also 64KB of virtual slack between memory mapped objects as well.
A deep unwavering belief is a sure sign you're missing something...
Each Linux thread has two things of its own: its own stack, which can be as small as 1 or 2 pages if the code to run is simple enough, and also its own task_struct, which is 1 page including kernel stack for the thread.
This is not true; the kernel stack is two pages in size, i.e. 8KB on i386.
Also, in 2.5 (where these tests were done), the task_struct is no longer allocated on the stack. It is allocated off the slab cache, while the thread_info struct is on the stack. The task_struct slab object is another ~1.7KB per task.
Finally, I do not know what the pthreads default stack size is (user-space? what is that?) but it is certainly larger than one page.
No, seriously. Process creation under Linux was time-similar to thread creation on other OSs. That's because Linux was as fast at creating *a process* as other OSs are at creating *a thread*. IIRC, threading was initially implemented in Linux from the process-creation methods, so it was similar in speed
It was and still is implemented by the process creation methods. Threads were (and still are) the same as processes in Linux (to the kernel, anyhow). All process creation is done by do_fork(), which accepts clone() flags that specify what to share between the parent and the child. "Threads" (as opposed to normal processes) just happen to share a few things: address space, signal handlers, open files, etc.
But yah, process creation in Linux is sick. Hold your head high.
You're quite right, and in fact this predates Linux and NT -- Unix was always good at process creation, whereas VMS process startup was very heavy on overhead.
It's not surprising that Linux (modelled on Unix) and NT (originally modelled on VMS) show similar characteristics. It's the reason that many Unix applications tend to be written as a bunch of cooperating processes, whereas NT apps are monolithic monsters with lots of threads.
Unfortunately, thanks to a generation of CS students having learned bad habits on Windows, we're starting to see a lot of Linux apps written as monolithic monsters. (Of course there are few old Unix apps out there like that too, perhaps some old mainframe mentality leaking through.) There are advantages to cooperating/communicating processes vs the monolithic multithreaded approach: it's easier to test the components separately, it's easier to reuse the components to make different systems, and a bug in one place won't necessarily clobber the whole thing.
-- Alastair
Some guys I know copied a Windows error dialog box and set it as a background image for the desktop, centered.
r atings ystem/windows/winerrors.html ;).
s cr eensaver.shtml
:).
Imagine the poor victim vainly clicking on the buttons, and getting more and more worried. Said victim actually rebooted the machine to see it reappear, and was not happy when he started to notice the sniggering bunch behind him...
For example pic:
http://www.adobe.com/support/techguides/ope
Probably want to replace CCmail with Explorer or something more dear to heart
I also installed a bluescreen STOP screensaver on April Fool's day on a colleague's PC. Heh, he was shocked enough to actually called another colleague over and made the usual worried mumbles.
http://www.sysinternals.com/ntw2k/freeware/blue
Since I had admin privs, I was also tempted to have ad.doubleclick.net and similar dns names to resolve to a private webserver which served out custom banner ads.
Wonder how users would take it if they see the "Staff Meeting at 2pm banner ad". Or "Company Slogan here". Or "Big boss is watching you!". Or for search result sensitive ads: "Stop downloading mp3s/movies/porn!"
I could actually justify that as a useful application. It's probably more useful than a doubleclick ad...
But I'd probably need the 100K parallel thread kernel to serve up all those ad banners
Bwahaha!
Link.
sco and solaris both can create threads 10,000 times faster then the current linux kernels according to sun's and sco's marketing departments. My guess is that this was exagurated but is one of the benefits of the big unix's. Heavily threaded linux apps have been rumoured to fly on unixware where they would run slower on their own native platforms! I guess Linux is maturing in this aspect. Does anyone who knows anything about unix/linux threading care to comment? I wonder if this will help linux in server environments.
http://saveie6.com/
I've created over 200,000 process on a PIII 550 laptop with 256 mb of ram running Windows XP. Of course, it took a while (swapping).
The process is called nothing.exe. Source Code: int WinMain(...) {Sleep(INFINITE);}
I work at a lab, so I also ran it on a Compaq 8-way with 4-GB of ram. It worked but I don't remember how fast it went.
However, there is a big gnarley limit in Windows that will limit the # of processes: the amount of memory allocated to virtual desktops or something. We researched it -- Look it up. This is why you get limited to a few thousand processes or threads if they all do GUI stuff. The bad thing is basically any function you call in user32 will register the thread as a GUI thread. It explains it all in the book Inside Windows 2000.
Not meaning to troll, I'm just going to share basic fact: It sucks that Windows threads are so expensive, but tens of thousands of threads *DOES* suck (read: thread per client) on Windows. However, this is not the same thing as saying Windows doesn't scale -- you just have to code it differently. (Check out how many SQL Server uses when it's processing thousands of clients.) Stuff like IO Completion ports, AWE memory, and Scatter/Gather IO is the way that you have to go.
Just because you *can* create hundreds of thousands of threads, doesn't mean it's a good idea or that your app won't run like shit on a 32-CPU machine!
Hidden in the article was a reference to a new locking primitive, futex. I don't see a manpage on line for it, though. Where is this documented?
See here ( http://lwn.net/Articles/9632/ )
and here ( http://lwn.net/Articles/10248/ )
--Linus is being pigheaded about this patch, wanting to "keep the code simple" instead of implementing Ingo's **fast** + Fixed solution.
To quote LWN:
[ So it's fast - though a few extra features have been requested. But this patch has stirred up a bit of a debate. Rather than put in a complicated new PID allocator, it is asked, why not just make the maximum PID be very large? Then, in theory, the quadratic part of get_pid() will never run so the performance problems go away, and the code stays simpler. Linus prefers this approach, as do a number of other developers; he has put a simple patch along these lines into his pre-2.5.37 BitKeeper tree.
Ingo disagrees, pointing out that any reasonable maximum PID size can be exceeded eventually. He would rather fix the problem than try to hid it behind a large process ID space. In the absence of real-world examples that show people being bitten by get_pid()'s behavior in a larger PID space, though, Linus appears unlikely to accept any more complicated fix.
]
.
== WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??
I remember that Linus made a remark that he tought that the O1 scheduler wouldn't impact Linux much at all, and that its development would not be a biggie for Linux, downplaying the importance of what it can achieve. Go Ingo for keeping at it!
--- Hindsight is 20/20, but walking backwards is not the answer.
- Consider that the Linux scheduler hasn't changed significantly in those THREE years.
- Consider Ingo Molnar's post on the subject.
- Consider providing some evidence for your position, rather than just saying that I'm wrong.
- Bulleted lists are pretty. I can do that too.
If you guys have some evidence that the paper I referenced is no longer valid, please post it (or references to it). Don't just tell me "oh, that paper's ancient; things are different now."'Cuz up until fairly recently, they weren't.
P.S. And if anyone wants to compare Windows XP's scheduling performance with NT's, be my guest... I don't think you'll see much of a change. Remember that XP is just NT 5.1, and I haven't heard about any significant performance improvements in NT's scheduler. (The only vaguely scheduler-related change I remember is the addition of "fibers" in NT 4.0 SPsomething (3?))
The latency issues that cause mp3 skipping under heavy load in Linux have nothing at all to do with context switching, and everything to do with /scheduling/ latency: how long it takes for a process that has work to do to actually get control of the cpu. Context switching has /nothing/ to do with that.
/is/ extremely fast - it's actually been measured (a lot), and it's something the kernel developers pay a lot of attention to and optimise very carefully. They literally count cpu cycles in these code paths. Context switching time is a serious performance limiter in many areas, so getting it right is important, and it's something that Linux does /very/ well.
The low latency patches go through the kernel breaking up areas where spinlocks are held for long periods of time. That's what causes massive scheduling latency in the kernel.
Context switching under Linux
Go do some real research before you accuse someone who's right of karma whoring bullshit.
himi
My very own DeCSS mirror.
Hardware page size is 4KB, as was noted elsewhere. The key element that I haven't seen mentioned is that Windows' virtual memory system has several ways to 'allocate' memory. There's reserving pages, and there's committing pages. In the case where you tell the OS you want memory, it reserves pages. That is to say, it does not actually take memory from the free physical memory, but instead creates a contiguous address space large enough for your request, but allocates no hardware RAM at those addresses.
When you commit a page, either through accessing a page (read or write) that is not allocated, it trips a hardware fault if the VM hasn't mapped a page to the address, which then searches for a free page, then links them together.
The end result is, even if Windows does try to create 64k worth of memory segment space for a process, unless it is actually reading or writing to a byte in each 4k chunk, its internal VM will not allocate physical memory for the whole 64k. Furthermore, there's no such advantage or realistic way for the operating system to align anything in memory physically, except in AGP ram. The VM system handles physical pages of memory exclusively, but does not manage AGP-allocated memory (IIRC). In other words, though the OS can align the address space to anything it likes, the OS layer cannot request any physical allocation mapping or alignment. So that comment about aligning memory for processes is quite unlikely.
Now, the XBox (which runs a variant of the Win2k kernel) has a bit more control over VM, but it also does not support demand paging, so it cannot swap to the hard disk and give you RAM+HD effective memory. Shame, that. But, as a result, you have an API that allows hardware level allocation control. Still, the OS doesn't take advantage of it, AFAIK. It's for developers.
Any connection between your reality and mine is purely coincidental.
The end result is, even if Windows does try to create 64k worth of memory segment space for a process, unless it is actually reading or writing to a byte in each 4k chunk, its internal VM will not allocate physical memory for the whole 64k.
Yes. Quite true. I hade a problem a while back on Windows which took me a bit of reading through the documentation (and verifying with some low level sys calls) to determine that what was happening is that I was running out of "reserve memory". Which is to say that, while I had plenty of physical memory left, all the address space had been used up. You can do this very easily by creating thousands of threads on your computer. To get a large number of these threads, you'll have to push the default stack size to its minimum, 64K. I was a bit disatisfied with this minimum, but I suppose I'll live with it now (or port to linux) if I have to, or upgrade to a 64 bit os if it becomes a practical limit in the future.
C//
Yes, that's a performance issue, but it's not a /latency/ issue - the new process is running, and from there on in the latencies are only a few hundred cycles rather than measurable in microseconds. Until the next time the process enters the kernel, or page faults, or whatever. As far as latency goes, context switching is of minimal importance unless you're worried by latencies on the order of less than a microsecond (depending on hardware and the like, of course).
/any/ process, because the program text will be mmapped read only, allowing the memory to be shared, and thus kept in cache. The TLB flush needed would be an added cost, but unless the cache really is being trashed completely by your program it'll be reloaded straight from cache, and that shouldn't be more than a few hundred cycles (I think - don't quote me on that).
/that/ comparison threads come last, simply because they /have/ those kinds of cache interactions and so forth, where a single-threaded version won't. They also have overhead due to locking, greater debugging difficulties, and other added complexities. On the other hand, though, you can't make use of more than one processor without having multiple processes, whether they're threads or full processes . . .
/real/ threads of control in the implementation. That comes at a cost, though . . .
.sig: "Threads are for people who can't program state machines". It's more complex than that, but it does seem to capture a lot of what motivates threaded designs.
The argument that threads trash cache less than full processes seems fairly bogus to me - the cache trashing will be much more dependant on the size of the working sets of all the running processes, and there's nothing to say that a thread will have a smaller working set than a process. The text segment will be shared, yes, but it's the same with multiple instances of
In any case, the real performance comparison isn't between multiple processes versus multiple threads, it's between a multithreaded implementation and a single-threaded one. In
I think the biggest thing making threads attractive to people is the fact that a threaded approach will often make things simpler to think about in the design stage. You can make all the independant threads of control in your design
Personally, I like the quote from Alan Cox that I've seen in a few people's
himi
My very own DeCSS mirror.
Err, Windows NT does use the native 4KB page size on Intel, but is designed to be expandable to systems with up to a 64KB page size. As a result, certain operations (like the reserve mapping that goes on for the thread stack) aligns data in 64KB increments
That's boneheaded. Linux supports page sizes up to at least 4MB, but it doesn't align everything on 4MB boundries on the off chance that you might be using 4MB pages. It uses the appropriate alignments for the page sizes actually in use.
An OS that has dropped all support for non-Intel hardware citing a portability concern which doesn't exist in portable OSes? As they say in Snatch, "It's spurious, mate. Not genuine."
Sumner
rage, rage against the dying of the light
And yes, Linux's process context switches are on a par (possibly faster - can't be bothered to look up benchmarks) with NT's thread context switches.
Last time I benchmarked, which was a long time ago (NT 3.51 days), Linux process switch times were 5x faster than NT thread-switch times on the same hardware. Linux thread-switch times were on a par with process-switch times, NT thread-switch times were about 20x faster than NT process-switch times.
I'd expect all those numbers to have changed, though.
Sumner
rage, rage against the dying of the light
Yes, I know you are right. Amongst other things, I won't be stuck with 64K per thread stack in Linux, and as you say, I could use 64 bit alpha linux. I'm looking forward to Hammer, actually.
C//
Why it needs to be larger than one page? The kernel will trap access to page faults due to stack overflow, and will allocate additional stack to it anyway.
It does not need to be bigger than one page, it just is. You are right, the stack is expanded via implicit mmap as it grows... but for performance reasons the default stack is usually measured in megabytes, not pages.
Anything but the simplest of applications would use a page rather quickly. User-space applications are programmed to assume they have any size stack they want. Local variables are huge.
In short, I was just commenting on the default. It can surely be lowered...
I don't understand what the issue is here.
I was able to run 1,600,000 simultaneous connections with a modified FreeBSD kernel, in June of 2001. Couldn't get much work done, but at about 300 baud per conection, after dividing up a gigabit ethernet link... you shouldn't expect to do much work.
Without modifications, after a patch to the credential reference counting (since committed to FreeBSD 4.5), as long as a stock kernel is tuned correctly, it can still *easily* handle 100,000 simultaneous connections (16K of window space for each connection = 1.6G of mbufs).
-- Terry
So? Use non-blocking I/O instead. Problem solved.
-- Terry
No you will see a pid per thread because, that is how the scheduler knows to schedule things. The getpid() c library call from within the program. When they said it is a 1-to-1 mapping that means that there is a process per thread. Just look when you see all those proccesses with the same name, and see if they have the exact same memory usage. If they do it means they are using the same memory and are threads. No matter how you implement threads there has to be more than one proccess other wise when the program blocks for I/O all threads would be blocked.
One day people will learn the folly of Winbloze, Linux Rules!
_Need_ the low latency patch? We don't.
/proc/cpuinfo | grep -E "model|cpu" /proc/meminfo | grep MemTotal
.ogg files with xmms, while ripping and ogging a CD in the background, with Mozilla running, and grabbing a mozilla window and moving it around the desktop (with opaque window moving switched on) really quickly for 20 seconds results in - no skipping.
karellen $ uname -a
Linux foo 2.4.17 #1 Sat Jul 13 12:21:18 GMT 2002 i686 unknown
karellen $ cat
cpu family : 6
model : 3
model name : AMD Duron(tm) Processor
cpu MHz : 757.485
cpuid level : 1
karellen $ cat
MemTotal: 126732 kB
karellen $
So, I'm running 2.4.17 on an AMD 750 with 128MB of RAM. You'll have to take my word that that's a stock 2.4.17, with no patches, but I'm playing a list of
Yeah, reducing latency will be nice, but as far as I can tell, it's not actually needed for anything to do with the `user experience' at the moment.
Don't know what you've got running in the background, but it must be pretty hefty.
K.
Why doesn't the gene pool have a life guard?
...run Ada 83 programs.
But while their threads will be slow, they will be to handle the text the users are entering; vastly more useful than the most optimized eight-bit character horror you would turn out.
Trolling is supposed to be:
1. Fast! Writing random mild insults almost a week after the original posting isn't as great as making a real-time flamewar immediately after posting.
2. Accessible to a potential reader. Referring to an obscure recurring theme of my rants made months away from this article (byte-value transparency of protocols vs. Unicode references in RFCs) would require a potential troll spectator a lot of googling before he will be able to appreciate your comment.
Contrary to the popular belief, there indeed is no God.