Linux 2.6 Multithreading Advances
chromatic writes "Jerry Cooperstein has just written an excellent article explaining the competing threading implementations in the upcoming 2.6 kernel for the O'Reilly Network."
← Back to Stories (view on slashdot.org)
While it's great that Linux has excellent multithreading support, it's a shame, however, that many programmers do not take advantage of multi-threading in their programs.
The worst example of this was the Quake I source code, which was used for many games, including Half-Life. The code was not multi-threaded, and the network code sat idle while everything else drew -- adding about 20ms of lag, unless you cut the frame rate down to about 15 or so.
The problem wasn't fixed in Half-Life -- the most popular multiplayer game of all time -- until sometime in 2000. We can only imagine how many other programs are not taking full advantage of multithreading.
Sigh, I'll do your web searching for you.
Basically, while Linus was incommunicado sailing across the ocean, someone got jumpy and suggested 3.0 should be the next step.
It might be more likely that it proceeds through 2.10 and higher before going to 3, though. Just to confuse the people who think version numbers are floating-point.
In terms of software archeology, there is an important intermediate ancestor.
Quake's original networking was meant for LANs only- the fact that it was even barely playable over the internet suprised the authors.
idsoftware soon released QuakeWorld free to Quake owners. It used the same interface and most of the graphics resources as Quake, so its arguably not a different program. But it came as a separate executable, with many Quake features removed (like monsters). And most importantly, the networking code was entirely re-written.
It is that code that QuakeII and successors derived from.
From what I understand NGPT is mainly a user space thing. Why not go with the 1:1 one in the kernel (NPTL or whatever), just have a libpthread.so (NPTL runtime) and libpthread-mn.so (NGPT). From a programmer's standpoint, when I say pthread_create() I want to know exactly what that does: with NPTL I know what happens. With NGPT I don't. Also, the old rule of "Don't pay for what you don't use" applies. If I'm going to have just, say, four threads, those four threads are going to run better as four kernel threads as opposed to 2 LWP's dynamically mapped between 4 CPU contexts.
But, again, I might want to write a server of some sort which handles hundreds of thousands of connections at once, but 99% are idle at any given time and the other 1% require some nontrivial processing sometimes and/or a long stream of data to be sent without prejudicing the other 99%. Now, for ANY 1:1 threading system, I can't just create x * 10^5 threads because the overhead would be colossal. But equally so, implementing this with poll() is going to be horrid, and if the amount of processing done on a connection is nontrivial and/or DoS'able, there's going to be tons of hairy context management code in there, until lo and behold you end up with a 1:N or M:N scheduling implementation yourself. NGPT could be very useful as a portable userspace library here, as these people have implemented an efficient M:N scheduler under GPL, something that hasn't existed before and could be very useful. I think these libraries might be much more complimentary than the article makes out.
I can easily imagine that one of them might be more efficient for gigantic numbers of threads that don't individually do much, or maybe one might be more efficient for very large numbers of processors, but I don't know jack about the issues involved, so I'm just talking out my ass. (Hello! I'd like to "ass" you a few questions!)
So, someone who knows... Are these threading systems good for different things? And would it really be that hard to make them both come with the kernel?
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
I really don't understand where people get that ridiculous idea.
Half-Life was mostly based off Quake1. The network protocol and prediction code was taken from QuakeWorld. Some small Quake2 functionality was merged later on.
The initial release of Half-Life was approximatly 65% Quake1, 15% QuakeWorld, 5% Quake2 and 15% Original(Not including the bastardisation of the code into MFC/C++).
And yes, people from Valve have confirmed the base was Quake1, not (as some people continue to claim, and I really wish I knew where the rumor started) Quake2.
Also, the percentages are based off some reverse engineering work I done a while ago when I was playing with making a compatible Linux clone of Half-Life.
(FYI, I took the Quake1 engine.. added Half-Life map loading and rendering within about three hours... Half-Life model support took about four days, and adding a mini-WINE dll loader for the gameplay code took about a week. I gave up on the project when it came down to having to keep it up-to-date with Valves patches)
- Ender
Founder, http://www.quakesrc.org/
Project Leader, http://www.scummvm.org/
I don't understand this all that well myself, but I did just read the whitepaper linked to in the article written by Ingo Molnar and Ulrich Drepper. From the looks of things, NGPT's M:N model will cause a lot more problems because of the difficulty of getting the two schedulers (userspace and kernelspace) to dance well together.
/proc directory when a process has tons of threads.
By sticking with the 1:1 solution that's currently used in the kernel and the NPTL model, there's really only the kernel scheduler to worry about, making things run a lot more smoothly generally. I'd imagine latency being a big issue with M:N (I'm pretty sure that it was mentioned in the whitepaper). I haven't read the other side of the issue, but I think that pretty graph in the O'Reilly article says it all performance-wise.
There are other issues though, like getting full POSIX compliance with signal handling. The 1:1 model apparently makes signal handling much more difficult (I don't know anything about the POSIX signaling model, but there's a paper about it on Drepper's homepage that could probably shed some light on the subject if you were so inclined. There are other issues in the current thread model that have to be dealt with in a new 1:1 model (and are) such as a messy
From the whitepaper, it seems that the development of the O(1) scheduler was meant to facilitate the new thread model they've developed, which I hadn't thought about before even though it makes sense. There's still some issues to work through, but both models look promising. If the signal handling issues can be resolved it looks like from the article that NPTL's model will win on sheer performance.
As for making them both come with the kernel, that's really really difficult, since this stuff touches on some major pieces of the kernel like signal handling. The same way you're only going to get one scheduler and VM subsystem, you're only going to get one threading model. You're able to patch your own tree to your heart's content, but as per a default install, there can be only one.
"I may not have morals, but I have standards."
I know most of the Tenebrae developers, and trust me, none of them have ANY time to do this.
:o)
Besides which, Tenebrae's code is just a proof of concept, and not remotely suitable for a real workable port.
That, and I'm too busy with ScummVM to help anyone implement this stuff, it's very disjointed and hacky. And I'm already revered, thanks
(If anyone does want to play with this type of thing theirselves, check the Forums section of my QuakeSrc site. Both the half-life map rendering code, and an older version of my Half-Life MDL renderer, are floating around in several engines. I'm sure someone on the forums would be happy to help you.
Your on your own on network protocol and merging the WINE stuff tho. I done this port some years ago, and I don't have much of the source left)
- Ender
Founder, http://www.quakesrc.org/
Project Leader, http://www.scummvm.org/
Wheren't we going to go straight to 3.0?
I don't think the 2.6 vs. 3.0 debate is over yet. But it seems to be quiet right now. I think the discussion will live up when the release date starts getting close about a year from now. And I even think there will be discussion after the release, because the version number come as a complete surprise to some people. And I will not try to guess how much doubt will be in Linus mind once he actually wants to release the thing.
But if you want to be unambigious when talking about it, you should call it 2.6. If it turns out not to be the case everybody will know that it was indeed 3.0 you were talking about. But if you talk about 3.0 already now, and it turns out to be called 2.6, then 3.0 might be something else released in the future.
Do you care about the security of your wireless mouse?
has a nice article about the state of threading on Linux. See the Sept. 27th Weekly Edition.
scheduler does not immediately respond to priority changes
thread-specific storage access is slow
There is a well known effect in multi-threaded programming called priority inversion that can cause deadlocks when a low-priority thread has acquired a resource that a high priority thread is waiting for, but a medium priority thread keeps the low priority thread from beeing executed and so the medium priority thread effectively gets more cycles than the high priority thread.
One way to overcome this problem is to use priority ceiling locks where the priority of a thread is boosted to a ceiling value when it acquires a lock. Unfortunately I found that changing the priority of a thread for a short interval does not have any effect at all with the current 2.4.x standard pthreads implementation.
The second problem I ecountered is that accessing thread-specific storage with pthread_getspecific() takes 100-200 processor cycles on 1 Ghz PIII, which makes this common approach to overcome hotspots almost as slow as locking.
Does anyone know if any of these issues are adressed by the new implementations ?
p.
Without order, nothing can exist. Without chaos, nothing can be created.
I don't see how someone can say that "kernel thread scheduling" is slower than "user thread scheduling". Whatever algorithms pthreads library is using could also be used by a kernel process scheduler and offer the same benefits for daemons that fork() a lot of processes. Indeed, most of the time threads are not used to take advantage of multiple processors. Instead they are used in place of multiple processes with some shared memory that handle multiple requests at once. If they could be re-written to really be multiple processes with some shared memory, the resulting application will be simplier and possibly more stable/secure because only some portions would need to worry about concurrent access. Conceptually, there is no reason why kernel code shouldn't use virtual memory, start system-use processes/threads, load shared libraries and so on. Or why "user" code shouldn't handle IRQs, call internal kernel functions or run in CPU supervisor code. Some tasks demand a certain programming model. For example, one would hope that a disk IRQ handler doesn't use virtual memory. But there is no need to place artificial restrictions to the point that multi-level schedulers and duplicated code are needed to run a nice Java web server.
Someone should start a site that covers long term issues, rather than the week by week stuff I've found on the web... or maybe someone has, and I'm just too out of the loop....
KernelTrap.
ex$$
Mostly because the was unix VMs are designed is much more efficient at multiple process programs than windows is. Windows started doing threads long before smp was all that common. They did it because multi process was slow as hell. But for 90% of tasks it worked just fine in linux. And its not like linux is just now moving to a thread model. Its just making the existing one (which worked well until you scale to many many threads) a bit better. And by better I don't mean similar to windows performance, I mean similar to solaris (which has threading from the gods).
Switching between user and kernel mode takes time. If all your primitive operations are implemented in user mode, synchronisation (for instance) takes several cycles in the best case (resource is free, lock it), and a bit over a hundred in the worst (resource is busy, context switch). When you also add the user/kernel mode transition (which may be a couple dozen cycles on some RISCs but takes more than a hundred on some x86 architectures), you can see how performance may degrade.
Multics had multithreading in the early 1970s.
Windows was still launched from DOS in 1992.
Please go back to your "innovating" with Windows.
-Kevin
Seriously, what do you use threads for? Unix traditionally uses seperate processes to handle most things Windows folks use threads for. Since Linux has things like copy-on-write and shared memory, it's not that much less efficient. Plus you get the advantage of complete address seperation, aka they can't crash each other.
Try lwn.net. They have a weekly overview of the kernel status. Since they moved to a subscription model, you have to pay to see the latest news, but previous weeks can be viewed for free.
"Just like Communism, GNU/HURD will never take off."
More realistically, like communism GNU/HURD is a great idea in theory, but the only available example right now is run by fascist lunatics and doesn't work...
AC because I'm too cruel
The "complete advantage" of running 'would-be threads' in separate process spaces also has the "complete disadvantage" of an inability to share data without using some heavyweight mechanism. Not to mention a heavyweight context switch.
Yes, threads can be misused and abused, but for some problems they're the right tool. All the developers in this article are trying to do is develop a hammer that won't break the first time they hit a nail with it (IMHO, Linux's inability to deal with massive numbers of threads without severe performance degredation makes the threading implementation approach uselessness. I'm glad someone is fixing it).
Debugging multithreaded programs in Linux is a complete bitch. As the article mentioned, the core dump only has the stack of the thread that caused the fault. Yes, I know any competant multithreaded programmer uses log files extensively in debugging such code but any additional tool helps. Either of these LinuxThreads replacements would be a major improvement. I just hope the major distros roll in either package in their next release.
I bet the 1:1 package would have finer-grained context switching, though. M:N models tend to switch thread contexts only during I/O or blocking system calls. With finer-grained thread switching you tend to expose more bugs in multithreaded code, which is a very good thing. But I suppose even in an M:N model you could always set M=N to acheive similar results.
Regarding thread specific data access: If your LinuxThreads library uses floating stacks (for ix86 this means it has been built with --enable-kernel=2.4 and for i686) it already will be faster.
For other TLS enhancements take a look at http://people.redhat.com/drepper/tls.pdf.
Here, here: acknowledgement of technical excellence where present, disparagement of ethical vacuum where noticed.
It's far more instructive for those of us a little lower on the learning curve to observe these architectural questions in public, than to have them made for us by Those Who Know Best.
Prediction: both implementations have their appropriate place. In the ideal world, you'd install the proper one based on the application.
The one that becomes the usual suspect in commercial distributions is likely the one that is best suited for business tasks, which aren't that thread-happy in the first place, if my guess isn't too far off...
Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
This is not true. Every process in Linux is a full-weight process. The fact that some of those processes may map to the same memory space does not make them in any way lighter to the other parts of the kernel. What Linux does have is a lightweight clone() system call.
Solaris is an excellent example of the differences between processes, lightweight processes (lwp's or kernel threads), and user threads.
I am not well-versed in the world of Linux, ( have my own allegiances but am being drawn to it more and more. Reading the article, it felt very clear to me that Linux will prevail (with a nod to William Faulkner's Nobel speech).
Consider a few quotes from the article:
Perhaps others have already pointed this out, but I am newly impressed with the universal nature of Linux. The power of an operating system that *everyone* is interested in improving, and has the opportunity to improve, is awesome. Yes, Microsoft has tremendous resources, and very earnest, good-willed, brilliant people. But to improve Microsoft's kernels, you have to work for Microsoft. That means switching the kid's schools, moving to Redmond, etc. etc. On the other hand, everyone from IBM to HP to some kid in, say, Finland, can add a good idea to Linux. When the kernel's threads implementation is a topic for conversation at conferences, with multiple independent teams coming up with their best ideas, Linux is sure to win in the long run.
I'm struck by the parallels to my own field of scientific research: Yes, the large multinational companies have made tremendous contributions in materials science, seminconductors, and biotech. They work on the "closed-source", or perhaps "BSD" model of development. But it is the "GPL"-like process of peer-reviewed, openly shared, and collaborative academic science that has truly prevailed.
four nine eighteen twenty-7 thirty-nine forty-7 fiftyeight sixty-nine seventy-9 eighty-8 one-hundred-and-nine one-twenty
Correctly programming threads is hard, so they should only be used when necessary. Many of the things that can be done with threads can be done more safely with fork() and/or select(). Since Windows lacks the former and has a broken version of the latter, Windows programmers tend to use threads when Unix programmers would use an alternative.
LOL! Why did this get modded up? He provided a link to the exact same link and he's modded up? It's not as if it's another unique link...it's the exact same link.
Come on people..mod him back down.
LOL.
Regarding changing of priorities, I think that with SCHED_OTHER the priority is beeing automatically modified by the scheduler to distibute cycles in a more fair fashion.
I tried both SCHED_RR and SCHED_FIFO and changing priorities basically works, but it seemed to me that changing priorities did not have an immediate effect as required to implement priority ceiling locks.
For example, when boosting the priority of a thread to the ceiling priority, and the thread is the only one with this priority, I expect it to run without beeing preempted by anyone before the priority is lowered or the process blocks. On the other hand, when lowering the priority, I expect a higher prio thread to be executed immediately. I would also expect the order of unblocking threads is correctly adjusted when their priority was changed while suspended.
However, it seems that priority changes do not much affect the actual timeslice or the unblocking order, but I did not have the means to find out what exactly happens; using a debugger is outright impossible with fine-grained multi threaded programs.
Is it possible that some system thread needs to run inbetween to do some housekeeping ? Do you have any hints about the scheduler's inner workings ?
Thank you
p.
Without order, nothing can exist. Without chaos, nothing can be created.
It's because of Solaris's inability to do 1:1 decently that we had Sun pushing for POSIX being M:N and consequently making the entire Linux world miserable.
1:1 is a cleaner, simpler model.
May we never see th
It's not going to be 3.0?? I thought that was the decision since so many changes and additions/features are being put into this kernel..
Berto
You poor Unix guys are struggling through something we all went through years ago -- learning how to think more sophisticated than a single thread of control correctly.
What the heck does altering the structure of a thread *library* have to do with application-level thread programming? What are you talking about?
May we never see th
Because 1:1 implementations are well known to not scale well because of context switch overhead and synchronization overhead.
For systems that don't require true high-end scalability, 1:1 works fairly well. It's because of this that M:N has some proponents.
The change in thinking for this is argued in this Sun Whitepaper , and this FAQ .
If one believes the Sun guys have a clue, you can take this as a vote in favor of 1:1.
IMO, anyone who runs more than about 4*NCPUS threads in a program is an idiot; the benchmarks on 10^5 threads are absurd and irrelevant.
Once you run a reasonable number of threads, you can be quickly driven to internal queueing of work from thread to thread; and by the time you have done that, you may already have reached a point of state abstraction that lets you run event driven in a very small number of threads, approaching NCPUs as the lower useful limit. Putting all your state in per-thread storage or on the thread stack is a sign of weak state abstraction.
-dB
"It if was easy to do, we'd find someone cheaper than you to do it."
Nice, I look forward to the improvemenets. One big problem is that I can not find a way to determine that two processes are actually threads of the same process. It is possible to guess, of course, but is there a way to conclusively determine for sure, while we wait for these improvements?
We should only go to 3 if the new kernel breaks binary compatability with 2.4/2.5.
/proc pseudofile. And ktop sundenly thought every process on the system was owned by root. Does this count as binary incompatibility?
First of all make that if it breaks binary compatability with 2.4, because if binary compatibility is not preserved between 2.4 and the latest 2.5 kernel, there will be lot of incompatibilities within the 2.5 series as well. And what happened during the development is not what we want to talk about anyway.
But what do you have in mind, when talking about binary compatibility? AFAIK Linus does consider binary compatibility on the kernel/user mode interface to be important. Deprecated interfaces might be removed at some time, but before that it might even have produced warnings for some time if any programs used it.
When I upgraded from 2.2 to 2.4 binary compatibility was preserved (mostly), I did not replace any libraries or executables because of the kernel change. But I did find a few problems: rpc.rstatd would core dump on the first request, because of a change of the format of a
I don't know how many interfaces changed between 1.2 and 2.0, I never used 1.2 my first kernel was 2.2. But I know that the reasons for Linus to choose 2.0 as version number was primarily the addition of SMP and portability to different platforms. Yet it might have remained binary compatible with 1.2. I suspect the executable format used by 1.2 was aout while today all executables are ELF. But kernels still have optional support for aout binaries. But actually I think even ELF support is optional, so I could build two 2.4 kernels with one supporting aout and the other supporting ELF, they will be same version number but obviously binary incompatible.
Don't expect any Linux kernel in the future to introduce major binary incompatibilities with your previous kernel. There will be changes, but they will be slow, so most executables will be usable across three or more kernel releases. It shouldn't be hard to find an executable working on 2.0, 2.2, 2.4, and the latest 2.5.
Do you care about the security of your wireless mouse?
The point is, it does nothing to address context switches. Switching to another thread still requires a complete context change. Period. While the O(1) schedule allows for more timely (thus lower latency), it does nothing to prevent the actual required context change. Other thread models, given *some types* of workloads, can continue to out scale 1:1 implementations (where the inverse is also true). This is especially true when the threaded workloads exceed available hardware. The result is the same amount of work with fewer context switches. This can result in improved cached hits as well as the obvious gain of work from not having to switch process/thread context.
So your POINT is actually incorrect. The correct statement is that the 2.6 kernel when using a 1:1 model will allow for greater scalability than it's previous implementation but has yet to prove it addresses all workloads better than all other threading models. In fact, it's expected that M:N will still have it's place for some types of workloads even after 2.6 is released.
I personally expect that 1:1 and M:N will both become more common and that there is a time and place for both.
which is the first real advance in scehdulng tech for about 15 years...
IIRC, BSD has had a O(1) scheduler for a very long time now. Many RT OS's have also had them. This is evolutionary for Linux and is not a "real advance in scheduling tech". It is, however, a real advance on an evolutionary basis, for Linux.
Counter Strike could be amazing with high-end graphics, but perhaps there is a reason that DoomIII isn't going to have really advanced multiplayer? Carmack said, some multiplayer, but not that much multiplayer. Could it be that DoomIII is too cumbersome to play over the net? Maybe too cumbersome to play over a LAN? (*ahem*Blood2*ahem*)
I really think that a Counter Strike realism factor could rock, but what about the new Nvidia graphics language everyone keeps talking about? Maybe it won't be hard for us to make leaps toward a fast net CS game with super cool graphics. Maybe the new code would enable the CS team to split off from the Half Life engine and do their own thing?
If you look at the stats on Gamespy, Counterstrike takes the biggest share of online use. Take that and think about Valve and how much money they made on the backs of that project!
*shudder*
To clear up my point, I think that a glossy Doom3 version of CS might be immersive, but it might also lack the networking capability that the current version has. CS today doesn't take up much resources on higher end systems, or even lower ones. Today it's pretty cheap to get CS enabled.
Maybe that plays a factor.