Linux 2.6 Multithreading Advances

Site down or not found by kareejb · 2002-11-08 20:39 · Score: 0, Informative

I couldn't get to article linked to in the story. found this one though that looks like the same thing.

Re:Call me stupid but by Minna+Kirai · 2002-11-08 20:43 · Score: 3, Informative

Sigh, I'll do your web searching for you.

Basically, while Linus was incommunicado sailing across the ocean, someone got jumpy and suggested 3.0 should be the next step.

It might be more likely that it proceeds through 2.10 and higher before going to 3, though. Just to confuse the people who think version numbers are floating-point.

Re:Non-threaded programs by Silh · 2002-11-08 20:55 · Score: 5, Informative

While Quake 1 was developed on NEXT, the target platform at that time would have been DOS, so multithreading would be a bit of a problem...

As to further licencees of the engine, revamping the engine to use multithreading was probably not a very high priority in making a game.

On the other hand, for someone writing an engine from scratch is a different matter.

--
-- Silhouette

Re:Actually, Quake II by Minna+Kirai · 2002-11-08 20:57 · Score: 3, Informative

In terms of software archeology, there is an important intermediate ancestor.

Quake's original networking was meant for LANs only- the fact that it was even barely playable over the internet suprised the authors.

idsoftware soon released QuakeWorld free to Quake owners. It used the same interface and most of the graphics resources as Quake, so its arguably not a different program. But it came as a separate executable, with many Quake features removed (like monsters). And most importantly, the networking code was entirely re-written.

It is that code that QuakeII and successors derived from.

Re:I don't see why the two are mutually exclusive. by jpc · 2002-11-08 21:29 · Score: 2, Informative

If you have hundreds of thousands of connections you should be using aio, which is the new scalable replacement for lots of polls...

Full Text by Anonymous Coward · 2002-11-08 21:32 · Score: 0, Informative

O'Reilly Network
Advertisement

Published on The O'Reilly Network (http://www.oreillynet.com/)
http://www.oreillynet.com/pub/a/onlamp/2002/11/07/ linux_threads.html
See this if you're having trouble printing code examples
Linux Multithreading Advances
by Jerry Cooperstein
11/07/2002

Recent advances in Linux's threading implementation are expected to continue to ease migration from other Unix-like operating systems. These advancements have arrived with intense activity on two fronts. First, thread-handling improvements have greatly enhanced the kernel's scalability even to thousands of threads. Second, there are now two fresh, competing implementations of the POSIX pthreads standard (NGPT and NPTL) set to replace the aging LinuxThreads library.

In typical open source fashion, only time will tell exactly who wins out in which arena. However, both new library implementations should be API-compatible with the standard, so the choice will depend on performance and stability. The required changes will appear in the upcoming Linux 2.6 (or 3.0) kernel and can already be tested in late development versions.
Multithreading on Linux

Threading implementations typically have components in both user and kernel space. It is possible to do everything from the one side or the other, but each approach has problems. With everything on the user side, all related threads are part of one single process (which can only run on one CPU at a time), and multi-processor systems are underutilized. With everything on the kernel side, the kernel scheduler must bear a heavy load.

proaches have ranged from the 1:1 pure kernel thread model in which each user thread has its own kernel thread, to the M:1 model in which the kernel sees only one normal process, with an arbitrary number of threads with which to schedule in user space. The M:N model falls in between, associating M user threads with each of N kernel threads.

The Linux kernel uses the clone() function to create new processes. Flags control parent/child resource sharing, where resources range from everything (memory, signal handlers, file descriptors, etc.) to nothing. While the usual fork() inherits resources from the parent, it may share nothing. Copy-on-write techniques ensure each process gets its own copy as soon as either one tries to modify a shared resource.

Programs can call the clone() function as a system call, using it directly to produce multithreaded programs. However, it is completely Linux-specific and non-portable. Since there is no external standard, there is no guarantee that its interface will be stable. Threading library implementations do in fact use the clone() system call, and it is the job of library maintainers to keep up with kernel changes.

The LinuxThreads implementation of the POSIX threads standard (pthreads), originally written by Xavier Leroy, has been the dominant one for years and is now incorporated and maintained in glibc. It has two problems on Linux:

* Compliance issues as compared to the POSIX standard
* Performance issues, especially when dealing with many threads; i.e., a lack of scalability

Compliance Issues

Virtually all compliance problems can be traced to the decision to use lightweight processes (LWPs, or the 1:1 model described above) as the basis of the implementation. New processes are created by clone(), with a shared-everything approach. While the new process is lighter due to the sharing, fundamentally it is still a process in its own right, with its own process identifier (pid), process descriptor, etc.

This led to the following standards compatibility problems:

* Signal handling is incorrect.
* An extra management thread is created by the pthreads library.
* ps shows all threads in a process.
* Core dumps don't contain the stack and machine register information for all threads.
* getpid() returns a different result for each thread.
* A thread cannot wait for a thread created by another thread.
* Threads have parent-child, not peer, relationships.
* Threads don't share user and group IDs.

If a pthreads application were written for Linux, one could expect easy portability. However, the inverse process, porting to Linux, was more difficult and slowed Linux deployment since important applications were now broken.

Some problems were resolvable by relatively minor kernel adjustments. For example, by modifying the basic data structure describing each process, (struct task_struct) to store a thread group identifier and some other bookkeeping, and then modifying the getpid() system call to return this identifier rather than the process identifier, one problem could be solved.

However, many key kernel developers resisted attempts to modify the kernel for compliance sake. On one hand, their taste runs to technically superior solutions rather than to "cut the toes to fit the shoes" to comply with standards. On the other hand, there was an aversion to creating many threads. Sentiments like "there is no need to create more threads than there are processors" were common. Thread-prolific languages such as Java were looked at with contempt for many reasons.
Performance Issues

The main performance problem has been scalability with growing numbers of threads. These difficulties are not unique to threads, but apply in all cases where the number of processes grows large.

Consider the process of obtaining a new pid. In the 2.4 kernel, Linux has to loop through all processes to ensure a candidate pid is not already assigned. With an outer loop on possible candidates, the time spent may scale quadratically; if there are thousands of processes, the system can slow down to a crawl. Since each thread has its own pid, creating zillions of threads is poisonous.

New Generation POSIX Threads

A group at IBM and Intel, led by Bill Abt at IBM, released the first version of the New Generation POSIX Threads (NGPT) library in May 2001. This consisted of a drop-in replacement for LinuxThreads, together with patches for kernels beginning with 2.4.0.

To ease acceptance, the group made a conscious effort to keep the impact on the kernel small. They worked to get the kernel modifications they needed through patient, piece-by-piece promotion and expected to have NGPT eventually replace LinuxThreads in the glibc system.

NGPT is a derivative of the GNU Pth (GNU Portable Threads) package, which up to now is based on an M:1 model. A user space priority and event-based, non-preemptive scheduler manages the M user threads. This was seen as an improvement over the 1:1 pure kernel thread model used by LinuxThreads where the kernel has to do a lot of scheduling work.

NGPT adopted the M:N hybrid model. Many developers saw this as the best path to good performance: keep all CPU's humming, minimize context switching between kernel threads, and switch mostly between user space threads. However, the M:N model is complex. It requires two cooperating schedulers, one each in user and kernel space. Signal handling is difficult and much work has to be done in user space. It takes fancy footwork to prevent one blocked thread from blocking other threads running in the same process.

Nonetheless, the NGPT team succeeded in implementing the full pthreads standard, and the kernel changes they needed were accepted in the mainline kernel early in the 2.5 development process (at kernel 2.5.4). They were also back-ported into the 2.4.19 kernel. Depending on the metric used, performance gains were claimed of up to 100 percent, and work continues on improvements.

On March 26-27, 2002, Compaq hosted a meeting to discuss the future replacement for the LinuxThreads library. In attendance were members of the NGPT team, some employees of (then distinct) Compaq and Hewlett-Packard, and representatives of the glibc team, including the head maintainer, Ulrich Drepper (a Red Hat employee), who wrote a summary of the meeting.

Pursuing the M:N approach, the report said:
"This is one of the reasons why it is absolutely necessary to think about two-level scheduling for the threads. I.e., the actual user threads are different from the kernel threads (or light-weighted process, or what ever one wants to call them) and scheduled separately. This is generally called the M-on-N model for a thread implementation. The 1-on-1 model dedicates a unique kernel thread for each user-level thread; this is the model used by the current, inadequate thread library implementation."

The report contains detailed analysis of how to get kernel and user-space schedulers to cooperate using the scheduler activations method.

It seemed the replacement for LinuxThreads would be based on NGPT.
Native POSIX Thread Library

On September 19, 2002, Ulrich Drepper and Ingo Molnar (also of Red Hat) released an alternative to NGPT called the Native POSIX Thread Library (NPTL). The project included a new user space library, changes to glibc, and kernel modifications. The initial announcement said in part:
"Unless major flaws in the design are found this code is intended to become the standard POSIX thread library on Linux system, and it will be included in the GNU C library distribution."

NPTL is based squarely on the 1:1 pure kernel thread model. A white paper explains why in detail.

Recent changes to kernel thread handling (mostly due to Ingo Molnar) had vastly improved thread performance. With these changes in place, the relative simplicity of the 1:1 model became very attractive.

There is only one scheduler. Signal handling remains in the kernel's hands. Blocking problems are handled naturally because each kernel thread schedules independently. In addition, the user space implementation becomes fundamentally simpler.

In some sense, one has come full circle; developers who wanted to ensure full Posix compliance were frustrated by the kernel maintainers' unwillingness to adapt the Linux kernel to fit their needs, and thus NGPT was developed in part as a polite end run requiring minimal kernel changes. Then a programming tour de force, mostly by one key kernel programmer, is now claimed to enable reversion to a much simpler approach.
Linux Kernel Improvements

What changes have been made in the Linux kernel to make threads perform and scale better?

Consider the previous example of obtaining a new pid. The potentially quadratic search is gone. Instead, the kernel sets aside a small but dynamic number of memory pages as bitmaps for process identifiers. Obtaining a new pid means finding a page with free entries and then finding and clearing the first set bit. No locking is required, and the search time is very short and almost independent of the number of running processes.

Another key improvement is the O(1) scheduler, which no longer has to cycle through all processes to find the most deserving one. Each CPU has its own queue, a simple priority-sorted bitmask. Once again finding a new process is very fast and scales fantastically.
Where Do We Go From Here?

The NPTL team posted some benchmarks, such as this display of the minimum time needed to create a number of top-level threads.

In general, while NGPT beat the old methods by a factor of two, NPTL could do better by another factor of two.

It remains to be seen exactly how the two implementations will rank against each other. NGPT may not yet be tuned to take advantage of recent kernel improvements the way NPTL has. Furthermore, benchmarks are often used to misrepresent. It will take further development by both teams, independent benchmarks, and real life comparisons to see who really beats whom.

You can test drive NGPT by simply downloading the library and installing it, as long as you have kernel 2.4.19 or later. For NPTL, you can download the library, but you will need a very recent development kernel as well as bleeding edge glibc and gcc. The announcement contains detailed instructions.

While there may be some hard feelings on the socio-political side about how NPTL seemed to come out of the blue, the maintainers of NGTP have not griped in public. It seems that any battle between the two implementations will now be played out in public, in good open source fashion. Either one library will win out over the other, or each will become the preferred tool in some universe for some load. At any rate it will be fun to see what happens. Linux will benefit by having a standards-compliant, and well-performing threads implementation(s).

Jerry Cooperstein is a senior consultant and Linux training specialist at Axian Inc., in Beaverton Oregon, and lives in Corvallis, Oregon.

Return to the Linux DevCenter.

oreillynet.com Copyright © 2000 O'Reilly & Associates, Inc.

Yes and No by krmt · 2002-11-08 21:33 · Score: 5, Informative

I don't understand this all that well myself, but I did just read the whitepaper linked to in the article written by Ingo Molnar and Ulrich Drepper. From the looks of things, NGPT's M:N model will cause a lot more problems because of the difficulty of getting the two schedulers (userspace and kernelspace) to dance well together.

By sticking with the 1:1 solution that's currently used in the kernel and the NPTL model, there's really only the kernel scheduler to worry about, making things run a lot more smoothly generally. I'd imagine latency being a big issue with M:N (I'm pretty sure that it was mentioned in the whitepaper). I haven't read the other side of the issue, but I think that pretty graph in the O'Reilly article says it all performance-wise.

There are other issues though, like getting full POSIX compliance with signal handling. The 1:1 model apparently makes signal handling much more difficult (I don't know anything about the POSIX signaling model, but there's a paper about it on Drepper's homepage that could probably shed some light on the subject if you were so inclined. There are other issues in the current thread model that have to be dealt with in a new 1:1 model (and are) such as a messy /proc directory when a process has tons of threads.

From the whitepaper, it seems that the development of the O(1) scheduler was meant to facilitate the new thread model they've developed, which I hadn't thought about before even though it makes sense. There's still some issues to work through, but both models look promising. If the signal handling issues can be resolved it looks like from the article that NPTL's model will win on sheer performance.

As for making them both come with the kernel, that's really really difficult, since this stuff touches on some major pieces of the kernel like signal handling. The same way you're only going to get one scheduler and VM subsystem, you're only going to get one threading model. You're able to patch your own tree to your heart's content, but as per a default install, there can be only one.

--

"I may not have morals, but I have standards."

Re:Non-threaded programs by DarkHelmet · 2002-11-08 21:42 · Score: 3, Informative

Yeah yeah yeah... When life isn't perfect, blame Abrash...

Troll! ;)

---
(And yes, Mike Abrash did WinQuake, not Carmack)

--
/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i

LWN by KidSock · 2002-11-08 21:46 · Score: 5, Informative

has a nice article about the state of threading on Linux. See the Sept. 27th Weekly Edition.

Re:Someone should start a site.... by taviso · 2002-11-08 23:07 · Score: 4, Informative

Someone should start a site that covers long term issues, rather than the week by week stuff I've found on the web... or maybe someone has, and I'm just too out of the loop....

KernelTrap.

--
ex$$

Mode switching times. by zak · 2002-11-09 00:04 · Score: 2, Informative

Switching between user and kernel mode takes time. If all your primitive operations are implemented in user mode, synchronisation (for instance) takes several cycles in the best case (resource is free, lock it), and a bit over a hundred in the worst (resource is busy, context switch). When you also add the user/kernel mode transition (which may be a couple dozen cycles on some RISCs but takes more than a hundred on some x86 architectures), you can see how performance may degrade.

Re:Oh crap, I wish I didn't have to say this... by khuber · 2002-11-09 00:13 · Score: 2, Informative

IBM's System 360 had multithreading in 1964.

Multics had multithreading in the early 1970s.

Windows was still launched from DOS in 1992.

Please go back to your "innovating" with Windows.

-Kevin

Re:I don't see why the two are mutually exclusive. by rweir · 2002-11-09 00:21 · Score: 5, Informative

Now, for ANY 1:1 threading system, I can't just create x * 10^5 threads because the overhead would be colossal.

Actually, it's kind of famous for that.

Re:Someone should start a site.... by Salsaman · 2002-11-09 00:40 · Score: 4, Informative

Try lwn.net. They have a weekly overview of the kernel status. Since they moved to a subscription model, you have to pay to see the latest news, but previous weeks can be viewed for free.

Re:So are they both useful? by barneyfoo · 2002-11-09 01:32 · Score: 2, Informative

With the new O(1) scheduler and other improvements made by ingo molnar scheduling kernel threads is no longer a major bottleneck. Besides, NGPT goes against the linux philosophy of minimally invasive changes to the kernel api, and it's doubtful linus will accept it into the kernel.

Re:How about scheduling & thread-specific stor by Juergen+Kreileder · 2002-11-09 01:35 · Score: 3, Informative

In the current version priorities only work SCHED_RR and SCHED_FIFO (both require superuser privileges), SCHED_OTHER (the default policy) doesn't support changing priorities.

Regarding thread specific data access: If your LinuxThreads library uses floating stacks (for ix86 this means it has been built with --enable-kernel=2.4 and for i686) it already will be faster.

For other TLS enhancements take a look at http://people.redhat.com/drepper/tls.pdf.

Re:I don't see why the two are mutually exclusive. by Quixote · 2002-11-09 01:46 · Score: 3, Informative

Now, for ANY 1:1 threading system, I can't just create x * 10^5 threads because the overhead would be colossal

If you read the article, it shows benchmarks done by the NPTL folks which shows a 2x improvement in thread start/stop timings over NGPT (which itself is a 2x improvement over POLT (plain old Linux threads)).

Read more about NPTL here (PDF file).

Re:Kernel vs user doesn't make sense by inquis · 2002-11-09 02:28 · Score: 3, Informative

Two words: context switches.

Whenever execution switches between user mode and kernel mode, a context switch is required. Context switches are expensive.

Inidentally, this is one of the advantages of the microkernel approach: by severely limiting the code that must be run in kernel space, you can minimize context switches between kernel and user mode and save a lot of time.

Re:Non-threaded programs by Salamander · 2002-11-09 03:31 · Score: 5, Informative

applications where deadlocks or race conditions would be an integral problem in a multithreaded implementation whilst a single thread has none of these problems.

That's a common myth. In fact, there are some kinds of deadlock that do go away, but there are also some kinds that merely change their shape. For example, the need to lock a data structure to guarantee consistent updates goes away, and so do deadlocks related to locking multiple data structures. OTOH, resource-contention deadlocks don't go away. You might still have two "tasks" contending for resources A and B, except that in the non-threaded model the tasks might be chained event handlers for some sort of state machine instead of threads. If task1 tries to get A then B, and task2 tries to get B then A, then task1's "B_READY" and task2's "A_READY" events will never fire and you're still deadlocked. Sure, you can solve it by requiring that resources be taken in order, but you can do that with threads too; the problem's solvable, but isn't solved by some kind of single-threading magic.

I've written several articles on this topic for my website in the past. In case anyone's interested...

--
Slashdot - News for Herds. Stuff that Splatters.

Re:Kernel vs user doesn't make sense by mesocyclone · 2002-11-09 05:16 · Score: 3, Informative

Typically, the microkernel approach INCREASES the number of context switches. However, a microkernel also normally has very fast context switches.

The context switches are increased because a single operation (say, and I/O read) requires switching into the kernel from the user process, and then out into a device driver. A non-microkernel would have the device driver in the kernel. This is just an example - it may be that the switch is to the file system manager instead, or some other helper process. The point is that the nature of a microkernel is to have lots of helper processes that perform what are normally macro-kernel functions.

Context switches typically are expensive because they involve more than just a switch into kernel mode. They are likely to involve some effort to see if there is other work to do (such as preempt this thread). They may involve some privelege checks, and some statistical gathering.

A microkernel just does less of this stuff.

BTW... the first elegant running micro-kernel I ran into was the original Tandem operating system. The kernel was primarily a messaging system and scheduler (I think scheduling *policy* may have been handled by a task, btw). I/O, file system activity, etc was handled by privileged tasks. It was very elegant, and conveniently fit into their "Non-Stop (TM)" operation.

--

The only good weather is bad weather.

Re:I thought it was 3.0? by WNight · 2002-11-09 05:26 · Score: 3, Informative

There aren't really any incompatibilities with older code, so you don't need to go to a new kernel version like you would if you broke anything.

In one of the discussions with Linus on this issue he said there was a planned change that broke something but it wouldn't be in for this version. Because that would warrant a major version change of its own, he didn't want to go from 2.5 to 3.0 then from 3.3 or something to 4.0, he'd rather go from 2.9(or so) to 3.0, and avoid the version inflation.

I agree. There's no stigma in having a product numbered 1.x or 2.x, it simply means you got it right early on, without needing to break old applications too often.

Re:Non-threaded programs by Salamander · 2002-11-09 16:20 · Score: 3, Informative

Every context switch burns hundreds, if not tens of thousands, of clock cycles.

A well-designed multi-threaded implemention will organize its thread usage in such a way that under light load and/or on a single processor it will not have significantly more context switches than a single-threaded equivalent. Under such conditions it will exhibit the same performance characteristics as that single-threaded version, and yet it will also be able to take advantage of inherent parallelism and multiple processors when they exist. Been there.

Bad multithreaded implementations schedule so many computationally active threads that TSE switches are inevitable. Bad multithreaded implementations force two context switches per request as work is handed off between "listener" and "worker" threads. Bad multithreaded implementations do lots of stupid things, but not all multithreaded implementations are bad. The main overhead involved in running a well-designed multithreaded program on a uniprocessor is not context switches but locking, and that will be buried in the noise. Done that.

A handful of extra context switches per second and a fraction of a percent of extra locking overhead are a small price to pay for multiprocessor scalability.

It's trivial to run multiple copies of a single-threaded program on the different CPUs, and let them interact over IPC.

Trivial, but stupid. You really will context-switch yourself to death that way, as every occasion where you need to coordinate between processes generates at least one IPC and every IPC generates at least one context switch (usually two or more)...and those are complete process/address-space switches, not relatively lightweight thread switches. That's how to build a really slow application.

this approach scales trivially to large numbers of networked processors.

No, it doesn't. There's simply no comparison between the speed of using the same memory - often the same cache on the same processor, if you do things right - and shipping stuff across the network...any network, and I was working on 1.6Gb/s full-duplex 2us app-to-app round-trip interconnects five years ago. Writing software to run efficiently on even the best-provisioned loosely-coupled system is even more difficult than writing a good multithreaded program. That's why people only bother for the most regularly decomposable problems with very high compute-to-communicate ratios.

catastrophic failure of one process does not necessarily corrupt the state of another process. (While one thread crashing is almost certain to bring down an entire multi-threaded program.)

Using separate processes instead of threads on a single machine might allow your other processes to stay alive if one dies, but your application will almost certainly be just as dead. The causal dependencies don't go away just because you're using processes instead of threads. In many ways having the entire application go down is better, because at least then it can be restarted. When I used to work in high availability, a hung node was considered much worse than a crash, and the same applies to indefinite waits when part of a complex application craps out.

--
Slashdot - News for Herds. Stuff that Splatters.

Slashdot Mirror

Linux 2.6 Multithreading Advances

22 of 194 comments (clear)