Should Servers be Mono-Process or Multithreaded?

info by illuminatedwax · 2006-07-12 12:54 · Score: 5, Informative

Check out the C10K page for a very detailed discussion about this.

--
Did you ever notice that *nix doesn't even cover Linux?

combination by TheSHAD0W · 2006-07-12 13:02 · Score: 2, Informative

If you're going to be serving more than a few connections at a time, it's easy for threads to eat monstrous amounts of resources. It's better if you can handle network connections via a single thread. On the other hand, moving other tasks to separate threads can help. For instance, you will probably want to run the UI with a separate thread (yes, even if it's text-only), and it's useful to be able to split file operations among several threads and let the OS optimize disk access.

Re:combination by agent+dero · 2006-07-12 13:17 · Score: 3, Informative

On what operating system? With FreeBSD'sSMPng project, they've made most of the network stack (from my understanding) SMP safe, and the kernel now supports pushing multiple threads across multiple CPUs (like Solaris, Xnu, and Linux)

it's easy for threads to eat monstrous amounts of resource

Don't you mean forked processes? The advantage of threads is that they're lightweight and use shared memory (dead locks hoorah!), forked processes are heavy because they need their own memory, etc.

--
Error 407 - No creative sig found
Re:combination by ILikeRed · 2006-07-12 14:40 · Score: 2, Informative

disclaimer: I have no experience doing this in Linux
That's because it's only needed (as a performance hack) with Windows. Forking is cheap with *nix.
"Those who don't understand UNIX are doomed to reinvent it, poorly."
--Henry Spencer

Get ESR's "The Art of UNIX Programming" - or if cheap read it online. Chapter 7 might be a good place for you to start.

--
I have come to a conclusion that one useless man is a shame, two is a law firm, and three or more is a congress -J Adams

fastest way by r00t · 2006-07-12 13:25 · Score: 4, Informative

Well first, you probably should keep things simple and just buy nice hardware. Most servers sit idle most of the time anyway. If you truly do need the perfornamce though...

Have 1 process per node. I mean "node" in a NUMA sense. 64-bit AMD systems have one node per chip package. Other PCs (except exotic stuff) have one node for the whole system. Lock your processes to separate nodes, so that they all get local memory. If you don't do this, at least remember to use the new system call for moving pages from one node to the other. (eh, "move_pages" if I remember right -- see unistd.h in the kernel source)

You'll need extra threads to do disk IO. Not counting those: On each node, have at least 1 thread per bottom-level (usually L2 or L3) cache, but not more than 1 thread per virtual core (hyperthreading thing). If you go with 1 thread per physical core but have virtual cores (hyperthreading) enabled, lock your threads to virtual cores that don't share phycical cores.

A lot of this should be configurable. Hopefully you'll make an easy way to automatically determine the best configuration, writing out the appropriate config file so that manual config hacking is not required to get the best performance.

Multiple processes by Alex+Belits · 2006-07-12 13:38 · Score: 3, Informative

You almost never have less performance-critical processes (say, web server + database server processes) than CPUs, so for most of applications, in situation where you don't have a shared context that you need a read-write access to from all requests, use multiple processes. If you do have such a shared context, you need to consider the overhead from synchronization between threads vs. no simultaneous access. Also take into account that socket i/o is not synchronous -- a process may have sent a buffer to the kernel and switched to processing another request, but kernel is still sending the data from that buffer. The same happens with receiving data -- it may have arrived already, but the kernel is filling the input buffer while the server process is not looking there. If the data never arrives faster than it can be processed, you gain no benefit from trying to process it in parallel -- your threads will be sleeping more, and performance will remain the same.

On systems with crippled schedulers and VM, threads are very efficient compared to everything else because your application becomes its own scheduler, and it reduces the total physical footprint and amount of cache invalidations. With better scheduler and VM, it makes more sense to rely on the OS insulating processes and scheduling their i/o, so the solution with multiple processes becomes more efficient in the absence application's need to share some data between multiple requests.

--
Contrary to the popular belief, there indeed is no God.

Re:Multiple processes by Alex+Belits · 2006-07-12 21:34 · Score: 2, Informative

I don't understand how processes could possibly be faster than threads doing the same task. If processes are better because there is no interaction between them, then threads doing the same task would also have no interaction between them, while incurring faster context switches.

Because there IS interaction between threads -- mutexes handling takes resources, too, and checking i/o status also requires time and syscalls (OS just supplies pending i/o information to its scheduler by itself). And because data access mechanism within your application is likely to be a much worse scheduler than scheduler in the OS.
Of course it seems silly in this day and age to have a single thread/process handling a single connection in anything resembling a high-performance server.

Actually, it's not (unless you are in Windows). For quite a while it was the only way network i/o was possible in Java at all. And while it was a bad decision for Java, it was based on a valid idea that sleeping threads or processs eat less resources than what is necessary for multiplexed handling of completely unrelated connections within one thread.
In that case, each thread/process should be able to handle multiple connections at once. It should execute a request, grab the next request off the queue or wait for one, ad infinitum. With all threads in a single process, it is easy to balance the load queues between threads; with separate processes it is almost impossible to move a request from a busy process to a waiting one.

You don't need to do that -- i/o scheduler in kernel does that for you already, and when that is not good enough, you can have a separate process that dispatches request by whatever set of rules. As opposed to Windows, Unixlike systems allow socket passing between processes.
Also, it's hard to understand why reducing total physical footprint and amount of cache invalidations is only good on systems with bad VM systems or "crippled" schedulers (whatever that means).

Because with good VM they are smaller in the first place. In any case, when you deal with scalale systems that have to handle huge amounts of requests simultaneously, your cache will be severely beaten just because of the large amount of data involved.

--
Contrary to the popular belief, there indeed is no God.

Consider Fault Tolerance & Thread Safety Too by thebiss · 2006-07-12 14:11 · Score: 5, Informative

Since you didn't say what kind of server you're building, I'm going to assume:
- that you're building a custom-purpose, client-server or message processing application,
- it needs to be highly parallel to be efficient
- the language is C, C#, or C++, and not Java (process-based servers in Java?)

I have done this before, using both processes and threads, for the same application. Consider the impact of application faults on your design, and then consider how hard it will be to create thread-safe code.

o A highly multithreaded server, where threads are performing complex and/or memory demanding tasks, will be susceptable to complete failure of all running jobs on all threads, if just a single thread SEGFAULTs. And despite your best testing efforts, complex code (1M+ lines) will at some point, somewhere, fail.

o Threaded code must be thread safe. Static variables, shared data structures, and factories all previously accessed through a singleton must now be protected with guard functions and semaphores. Race conditions need to be considered. Design for this up front. It will be much harder to add it later.

The Project

I worked on a team which added an asychronous processing engine to a web application. The engine was responsible for performing memory and time-intensive financial analysis and reporting for 16,000 accountants, so that they could close a large company's financial books. Unlike a webserver triggered by on-line end users, this engine is triggered by events in the company's financial database: once the database raises the "ready" flag, this engine begins running as many reports as it can, as fast as it can, on behalf of the 16K users. The analysis and report code was 2 million lines of C++, running on AIX.

Process implementation

The initial implementation used processes. A dispatcher job monitored the database for the ready flag, and then forked children of itself to analyze slices of the data, and generate the reports. One child job was used for each analysis and report pair, and the manager controlled how many jobs ran in parallel, maintaining a scoreboard of which jobs succeeded, and which failed.

Due to the complexity of the system, failures (core) occasionally occurred. The monitor would record this, retry the failed analysis up to 3 times, and keep a uniquely named core file of the event. Other analysis reports would continue to be generated, otherwise unharmed by the thrashing thread. Approximately once every 90 days, the development team would collect the few cores generated, use the gbd/xldb debugger to determine the cause of failure, and correct the fault.

The downsides of this? The solution was slowed because couldn't re-use resources like database connections (they were destroyed with each process), and more memory was used than need be. DB2 caching helped somewhat, but potential performance improvements remained.

Threaded implementation

In a large company, there are IT standards, and one of the standards at my company is that applications shall never, ever, ever fork(), even if running on a large dedicated machine. After losing the fight against this, my team re-architected the report engine. Largely
the same as the previous, the new engine waits for the "ready" signal, and then spawns pthreads (POSIX threads) as workers to analyze the data and generate the report. In theory, it was robust.

The alpha version of this solution immediately failed (cored) during testing. We neglected to identify the less obvious non-thread-safe code in the application, and failed to identify several race conditions. Unlike previous failures, this faults were total: a SEGFAULT in code on one of 20 threads would halt the entire application. And the corefile generated was now huge - it contained a snapshot of memory for all 20 running jobs, instead of just the one of interest.

Extensive root-cause analysis, design, and restart management solved this, and the current version is as robust, and a good bit faster, than the previous. At a significant price.

--
Beware: I believe all are created equal, and have the right to life, liberty, and the pursuit of happiness.

outdated info by Anonymous Coward · 2006-07-12 14:51 · Score: 2, Informative

The last time this page was updated was November 2003, since there there have been two major revisions to Java and at least one major revision to the linux kernel (as we;; as changes to FreeBSD, OpenBSD and Windows)... and in both all cases these revisions introduced changes to address scalability/concurrency. This page is incredibly out of date.

Also... they opening statement and its bias toward Unix "for obvious reasons" doesn't lend towards it's credibility. I've tuned out high volume systems on both platforms and I could care less wether the systems are running on unix or windows. Anyone who is claims that one has inherent performance advantages over the other is just showing their inexperience or bias. Nine times out of ten, any serious performance problems are with the application design and implementation.

Re:outdated info by thesandbender · 2006-07-12 16:59 · Score: 2, Informative

I'm the original anonymous coward... wish it hadn't posted as such.

Windows support for IPC is actually very robust, shared memory and semaphores is about as fast as you can get and it exists on both Windows and *nix just under different names.

A lot of people either don't know or forget that the NT->XP->2k3 kernel owes a lot to the Vax/VMS group from Digital who were snatched away to build the NT kernel. There seems to be this illusion that XP is the bastard step child of Windows 1.x and that is not the case, a lot of thought was put into the system from a multi-process/user stand point. The current problem is that thought didn't extend that far into today's current security evironment (although Dave Cutler and crew did try... I remeber the fight to move the video card drivers out of ring zero... Cutler argued that they were ancillary to the system, which is true from a server standpoint, but he lost).

Real life examples by sdfad1 · 2006-07-12 15:38 · Score: 5, Informative

I cannot speculate, but I can look at what people are doing today. One thing that I have noticed, is the widespread research into, with compelling arguments, for massively multithreaded programming techniques. See Erlang for example. It is designed right from the beginning for this sort of problem - high throughput, high reliability, high uptime telephony networks.

As a rough benchmark, someone's got this.

That's an order of magnitude increase in "performance" (depends on what you mean by performance". I thought I'll do a casual informal test of my own, with a decent static file size (instead of the 1 byte used in that benchmark)

Server Software: Yaws/1.56
Document Length: 402 bytes

Concurrency Level: 500
Time taken for tests: 8.480740 seconds
Complete requests: 5000
Requests per second: 589.57 [#/sec] (mean)
Time per request: 848.074 [ms] (mean)

Server Software: Apache/2.0.54
Document Length: 402 bytes

Concurrency Level: 500
Time taken for tests: 29.787216 seconds
Complete requests: 5000
Requests per second: 167.86 [#/sec] (mean)
Time per request: 2978.722 [ms] (mean)

Output edited to get past lameness filter.

Err crap, I could have sworn the first time I tried this, when Yaws was first installed, its performance was worse! Oh well, perhaps it's something I've inadvertently done since then. Could have been due to my computer reboot (this is a desktop PC). It seems I've proven my point, although I was trying to disprove it. Standard caveats regarding benchmarks apply. Both servers are default Ubuntu installs with no configuration changes - I didn't compile anything manually.

Additionally it has also been noted that:

> Linus Torvalds: 100k threads at once is crazy Using Posix style threads, I'd have to agree. Posix threads were just not designed with this level of usage in mind. Which is why concurrent lanugages like Erlang and Mozart/Oz don't use Posix threads.

Well, that's where it could be headed anyway - a multiprocessor system with green threads (ie simulated threads, like Java ones) implementing massive concurency and redundancy. Some prototypes for systems like this are already available, and being used. Cheers.

non-blocking by pizza_milkshake · 2006-07-12 16:24 · Score: 4, Informative

higher throughput can be achieved with one process or thread (whichever floats your boat) per CPU, using epoll() (linux 2.6 only, use poll() for more portability) with non-blocking I/O.

however, it's easier conceptually to write a threaded server, it's more natural to write, and you just launch a single thread per connection. unfortunately, currently, this doesn't scale (see Why Events Are A Bad Idea (for High-concurrency Servers) http://www.usenix.org/events/hotos03/tech/vonbehre n.html for an argument that thread implementations, and not their design, are the issue).

the former method can handle thousands of simultaneous connections with high throughput, even on a decent workstation; the latter cannot. threads simply have an inherent overhead that cannot be eliminated.

i've actually been working on writing a non-portable insanely fast httpd in my spare time (svn co svn://parseerror.dyndns.org/web/) over the past few weeks as a way to explore non-blocking I/O + epoll() and it performs very well (~600% faster conns/sec than a traditional fork()ing server (which i wrote first)).

for further discussion see The C10K Problem http://www.kegel.com/c10k.html which goes in-depth on these very subjects

Re:Threading --- hype, more hype and extra hyped h by Charan · 2006-07-12 17:25 · Score: 3, Informative

There is a genuine difference between multithreading and forking. The kernel does take longer to switch between processes than between threads since there's an address space change between processes. 10,000 threads in one process will use fewer per-process resources than would 10,000 processes of one thread each. I want to say that process accounting (on creation/destruction) takes more time than thread accounting, but I'm not intimately familiar with their implementation on Linux. For some applications, sharing a heap among all threads might make passing data a bit simpler than using IPC or shared memory.

As for utilizing the CPU, threads and processes should be close in performance. I would still expect threads to be slightly faster, since the (x86) processor's TLB is flushed on a context change that wouldn't happen if you switch between threads. For a server where any of this really matters, there will be thousands of worker threads/processes compared to a small number of system threads, so the probability of switching between two worker threads will be high.

I'm sure I'm leaving out some other important differences, but I can't think of them at the moment.

User-level thread libraries can let a process run even faster than with kernel threads or processes (less kernel involvement = faster), but in order to get good performance, asynchronous IO is a necessity.

It depends on software and hardware setup. by master_p · 2006-07-12 20:46 · Score: 2, Informative

If your software is destined to serve few clients and your hardware has a single core CPU, then mono-process is better: easier to debug, easier to change etc. You may use threads for long computations.

If your software is destined to serve few clients and your hardware has more than one CPU core, and then your services are not I/O bound, then performance would increase if your O/S can dispatch threads to different cores. If your services are I/O bound, then increased performance from threading depends on O/S and hardware architecture (in other words, how fast I/O can be multiplexed).

If your software is destined to serve thousands of clients, you need clustering: a few thousand machines that can process requests plus a few others to dispatch requests to the cluster. I actually have no idea how this is done though, so take this advice lightly. In this case, your software is going to be multiprocess/multithreaded anyway.

Re:Consider Fault Tolerance & Thread Safety To by Anonymous Coward · 2006-07-12 23:49 · Score: 1, Informative

A highly multithreaded server, where threads are performing complex and/or memory demanding tasks, will be susceptable to complete failure of all running jobs on all threads, if just a single thread SEGFAULTs.

In Windoze and OS/2 (am I showing my age here :) it is possible to trap these type of exceptions on a per thread basis. You can then create a "manager" that does effectively what you had in the multi-process scenario. The exception handling code does whatever cleanup it can, and then triggers some action that will cause a thread to be spun up if an existing one bites the dust. Works well, though the biggest drawback is that the OS does a lot of cleanup work for you if you're in a process that doesn't occur when you're in a thread. Therefore resources that the thread may had locked/opened won't be cleaned up, you have to track it all yourself. In both OS's, if these were kernel type resources then you'd be screwed, but this was not very common.

Threaded code must be thread safe. Static variables, shared data structures, and factories all previously accessed through a singleton must now be protected with guard functions and semaphores. Race conditions need to be considered. Design for this up front. It will be much harder to add it later.

Agree about the designing up front part. Successful multithreaded coding is all about planning and forethought. You have to understand that debugging after the fact is a complete nightmare scenario given a highly multithreaded app (or even a moderately multithreaded app for that matter). One comment about your above comment though, if you considering going from a multiproccess to multithreaded app, you should be careful of going too overboard and taking advantage of those static variables/shared data/factories. In a MP app, you pay the performance hit so you try to minimize any sharing, you should keep that approach in a MT app as well and not fall into the trap of thinking that the access is now somehow "free".

Overall an interesting post, thanks.

Re:This hits home. What I did. What should I do? by TheRaven64 · 2006-07-13 00:05 · Score: 3, Informative

First, you do not want to be using semaphores for inter-thread synchronisation. Semaphores are IPC primitives, not ITC. The POSIX threading API provides mutexes and condition variables for this kind of thing. If you are doing message passing, then you have a classic producer-consumer situation which is exactly what condition variables were created for.

Each condition variable has a mutex associated with it. The first thing you do, is lock the mutex. If someone else has the mutex locked, then this blocks. You then (atomically) release the mutex and wait on the condition variable. The other thread then acquires the mutex, signals the condition variable, and releases the mutex. The first thread then wakes up with the mutex.

Really though, you should be using an asynchronous model if you want to be able to reason about your code. Difficulty in debugging scales linearly with the number of asynchronous threads/processes and exponentially with a synchronous approach.

To be honest, if you need concurrency, you would be better off using a language like Erlang which is designed for concurrency, rather than trying to shoehorn it into a hacked version of PDP-11 assembler.

--
I am TheRaven on Soylent News

Having just written a multithreaded server... by pieterh · 2006-07-13 00:09 · Score: 4, Informative

My company designed high-performance mono-process servers (portable ones too) starting in 1995, using event-driven virtual threads and state-machine frameworks. Very elegant, very fast, and really easy programming. The Xitami web server was one example - I remember seeing a Win95 system with Xitami survive a slashdotting (it was serving static pages but that was still impressive.)

We worked in C, because we needed guaranteed low latencies.

In 2004 we decided to rebuild these frameworks to handle OS multithreading. The reason was that on a single CPU we could not get the performance we needed, and the choice was either to use clusters, or multithreading.

We continued to work in C. C, and C++ are really nasty for multithreading because the languages have zero support for concurrency. You need to handle everything yourself, and most threading errors are extremely hard to detect.

It cost us about 10 times more to write our software as multithreaded code than using virtualised threads and we had to build whole reference management frameworks to ensure that threads could share data safely.

We did keep virtual threading, in fact, but virtual threads get handled by a pool of OS threads. Using 1 OS thread per connection is not scalable beyond a few hundred threads. Modern Linux kernels handle lots of threads but we also target Solaris, and Windows with the same code. So we use two virtual threads per connection, for full-duplex traffic, and we design most of the major server components as threaded objects, which are asynchronous event-driven objects.

Doing multithreading in C is a *huge* work. C++ has frameworks like ACE that help a lot.

But there is a performance gain. Our software is a messaging server (implementing the AMQP draft standard). We maxed out at around 55,000 messages per second using a pure virtual-threaded model. Very efficient code. On a single CPU the multithreaded code hits 35,000 messages per second. With two CPUs we're back at 55k, and with 4 dual-core Opterons we're at 120k-150k and higher. (Our software runs a massive trading application that processes 1.5bn messages per day). We still need to improve some of the low-level locking functions to use lock-free mechanisms, and we max out a gigabit network. It is difficult to find machines powerful enough to really stress test the software.

Without very robust frameworks, I'd never attempt such a project. As it was, we paid a lot for the extra performance. Our frameworks will eventually be released as free software, along with the middleware server.

Interestingly, a very similar application written in Java 1.5 and using the BEA runtime gets similar performance to ours. Java's threading is so good that I'd be hesitant to chose C on the basis of performance again. I'm not sure whether ACE can reach the levels of performance we need; 100k messages per second is extreme.

Other questions that are very important to ask:

- The number of clients you expect to connect at once. If it's less than 500 you can probably use one or two OS threads per connection. If it's more you need to virtualise connections or share your OS threads.
- The footprint. If you don't care, then I'd advise using Java. If you want a native Linux service, consider C++ and ACE. If you really want to write multithreaded C code, and don't have a full toolkit, consider seeing a doctor.

When it comes to the future, clearly multiple cores are the way we're heading. This was clear two years ago, and was the main reason we bit the bullet and chose to write our software multithreaded rather than using a clustering model. It seemed clear to me that within a decade, systems would have 32, 64, 128 cores, and software that could take advantage of this would survive for longer. Clustering is not as powerful an abstraction as multithreading.

--
My blog

Slashdot Mirror

Should Servers be Mono-Process or Multithreaded?

17 of 96 comments (clear)