Should Servers be Mono-Process or Multithreaded?

Error by donald_the_laker · 2006-07-12 12:52 · Score: 0

I get this: "Nothing for you to see here. Please move along".
Is this an error? I'd like some help.

Re:Error by donald_the_laker · 2006-07-12 13:02 · Score: 2, Funny

Oh...I'm sorry. I am new around here, I signed up for the tour, but Natalie Portman was too busy eating grits, and they were gonna send Natalie Wood instead, but the insensitive clod had to go drown. I for one welcome her new Neptunean overlords.
Re:Error by captnitro · 2006-07-12 13:03 · Score: 1

According to submitter, you need more cores.
Re:Error by donald_the_laker · 2006-07-12 13:08 · Score: 1

Oh see I was running slashdot in one thread and firefox in another...I guess that's why. Maybe if I buy another core this will fix it?

info by illuminatedwax · 2006-07-12 12:54 · Score: 5, Informative

Check out the C10K page for a very detailed discussion about this.

--
Did you ever notice that *nix doesn't even cover Linux?

combination by TheSHAD0W · 2006-07-12 13:02 · Score: 2, Informative

If you're going to be serving more than a few connections at a time, it's easy for threads to eat monstrous amounts of resources. It's better if you can handle network connections via a single thread. On the other hand, moving other tasks to separate threads can help. For instance, you will probably want to run the UI with a separate thread (yes, even if it's text-only), and it's useful to be able to split file operations among several threads and let the OS optimize disk access.

Re:combination by agent+dero · 2006-07-12 13:17 · Score: 3, Informative

On what operating system? With FreeBSD'sSMPng project, they've made most of the network stack (from my understanding) SMP safe, and the kernel now supports pushing multiple threads across multiple CPUs (like Solaris, Xnu, and Linux)

it's easy for threads to eat monstrous amounts of resource

Don't you mean forked processes? The advantage of threads is that they're lightweight and use shared memory (dead locks hoorah!), forked processes are heavy because they need their own memory, etc.

--
Error 407 - No creative sig found
Re:combination by PhrostyMcByte · 2006-07-12 13:23 · Score: 2, Insightful

If you're going to be serving more than a few connections at a time, it's easy for threads to eat monstrous amounts of resources. It's better if you can handle network connections via a single thread.

I disagree. While using a single thread per connection is definately a retarded way to go, making I/O operations asynchronous and handling callbacks via a thread pool is definately the fastest and most efficient way to build a scalable daemon today. And once you get used to the idea, it is even much easier than using select()/poll()-based approaches. (disclaimer: I have no experience doing this in Linux)

If anyone is interested I have been slowly piecing together a library to make this easier for Windows developers.
Re:combination by ILikeRed · 2006-07-12 14:40 · Score: 2, Informative

disclaimer: I have no experience doing this in Linux
That's because it's only needed (as a performance hack) with Windows. Forking is cheap with *nix.
"Those who don't understand UNIX are doomed to reinvent it, poorly."
--Henry Spencer

Get ESR's "The Art of UNIX Programming" - or if cheap read it online. Chapter 7 might be a good place for you to start.

--
I have come to a conclusion that one useless man is a shame, two is a law firm, and three or more is a congress -J Adams
Re:combination by PhrostyMcByte · 2006-07-12 14:53 · Score: 1

This is not about how cheap threads/forking is. It's about how much context switches you want to waste when you shouldn't be.
Re:combination by Anonymous Coward · 2006-07-12 15:27 · Score: 0

I guarantee that if your experience is based on performance profiles on Windows, then you have absolutely no clue how differently Linux is going to perform. Sure, context switches have an unavoidable minimum cost. But that cost is far less than the actual cost is in Windows. The performance profiles in your benchmarks are nowhere near the true cost in any modern *nix system.

And if you have a multi-CPU machine, that cost can quickly be paid for by the fact that schedulers are more willing to move processes around to balance load than they are to move threads around to balance load. The reason schedulers are reluctant is that when you have multiple threads running on multiple CPUs, there are often memory contention issues between those CPUs. With multiple processes, those issues are far less common. (Yes, you can write threaded programs that will have no such issues, and with multiple processes and shared memory pools you can create contention. But those are less likely, and schedulers are designed to make the likely case fast.)

This becomes an even bigger consideration when you move to a compute cluster, or a NUMA architecture.

So the long and short of it is that it is not obvious that threads are faster than independent processes. In fact if you really want to scale, processes are likely to be faster.
Re:combination by PhrostyMcByte · 2006-07-12 15:59 · Score: 2, Insightful

So the long and short of it is that it is not obvious that threads are faster than independent processes. In fact if you really want to scale, processes are likely to be faster.
I'm not arguing on threads or processes, I don't care about that. Just that 2 threads will be more performant than 200 threads on a 2 cpu machine. And apparently modern production web daemons agree with me.
Re:combination by sd4l · 2006-07-12 20:29 · Score: 1

AIUI forked processes on Linux have a very lightweight memory model using copy-on-write. So the memory for each process is mapped to a single block of memory until one of them tries to write to it, in which case that page is duplicated for the other process(es).

--
-- Andy Jeffries Scramdisk for Linux (Change the orgy to org to reply)
Re:combination by gbjbaanb · 2006-07-12 22:46 · Score: 1

That's good - Windows also does Copy-on-Write too. But, in the real world, how much memory that is used by a process will be written to? I'd say quite a lot of it. Executable code, and static data (strings etc) will be effectively read-only, as will startup initialisation, but past that and especially in a dynamic process that is a small engine that works with a lot of dynamic data (ie a web server), you'll see a lot of private memory used for each process.

On the other hand, with threads you'll still use a lot of that memory too, only in a single process instead of several.

The benefits of threads are mainly process startup time - a single process doesn't have to re-authenticate itself with the OS, but with a threaded program all threads run as the same user. Synchronisation objects are faster (much faster in some cases - a critical section is super fast compared to a cross-process mutex).
Inter-thread communcation is faster than inter-process, and certainly easier to use (this is a bit of a genralisation - you can use shared memory between processes, but that's not nearly as easy as using a single variable shared between threads. Don't forget the cost of the sync object in there too)

That's it off the top of my head, I'm sure there are more reasons but even so, there's not as much between them as popular wisdom thinks.
Re:combination by gbjbaanb · 2006-07-12 23:43 · Score: 1

Forking is cheaper than Windows, not cheap. (define cheap. lol).

If you have to fork to serve a new request, regardless of how efficient unix makes it, it will still require a fair amount of resources and start up some context switching. In addition you then have to set up the network connection and so on. Its a lot more expensive than a single process running and sharing work within itself. (you can see this in action by comparing cgi based webservers with modular ones. The cgi ones are a tenth of the performance of the internally managed ones).

The fastest systems are ones who startup a set of workers (whether forked processes or threads doesn't matter) and then pass work to them as needed. Note that you do no worker creation once the system is set up, and you generally block incoming requests once all your workers are busy. Creating a new worker each time is terribly inefficient, so even though unix may be fast at forking, its still a slow process and shouldn't be used as if its a cost-free solution.
Re:combination by nahdude812 · 2006-07-13 02:15 · Score: 1

Unless you're heavily IO bound.

I think the long and short of this whole discussion is that there's no silver bullet here. If there was, then the opposing system would by and large fall into disuse. Depending on what sort of thing you're making, how it behaves, and what it interacts with, different scenarios will be easier / harder to program, and more / less efficient with system resources.

--
Slay a dragon... over lunch!
Re:combination by Phillup · 2006-07-13 03:28 · Score: 1

But, in the real world, how much memory that is used by a process will be written to? I'd say quite a lot of it. Executable code, and static data (strings etc) will be effectively read-only, as will startup initialisation, but past that and especially in a dynamic process that is a small engine that works with a lot of dynamic data (ie a web server), you'll see a lot of private memory used for each process.
In my real world very little of the process memory is written to.

Specifically, I'm talking about a web application written in perl and run via mod_perl (apache 1.3).

I load almost all used perl modules during apache startup... which makes for a very fat parent memory wise. But, when the child processes are forked most of their code (percentage wise) is shared with the parent process.

In real world terms I found that I could easily run (actually running and serving client requests) ten times the number of apache processes in the same memory footprint.

So... I think it would really depend on the process.

And that is the answer to the original question: It depends on the nature of the process.

--

--Phillip

Can you say BIRTH TAX
Re:combination by BigCheese · 2006-07-13 05:48 · Score: 1

The rule of thumb for the number of network threads is (# of CPUS) + 1. Much more then that and you lose performance from context switches.

--
The obscure we see eventually. The completely obvious, it seems, takes longer. - Edward R. Murrow

The answer is by Billly+Gates · 2006-07-12 13:10 · Score: 0

yes

--
http://saveie6.com/

Re:The answer is by Anonymous Coward · 2006-07-13 00:57 · Score: 0

No, the answer (my friend) is either:

* Blowing in the wind

or

* 42 (Now, what's the question?)

the answer is yes... by riprjak · 2006-07-12 13:13 · Score: 1

Of course, It would help if you asked a meaningful question;

Absent context, servers should be both and neither and something we havent invented yet.

What is the server for? is it serving a database to 27 hojillion simultaneous users?? serving a static intranet to 5?? filtering mail for AOL.com??

On suspects that each problem will have an optimum solution... so, as the first post noted, nothing to see here.

err!
jak.

Re:the answer is yes... by fastgood · 2006-07-12 13:23 · Score: 1

is it serving a database to 27 hojillion simultaneous users?
I'd say any time there's over 27 brazilian simultaneous users.
Re:the answer is yes... by Anonymous Coward · 2006-07-12 13:55 · Score: 1, Interesting

What is the server for? is it serving a database to 27 hojillion simultaneous users?? serving a static intranet to 5?? filtering mail for AOL.com??

Bingo.

I recently replaced a server for a client. His previous one was just too slow, and his response was to keep replacing it with a new one with a faster CPU and more RAM. His result? Only marginal increases in speed.

I profiled the machine as he was using it, noticed that the bottleneck was disk access, and replaced the IDE drives he was using with a caching SCSI RAID. The result? 20x speed improvement (yes, 20X. operations that used to take 8 to 10 minutes were being run in under 20 seconds.)

fastest way by r00t · 2006-07-12 13:25 · Score: 4, Informative

Well first, you probably should keep things simple and just buy nice hardware. Most servers sit idle most of the time anyway. If you truly do need the perfornamce though...

Have 1 process per node. I mean "node" in a NUMA sense. 64-bit AMD systems have one node per chip package. Other PCs (except exotic stuff) have one node for the whole system. Lock your processes to separate nodes, so that they all get local memory. If you don't do this, at least remember to use the new system call for moving pages from one node to the other. (eh, "move_pages" if I remember right -- see unistd.h in the kernel source)

You'll need extra threads to do disk IO. Not counting those: On each node, have at least 1 thread per bottom-level (usually L2 or L3) cache, but not more than 1 thread per virtual core (hyperthreading thing). If you go with 1 thread per physical core but have virtual cores (hyperthreading) enabled, lock your threads to virtual cores that don't share phycical cores.

A lot of this should be configurable. Hopefully you'll make an easy way to automatically determine the best configuration, writing out the appropriate config file so that manual config hacking is not required to get the best performance.

forgot to mention IRQs by r00t · 2006-07-12 13:28 · Score: 1

Disable the IRQ balancing. Direct IRQs to the correct CPUs so that you don't send/receive packets on a different CPU from where the app will be running. This has been proven to help with network cards. The same might apply to disk.

Debugging by Spazmania · 2006-07-12 13:28 · Score: 2, Interesting

Debugging a preforked C program like Apache 1.3 is still much easier than debugging a monolithic process with non-blocking I/O like Squid.
Debugging a monolithic C process like Squid is still much easier than debugging multithreaded software like Microsoft Windows.

This isn't likely to change regardless of how many processors you throw in the machine.

The rules change when you move to something like Java where cross-thread contamination of the data structures is relatively difficult. But then if you're talking about Java then why are you seriously considering hardware performance issues?

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.

Re:Debugging by samkass · 2006-07-12 13:39 · Score: 2, Insightful

>

Is this intended as flamebait? Performance in Java is as important as performance anywhere else. Most of the performance in any complex system depends more on the algorithms and how easy it is to debug a complex, fast algorithm, than on any inherent speed advantage of a particular language. I'd say the question of what design of network I/O and threading maximizes performance is at least as important a question in Java as any other language.

--
E pluribus unum
Re:Debugging by Anonymous Coward · 2006-07-12 13:49 · Score: 1, Interesting

You are right....for a C/C++ based program. For Erlang (or even Oz) I would disagree. Ejabberd and Tsung are good examples of what Erlang is capable of in terms of simultaneous users and performance. Check it out.
Re:Debugging by quanticle · 2006-07-12 15:04 · Score: 1

The original poster distincly said hardware performance issues. Algorithm efficiency is something that I'd probably characterize as software performace. I think the OP was trying to point out that, if your hardware is fast enough to run a Java VM comfortably, it'll be fast enough for you to not worry about the performance impact of "low level" design decisions.

--
We all know what to do, but we don't know how to get re-elected once we have done it
Re:Debugging by Al+Dimond · 2006-07-12 17:29 · Score: 1

This topic is about whether you should write single-threaded or multi-threaded programs for the highest performance. If for your application there is a cost of one over the other that grows faster than linearly as the number of users scales upwards, that cost would soon eclipse the difference in performance due to implementation language, which is probably a percentage speedup that holds pretty constant as number of users grows.
Re:Debugging by Kj0n · 2006-07-12 17:37 · Score: 1

Debugging a monolithic C process like Squid is still much easier than debugging multithreaded software like Microsoft Windows.

Why would you want to debug Microsoft Windows?
Re:Debugging by julesh · 2006-07-12 22:22 · Score: 3, Funny

Why would you want to debug Microsoft Windows?

You work for Microsoft, don't you?
Re:Debugging by Spazmania · 2006-07-13 00:25 · Score: 1

Yeah, I worded it badly. The point I was trying to make is that if you aren't talking about something written in C/C++ then the performance issues associated with threaded/not-threaded are the least of your concerns. Its only when you get to a fairly low-level language like C that they start to make a noticable difference.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Re:Debugging by samkass · 2006-07-13 05:22 · Score: 1

I'd be curious to see actual data that backs this up. Link?

--
E pluribus unum
Re:Debugging by Spazmania · 2006-07-13 11:15 · Score: 1

Actually, I'd be curious to see data too on this too. I haven't really heard anything of note about preforking or single-thread non-blocking IO outside of the C/Unix universe. It strikes me that you only use those methods when you're trying to squeeze out the last ounce of performance from the hardware. Other languages (like Java) generally have different priorities and the programmers using those other languages generally have priorities in line with their language's strengths.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.

Multiple processes by Alex+Belits · 2006-07-12 13:38 · Score: 3, Informative

You almost never have less performance-critical processes (say, web server + database server processes) than CPUs, so for most of applications, in situation where you don't have a shared context that you need a read-write access to from all requests, use multiple processes. If you do have such a shared context, you need to consider the overhead from synchronization between threads vs. no simultaneous access. Also take into account that socket i/o is not synchronous -- a process may have sent a buffer to the kernel and switched to processing another request, but kernel is still sending the data from that buffer. The same happens with receiving data -- it may have arrived already, but the kernel is filling the input buffer while the server process is not looking there. If the data never arrives faster than it can be processed, you gain no benefit from trying to process it in parallel -- your threads will be sleeping more, and performance will remain the same.

On systems with crippled schedulers and VM, threads are very efficient compared to everything else because your application becomes its own scheduler, and it reduces the total physical footprint and amount of cache invalidations. With better scheduler and VM, it makes more sense to rely on the OS insulating processes and scheduling their i/o, so the solution with multiple processes becomes more efficient in the absence application's need to share some data between multiple requests.

--
Contrary to the popular belief, there indeed is no God.

Re:Multiple processes by Anonymous Coward · 2006-07-12 20:33 · Score: 0

I don't understand how processes could possibly be faster than threads doing the same task. If processes are better because there is no interaction between them, then threads doing the same task would also have no interaction between them, while incurring faster context switches.

Of course it seems silly in this day and age to have a single thread/process handling a single connection in anything resembling a high-performance server. In that case, each thread/process should be able to handle multiple connections at once. It should execute a request, grab the next request off the queue or wait for one, ad infinitum. With all threads in a single process, it is easy to balance the load queues between threads; with separate processes it is almost impossible to move a request from a busy process to a waiting one.

Also, it's hard to understand why reducing total physical footprint and amount of cache invalidations is only good on systems with bad VM systems or "crippled" schedulers (whatever that means).

dom
Re:Multiple processes by Alex+Belits · 2006-07-12 21:30 · Score: 2, Insightful

I don't understand how processes could possibly be faster than threads doing the same task. If processes are better because there is no interaction between them, then threads doing the same task would also have no interaction between them, while incurring faster context switches.

Because there IS interaction between threads -- mutexes handling takes resources, too, and checking i/o status also requires time and syscalls (OS just supplies pending i/o information to its scheduler by itself). And because data access mechanism within your application is likely to be a much worse scheduler than scheduler in the OS.

Of course it seems silly in this day and age to have a single thread/process handling a single connection in anything resembling a high-performance server.

Actually, it's not (unless you are in Windows). For quite a while it was the only way network i/o was possible in Java at all. And while it was a bad decision for Java, it was based on a valid idea that sleeping threads or processs eat less resources than what is necessary for multiplexed handling of completely unrelated connections within one thread.

In that case, each thread/process should be able to handle multiple connections at once. It should execute a request, grab the next request off the queue or wait for one, ad infinitum. With all threads in a single process, it is easy to balance the load queues between threads; with separate processes it is almost impossible to move a request from a busy process to a waiting one.

You don't need to do that -- i/o scheduler in kernel does that for you already, and when that is not good enough, you can have a separate process that dispatches request by whatever set of rules. As opposed to Windows, Unixlike systems allow socket passing between processes.

Also, it's hard to understand why reducing total physical footprint and amount of cache invalidations is only good on systems with bad VM systems or "crippled" schedulers (whatever that means).

Because with good VM they are smaller in the first place. In any case, when you deal with scalale systems that have to handle huge amounts of requests simultaneously, your cache will be severely beaten just because of the large amount of data involved.

--
Contrary to the popular belief, there indeed is no God.
Re:Multiple processes by Alex+Belits · 2006-07-12 21:34 · Score: 2, Informative

I don't understand how processes could possibly be faster than threads doing the same task. If processes are better because there is no interaction between them, then threads doing the same task would also have no interaction between them, while incurring faster context switches.

Because there IS interaction between threads -- mutexes handling takes resources, too, and checking i/o status also requires time and syscalls (OS just supplies pending i/o information to its scheduler by itself). And because data access mechanism within your application is likely to be a much worse scheduler than scheduler in the OS.
Of course it seems silly in this day and age to have a single thread/process handling a single connection in anything resembling a high-performance server.

Actually, it's not (unless you are in Windows). For quite a while it was the only way network i/o was possible in Java at all. And while it was a bad decision for Java, it was based on a valid idea that sleeping threads or processs eat less resources than what is necessary for multiplexed handling of completely unrelated connections within one thread.
In that case, each thread/process should be able to handle multiple connections at once. It should execute a request, grab the next request off the queue or wait for one, ad infinitum. With all threads in a single process, it is easy to balance the load queues between threads; with separate processes it is almost impossible to move a request from a busy process to a waiting one.

You don't need to do that -- i/o scheduler in kernel does that for you already, and when that is not good enough, you can have a separate process that dispatches request by whatever set of rules. As opposed to Windows, Unixlike systems allow socket passing between processes.
Also, it's hard to understand why reducing total physical footprint and amount of cache invalidations is only good on systems with bad VM systems or "crippled" schedulers (whatever that means).

Because with good VM they are smaller in the first place. In any case, when you deal with scalale systems that have to handle huge amounts of requests simultaneously, your cache will be severely beaten just because of the large amount of data involved.

--
Contrary to the popular belief, there indeed is no God.
Re:Multiple processes by Pseudonym · 2006-07-13 17:10 · Score: 1

Because there IS interaction between threads -- mutexes handling takes resources, too, and checking i/o status also requires time and syscalls (OS just supplies pending i/o information to its scheduler by itself). And because data access mechanism within your application is likely to be a much worse scheduler than scheduler in the OS.

One of the worst offenders is memory deallocation.

Yes, you heard that right: deallocation.

A decent memory allocator these days has multiple memory arenas. When you allocate memory, the system tries to pick one that currently isn't in use and allocate from that. On deallocation, however, you have no choice: a block of memory must be returned to the arena from which it came. This usually means thread contention.

--
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});

Neither. Mutiprocess usually Prevails by KidSock · 2006-07-12 14:03 · Score: 4, Insightful

You forgot multiprocess. Like anything in software the answer is, "it depends on the application". But one of the most overlooked and frequently very important factors that affects performance is cache locality. If the CPU has to fetch something from main memory (or heaven forbid it actually has to drudge it up from disk) the program has to wait. That wait time is often much much greater than the execution time of the target code. Aside from simply writing small code (that only get's you so far), one way to get better cache locality is to break up your processing into a pipeline. Mail servers frequently do this. One process will accept connections do some sanity checking and write the message to another process. The next process juggles addresses for routing and writes it to another process. That process might then work on delivery either locally or remotely. What happends (or what is supposed to happen under high load) is that one process becomes hot and processes as many messages as it can until the buffer to the next process is full. Then the next process runs processing all of those messages until it either runs out of stuff to process or cant write anything more to the next process in the pipeline. If you have multiple cores / CPUs this scales pretty well too.

But again, "it depends on the application". The above pipelining method only performs well if you're processing items in an assembly line fashion. If you're an HTTP proxy server you wouldn't want that model. You would probably want a single process libevent type of thing. I have some code that doesn't use either of those models. It's a multiprocess model but event driven with *everything* in shared memory. It's very close to a multithreaded model but I needed security context switching. Also, contrary to popular belief threaded servers are slower than equivalent multiprocess servers. So in-general, the benifit of a multithreaded server is pretty much just about convenience for the programmer. Since you can acheive the same effect by just creating an allocator from a big chunk of shared memory mapped before anything is forked, there's very little reason to use threads at all.

Simple rule. by megaditto · 2006-07-12 14:10 · Score: 1

How about this?

Either use asynchrous and non-blocking calls throughout, OR use multiple threads.

--
Obama likes poor people so much, he wants to make more of them.

Consider Fault Tolerance & Thread Safety Too by thebiss · 2006-07-12 14:11 · Score: 5, Informative

Since you didn't say what kind of server you're building, I'm going to assume:
- that you're building a custom-purpose, client-server or message processing application,
- it needs to be highly parallel to be efficient
- the language is C, C#, or C++, and not Java (process-based servers in Java?)

I have done this before, using both processes and threads, for the same application. Consider the impact of application faults on your design, and then consider how hard it will be to create thread-safe code.

o A highly multithreaded server, where threads are performing complex and/or memory demanding tasks, will be susceptable to complete failure of all running jobs on all threads, if just a single thread SEGFAULTs. And despite your best testing efforts, complex code (1M+ lines) will at some point, somewhere, fail.

o Threaded code must be thread safe. Static variables, shared data structures, and factories all previously accessed through a singleton must now be protected with guard functions and semaphores. Race conditions need to be considered. Design for this up front. It will be much harder to add it later.

The Project

I worked on a team which added an asychronous processing engine to a web application. The engine was responsible for performing memory and time-intensive financial analysis and reporting for 16,000 accountants, so that they could close a large company's financial books. Unlike a webserver triggered by on-line end users, this engine is triggered by events in the company's financial database: once the database raises the "ready" flag, this engine begins running as many reports as it can, as fast as it can, on behalf of the 16K users. The analysis and report code was 2 million lines of C++, running on AIX.

Process implementation

The initial implementation used processes. A dispatcher job monitored the database for the ready flag, and then forked children of itself to analyze slices of the data, and generate the reports. One child job was used for each analysis and report pair, and the manager controlled how many jobs ran in parallel, maintaining a scoreboard of which jobs succeeded, and which failed.

Due to the complexity of the system, failures (core) occasionally occurred. The monitor would record this, retry the failed analysis up to 3 times, and keep a uniquely named core file of the event. Other analysis reports would continue to be generated, otherwise unharmed by the thrashing thread. Approximately once every 90 days, the development team would collect the few cores generated, use the gbd/xldb debugger to determine the cause of failure, and correct the fault.

The downsides of this? The solution was slowed because couldn't re-use resources like database connections (they were destroyed with each process), and more memory was used than need be. DB2 caching helped somewhat, but potential performance improvements remained.

Threaded implementation

In a large company, there are IT standards, and one of the standards at my company is that applications shall never, ever, ever fork(), even if running on a large dedicated machine. After losing the fight against this, my team re-architected the report engine. Largely
the same as the previous, the new engine waits for the "ready" signal, and then spawns pthreads (POSIX threads) as workers to analyze the data and generate the report. In theory, it was robust.

The alpha version of this solution immediately failed (cored) during testing. We neglected to identify the less obvious non-thread-safe code in the application, and failed to identify several race conditions. Unlike previous failures, this faults were total: a SEGFAULT in code on one of 20 threads would halt the entire application. And the corefile generated was now huge - it contained a snapshot of memory for all 20 running jobs, instead of just the one of interest.

Extensive root-cause analysis, design, and restart management solved this, and the current version is as robust, and a good bit faster, than the previous. At a significant price.

--
Beware: I believe all are created equal, and have the right to life, liberty, and the pursuit of happiness.

Synchronous or Asynchronous IO by c0d3r · 2006-07-12 14:12 · Score: 1

Mono-process is synchronous IO (select system call)
which blocks, where as multithreading is asynchronous and non-blocking. You can get far more throughput with the same amount of memory with threads, and be able to levarage multi-core, multi-processor and distributed processing architectures.

mb

Re:Synchronous or Asynchronous IO by Anonymous Coward · 2006-07-13 16:27 · Score: 0

Uhh... If you set O_NONBLOCK on a socket with fcntl(), then read() and write() will set errno to EAGAIN, and you can poll it later instead of blocking. This is non-blocking operation and together with select() or poll() you can handle multiple clients in a single process. Your post is a bit confused. You can do non-blocking I/O without using threads.

outdated info by Anonymous Coward · 2006-07-12 14:51 · Score: 2, Informative

The last time this page was updated was November 2003, since there there have been two major revisions to Java and at least one major revision to the linux kernel (as we;; as changes to FreeBSD, OpenBSD and Windows)... and in both all cases these revisions introduced changes to address scalability/concurrency. This page is incredibly out of date.

Also... they opening statement and its bias toward Unix "for obvious reasons" doesn't lend towards it's credibility. I've tuned out high volume systems on both platforms and I could care less wether the systems are running on unix or windows. Anyone who is claims that one has inherent performance advantages over the other is just showing their inexperience or bias. Nine times out of ten, any serious performance problems are with the application design and implementation.

Re:outdated info by illuminatedwax · 2006-07-12 15:49 · Score: 1

Yet the methods contained therein have yet to be put into widespread use. At least use this as a starting point, which most people may not have known about.

Does Windows have the kind of kernel control necessary to implement things quickly like on this page? Serious question; I remember discovering that Windows' support for asynchronus process communication left me severely wanting.

--
Did you ever notice that *nix doesn't even cover Linux?
Re:outdated info by PhrostyMcByte · 2006-07-12 16:11 · Score: 1

Windows has had correct async support for sockets, files, and pipes ever since winnt. Pipes are usually the method of choice for IPC.
Re:outdated info by thesandbender · 2006-07-12 16:59 · Score: 2, Informative

I'm the original anonymous coward... wish it hadn't posted as such.

Windows support for IPC is actually very robust, shared memory and semaphores is about as fast as you can get and it exists on both Windows and *nix just under different names.

A lot of people either don't know or forget that the NT->XP->2k3 kernel owes a lot to the Vax/VMS group from Digital who were snatched away to build the NT kernel. There seems to be this illusion that XP is the bastard step child of Windows 1.x and that is not the case, a lot of thought was put into the system from a multi-process/user stand point. The current problem is that thought didn't extend that far into today's current security evironment (although Dave Cutler and crew did try... I remeber the fight to move the video card drivers out of ring zero... Cutler argued that they were ancillary to the system, which is true from a server standpoint, but he lost).
Re:outdated info by illuminatedwax · 2006-07-12 17:08 · Score: 1

I want my asynchronous signals though!!

--
Did you ever notice that *nix doesn't even cover Linux?
Re:outdated info by Pogue+Mahone · 2006-07-12 22:52 · Score: 1

IIRC the graphics subsystem *did* run in user space in NT up to and including v3.51. It was NT4 that put them all in the kernel, allegedly on performance grounds, with the inevitable loss of stability. Not that I ever noticed and performance gain when I "up"graded from 3.51 to 4.0.

--
Every bloody emperor has his hand up history's skirt [Peter Hammill/VdGG]
Re:outdated info by Phillup · 2006-07-13 03:18 · Score: 2, Interesting

Also... they opening statement and its bias toward Unix "for obvious reasons" doesn't lend towards it's credibility.

Eh?

Here is what I see:
The discussion centers around Unix-like operating systems, as that's my personal area of interest, but Windows is also covered a bit.

Perhaps the bias, and lack of credibility, is elsewhere...

--

--Phillip

Can you say BIRTH TAX
Re:outdated info by BigCheese · 2006-07-13 05:45 · Score: 1

Yes, OpenVMS is very robust (if a bit arcane) but multithreading support is abysmal. We had project here to remove the multithreading from OpenVMS processes and it was providing unbelievable performance improvements. QIO is the way to go in OpenVMS.

To be fair Windows handles threads much better.

--
The obscure we see eventually. The completely obvious, it seems, takes longer. - Edward R. Murrow
Re:outdated info by Foolhardy · 2006-07-13 06:17 · Score: 1

If you're asking for a queue arrangement, how about shared memory for data transfer and an IO completion port for synchronization? IOCPs are great for async IO as well, and can schedule a set number of active threads, even taking into account the number currently sleeping.

Both by Anonymous Coward · 2006-07-12 15:01 · Score: 0

One word: SEDA

Real life examples by sdfad1 · 2006-07-12 15:38 · Score: 5, Informative

I cannot speculate, but I can look at what people are doing today. One thing that I have noticed, is the widespread research into, with compelling arguments, for massively multithreaded programming techniques. See Erlang for example. It is designed right from the beginning for this sort of problem - high throughput, high reliability, high uptime telephony networks.

As a rough benchmark, someone's got this.

That's an order of magnitude increase in "performance" (depends on what you mean by performance". I thought I'll do a casual informal test of my own, with a decent static file size (instead of the 1 byte used in that benchmark)

Server Software: Yaws/1.56
Document Length: 402 bytes

Concurrency Level: 500
Time taken for tests: 8.480740 seconds
Complete requests: 5000
Requests per second: 589.57 [#/sec] (mean)
Time per request: 848.074 [ms] (mean)

Server Software: Apache/2.0.54
Document Length: 402 bytes

Concurrency Level: 500
Time taken for tests: 29.787216 seconds
Complete requests: 5000
Requests per second: 167.86 [#/sec] (mean)
Time per request: 2978.722 [ms] (mean)

Output edited to get past lameness filter.

Err crap, I could have sworn the first time I tried this, when Yaws was first installed, its performance was worse! Oh well, perhaps it's something I've inadvertently done since then. Could have been due to my computer reboot (this is a desktop PC). It seems I've proven my point, although I was trying to disprove it. Standard caveats regarding benchmarks apply. Both servers are default Ubuntu installs with no configuration changes - I didn't compile anything manually.

Additionally it has also been noted that:

> Linus Torvalds: 100k threads at once is crazy Using Posix style threads, I'd have to agree. Posix threads were just not designed with this level of usage in mind. Which is why concurrent lanugages like Erlang and Mozart/Oz don't use Posix threads.

Well, that's where it could be headed anyway - a multiprocessor system with green threads (ie simulated threads, like Java ones) implementing massive concurency and redundancy. Some prototypes for systems like this are already available, and being used. Cheers.

Re:Real life examples by TheRaven64 · 2006-07-12 23:43 · Score: 1

The important thing about Erlang is it doesn't show threads to the user. Everything you do in Erlang is asynchronous and based on message passing. Erlang processes do not share any state with each other, and often do not have much state of their own (state lives in messages in a well-designed Erlang program). This means it's much easier to reason about your code, which means it's easier to debug, and easier to write working code.

--
I am TheRaven on Soylent News
Re:Real life examples by KidSock · 2006-07-13 05:21 · Score: 1

Concurrency Level: 500 Time taken for tests: 8.480740 seconds Complete requests: 5000 Requests per second: 589.57 [#/sec] (mean) Time per request: 848.074 [ms] (mean)

I don't understand this. These results are HORRIBLE. 5000 requests in 8 seconds? 1 request in .8 seconds? I think your decimal point is off.
Re:Real life examples by sdfad1 · 2006-07-13 16:15 · Score: 1

This is not a beefed up server - it's a very old computer, and is used only for development. Remember that there are 500 simultaneous connections, with the test itself also running on the same machine, and I'm also running a shitload of other stuff (X, mozilla-firefox - heaps of tabs open, Lisp (this one shows hundreds of megs of memory usage, but I've read that for this kind of applications, top is a deceptive measure), emacs etc).
The benchmarks I've shown are from running "ab". I was going to show all program outputs, had to edit it extensively (and finally gave up because I got tired of negotiating with the lameness filter) but it seems accurate. Do you think it could be much faster? Here're some more info:
cpuinfo:
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 8
model name : AMD Athlon(tm) XP 1700+
stepping : 1
cpu MHz : 1466.795
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 2916.35

uname -a gives Linux leia 2.6.12-9-386 #1 Mon Oct 10 13:14:36 BST 2005 i686 GNU/Linux
And finally, it's relative performance that counts, and for the other benchmark I show, where the outperformance is in the orders of magnitude, I wouldn't worry about processor speed/software versions or other configurations.
Oh wait, do you mean something else? Here, it's 500 concurrent requests, so at any time, there's only 500 connections. After 5000 connections have passed, we get this. Check out the output from ab, or read man ab, for a rough handle on what I have shown before.

Threading --- hype, more hype and extra hyped hype by inflex · 2006-07-12 16:06 · Score: 4, Insightful

start-rant:

Threads are useful, that's granted - but it would seem a lot of people are trying to convert wholesale over to this threading model just for the hell of it, running along with the apparent reasoning that threading is "lighter" than processes. Maybe threads are lighter/cheaper on Windows systems - but a Unix system with copy-on-demand paging forking/process system is _DESIGNED_ to handle processes. Right now a lot of the time threads are a hack. Unix and processes work nicely together.

As for "maximising" available resources, well don't forget there's typically another couple of dozen processes running on any give Unix setup, more so on a multi-user multi-purpose machine (let's say WWW, email and DNS setup - throw in SpamAssassin for lots of fun) there's no shortage of available processes to use up a CPU. On a monolithic system where it's running only one process, sure, threads become useful there to spread the load.

My gripe basically boils down to a lot of people going along and choosing to use threads rather than forking because they think that it's "cool" or (supposedly) "lighter" - not because they've done any real world testing/checking. Remember, Unix was built around the idea of many small processes/programs working together, so that'd tend to naturally allow usage of multiple CPUs without any exotic hacks. /rant.

non-blocking by pizza_milkshake · 2006-07-12 16:24 · Score: 4, Informative

higher throughput can be achieved with one process or thread (whichever floats your boat) per CPU, using epoll() (linux 2.6 only, use poll() for more portability) with non-blocking I/O.

however, it's easier conceptually to write a threaded server, it's more natural to write, and you just launch a single thread per connection. unfortunately, currently, this doesn't scale (see Why Events Are A Bad Idea (for High-concurrency Servers) http://www.usenix.org/events/hotos03/tech/vonbehre n.html for an argument that thread implementations, and not their design, are the issue).

the former method can handle thousands of simultaneous connections with high throughput, even on a decent workstation; the latter cannot. threads simply have an inherent overhead that cannot be eliminated.

i've actually been working on writing a non-portable insanely fast httpd in my spare time (svn co svn://parseerror.dyndns.org/web/) over the past few weeks as a way to explore non-blocking I/O + epoll() and it performs very well (~600% faster conns/sec than a traditional fork()ing server (which i wrote first)).

for further discussion see The C10K Problem http://www.kegel.com/c10k.html which goes in-depth on these very subjects

Re:non-blocking by julesh · 2006-07-12 22:19 · Score: 1

I think you misunderstood the original submitters question: as I read it, he's not considering thread-per-connection as is discussed in the paper you link, but rather a small number of threads (e.g. 4 on a dual-processor machine with hyperthreading) each running an event-based server, the idea being that putting everything in a single thread wastes CPU time as only 1 of the 4 CPUs presented by such a system can be used.

You'd have to write such a system carefully, but I think if it was done right you'd get better performance than a pure event-driven server.

Re:Threading --- hype, more hype and extra hyped h by Charan · 2006-07-12 17:25 · Score: 3, Informative

There is a genuine difference between multithreading and forking. The kernel does take longer to switch between processes than between threads since there's an address space change between processes. 10,000 threads in one process will use fewer per-process resources than would 10,000 processes of one thread each. I want to say that process accounting (on creation/destruction) takes more time than thread accounting, but I'm not intimately familiar with their implementation on Linux. For some applications, sharing a heap among all threads might make passing data a bit simpler than using IPC or shared memory.

As for utilizing the CPU, threads and processes should be close in performance. I would still expect threads to be slightly faster, since the (x86) processor's TLB is flushed on a context change that wouldn't happen if you switch between threads. For a server where any of this really matters, there will be thousands of worker threads/processes compared to a small number of system threads, so the probability of switching between two worker threads will be high.

I'm sure I'm leaving out some other important differences, but I can't think of them at the moment.

User-level thread libraries can let a process run even faster than with kernel threads or processes (less kernel involvement = faster), but in order to get good performance, asynchronous IO is a necessity.

SEDA: Mixing events and threads by Charan · 2006-07-12 17:38 · Score: 1

For an interesting hybrid approach between threads and events, check out SEDA - Architecture for Highly-Concurrent Server Applications. Basically, you write a server as a collection of stages connected by event queues. A stage receives an event on an incoming queue, does some processing on it, and then places it on an queue to some other stage. This mirrors the way an event-driven system is designed. Each stage has its own thread pool to handle events. All IO is asynchronous and is treated like any other event in the system.

The state of the art is hybrid of the two by NittanyTuring · 2006-07-12 18:37 · Score: 2, Insightful

Threads are used because they are easy for development. They can also keep the multiple units of a parallel processor busy. However, using more threads than there are processing units introduces overhead into the system. There are better ways to do task scheduling... like event-driven models.

So, why not combine the multi-threaded and event-driven models? Some very interesting research has been done in this area. Check out Staged Event-Driven Architecture, or SEDA. Like threads, it has high fairness in scheduling. Like event-driven models, it scales to a large number of concurrent requests. In fact, it degrades gracefully even during a Slashdot effect.

Re:Consider Fault Tolerance & Thread Safety To by julesh · 2006-07-12 19:24 · Score: 1

Threaded code must be thread safe. Static variables, shared data structures, and factories all previously accessed through a singleton must now be protected with guard functions and semaphores.

Not necessarily. It is entirely possible to design code that is threadsafe without using locks for data access. Instead, in many cases you can ensure that due to program logic variables you are accessing can only be used by a single thread (e.g. by only allowing a single thread to work on a connection's state at once and storing all data in association with the connection, or by using thread-local data for shared resources like database connections); in others algorithms can be designed that work without a lock: generally you'd use this approach in the code that reacts to an incoming event and selects a thread to handle it.

Thread vs Process is a linguistic debate, not tech by Anonymous Coward · 2006-07-12 19:25 · Score: 0

Short summary of the facts:

Shared memory is an efficient form of communication between processes.
Non-shared memory is an effective form of damage control/safety.
By default memory in a multi-process model (old apache, etc) is not-shared, but you can jump through some hoops to share some (read docs on shared memory).
By default memory in a multi-thread model is shared, but you can jump through some hoops to not share some.
The right way to write a server is to protect what you can (via non-shared memory) and share what you must (via shared memory)
In both the cases of threads and processes the OS needs to be able to account for private and shared memory - so get your OS vendor to make sure that cross-process context switches aren't that much more expensive than thread switches
- if that's not the case for you, your OS is buggy
- If it is the case for you, the whole process/thread debate is pretty moot, isn't it

But appart from the termiology of threads vs processes, the actual rule for writing an efficient and safe server is "share the memory you want shared between the streams of execution, and do NOT share the memory you don't want shared".

This hits home. What I did. What should I do? by kingradar · 2006-07-12 20:09 · Score: 4, Interesting

What should I do?

This debate hits home with me. I wrote a server daemon to handle the SMTP and POP protocols, and when I first started out I had to make a choice. The choice I made back then was to use a threaded model. The way it works is I spawn X threads which collectively use blocking calls to the accept() function. Each thread will only return from accept() once they have been assigned a new connection by the kernel. For performance I spawn the threads ahead of time. This architecture was a mistake. The issue is that I have to spawn a seperate pool of threads to listen on port 25 (SMTP), port 110 (POP), port 465 (SMTP over SSL) and port 995 (POP over SSL). With this model if I could end up with extra threads listening on port 25, when I need more threads listening and processing connections on port 465. This problems leads me to overcompensate by spawning _extra_ threads just in case. Of course this strategy wastes resources as now extra threads eat memory without benefit.

To address the SEGFAULT issue, ie one rouge thread taking the whole system down, I also fork multiple processes. In my case I fork 12 processes with 128 threads each. If one process gets killed by a SEGFAULT, the remaining processes continue to work. When I first launched the system, and it faced a torrent of email... 100K+ messages a day, I would have about one process die every 24 hours. With careful debugging work, I've gotten the code stable enough now that I haven't lost a process in about 9 months.

My theory when I first wrote this code was to leave scheduling to the kernel. I figured that if a thread was blocked waiting for IO data the kernel wouldn't schedule time slices for it. This meant those extra threads sat in the background waiting, but not using CPU time. I am starting to wonder whether this is a good theory? I am considering switching to a different model (more on that in a second), but am not sure which one is best? By the way, the reason each process has so many threads is for DB connection pooling. Each process gets 8 DB connections which are shared between the 128 threads. Each process also has its own copy of the antivirus database. I know its possible, but trying to share DB connections and data between processes is much more difficult.

I plan to refactor this code soon, and have been struggling with what to do. I am curious to hear the thoughts of others?

The current plan is to move to a model where I spawn a single thread for each port. When these listening threads have a new connection, they dump the socket handle, and the the protocol into a buffer. I would then also spawn a pool of worker threads which read the incoming connections out of the buffer. Using semaphores and reflection these worker threads would pickup incoming connections and feed them to the right function depending on the protocol. I think this model would work much better than what I have now, but is this the best option?

The other option is to create system where I spawn only 8 worker threads (or some similar number). This pool of 8 threads then uses epoll() to find out which sockets need attention and address them accordingly. The problem with this model is that if an incomplete message is receieved, the thread couldn't process all the way into the output stage. Instead the data would need to be stored until the message sending was complete. Let me give an example, the thread might get "RCPT TO: " the first time it checked a socket. The thread stores this incomplete message. Then the second time around another thread picks up "example@example.com". The thread assembles the message into "RCPT TO: example@example.com" and then processes the entire command accordingly.

Does this model work better? Keep in mind that when DB calls need to be made, the MySQL library won't work the same way. A slow database server could hang all 8 worker threads effectively killing the model. There are also SPF, SPAM and Virus libraries. Any one of them could tie up a thread for an extended period, thereby killing this model. What does everyone else think? Am I not thinking about an event processing model correctly? Or is that this type of daemon is better off served using the one thread per connection model?

Re:This hits home. What I did. What should I do? by kingradar · 2006-07-12 21:05 · Score: 1

I should probably mention that I've chosen to use CentOS 4.3, running on dual processor Dell 1650 servers. All of my code is written in C, and compiled with GCC.

This is important because it means I use the 2.6 version of the Linux kernel, and the CentOS/RHEL version of the kernel uses the Native Posix Thread Library (NPTL) by default.
Re:This hits home. What I did. What should I do? by TheRaven64 · 2006-07-13 00:05 · Score: 3, Informative

First, you do not want to be using semaphores for inter-thread synchronisation. Semaphores are IPC primitives, not ITC. The POSIX threading API provides mutexes and condition variables for this kind of thing. If you are doing message passing, then you have a classic producer-consumer situation which is exactly what condition variables were created for.
Each condition variable has a mutex associated with it. The first thing you do, is lock the mutex. If someone else has the mutex locked, then this blocks. You then (atomically) release the mutex and wait on the condition variable. The other thread then acquires the mutex, signals the condition variable, and releases the mutex. The first thread then wakes up with the mutex.
Really though, you should be using an asynchronous model if you want to be able to reason about your code. Difficulty in debugging scales linearly with the number of asynchronous threads/processes and exponentially with a synchronous approach.
To be honest, if you need concurrency, you would be better off using a language like Erlang which is designed for concurrency, rather than trying to shoehorn it into a hacked version of PDP-11 assembler.

--
I am TheRaven on Soylent News
Re:This hits home. What I did. What should I do? by zero-one · 2006-07-13 06:08 · Score: 1

One thing I have found useful is moving from having a thread for each request to having a object per request and using asynchronous network and database calls. With this model, when you see anything interesting happen on the network or in the database, you dig out the right object and call the appropriate function. The advantage is you can keep all you request handling code in one place along with all the data required for that request.

It is also good to avoid having a thread per request (practically if you are handling a lot of short requests) as each thread can take up a lot of resources. For example, on Windows the default the default stack size for a new thread is 1mb (I assume it is similar on other operating systems). If you have 200 concurrent requests you have lost 200mb of memory for very little gain.

Event devices by bytesex · 2006-07-12 20:23 · Score: 2, Insightful

We need proper event devices in systems programming, both between threads and processes. I want to be able to select() (or poll() or epoll() dammit) and wait on not only file-descriptors, but semaphores, system signals, threads waking up out of a sleep() call, and even a crucial variable changing its value. And I want it API-proof (no signals) and statically initializable. And tomorrow. And a pony.

Seriously though, I'm not sure what methods the WIN32 API has, under the hood, of its event waiting calls, but they have a good side to them. For any device that has a potentially blocking system call associated with it, I can define an event that I can wait for. Of course the rest of the whole interface is clunky as hell, but why hasn't anyone come up with something similar in *IX ?

--
Religion is what happens when nature strikes and groupthink goes wrong.

Re:Event devices by ziggyboy · 2006-07-12 21:34 · Score: 1

I totally agree. But wouldn't a pthread mutex+condition variable suffice? pthread_cond_wait() and pthread_cond_signal() does the sleep-wake up thingamajig you're after.
Re:Event devices by TheRaven64 · 2006-07-13 00:01 · Score: 1

So, really, what you want is FreeBSD (and now NetBSD)'s kevent API?

--
I am TheRaven on Soylent News

It depends on software and hardware setup. by master_p · 2006-07-12 20:46 · Score: 2, Informative

If your software is destined to serve few clients and your hardware has a single core CPU, then mono-process is better: easier to debug, easier to change etc. You may use threads for long computations.

If your software is destined to serve few clients and your hardware has more than one CPU core, and then your services are not I/O bound, then performance would increase if your O/S can dispatch threads to different cores. If your services are I/O bound, then increased performance from threading depends on O/S and hardware architecture (in other words, how fast I/O can be multiplexed).

If your software is destined to serve thousands of clients, you need clustering: a few thousand machines that can process requests plus a few others to dispatch requests to the cluster. I actually have no idea how this is done though, so take this advice lightly. In this case, your software is going to be multiprocess/multithreaded anyway.

Re:Consider Fault Tolerance & Thread Safety To by bytesex · 2006-07-12 20:51 · Score: 1

>> In a large company, there are IT standards, and one of the standards at my company is that applications shall never, ever, ever fork(), even if running on a large dedicated machine.

May I ask - Gods, man ! Why oh why ?

--
Religion is what happens when nature strikes and groupthink goes wrong.

Whatever fits the reqs by ziggyboy · 2006-07-12 21:27 · Score: 3, Interesting

I just happen to be in the middle of the design process of a server for a large telco in Australia. We have decided to use both select() and threads in handling client connections. Clients of the same class/type will be handled each by a thread. Each have their uses, pros and cons, but if you intend on using threads for spawning each client then that's not a very good idea. Pre-created threads would be ok, though. Better than pre-forked processes.

My only complaint with POSIX threads is they do not have a "generic" join function that grabs *any* threads that have exited.

Contrary to popular belief by Anonymous Coward · 2006-07-12 22:18 · Score: 0

is another way of saying "I have an unsupported assertion". There's reasons a multi-threaded server could be slower than a multi-process server. The competence of the programmer is one possiblility. Particularly in the area of locking or maybe not bothering to use locking w/ muli-process to access the shared memory.

Multi-process is more robust but only if you do not use shared memory or else ignore shared memory corruption problems.

Re:Contrary to popular belief by KidSock · 2006-07-13 05:25 · Score: 1

Don't believe me? Ok. Andrew Tridgell from the Samba team has some experience with this. Here's what he has to say about the topic:

http://lists.samba.org/archive/samba-technical/200 4-December/038301.html

Re:Consider Fault Tolerance & Thread Safety To by Anonymous Coward · 2006-07-12 23:49 · Score: 1, Informative

A highly multithreaded server, where threads are performing complex and/or memory demanding tasks, will be susceptable to complete failure of all running jobs on all threads, if just a single thread SEGFAULTs.

In Windoze and OS/2 (am I showing my age here :) it is possible to trap these type of exceptions on a per thread basis. You can then create a "manager" that does effectively what you had in the multi-process scenario. The exception handling code does whatever cleanup it can, and then triggers some action that will cause a thread to be spun up if an existing one bites the dust. Works well, though the biggest drawback is that the OS does a lot of cleanup work for you if you're in a process that doesn't occur when you're in a thread. Therefore resources that the thread may had locked/opened won't be cleaned up, you have to track it all yourself. In both OS's, if these were kernel type resources then you'd be screwed, but this was not very common.

Threaded code must be thread safe. Static variables, shared data structures, and factories all previously accessed through a singleton must now be protected with guard functions and semaphores. Race conditions need to be considered. Design for this up front. It will be much harder to add it later.

Agree about the designing up front part. Successful multithreaded coding is all about planning and forethought. You have to understand that debugging after the fact is a complete nightmare scenario given a highly multithreaded app (or even a moderately multithreaded app for that matter). One comment about your above comment though, if you considering going from a multiproccess to multithreaded app, you should be careful of going too overboard and taking advantage of those static variables/shared data/factories. In a MP app, you pay the performance hit so you try to minimize any sharing, you should keep that approach in a MT app as well and not fall into the trap of thinking that the access is now somehow "free".

Overall an interesting post, thanks.

Having just written a multithreaded server... by pieterh · 2006-07-13 00:09 · Score: 4, Informative

My company designed high-performance mono-process servers (portable ones too) starting in 1995, using event-driven virtual threads and state-machine frameworks. Very elegant, very fast, and really easy programming. The Xitami web server was one example - I remember seeing a Win95 system with Xitami survive a slashdotting (it was serving static pages but that was still impressive.)

We worked in C, because we needed guaranteed low latencies.

In 2004 we decided to rebuild these frameworks to handle OS multithreading. The reason was that on a single CPU we could not get the performance we needed, and the choice was either to use clusters, or multithreading.

We continued to work in C. C, and C++ are really nasty for multithreading because the languages have zero support for concurrency. You need to handle everything yourself, and most threading errors are extremely hard to detect.

It cost us about 10 times more to write our software as multithreaded code than using virtualised threads and we had to build whole reference management frameworks to ensure that threads could share data safely.

We did keep virtual threading, in fact, but virtual threads get handled by a pool of OS threads. Using 1 OS thread per connection is not scalable beyond a few hundred threads. Modern Linux kernels handle lots of threads but we also target Solaris, and Windows with the same code. So we use two virtual threads per connection, for full-duplex traffic, and we design most of the major server components as threaded objects, which are asynchronous event-driven objects.

Doing multithreading in C is a *huge* work. C++ has frameworks like ACE that help a lot.

But there is a performance gain. Our software is a messaging server (implementing the AMQP draft standard). We maxed out at around 55,000 messages per second using a pure virtual-threaded model. Very efficient code. On a single CPU the multithreaded code hits 35,000 messages per second. With two CPUs we're back at 55k, and with 4 dual-core Opterons we're at 120k-150k and higher. (Our software runs a massive trading application that processes 1.5bn messages per day). We still need to improve some of the low-level locking functions to use lock-free mechanisms, and we max out a gigabit network. It is difficult to find machines powerful enough to really stress test the software.

Without very robust frameworks, I'd never attempt such a project. As it was, we paid a lot for the extra performance. Our frameworks will eventually be released as free software, along with the middleware server.

Interestingly, a very similar application written in Java 1.5 and using the BEA runtime gets similar performance to ours. Java's threading is so good that I'd be hesitant to chose C on the basis of performance again. I'm not sure whether ACE can reach the levels of performance we need; 100k messages per second is extreme.

Other questions that are very important to ask:

- The number of clients you expect to connect at once. If it's less than 500 you can probably use one or two OS threads per connection. If it's more you need to virtualise connections or share your OS threads.
- The footprint. If you don't care, then I'd advise using Java. If you want a native Linux service, consider C++ and ACE. If you really want to write multithreaded C code, and don't have a full toolkit, consider seeing a doctor.

When it comes to the future, clearly multiple cores are the way we're heading. This was clear two years ago, and was the main reason we bit the bullet and chose to write our software multithreaded rather than using a clustering model. It seemed clear to me that within a decade, systems would have 32, 64, 128 cores, and software that could take advantage of this would survive for longer. Clustering is not as powerful an abstraction as multithreading.

--
My blog

Programmers by SpeedBump0619 · 2006-07-13 02:18 · Score: 2, Insightful

While most of the discussion here covers the technical aspects of the question I think that the biggest factor here is being overlooked. I've been employed now in 5 different environments 4 of which used threading for parallelism. For some reason many programmers just have a problem wrapping their minds around shared memory problems. The number of times I have reveiwed code changes and seen someone not lock a mutex, or forget to release a semaphore is mind boggling.

As far as I'm concerned threading should only be considered when you have either:
1) a high tolerance for extremely difficult debugging
2) a customer who likes it (pays you more) when your program crashes
3) a team of programmers experienced with threading (at least 75%...if not 100% you *must* review code changes)

In modern Unix systems most people won't be able to really tell the speed difference between create_thread and fork. If you screw up pushing data through a pipe it just doesn't come out the other end, but if you screw up using shared memory you won't know until something unexpected (and seemingly unrelated) happens.

If you can build it that way... by mengel · 2006-07-13 02:47 · Score: 2, Insightful

... the fastest implementation is always the lean, event-loop (e.g. while/select loop) version that never blocks. This is because when a threaded implementation works properly on a single CPU, that's what it ends up doing anyway -- code in each thread runs, until it does something that blocks that thread, and another thread wakes up... If you split out those same code segments into a common event loop, you get rid of the thread context-switch overhead. You can even break up long event-handlers with "Idle events" -- you do part of the work, and send yourself an "idle event" to remind yourself to do the next chunk when you don't have new work coming in. This is generally less overhead than a thread time-slicing context switch.

If you then want to take advantage of multiple CPU's efficiently, you need to look at the event handler code in the above implementation, and see if you have race conditions if two branches are run in parallel. If not, you can just fork a couple of processes and/or threads to tag-team the same event stream, and the code just cruises and uses more CPUs.

If there are race conditions between branches, you take those branches that need to share a resource, make a single separate thread/process for those handlers, and forward the required events to that thread/process so they get handled sequentially.

This style of code is often harder to design than a more obvious multiple-threads type implementation, but it is faster (and often easier to maintain) when properly done. In either case, the source of obscure bugs is race conditions that are overlooked.

--
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'

c10k updated a bit... by Anonymous Coward · 2006-07-13 03:17 · Score: 0

I've updated http://kegel.com/c10k.html to make the "... for obvious reasons"
section clearer, and I now point to Cory's library, as
well as to an updated version of Polyakov's kevent and naio pages.

Re:Threading --- hype, more hype and extra hyped h by BigCheese · 2006-07-13 06:02 · Score: 1

I've run into that too. A lot of people seem to think that threads are free when they aren't. They use memory and add context switches.

Which reminds me. Why the hell does the new Yahoo! Messenger use 25 threads? There's no good reason for that.

--
The obscure we see eventually. The completely obvious, it seems, takes longer. - Edward R. Murrow

Re:Consider Fault Tolerance & Thread Safety To by kurtdg · 2006-07-13 10:24 · Score: 1

Since you didn't say what kind of server you're building, I'm going to assume:
- ...
- the language is C, C#, or C++, and not Java (process-based servers in Java?)

Standard API package java.nio, serving you since 2002.

Introduced in J2SE 1.4 (Merlin) after an almost unanimous vote in the Java democracy: http://jcp.org/en/jsr/detail?id=51

threads are not nice to the CPU by r00t · 2006-07-13 18:16 · Score: 1

In general, every thread will need a stack. The CPU does not have an infinite cache, nor does it have infinitly many TLB slots. You probably get something like 128 TLB slots, to be used for more than just stacks. Right there you have a thread limit; exceed it and your CPU goes really slow.

OT, yes, but... by tmasssey · 2006-07-14 12:20 · Score: 1

To quote David Abrash: Profile before you optimize.

I had a very similar experience with a client in 1997 or so. They had an old file server: dual Pentium 133's running Windows NT 3.51 with 6 hot-swap SCSI drives in RAID-5. It would take approximately 45 minutes for a designer to open a single file (about 1GB in size). Their previous consultants told them that their server was hopelessly out-of-date (which it was) and that they needed a new file server: $15,000, please.

I came and did some profiling on their server. Their CPU utilization was about 30% and even their disk utilization was relatively low: they weren't even using the hardware they had! Turns out that they had two hubs: a high-end 10MBit switch with 2 100MBit uplinks, and a 12-port 100Mbit *hub*. This was at a point where 100Mbit switching was very expensive, and Gigabit was fiber-only.

So, we replaced the hub with a Cisco 2924M-XL switch. Instantly the designers were getting a 6x improvement, for only $3500 or so. Further profiling on the server showed that we had *still* not maxed out disk performance: the 6 drives could *easily* pump out the 10MB/s necessary to satuate a full 100Mbit connection, so we decided to go Gigabit by adding a module on the Cisco and a Gigabit . That alone tripled the performance *again*, with their exact same hopelessly old server! And we had spent less than $5000.

If the previous clients would have replaced the old server, my estimation is that things would have run at best about 15% faster: that's all that the old server was adding to the process. When I was done, we had spent 1/3 as much, and had cut the time by 1/18! A process that took 45 minutes now only took 2-3 minutes! And this was for a process (opening or saving a file) that they did 5-10 times a day! We estimated that the upgrade paid for itself in the first week! :)

And the very best part was, 3 months after we finished that, they bought the new server, anyway. Once they saw the tremendous performance improvement, they could *justify* the expense: they were getting so much more work done! (And I ended up putting in a more capable, IBM-brand server for $10,000!).

So, long story short: profile before you optimize.

Oh, and *always* use SCSI! :)

--
Linux IT Consulting and Domino Development in Michigan

Don't pick any - pick them all with Flux by Ristretto · 2006-07-14 15:21 · Score: 1

We recently developed a programming language system called Flux that we presented at USENIX this June that addresses exactly this problem. Flux is a domain-specific language for building server applications from serial C and C++ components. The compiler then generates event-driven or multithreaded code -- it's just a compiler flag and a different runtime system. You as a programmer do not deal with the nitty-gritty details of managing concurrency. Moreover, Flux can optionally generate a simulator that lets you load-test your app before deployment in order to isolate bottlenecks. We've built four servers so far in Flux - including the web server hosting Flux, and a BitTorrent seed - and its performance matches or exceeds that of hand-tuned high-performance servers. It's still a prototype system, but worth checking out.

-- Emery Berger
-- Assistant Professor
-- Department of Computer Science
-- University of Massachusetts Amherst

Re:Consider Fault Tolerance & Thread Safety To by Anonymous Coward · 2006-07-15 12:55 · Score: 0

note that on proper threading impls your app doesn't have to die when one thread crashes, such impls include BeOS and Windows.

I believe QNX and Solaris might also be among these impls, but it's been a while and I don't feel like checking Mac OS X's.

basically what happens is a thread crashes, and a crash handler runs for that thread, in beos you get a dialog saying a thread crashed, and all the other threads continue running. In windows the default handler tends to halt the process, but if you attach a debugger you can let the other threads continue if you feel like it.

Now, of course, if a thread dies, you're properly in a bit of a pickle since it probably has /some/ resource that you would like to talk to, and you won't know that you can't talk to it. Ideally when enough threads die, you'd have started a second process and arranged for new requests to go to it, you finish off whatever things you can do and then trigger a core dump.

now of course, if the problem you had related to some interaction between threads, sometimes you really need something that the time memory dump won't show. there are now some tools which will try to recognize and complain about thread unsafe patterns, so you should probably use them, but yes, programming with threads does involve being more careful. but working with a big database as you described shouldn't be done carelessly no matter what path you take.

i'm glad things worked out in the end. i'm sorry about the price.

Re:Consider Fault Tolerance & Thread Safety To by Dolda2000 · 2006-07-18 22:37 · Score: 1

In Windoze and OS/2 (am I showing my age here :) it is possible to trap these type of exceptions on a per thread basis. You can then create a "manager" that does effectively what you had in the multi-process scenario. The exception handling code does whatever cleanup it can, and then triggers some action that will cause a thread to be spun up if an existing one bites the dust.

However, if the dying thread has run amok over the heap already, that's not going to help much. Instead, you are likely to get other mysterious failures on other threads 5 hours later with no debugging information. Neither do you get a core file of the dying thread for post mortem debugging.

Mono-process? by Anonymous Coward · 2006-07-21 09:34 · Score: 0

Why would you want a single threaded server? Dispatch tasks to pooled threads. This works great for just about any application.

Slashdot Mirror

Should Servers be Mono-Process or Multithreaded?

96 comments