High Performance Network Applications
An Anonymous Coward sent in this: "An article over at SysAdmin magazine seeks the truth while comparing network application performance under RH Linux, Solaris x86, FreeBSD 4.2, and Windows 2000. I'm a little suspicious of the writer's results, but you be the judge."
They say:
/etc/system:
n ms 415/patch1/TuningGuide.html )
/etc/init.d/inetinit:
/dev/tcp tcp_keepalive_interval 30000
/dev/tcp tcp_time_wait_interval 15000
/dev/tcp tcp_conn_req_max_q 1024
/dev/tcp tcp_conn_req_max_q0 1024
/dev/tcp tcp_xmit_hiwat 32768
/dev/tcp tcp_recv_hiwat 32768
/etc/vfstab, like this:
/dev/rdsk/c0t1d0s7 /opt ufs 2 yes logging,noatime
/etc/sysctl.conf:
> At Lyris Technologies, we write high-performance, cross-platform,
> email-based server applications. Better application performance is
> a competitive advantage, so we spend a great deal of time tuning all
> aspects of an application's performance profile (software, hardware,
> and operating system). Our customers frequently ask us which operating
> system is best for running our software. Or, if they have already chosen
> an OS, they ask how to make their system run our applications faster.
> Additionally, we run a hosting (outsourcing) division and want to reduce
> our hardware cost while providing the best performance for our hosting
> customers.
What a crap! They're claiming to be experts! Ha!
They just don't know how to tune Solaris or FreeBSD properly.
Results will be completely different if they've tuned it well.
Solaris Tuning Guide.
1) Apply latest recommended patches from http://sunsolve.sun.com
2) Add the following to the end of
* Raise TCP connection buffer size
set tcp:tcp_conn_hash_size=262144
* Increase various kernel buffers
set maxusers=2048
* Set hard limit on file descriptors
set rlim_fd_max=1024
* Set soft limit on file descriptors
set rlim_fd_cur=1024
* Increase directory name lookup cache
set ncsize=100000
* Should be the same as setting above
set ufs_ninode=100000
* Enable priority paging
set priority_paging=1
(These settings are based on information taken from:
http://docs.iplanet.com/docs/manuals/messaging/
3) The following should be at the bottom of
# TCP stack tuning
# default is 7200000
ndd -set
# default is 240000
# change to "tcp_close_wait_interval" on Solaris 2.6
ndd -set
# default is 128
ndd -set
# default is 1024
ndd -set
# default is 8192
ndd -set
# default is 8192
ndd -set
4) Speed up filesystem access under Solaris 2.7 and later.
Add logging to filesystem mount options in
/dev/dsk/c0t1d0s7
I have added noatime - this is another setting that might help
on very busy filesystem, but not that much as logging.
FreeBSD Tuning Guide
Recompile kernel with increased number of MAXUSERS (good number
to start is 256) and NMBCLUSTERS (I use 10000, see netstat -m
under load to get number that good for you).
You might want to play with "options HZ=1000".
Add this to
kern.maxfiles=65536
kern.maxfilesperproc=32768
net.inet.tcp.delayed_ack=0
net.local.stream.recvspace=65535
net.local.stream.sendspace=65535
net.inet.tcp.sendspace=65535
net.inet.tcp.recvspace=65535
Turn on softupdates on all filesystems
using tunefs -n enable (noatime might help as well).
Vadim Mikhailov
You're absolutely right. Their "benchmark" is perfectly valid, for their product running on a naively tuned operating system. But only a neophyte would put an out-of-the-box OS -- whether Linux, Solaris, Windows, or BSD -- into production as a high-performance network server. All the complaining boils down to two things:
The FreeBSD folks are especially upset because the article states that the OS was logging resource failures but the testers still didn't perform any tuning. That's an amazing level of incompetence to display in a magazine which is supposed to inform system administrators.
Now do you see what all the noise is about?
Agreed -- it's been a long time since I've seen a "benchmark" as poor as this one. But I don't think Windows was treated any more poorly than the other OSes. It wasn't a fair test of any of them.
The "tuning" for the Unix systems consisted in bumping up the maximum number of file descriptors. That's it. The FreeBSD system in particular was left completely mistuned and clearly running out of socket resources -- they report that it was logging errors but seem entirely ignorant of what those errors were (beyond their being load-related) and how to correct them.
Polling is hardly the best system interface for multiplexing TCP connections on either Windows or FreeBSD. As you mention, completion ports are best for Windows. Kqueue is best for FreeBSD. It just happens that polling is used in the crappy commercial SPAM program they "benchmarked". (All the OSes support scatter/gather, BTW, so you can't claim Windows was treated unfairly by its omission.)
None of the systems were testing in a way that shows their actual capabilities. The article is just a thinly disguised commercial for a (barely-)cross-platform "bulk email" product.
You misunderstand 'poll' completely. poll asks the OS to suspend your process until one of the indicated events happens, then you get to go respond to it. It's essentially the same thing.
Say, for example, that your dumping data into a socket. Under Unix, you write to the socket until the OS tells you that the socket buffer is full by setting the socket to non-blocking and writing until write returns EAGAIN as an error. Then you put the ability to write to that socket on the list of OS events you're interested in. Then, you go do whatever else it is you have to do. After you get done servicing everything you can service, you call poll and it blocks your process (possibly running others) until one of the indicated events happens and there's something else to service. Same basic paradigm.
Need a Python, C++, Unix, Linux develop
Also, VirtualAlloc there sounds and awful like like 'mmap'. Again, same basic idea, and Microsoft does it completely differently.
I know a fair amount about the insides of NT, and most design choices they made that are different than Unix's are worse.
Here are just two:
Need a Python, C++, Unix, Linux develop
I'm sure Linux will talk just fine to Linux, but other platforms might not be tuned the same. (2.4 kernels were having trouble because of this recently. Linux implemented some feature that lots of routers didn't, and performance was hosed somtimes.)
/. page? ECN is supposed to help avoid that.)
You don't seem to understand ECN. ECN is now (as of June 12) an internet standard. It will improve the performance of the Internet by allowing ECN-aware stacks to note congestion and respond appropriately instead of waiting for packets to fail to be acked and backing off one the transmission speeds. (Ever got a 'stalled' message loading a
Buggy routers responded incorrectly to ECN packets by terminating the connection. It appears as if the other computer isn't even on the net. Cisco has released bug fixes to correct this bug. They have not been applied by all of the admins.
Yes, Linux 2.4 shipped with ECN enabled. The distribution packagers generally (all?) included a command in the start-up scripts to disable the feature.
Because TCP/IP is a standard, there should not be performance differences between stacks whereas a stack performs better speaking to another stack of the same design. TCP/IP should be completely interoperable.
I have discovered a truly marvelous sig, unfortunately the sig limit is too small to contain i
While your point that this benchmark is somewhat flawed is correct, you also point out a large problem with Windows:
You are forced to use proprietary MS-only extentions rather than straight, standardized POSIX calls to achieve the best performance. That means you have to suffer proprietary lock-in if you want to code high performance network applications for Windows.
I think is deliberate: there is no reason why calls like malloc, creat, mmap, poll, whatever, couldn't have been tuned to get similar performance to the Windows specific VirtualAlloc, CreateFile, etc. Microsoft wants you to trade off portability for speed.
I think is deliberate: there is no reason why calls like malloc, creat, mmap, poll, whatever, couldn't have been tuned to get similar performance to the Windows specific VirtualAlloc, CreateFile, etc.
... apart from the fact that they expose different paradigms entirely?
Malloc - heap based allocation
VirtualAlloc - allocates entire pages from the VMM. Allows you to reserve or commit pages when and as you need them.
fopen - opens a file handle
CreateFile - Allows you to open a file handle, specifying buffers to use, etc etc etc.
poll - you sit there waiting and doing nothing most of the time because you're asking all your connections "are we there yet?"
CompletionPorts - the OS comes back to you when it's done, and tells you that it's finished. You can now use those spare cycles doing something else - like another 1000 network connections.
Simon
Coming soon - pyrogyra
Nice! So in other words, they used straight BSD sockets for their
implementation - which is NOT the way to get performance from Windows. You
need to use:
1. Asynchronous, Event based socket handling.
2. Completion ports.
3. Scatter/Gather buffering.
Polling is lousy no matter what way you do it. You'll lose most of your
performance spent going round a small loop.
Similarly you can infer that they used straight malloc() for their memory
handling, and most likely file handling - again very lousy
performance-wise on windows compared to the alternatives, such as
VirtualAlloc, CreateFile(), scatter-gather file handling and more.
As for the second test, we can guess (from their comments) that they're
using straight C++/C file operations under windows instead of tuning them to
the architecture, so of course performance is going to be lousy -- they're
benchmarking Microsoft's C runtime implementation, nothing more, nothing
less.
Also note that:
1. They don't provide details of which compiler they're using.
2. They don't provide details of the actual benchmark code for test 2.
3. They only tuned the Linux, FreeBSD and Solaris setups -- they should have
tuned Win2k server as well.
Sheesh. Talk about a crappy way to benchmark.
Simon
Coming soon - pyrogyra
Anyone else notice the heavy concentration in that article about the efficiency of mailing out large numbers of email messages. Now, I'm certain there are many MANY legitimate reasons why someone would have a "test list" of 200,000 email addresses, its just that I can't seem to think of any at the moment.
-Restil
Play with my webcams and lights here
I'm not sure. It looks like they've tried to use the same methods on 4 different operating systems. This is something that is doomed to failure in a benchmark situation as there are different programming paradigms for the different systems.
A much better benchmark would have been simply comparing IIS to Apache or Tux. Oh yeah. That's been done. Tux won. Hehe.
Fear: When you see B8 00 4C CD 21 and know what it means
True. There's another way that's also very fast in NT that would be really difficult to emulate on Unix (probably because it wouldn't be fast on Unix):
To set this up you treat the sockets as file handles and use ReadFileEx() and WriteFileEx() with the lpCompletionRoutine parameter set to point to a function that the OS should call directly when the I/O is done. When you are blocked waiting for activity, put the thread in an alertable wait state using *WaitForXXXObjectEx() function and the completion routine you specified will be called by magic (actually via an Asynchronous Procedure Call or APC, but close enough to magic) when the I/O has finished.
This works very quickly on NT because it mirrors the way the underlying kernel and device driver stack works. Basically the I/O completion can come straight up from the driver routine into user space with a minimal delay and minimal number of context switches. The second advantage is you don't have to open event handles for every I/O you have outstanding, and so you don't run into the limit of waiting on 64 objects at a time.
The only drawback to this method (if you can call it a drawback) is that I/O that is initiated on one thread is always sent back to that thread so you have to run one thread per CPU and round robin them
The closest thing on Unix to this sort of behaviour is signals, but signals and multithreaded code tend not to mix very well.
Just a FYI really, not saying it's good or bad compared to Unix - just another thing to have in your bag of tricks.
Fear: When you see B8 00 4C CD 21 and know what it means
The method used here for programming Windows 2000 is almost certain to guarantee slow results. Assuming he's written his code to use select() or even WaitForSingleObject() then he's signifiantly slowing down the system.
If you want to write high performance socket applications on Windows you MUST use I/O completion ports (something this article failed to mention at all). Most high load applications I've written using sockets have shown a 50% to 100% improvement in throughput for the same CPU load when switching to I/O Completion ports from a tradition (Unix style) asyncronous I/O model.
I'm not saying in this case that Win2k would beat Linux, just that the tests were skewed by the author's inadequate knowledge of writing high performance code on Windows 2000.
Fear: When you see B8 00 4C CD 21 and know what it means
I read this a couple of weeks back when a linux-centric friend sent it to me... my main observation: This is Obviously a comercial masquerading as a 'test'. When the 'device' being used to do this so called 'benchmark' is a software application written by the testers for something else, there is nothing else to call it. Maybe the title of the article is a bit misleading, the meat clearly says all they are doing is showing which OS they have optimized thier application for. They then use that as the FLAWED basis for determining which OS is 'best'? Give me a break.
My own complimentary subscription for presenting at LISA '99 just ran out, but as anyone who's read this journal before can tell you, this article was just written by Joe Admin, and was about on par for the magazine. Even if you haven't read the journal before, you could click on the big "Write For Us" link at the top of the page, and see that "all of our articles are written by readers."
Now, I'm not slamming the magazine! It's a decent piece of work, and actually has some good articles about tricks and tools that help sys admins get their day to day jobs done. But at the same time, it's also subject to some one-sided reviews and some articles take a lot of flak for their controversial positions. Just look at who wrote the article (the original developer of the mail engine) and take it with a grain of salt.
And if you really disagree write them a counter piece, or at least a letter to the editor pointing out the flaws.
It's clear from their comments that they did not turn on Softupdates on the filesystems when they set up their FreeBSD machine for the testing. It's no wonder that they found disk I/O to be slower on FreeBSD, therefore.
Traditionally, Linux has traded speed for safety in filesystem meta data handling. FreeBSD has always refused to do so, insisting that metadata be updated synchronously. With softupdates, the metadata is cached, but the cache is flushed in the right order. The upshot is that you get the speed and the safety.
In short (too late), I am sure that their opinion of FreeBSD would improve markedly if they would set it up properly.
From what I see, just about every other OS represented has a defender saying exactly the same thing. That doesn't speak well for the thoroughness of the testing. I'll leave it at that.
I was going to read this article and make an informed comment about it. But, because of my laziness to wait forever for it to load, I'm just going to post this summary of comments to come:
Linux users: Linux is better, Windows is unstable.
Win users: Windows is better, Linux is hard.
BSD users: You're both wrong.
Mac users: Hey, look at us. We are pretty.
Top 3: Mac, shut up.
BeOS users: We're better but y'all will never know it.
Bill Gates: All your $$ is belong to me.
---
It means nothing if "A" is fastest, if it runs on a bad OS, cheap commodity hardware or isn't supported. You go with "B" becuase it DOES.
Fast != correct all the time.
Solaris is much more finely grained in its locking than any of the other OSes mentioned. Because of that, comparisons with other OSes running on one or two CPUs (usually on PCs) do not do Solaris its due justice. Sure, Linux or FreeBSD, which aren't very finely grained in their locking (but are working towards changing that) spend less overhead in locking calls, so they run faster.
But how fast can they run on a 32-cpu machine? Or a 64-cpu machine? According to some public documents I saw, Sun will release a 72-cpu machine this summer. They currently support 64 cpus on their E10000 machines. Solaris is a highly scalable OS. Linux is not. FreeBSD is most certainly not. Windows2000 may like to style itself scalable, but come on, we all know they are dreaming. Maybe scalable to 4 CPUs (if you own Pentium Xeons), and maybe in someone's wet dream it could scale to 16 CPUs or so, but none, I repeat none, of these OSes can scale like Solaris.
Solaris' strength isn't the fact that it's blazing fast on a single CPU, because a lot of tests can show Linux is faster. But Solaris *is* blazing fast on massively parallel machines. Solaris shows time and again an amazing ability to scale performance with the addition of more CPUs. The overhead required to build that scalability into the OS penalizes Solaris on single or dual-cpu machines, and that *must* be taken into account by people.
And don't even talk about 64-bit. Sure, Solaris for Intel is limited to 32-bit address spaces due to the constraints of the CPU architecture on which it runs, but Solaris the OS is built through and through as a 64-bit OS, and Solaris running on UltraSparc hardware supports zillions of bytes of RAM. The new SunFire 6800s can support in the hundreds of gigabytes of RAM.
Can Windows2000 do that? Can Linux do that? Can FreeBSD do that? Really we are talking about different markets here, that's all. You really need to test the OSes in the areas they are designed to operate, and then you'll see who the real champ is.