Boosting Socket Performance on Linux
Cop writes "The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet. Like any API, you can use the Sockets API in ways that promote high performance -- or inhibit it. This article explores four ways to use the Sockets API to squeeze the greatest performance out your application and to tune the GNU/Linux® environment to achieve the best results."
Some engineers at Berkeley have been looking at this for a while, but haven't gotten much credit for it.
Bell South hear about this.
Judging by the response time from IBM's web server, it looks like they have yet to put their advice into practice.
'aunses'
Did you mean anuses? Or the correct pluralization:
Anii ?
Usage: You are an anus! You and your kin are anii!
Beat 'Em and Eat 'Em
Most time is spent in select()/poll() anyway. And there's sendfile() for web/ftp servers, hey, that saves syscalls!
Want nodelay? use UDP! :-)
Hehehe, go spend your time on serious issues, folks ;-)
I mean really, I think we understand what you mean by just saying Linux.
If we don't fight for ourselves no one will.
This reads like an article from the 90's. This being 2006 and all, I would hope that programmers know how to make effective use of TCP/IP sockets. I wonder if maybe they just yanked an article from 1995 and did a search/replace on s/Windows/GNU Linux/g.
Javascript + Nintendo DSi = DSiCade
Here is the summary:
The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet. Like any API, you can use the Sockets API in ways that promote high performance -- or inhibit it. This article explores four ways to use the Sockets API to squeeze the greatest performance out your application and to tune the GNU/Linux® environment to achieve the best results.
Here is the first paragraph of the article:
The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet. Like any API, you can use the Sockets API in ways that promote high performance -- or inhibit it. This article explores four ways to use the Sockets API to squeeze the greatest performance out your application and to tune the GNU/Linux® environment to achieve the best results.
Unless Cop (the submitter) is actually M. Tim Jones (the article author), Cop didn't write a darn thing.
Didn't we just have this discussion on
I want to drag this out as long as possible. Bring me my protractor.
Are you a /. editor? If your suggestion appeared on the front page it would be a dupe. See here.
Haha quality post! Made me chuckle anyway.
http://www.contactmusic.com/new/film.nsf/reviews/
Tuning socket parameters is great and all, but the real performance problem with socket IO has to do with using select and poll. There are high-performance alternatives (which admittedly tend to vary from OS to OS) that are so far superior that I wouldn't even consider the default methods unless complete code portability were a crucial factor.
It's funny you should mention this. I was thinking of the class libraries or frameworks, if you will, included with Java, MFC (if it's still used these days), Visual Age, and so on. Does this mean, and are you saying, that the only way to get the best performance from TCP/IP is to roll your own?
Yikes!
Check his back catalogue, he's been serious about nothing else since the dawn of time...
going from Socket 7 to Socket 462 to Socket 478 boosted it quite a damn bit over the years.
Bonus tip: Experimentation with Samba demonstrates that disabling the Nagle algorithm results in almost doubling the read performance when reading from a Samba drive on a Microsoft® Windows® server.
What is he saying here? Is it faster when using Samba to access a Windows server? Perhaps he was talking about using Samba as a Windows server and making it serve faster with this technique? What it actually says, about running Samba on a Windows server, makes the least sense of all!
Why the corporate-style circle-R? Is this a subtle bit of sarcasm or trollery targeting RMS's followers?
With spending like this, exactly what are "conservatives" conserving?
...on developerWorks, not the least of which, if I may say so, is the GLib tutorial I wrote for them this past summer. If you wanted how to use various GLib collections and utilities - lists, tables, trees, quarks, relations, and all that - check it out. You can even download a nice PDF file for offline perusing.
Folks who are thinking about writing something technical - give dW a shot. The editors are savvy folks and there's lots of good stuff up there already.
Oh, and book plug!
The Army reading list
to send signals to a network socket without writing code but using some ready made command-line tool (netstat?)? I've looked around for this but can't seem to find anything...
Yam, yam, uga booga, yam, yam, yade, yade, uga booga, yam, yam, yade, yade
I believe that Windows is using (or about to use) 'completion ports' - this is where the network hardware makes a callback direcly to an OS-nominated routine. Apparently the idea works very well in practice and as such, on appropriate hardware the Windows network stack really flies.
Can a network I/O engineer advise how true this is and if any other OS' are impementing for this hardware?
(Yes I do realise that I could Google for the results but I'd like some local opinions here).
Cheers.
I know somebody who thought that's what Linux was, you insensitive clod!
To get around the above problems, I came up with the following scheme: Leave Nagle's algorithm enabled, but create a FlushSocket() function that merely disables Nagle on the socket, then calls send() on the socket with a 0-byte buffer, then enables Nagle again. This apparently forces the TCP stack to immediately send any data that it may have accumulated in its Nagle-buffer. Therefore the only thing the calling code has to remember to do is to call FlushSocket() whenever it has called send() one or more times and doesn't think it will be sending any more data any time soon.
The above technique seems to work pretty well under Linux, Windows, and OS/X (and is more portable than Linux-specific flags like TCP_CORK, etc), but I haven't seen it documented anywhere. Is that simply an oversight, or is there some nasty downside to this technique that I'm overlooking?
I don't care if it's 90,000 hectares. That lake was not my doing.
Their description of Nagling seems a little oversimplified. Unless I'm mistaken, the Nagle algorithm acts to consolidate short packets only when TCP is waiting on an acknowledgement from a previously-transmitted packet.
Consequently, using TCP_NODELAY wouldn't necessarily make a difference in the Telnet example the article cites... at least, not as long as the ping time is better than the user's typing speed.
Use libevent.
The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet.
This kind of knowledge is why I keep pressing the "reload current page" button after subscribing to slashdot.
Wouldn't it be nice if C programmers were given an option similar to what fflush does for streams? Something like flush(sd) whenever you need to ignore Nagle's algorithm. In this way you can enable and disable nagling dynamically in your program without calling setsockopt to switch nagling on and off. This option is given for Java since you can easily convert a socket to any type of stream you wish, while most Stream objects have a member function flush(). Perhaps I am wrong and such an interface is already provided in C but I personally never found one, while the necessity for it appears to be obvious.
throughput = window_size / RTT
110KB / 0.050 = 2.2MBps
If instead you use the window size calculated above, you get a whopping 31.25MBps, as shown here:
625KB / 0.050 = 31.25MBps
That's funny, I get 12.5MBps
???
I've used Azureus a lot on my Linux box, and one of its features is tunability and graphs. Number of connections, max up and down, etc, and watch the results. Now, I have a very asymmetric line (10:1 ratio). I've noticed that trying to use maximum upload and download at once can create sinewave patterns of slow response that look a lot like resonant feedback, and in extreme cases can wedge the line completely, throughput zero on all net apps. Running uploads at 20K and leaving the top 5K unused gets a far better total rate both up and down.
What I'm wondering is, might it be possible to make these sort of calculations in kernel, detect congestion feedback and back off automatically? I'm not talking about the regular exponential backoff algorithm, but about some sort of best-rate prediction based on detecting the characteristic shape of feedback waves and backing off until they disappear.
The GNU community has been divided by Linux into two subsets: dogmatic types who work on HURD and the pragmatic majority who just use what they call "GNU/Linux". Therefore, "GNU/Linux" (also spelt "GNU÷Linux") is completely accurate :-)
Documentation like this is great and extremely valuable. It would be much more valuable, however, if it remained current. For example, can the ABISS project (which improves block I/O) be used at all? What do the numbers look like, when using profiling tools like Web100 (which profiles TCP communications)?
Has anyone run the Linux or one of the *BSD kernels through DAKOTA, KOJAK or PAPI to determine where, precisely, bottlenecks are within the kernels? It's easy to theorise, but isn't it cleaner to measure?
Now, I'm not saying these things aren't being done. They probably are, somewhere, by someone, but if the results aren't getting published we don't really know what impact what changes are going to have. The current method of evolving Operating System code in general is often a mix of personal theory and subjective experience based on non-random samples of activity. That can't really be a good way to do things, can it?
If I'm wrong, feel free to say. If I'm right, then maybe it would be a good thing if someone (possibly me) put together some kind of testing kit for measuring Linux kernel performance and actually measured the stats for Linux kernels on some kind of regular basis.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
GNU/Linux® ®? WTF®
The 200-millisecond delay you're experiencing is the delayed ACKs, which is independent of Nagle. Well, delayed ACKs are incompatible with Nagle. I have implemented TCP with Nagle and no delayed ACKs.
The reason for delayed ACKs is that the OS would like to withhold sending an empty ACK packet right away because the application is likely to respond to the packet. So the kernel implements a hack, whereby it waits a while (200 ms) to give the application time to react.
The socket API could be enhanced easily to get rid of the 200-millisecond penalty. For example, calling send(2) with zero bytes (maybe with a special flag) could allow the application to tell the kernel that a response is not forthcoming.
"tcp/ip's swiss army knife"
However Lame List contains a lot of wonderful nuggets.
I must disagree with the article however, there are so SO few times that disabling the Nagle algorythm is the correct answer that the standard answer when someone asks about it on the networking forums is that the asker doesn't understand Nagle, and to reenable it. Telnet is even a bastard case in that your networking performance may actually go UP sending smaller bursts of network characters, rather than one at a time, each in its own packet. But you have to measure your own performance.
Frankly none of these suggestions will get you ultimate performance from a 10 Gig networking stack, and that is where networking finally becomes fun
I have mod points and I am not afraid to use them
the article is all about TCP, which is great. how about an article on optimizing UDP though?
Here's the real problem, and its solution.
The concept behind delayed ACKs is to bet, when receiving some data from the net, that the local application will send a reply very soon. So there's no need to send an ACK immediately; the ACK can be piggybacked on the next data going the other way. If that doesn't happen, after a 500ms delay, an ACK is sent anyway.
The concept behind the Nagle algorithm is that if the sender is doing very tiny writes (like single bytes, from Telnet), there's no reason to have more than one packet outstanding on the connection. This prevents slow links from choking with huge numbers of outstanding tinygrams.
Both are reasonable. But they interact badly in the case where an application does two or more small writes to a socket, then waits for a reply. (X-Windows is notorious for this.) When an application does that, the first write results in an immediate packet send. The second write is held up until the first is acknowledged. But because of the delayed ACK strategy, that acknowledgement is held up for 500ms. This adds 500ms of latency to the transaction, even on a LAN.
The real problem is that 500ms unconditional delay. (Why 500ms? That was a reasonable response time for a time-sharing system of the 1980s.) As mentioned above, delaying an ACK is a bet that the local application will reply to the data just received. Some apps, like character echo in Telnet servers, do respond every time. Others, like X-Windows "clients" (really servers, but X is backwards about this), only reply some of the time.
TCP has no strategy to decide whether it's winning or losing those bets. That's the real problem.
The right answer is that TCP should keep track of whether delayed ACKs are "winning" or "losing". A "win" is when, before the 500ms timer runs out, the application replies. Any needed ACK is then coalesced with the next outgoing data packet. A "lose" is when the 500ms timer runs out and the delayed ACK has to be sent anyway. There should be a counter in TCP, incremented on "wins", and reset to 0 on "loses". Only when the counter exceeds some number (5 or so), should ACKs be delayed. That would eliminate the problem automatically, and the need to turn the "Nagle algorithm" on and off.
So that's the proper fix, at the TCP internals level. But I haven't done TCP internals in years, and really don't want to get back into that. If anyone is working on TCP internals for Linux today, I can be reached at the e-mail address above. This really should be fixed, since it's been annoying people for 20 years and it's not a tough thing to fix.
The user-level solution is to avoid write-write-read sequences on sockets. write-read-write-read is fine. write-write-write is fine. But write-write-read is a killer. So, if you can, buffer up your little writes to TCP and send them all at once. Using the standard UNIX I/O package and flushing write before each read usually works.
John Nagle
Linux IS a registered trademark, you know. Especially if you are an Australian...
Just "gittin-r-done," day after day.
hahahaha!
Geez. That was so simple. Couldn't you come up with a personality that requires a little more thought?
Loser.
--
Trolling all trolls since 2001.
What really bums me out about doing network services on the Linux platform is that Linux does not support doors, a la Solaris, so you can't have multiple processes collaborating on a single socket service without a scheduler burp. There was a guy who implemented doors for 2.4, but his code was never adopted into the kernel, and now its rotting away....
Linux is quite tragic that way. Hopefully there will be a Debian user-land on the OpenSolaris kernel soon, and then I can rock-n-roll again.
-I like my women like I like my tea: green-
yeah, it seems to be an error.
looks like they were doing the calculations with a calculator and somebody presses '*' instead of '/' !!!!
625 * 5 = 3125
a forgivable slip of the finger's tip.
In this case, though, "GNU/Linux®" isn't just overly wordy, it's incorrect. The advice is all about tuning the kernel's TCP stack, which is pure Linux®.