Boosting Socket Performance on Linux

Be aware by 2.7182 · 2006-01-19 08:58 · Score: 4, Funny

Some engineers at Berkeley have been looking at this for a while, but haven't gotten much credit for it.

Re:Be aware by leonmergen · 2006-01-19 09:02 · Score: 3, Insightful

Exactly... especially with things like these, it's usually best for the entire internet if you just stick with the defaults... they are defaults for a reason, it might not be the best for you, but it's most likely the best for the internet as a whole.

Reminds me of those people tweaking firefox settings to hammer all kind of webservers... sure, your browsing might be a slight bit faster, at the expense of the browsing of lots of other people...

--
- Leon Mergen
http://www.solatis.com
Re:Be aware by heavy+snowfall · 2006-01-19 09:43 · Score: 2, Interesting

Fasterfox also trips a lot of traps intended to catch content stealing bots.
Re:Be aware by zcat_NZ · 2006-01-19 11:52 · Score: 2, Interesting

imbsc but I vaguely recall in the early days of web browsers, they would pull down the base page, and then one image at a time. Netscape opening multiple requests in parallel seemed like a massive abuse of webserver resources at the time, to me at least.

--
455fe10422ca29c4933f95052b792ab2
Re:Be aware by jas0n · 2006-01-19 17:50 · Score: 4, Informative

Looks like a rip off of an OnLamp article from a few months ago, and not a very good one at that! At least the OnLamp article explained how to tweak a few more OS's and the math was correct. And just to add insult to injury the article on OnLamp was written by one of those Berkeley guys ;-)
Re:Be aware by gbjbaanb · 2006-01-19 23:15 · Score: 2, Interesting

best for the internet as a whole
are you sure?

From a paper written by Phil Dykstra, back in 1999.

"A recent example comes from the Pacific Northwest Gigapop in Seattle which is based on a collection of Foundry gigabit ethernet switches. At Supercomputing '99, Microsoft and NCSA demonstrated HDTV over TCP at over 1.2 Gbps from Redmond to Portland. In order to achieve that performance they used 9000 byte packets and thus had to bypass the switches at the NAP! Let's hope that in the future NAPs don't place 1500 byte packet limitations on applications."

Ok, forget it mentions the M word, this article is about using jumbo frames (9000 byte packets) instead of the 1500 byte ones that were originally specced in 1980 (back when ethernet was.. not quite as fast as it is today).

Seeing as how the internet as a whole is based on this packet size, and the article (http://sd.wareonearth.com/~phil/jumbo.html) describes the stunning performance gains that can be had with jumbo frames, the internet as a whole is actually being held back significantly by it (ie. increase the frame buffer by 6, you get about a 40 times throughput)(bigger frames than 9000 bytes are not practical due to other TCP design limitations).

His recommendations are - if you're on a LAN, enable jumbo frames today.

IPv6 will not have this restriction and so will be faster, maybe things like HDTV on demand will drive its adoption on the internet.

slashdotted? by ChipMonk · 2006-01-19 09:00 · Score: 4, Funny

Judging by the response time from IBM's web server, it looks like they have yet to put their advice into practice.

GNU/Linux®...A lessefficent way to say Linux by Real+World+Stuff · 2006-01-19 09:01 · Score: 2, Funny

I mean really, I think we understand what you mean by just saying Linux.

--
If we don't fight for ourselves no one will.

Hello 1995 by AKAImBatman · 2006-01-19 09:02 · Score: 4, Insightful

This reads like an article from the 90's. This being 2006 and all, I would hope that programmers know how to make effective use of TCP/IP sockets. I wonder if maybe they just yanked an article from 1995 and did a search/replace on s/Windows/GNU Linux/g.

--
Javascript + Nintendo DSi = DSiCade

Re:Hello 1995 by epiphani · 2006-01-19 09:37 · Score: 3, Interesting

Agreed. In fact, as someone who learned socket coding around 1999/2000 (and as a result do not have a good grasp on how to actively define register variables, compilers do that stuff for you these days) I did all of these things out of habit, and didnt fully understand them until this article.

In the same line - where is the discussion of different FD table polling mechanisms? select() versus poll(), and wheres the writeup about Linux's epoll(). I would have been interested in an epoll() article, especially how it compares to FreeBSD's kqueue().

--
.
Re:Hello 1995 by AKAImBatman · 2006-01-19 09:45 · Score: 2, Insightful

One of the great things about computers is they allow different implementations of the same idea. Because of this, someone who knows how to tune the networking on one OS may not know how to on Linux.

Now if only the article actually covered something specific to Linux, I'd agree with you. About the most useful thing it does is tell you the location of the same parameters that you muck with on every other system in existence. This info has only been around for Linux for, oh, more than a decade. Pick up any book or tutorial on TCP/IP for the same info.

Do you also complain when the weather report comes on the local news, because you've seen a weather report before?

No more than I complain that I just ate dinner yesterday. But I do tend to get annoyed when TV Networks show reruns of my favorite TV shows in slots that they're supposed to be showing new episodes! (Star Trek: Enterprise was probably the worst at this. You never know when they were actually going to show something new. I didn't even consider it "one of my favorite TV Shows," and it was still annoying.)

--
Javascript + Nintendo DSi = DSiCade
Re:Hello 1995 by pthisis · 2006-01-19 10:39 · Score: 4, Informative

In the same line - where is the discussion of different FD table polling mechanisms? select() versus poll(), and wheres the writeup about Linux's epoll(). I would have been interested in an epoll() article, especially how it compares to FreeBSD's kqueue().

For the overview, you want Dan Kegel's c10k page:

http://www.kegel.com/c10k.html

--
rage, rage against the dying of the light
Re:Hello 1995 by AKAImBatman · 2006-01-19 14:14 · Score: 3, Informative

The Linux kernel automatically doubles the buffer for its own use. In the article:

Within the Linux 2.6 kernel, the window size for the send buffer is taken as defined by the user in the call, but the receive buffer is doubled automatically. You can verify the size of each buffer using the getsockopt call.

From the MAN page:

NOTES

Linux assumes that half of the send/receive buffer is used for internal kernel structures; thus the sysctls are twice what can be observed on the wire.

The article could have better explained that in context. For the most part it's automatic though, so don't worry about it.

--
Javascript + Nintendo DSi = DSiCade

Summary ripped directly from article (again) by sczimme · 2006-01-19 09:06 · Score: 2, Informative

Here is the summary:

The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet. Like any API, you can use the Sockets API in ways that promote high performance -- or inhibit it. This article explores four ways to use the Sockets API to squeeze the greatest performance out your application and to tune the GNU/Linux® environment to achieve the best results.

Here is the first paragraph of the article:

The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet. Like any API, you can use the Sockets API in ways that promote high performance -- or inhibit it. This article explores four ways to use the Sockets API to squeeze the greatest performance out your application and to tune the GNU/Linux® environment to achieve the best results.

Unless Cop (the submitter) is actually M. Tim Jones (the article author), Cop didn't write a darn thing.

Didn't we just have this discussion on /. a few days ago?

--
I want to drag this out as long as possible. Bring me my protractor.

No mention of alternatives to select? by complexmath · 2006-01-19 09:15 · Score: 4, Informative

Tuning socket parameters is great and all, but the real performance problem with socket IO has to do with using select and poll. There are high-performance alternatives (which admittedly tend to vary from OS to OS) that are so far superior that I wouldn't even consider the default methods unless complete code portability were a crucial factor.

Re:No mention of alternatives to select? by hackstraw · 2006-01-19 11:02 · Score: 4, Informative

Try this:

http://www.xmailserver.org/linux-patches/nio-impro ve.html /dev/epoll

The website is hideous, but there used to be benchmarks against different polling/selecting methods. If I remember correctly, its kinda trial and error, YMMV, kind of stuff. Its worth a look.
Re:No mention of alternatives to select? by statusbar · 2006-01-19 16:39 · Score: 2, Informative

This page, while out of date, and referenced earlier during this discussion, needs re-emphasis. I hope it gets updated soon:

http://www.kegel.com/c10k.html

Very awesome paper. How do _you_ make a server that handles 10,000 connections?

--jeff++

--
ipv6 is my vpn

IBM is getting some good Linux content... by tcopeland · 2006-01-19 09:40 · Score: 4, Interesting

...on developerWorks, not the least of which, if I may say so, is the GLib tutorial I wrote for them this past summer. If you wanted how to use various GLib collections and utilities - lists, tables, trees, quarks, relations, and all that - check it out. You can even download a nice PDF file for offline perusing.

Folks who are thinking about writing something technical - give dW a shot. The editors are savvy folks and there's lots of good stuff up there already.

Oh, and book plug!

--
The Army reading list

Re:GNU/Linux®? by wfberg · 2006-01-19 09:45 · Score: 2, Informative

Most probably it's just IBM policy to always acknowledge some one else's trademarks, so as not to get in trouble. Both GNU (yeah, I know! I knooow..) and Linux are registered trademarks (... of their respective owners, of course..)

--
SCO employee? Check out the bounty

Re:I've always wanted to know if it is possible by pclminion · 2006-01-19 09:52 · Score: 4, Informative

Netcat might be what you want. It has two modes, a "client" and "server" mode. In client mode, it connects to an IP/port that you specify, then reads data from stdin and sends it through that socket. In server mode, it listens on a port you specify, and prints any data it received to stdout.

Is that what you're looking for?

Re:Code Portability by complexmath · 2006-01-19 10:00 · Score: 5, Informative

There was a Boost library in the works to encapsulate all of this rather nicely, but I'm not sure if it ever made it out of beta. ACE is another option, though that tends to be overkill for some projects. I rolled my own class wrapper around this stuff, but then I enjoy library programming.

Nagle's algorithm by Jeremi · 2006-01-19 10:12 · Score: 5, Interesting

For an application where I want both low latency AND high bandwidth, it's not enough to leave Nagle's algorithm on or off. If I leave it on, I'll get increased bandwidth, but >200ms latency due to the Nagle delay. If I leave it off, I get low latency, but the computer will (typically?) send out one network packet per send() call, which means inefficient use of bandwidth unless the calling code is very careful to call send() only with large amounts of data per call.

To get around the above problems, I came up with the following scheme: Leave Nagle's algorithm enabled, but create a FlushSocket() function that merely disables Nagle on the socket, then calls send() on the socket with a 0-byte buffer, then enables Nagle again. This apparently forces the TCP stack to immediately send any data that it may have accumulated in its Nagle-buffer. Therefore the only thing the calling code has to remember to do is to call FlushSocket() whenever it has called send() one or more times and doesn't think it will be sending any more data any time soon.

The above technique seems to work pretty well under Linux, Windows, and OS/X (and is more portable than Linux-specific flags like TCP_CORK, etc), but I haven't seen it documented anywhere. Is that simply an oversight, or is there some nasty downside to this technique that I'm overlooking?

--

I don't care if it's 90,000 hectares. That lake was not my doing.

Re:Nagle's algorithm by convolvatron · 2006-01-19 10:42 · Score: 2, Interesting

aren't you just drastically increasing the number of system
calls you have to pay for?

if you have some knowledge about the natural grouping of data,
it would be better to just turn nagle off and do buffering
in user space (collect up enough data and send it all in one
go)
Re:Nagle's algorithm by buck68 · 2006-01-19 13:35 · Score: 3, Informative

You may be interested in a paper we wrote a few years back [1]. We also started with the premise that some applications require both minimal latency and maximal bandwidth. In our case the application was our own media streaming system. We came up with our own patch to TCP (in Linux). The patch provided a new socket option, we call TCP_MINBUF. The idea is that you need a certain minimum amount of buffer allow TCP's congestion window to function, but no more. Indeed, in the paper we show that the delay due to socket buffer beyond the congestion window is often by far the dominant source of latency--not retransmissions, or delayed acks, or all the other more commonly cited things. So basically what TCP_MINBUF does, is dynamically size the socket buffer to follow the congestion window size. It had a huge impact on latency.

[1] "Supporting Low Latency TCP-Based Media Streams", Ashvin Goel, Charles Krasic, Kang Li, and Jonathan Walpole. Tenth International Workshop on Quality of Service (IWQoS), May 2002.

http://www.eecg.toronto.edu/~ashvin/publications/i wqos2002.pdf

Re:I've always wanted to know if it is possible by temojen · 2006-01-19 10:37 · Score: 4, Funny

which acts very much like cat

It ignores you except at feeding time, and pees in your shoes when it's mad at you?

Math error in paper? by Stiletto · 2006-01-19 11:46 · Score: 2, Informative

throughput = window_size / RTT

110KB / 0.050 = 2.2MBps

If instead you use the window size calculated above, you get a whopping 31.25MBps, as shown here:

625KB / 0.050 = 31.25MBps

That's funny, I get 12.5MBps

???

Hello 2003. by jd · 2006-01-19 12:29 · Score: 4, Interesting

The paper is 2 years, 2 months old. Many of the arguments will still be valid, but the code in all cases will have evolved considerably. In addition, other code has certainly been developed (there's a hard real-time UDP patch for Linux, for example) and the state of affairs is - if anything - much more muddled today.

Documentation like this is great and extremely valuable. It would be much more valuable, however, if it remained current. For example, can the ABISS project (which improves block I/O) be used at all? What do the numbers look like, when using profiling tools like Web100 (which profiles TCP communications)?

Has anyone run the Linux or one of the *BSD kernels through DAKOTA, KOJAK or PAPI to determine where, precisely, bottlenecks are within the kernels? It's easy to theorise, but isn't it cleaner to measure?

Now, I'm not saying these things aren't being done. They probably are, somewhere, by someone, but if the results aren't getting published we don't really know what impact what changes are going to have. The current method of evolving Operating System code in general is often a mix of personal theory and subjective experience based on non-random samples of activity. That can't really be a good way to do things, can it?

If I'm wrong, feel free to say. If I'm right, then maybe it would be a good thing if someone (possibly me) put together some kind of testing kit for measuring Linux kernel performance and actually measured the stats for Linux kernels on some kind of regular basis.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Re:flush( sd ) would be nice by Jeremi · 2006-01-19 13:27 · Score: 2, Interesting

Recall that fflush() blocks until the data makes it to disk; I expect he'd want to block until the socket buffers were empty, too.

I don't know if that really makes sense for networking though... the reason you'd want fflush() to block until the data makes it to disk is so that once your call to fflush() returns you know that your written data is safe in the event of a crash or power failure. (Although with too-clever hard drive firmware I'm not so sure even that's true anymore!). With networking on the other hand, even once the data has left your Ethernet port there is no guarantee that it will get to its destination... so what would be the purpose is waiting?

--

I don't care if it's 90,000 hectares. That lake was not my doing.

The trouble with the Nagle algorithm by Animats · 2006-01-19 13:38 · Score: 4, Interesting

I really should fix the bad interaction between the "Nagle algorithm" and "delayed ACKs". Both ideas went into TCP around the same time, and the interaction is terrible. That fixed timer for ACKs is all wrong.

Here's the real problem, and its solution.

The concept behind delayed ACKs is to bet, when receiving some data from the net, that the local application will send a reply very soon. So there's no need to send an ACK immediately; the ACK can be piggybacked on the next data going the other way. If that doesn't happen, after a 500ms delay, an ACK is sent anyway.

The concept behind the Nagle algorithm is that if the sender is doing very tiny writes (like single bytes, from Telnet), there's no reason to have more than one packet outstanding on the connection. This prevents slow links from choking with huge numbers of outstanding tinygrams.

Both are reasonable. But they interact badly in the case where an application does two or more small writes to a socket, then waits for a reply. (X-Windows is notorious for this.) When an application does that, the first write results in an immediate packet send. The second write is held up until the first is acknowledged. But because of the delayed ACK strategy, that acknowledgement is held up for 500ms. This adds 500ms of latency to the transaction, even on a LAN.

The real problem is that 500ms unconditional delay. (Why 500ms? That was a reasonable response time for a time-sharing system of the 1980s.) As mentioned above, delaying an ACK is a bet that the local application will reply to the data just received. Some apps, like character echo in Telnet servers, do respond every time. Others, like X-Windows "clients" (really servers, but X is backwards about this), only reply some of the time.

TCP has no strategy to decide whether it's winning or losing those bets. That's the real problem.

The right answer is that TCP should keep track of whether delayed ACKs are "winning" or "losing". A "win" is when, before the 500ms timer runs out, the application replies. Any needed ACK is then coalesced with the next outgoing data packet. A "lose" is when the 500ms timer runs out and the delayed ACK has to be sent anyway. There should be a counter in TCP, incremented on "wins", and reset to 0 on "loses". Only when the counter exceeds some number (5 or so), should ACKs be delayed. That would eliminate the problem automatically, and the need to turn the "Nagle algorithm" on and off.

So that's the proper fix, at the TCP internals level. But I haven't done TCP internals in years, and really don't want to get back into that. If anyone is working on TCP internals for Linux today, I can be reached at the e-mail address above. This really should be fixed, since it's been annoying people for 20 years and it's not a tough thing to fix.

The user-level solution is to avoid write-write-read sequences on sockets. write-read-write-read is fine. write-write-write is fine. But write-write-read is a killer. So, if you can, buffer up your little writes to TCP and send them all at once. Using the standard UNIX I/O package and flushing write before each read usually works.

John Nagle

Slashdot Mirror

Boosting Socket Performance on Linux

29 of 138 comments (clear)