Slashdot Mirror


Boosting Socket Performance on Linux

Cop writes "The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet. Like any API, you can use the Sockets API in ways that promote high performance -- or inhibit it. This article explores four ways to use the Sockets API to squeeze the greatest performance out your application and to tune the GNU/Linux® environment to achieve the best results."

138 comments

  1. Be aware by 2.7182 · · Score: 4, Funny

    Some engineers at Berkeley have been looking at this for a while, but haven't gotten much credit for it.

    1. Re:Be aware by leonmergen · · Score: 3, Insightful

      Exactly... especially with things like these, it's usually best for the entire internet if you just stick with the defaults... they are defaults for a reason, it might not be the best for you, but it's most likely the best for the internet as a whole.

      Reminds me of those people tweaking firefox settings to hammer all kind of webservers... sure, your browsing might be a slight bit faster, at the expense of the browsing of lots of other people...

      --
      - Leon Mergen
      http://www.solatis.com
    2. Re:Be aware by mordors9 · · Score: 1

      But did they get their work patented? Otherwise in these days and times it (depressingly) doesn't seem to count.

    3. Re:Be aware by heavy+snowfall · · Score: 2, Interesting

      Fasterfox also trips a lot of traps intended to catch content stealing bots.

    4. Re:Be aware by zcat_NZ · · Score: 2, Interesting

      imbsc but I vaguely recall in the early days of web browsers, they would pull down the base page, and then one image at a time. Netscape opening multiple requests in parallel seemed like a massive abuse of webserver resources at the time, to me at least.

      --
      455fe10422ca29c4933f95052b792ab2
    5. Re:Be aware by pyrrhonist · · Score: 1
      Netscape opening multiple requests in parallel seemed like a massive abuse of webserver resources at the time, to me at least.

      I'm glad I'm not the only one who remembers this. We used to call it, "Netrape", because of this behavior.

      I still kind of miss Mosaic.

      --
      Show me on the doll where his noodly appendage touched you.
    6. Re:Be aware by jas0n · · Score: 4, Informative

      Looks like a rip off of an OnLamp article from a few months ago, and not a very good one at that! At least the OnLamp article explained how to tweak a few more OS's and the math was correct. And just to add insult to injury the article on OnLamp was written by one of those Berkeley guys ;-)

    7. Re:Be aware by gbjbaanb · · Score: 2, Interesting

      best for the internet as a whole
      are you sure?

      From a paper written by Phil Dykstra, back in 1999.

      "A recent example comes from the Pacific Northwest Gigapop in Seattle which is based on a collection of Foundry gigabit ethernet switches. At Supercomputing '99, Microsoft and NCSA demonstrated HDTV over TCP at over 1.2 Gbps from Redmond to Portland. In order to achieve that performance they used 9000 byte packets and thus had to bypass the switches at the NAP! Let's hope that in the future NAPs don't place 1500 byte packet limitations on applications."

      Ok, forget it mentions the M word, this article is about using jumbo frames (9000 byte packets) instead of the 1500 byte ones that were originally specced in 1980 (back when ethernet was.. not quite as fast as it is today).

      Seeing as how the internet as a whole is based on this packet size, and the article (http://sd.wareonearth.com/~phil/jumbo.html) describes the stunning performance gains that can be had with jumbo frames, the internet as a whole is actually being held back significantly by it (ie. increase the frame buffer by 6, you get about a 40 times throughput)(bigger frames than 9000 bytes are not practical due to other TCP design limitations).

      His recommendations are - if you're on a LAN, enable jumbo frames today.

      IPv6 will not have this restriction and so will be faster, maybe things like HDTV on demand will drive its adoption on the internet.

    8. Re:Be aware by d1rty_d0gg_ · · Score: 1

      Netscape opening multiple requests in parallel seemed like a massive abuse of webserver resources

      ...as opposed to Internet Exploder? AAMOF this had more to do with the absence of persistent connections in the HTTP 1.0. The server would simply close the socket at its end after servicing a request, so the client had to open a new connection for each new object in the page. That has changed in HTTP 1.1, among other reasons due to the server maxing out the number of open connections on the host.

      --
      "Show me your tables and I won't usually need your flow charts; they'll be obvious".
    9. Re:Be aware by zcat_NZ · · Score: 1

      As opposed to NCSA Mosaic and Spyglass Mosaic. MSIE didn't even exist at the time. some history for you.

      I think my point is; what might seem like network abuse today is likely to be SOP in a few years time.

      --
      455fe10422ca29c4933f95052b792ab2
  2. Don't let... by jimand · · Score: 0, Offtopic

    Bell South hear about this.

  3. slashdotted? by ChipMonk · · Score: 4, Funny

    Judging by the response time from IBM's web server, it looks like they have yet to put their advice into practice.

    1. Re:slashdotted? by leonmergen · · Score: 1

      Judging by the response time from IBM's web server, it looks like they have yet to put their advice into practice.

      ... or too many Slashdot visitors already did that exact thing... :-)

      --
      - Leon Mergen
      http://www.solatis.com
    2. Re:slashdotted? by nahdude812 · · Score: 1

      Because it seems to be beginning to crawl under a good ol' fashioned /.ing, here's the article text:

      Boost socket performance on Linux

      Four ways to speed up your network applications

      M. Tim Jones (mtj@mtjones.com), Senior Principal Software Engineer, Emulex

      17 Jan 2006

      The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet. Like any API, you can use the Sockets API in ways that promote high performance -- or inhibit it. This article explores four ways to use the Sockets API to squeeze the greatest performance out your application and to tune the GNU/Linux® environment to achieve the best results.

      When developing a sockets application, job number one is usually establishing reliability and meeting the necessary requirements. With the four tips in this article, you can design and develop your sockets application for best performance, right from the beginning. This article covers use of the Sockets API, a couple of socket options that provide enhanced performance, and GNU/Linux tuning.

      To develop applications with lively performance capabilities, follow these tips:

      * Minimize packet transmit latency.
      * Minimize system call overhead.
      * Adjust TCP windows for the Bandwidth Delay Product.
      * Dynamically tune the GNU/Linux TCP/IP stack.

      Tip 1. Minimize packet transmit latency

      When you communicate through a TCP socket, the data are chopped into blocks so that they fit within the TCP payload for the given connection. The size of TCP payload depends on several factors (such as the maximum packet size along the path), but these factors are known at connection initiation time. To achieve the best performance, the goal is to fill each packet as much as possible with the available data. When insufficient data exist to fill a payload (otherwise known as the maximum segment size, or MSS), TCP employs the Nagle algorithm to automatically concatenate small buffers into a single segment. Doing so increases the efficiency of the application and reduces overall network congestion by minimizing the number of small packets that are sent.

      John Nagle's algorithm works well to minimize small packets by concatenating them into larger ones, but sometimes you simply want the ability to send small packets. A simple example is the telnet application, which allows a user to interact with a remote system, typically through a shell. If the user were required to fill a segment with typed characters before the packet was sent, the experience would be less than desirable.

      Another example is the HTTP protocol. Commonly, a client browser makes a small request (an HTTP request message), resulting in a much larger response by the Web server (the Web page).

      The solution

      The first thing you should consider is that the Nagle algorithm fulfills a need. Because the algorithm coalesces data to try to fill a complete TCP packet segment, it does introduce some latency. But it does this with the benefit of minimizing the number of packets sent on the wire, and so it minimizes congestion on the network.

      But in cases where you need to minimize that transmit latency, the Sockets API provides a solution. To disable the Nagle algorithm, you can set the TCP_NODELAY socket option, as shown in Listing 1.

      Listing 1. Disabling the Nagle algorithm for a TCP socket

      int sock, flag, ret; /* Create new stream socket */
      sock = socket( AF_INET, SOCK_STREAM, 0 ); /* Disable the Nagle (TCP No Delay) algorithm */
      flag = 1;
      ret = setsockopt( sock, IPPROTO_TCP, TCP_NODELAY, (char *)&flag, sizeof(flag) );

      if (ret == -1) {
      printf("Couldn't setsockopt(TCP_NODELAY)\n");
      exit(-1);
      }

      Bonus tip: Experimentation with Samb

  4. Re:Don't forget to wipe - and stay off the ice pip by heauxmeaux · · Score: 0, Informative

    'aunses'

    Did you mean anuses? Or the correct pluralization:
    Anii ?

    Usage: You are an anus! You and your kin are anii!

    --
    Beat 'Em and Eat 'Em
  5. somewhat old... by midom · · Score: 0, Redundant
    Where're news? where's the discovery? This has been for ages... And still...

    Most time is spent in select()/poll() anyway. And there's sendfile() for web/ftp servers, hey, that saves syscalls!

    Want nodelay? use UDP! :-)

    Hehehe, go spend your time on serious issues, folks ;-)

    1. Re:somewhat old... by AuMatar · · Score: 1

      UDP is lacking a lot of features that TCP has. Such as resend on lost transmission.

      I agree though, nothing earth shaking. Nagle's algorithm is discussed in depth in most TCP/IP books, and so is how to turn it off. Wake me up when they post something new.

      --
      I still have more fans than freaks. WTF is wrong with you people?
  6. GNU/Linux®...A lessefficent way to say Linux by Real+World+Stuff · · Score: 2, Funny

    I mean really, I think we understand what you mean by just saying Linux.

    --
    If we don't fight for ourselves no one will.
  7. Hello 1995 by AKAImBatman · · Score: 4, Insightful

    This reads like an article from the 90's. This being 2006 and all, I would hope that programmers know how to make effective use of TCP/IP sockets. I wonder if maybe they just yanked an article from 1995 and did a search/replace on s/Windows/GNU Linux/g.

    1. Re:Hello 1995 by ClamIAm · · Score: 1
      This being 2006 and all, I would hope that programmers know how to make effective use of TCP/IP sockets.

      One of the great things about computers is they allow different implementations of the same idea. Because of this, someone who knows how to tune the networking on one OS may not know how to on Linux. Also, not everyone has been programming since 1995. Do you also complain when the weather report comes on the local news, because you've seen a weather report before?

    2. Re:Hello 1995 by epiphani · · Score: 3, Interesting

      Agreed. In fact, as someone who learned socket coding around 1999/2000 (and as a result do not have a good grasp on how to actively define register variables, compilers do that stuff for you these days) I did all of these things out of habit, and didnt fully understand them until this article.

      In the same line - where is the discussion of different FD table polling mechanisms? select() versus poll(), and wheres the writeup about Linux's epoll(). I would have been interested in an epoll() article, especially how it compares to FreeBSD's kqueue().

      --
      .
    3. Re:Hello 1995 by pair-a-noyd · · Score: 1

      Could be. But considering that I live in the past anyway, I find the article particularly useful.

      Vuja de rules!

    4. Re:Hello 1995 by AKAImBatman · · Score: 2, Insightful

      One of the great things about computers is they allow different implementations of the same idea. Because of this, someone who knows how to tune the networking on one OS may not know how to on Linux.

      Now if only the article actually covered something specific to Linux, I'd agree with you. About the most useful thing it does is tell you the location of the same parameters that you muck with on every other system in existence. This info has only been around for Linux for, oh, more than a decade. Pick up any book or tutorial on TCP/IP for the same info.

      Do you also complain when the weather report comes on the local news, because you've seen a weather report before?

      No more than I complain that I just ate dinner yesterday. But I do tend to get annoyed when TV Networks show reruns of my favorite TV shows in slots that they're supposed to be showing new episodes! (Star Trek: Enterprise was probably the worst at this. You never know when they were actually going to show something new. I didn't even consider it "one of my favorite TV Shows," and it was still annoying.)

    5. Re:Hello 1995 by pclminion · · Score: 1
      This reads like an article from the 90's. This being 2006 and all, I would hope that programmers know how to make effective use of TCP/IP sockets.

      Actually, given that it's 2006, I would have thought that the socket layer would be smart enough to perform these sorts of "optimizations" for you automatically, by analyzing your usage patterns. There's no reason the programmer should have to deal with any of this crap, except maybe by providing a broad hint such as "Maximize throughput" or "Minimize latency."

    6. Re:Hello 1995 by AKAImBatman · · Score: 1

      Actually, given that it's 2006, I would have thought that the socket layer would be smart enough to perform these sorts of "optimizations" for you automatically

      To a certain degree, they are optimized. Since most network activity occurs through a higher level networking API (e.g. HTTP), the network performance is already optimized by the library. It's not all that often that you have to open a direct socket unless you happen to be writing such a library or server.

      Which just further points out how much this article is NOT news and DOESN'T matter. :-)

    7. Re:Hello 1995 by hackstraw · · Score: 1

      Yes, I thought most of the stuff has been addressed for years too. But I'm confused about this, which is new to me:

      BDP = link_bandwidth * RTT
      100MBps * 0.050 sec / 8 = 0.625MB = 625KB
      Note: I divide by 8 to convert from bits to bytes communicated.
      So, set your TCP window to the BDP, or 1.25MB.
      Where does .625MB turn into 1.25MB? If it was double, it might make sense for a send and receive window, but I doubt that is the case either.

      Is this a typo, or am I missing something in the calculation?

    8. Re:Hello 1995 by Anonymous Coward · · Score: 0

      Maybe .625 in and .625 out?

    9. Re:Hello 1995 by dtfinch · · Score: 1

      It's probably a "better safe than sorry" sort of thing. 0.625mb is just the minimum before you're guaranteed to have poor performance.

    10. Re:Hello 1995 by pthisis · · Score: 4, Informative

      In the same line - where is the discussion of different FD table polling mechanisms? select() versus poll(), and wheres the writeup about Linux's epoll(). I would have been interested in an epoll() article, especially how it compares to FreeBSD's kqueue().

      For the overview, you want Dan Kegel's c10k page:

      http://www.kegel.com/c10k.html

      --
      rage, rage against the dying of the light
    11. Re:Hello 1995 by KidSock · · Score: 1

      Why is this flagged "Insightful"? I thought it was a very well written article and I do a lot of network programming. What should an article about an API designed in 1983 in a language dating back to 1972 supposed to look like? And I doubt the poster actually read it considering it describes features specific to Linux 2.6 (e.g. I don't think 2.4 actually supported setting SO_{SND,RCV}BUF).

    12. Re:Hello 1995 by nagora · · Score: 1
      Where does .625MB turn into 1.25MB? If it was double, it might make sense

      What do you mean "if"?

      TWW

      --
      "Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
    13. Re:Hello 1995 by AKAImBatman · · Score: 1
      Why is this flagged "Insightful"?

      Because most of us know more than you think you do? ;-)

      What should an article about an API designed in 1983 in a language dating back to 1972 supposed to look like?

      Old.

      Barring that, defintitely not "News for Nerds" or "Stuff that Matters".

      And I doubt the poster actually read it considering it describes features specific to Linux 2.6 (e.g. I don't think 2.4 actually supported setting SO_{SND,RCV}BUF).

      You do realize that SO_SNDBUF and SO_RCVBUF are part of the POSIX standard, don't you? They've been in Linux for as long as I can remember. At least as long as the 2.x kernels have been in production. The socket man page tells you what new features were added during development:

      VERSIONS

      SO_BINDTODEVICE was introduced in Linux 2.0.30. SO_PASSCRED is new in Linux 2.2. The sysctls are new in Linux 2.2. SO_RCVTIMEO and SO_SNDTIMEO are supported since Linux 2.3.41. Earlier, timeouts were fixed to a protocol specific setting, and could not be read or written.


      And wonders upon wonders, it even tells you about the buffer doubling!

      NOTES

      Linux assumes that half of the send/receive buffer is used for internal kernel structures; thus the sysctls are twice what can be observed on the wire.


      Who'd have thunk that you could check the online documentation to get such amazing info?!

      Yes, I'm being horribly sarcastic. I probably shouldn't be, so I apologize in advance. But I stand by my contention that there's nothing about this article that makes it news worthy. Especially not as a front page story.
    14. Re:Hello 1995 by AKAImBatman · · Score: 3, Informative
      The Linux kernel automatically doubles the buffer for its own use. In the article:

      Within the Linux 2.6 kernel, the window size for the send buffer is taken as defined by the user in the call, but the receive buffer is doubled automatically. You can verify the size of each buffer using the getsockopt call.


      From the MAN page:

      NOTES

      Linux assumes that half of the send/receive buffer is used for internal kernel structures; thus the sysctls are twice what can be observed on the wire.


      The article could have better explained that in context. For the most part it's automatic though, so don't worry about it.
    15. Re:Hello 1995 by Anonymous Coward · · Score: 0


      Well, hello 1992_Called. Remember me? Yes, the one you got your current sig from. I'm just posting to tell you that I'm back, and I'm not only going to be stalking && attacking you, but also your other trolls this time.

      Boy, these next few days are gonna be fun!

      --
      Trolling all trolls since 2001.
      (btw: your post is actually pretty funny)

    16. Re:Hello 1995 by hackstraw · · Score: 1

      Linux assumes that half of the send/receive buffer is used for internal kernel structures; thus the sysctls are twice what can be observed on the wire.

      The article could have better explained that in context. For the most part it's automatic though, so don't worry about it.


      Thanks, that is the answer. Hopefully others will see it.

    17. Re:Hello 1995 by KidSock · · Score: 1

      You do realize that SO_SNDBUF and SO_RCVBUF are part of the POSIX standard [jaluna.com], don't you?

      Yeah? So does this mean you think Linux is POSIX compliant? If so, then maybe you should spend more time coding than posting drivel on ./

    18. Re:Hello 1995 by AKAImBatman · · Score: 1

      Yeah? So does this mean you think Linux is POSIX compliant?

      For the most part? Yes. It's not fully POSIX compliant, but close enough. Patches exist in the wild that make it 100% POSIX. It's actually been a pretty big thing to the Linux community to reach a compliant state.

      If so, then maybe you should spend more time coding than posting drivel on ./

      I'm sorry, is your point that SO_[SND|RCV]BUF wasn't in 2.4? Or 2.2? Because (as we can see from this pretty manpage for Linux 2.0) it was. So there's no reason to get all upset over things. You thought one thing, it's actually another way. Such is life.

    19. Re:Hello 1995 by AKAImBatman · · Score: 1

      P.S. I noticed your previous post about physics lectures. You might find this link to be of great interest. It kind of helps visualize the Special Theory of relativity.

  8. Summary ripped directly from article (again) by sczimme · · Score: 2, Informative


    Here is the summary:

    The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet. Like any API, you can use the Sockets API in ways that promote high performance -- or inhibit it. This article explores four ways to use the Sockets API to squeeze the greatest performance out your application and to tune the GNU/Linux® environment to achieve the best results.

    Here is the first paragraph of the article:

    The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet. Like any API, you can use the Sockets API in ways that promote high performance -- or inhibit it. This article explores four ways to use the Sockets API to squeeze the greatest performance out your application and to tune the GNU/Linux® environment to achieve the best results.

    Unless Cop (the submitter) is actually M. Tim Jones (the article author), Cop didn't write a darn thing.

    Didn't we just have this discussion on /. a few days ago?

    --
    I want to drag this out as long as possible. Bring me my protractor.
    1. Re:Summary ripped directly from article (again) by c0dedude · · Score: 0, Offtopic

      Fun facts about slashdot editors: Slashdot editors do not edit. Hire the damn copy editor, taco.

      --
      Since when has this country used intellectual elite as a pejorative term?
    2. Re:Summary ripped directly from article (again) by Luyseyal · · Score: 1
      Rob already tried having a copy editor and was unsatisfied with the results.

      -l

      --
      Help cure AIDS, cancer, and more. Donate your unused computer time to worldcommunitygrid.org. Join Team Slashdot!
    3. Re:Summary ripped directly from article (again) by sammy+baby · · Score: 1
      From the story to which you link:

      As an aside, for awhile we actually had an editor reading Slashdot articles and correcting grammatical mistakes. Turns out it doesn't really matter much. People found other things to complain about. It's almost as if some percentage of the population wants to complain. And they will find something to complain about no matter what. Perhaps by leaving a few typos on the site, I am making their day a little easier! Leave them some low hanging fruit I guess.

      No wonder he was unsatisfied with the results. Why fix the minor issues when they do such a good job of obscuring the real ones?
  9. Re:The missing article by jimand · · Score: 0, Offtopic

    Are you a /. editor? If your suggestion appeared on the front page it would be a dupe. See here.

  10. Re:Re:CRITICAL MUST READ by Anonymous Coward · · Score: 0

    Haha quality post! Made me chuckle anyway.

  11. Don't plug a 110 device into 220 or else.... by Anonymous Coward · · Score: 0
    I always get those socket converters which work sort of well. However dont plug a 110 tool into 220 or else you'll end up like the poor sap in "Top Secret".


    http://www.contactmusic.com/new/film.nsf/reviews/t opsecret

  12. No mention of alternatives to select? by complexmath · · Score: 4, Informative

    Tuning socket parameters is great and all, but the real performance problem with socket IO has to do with using select and poll. There are high-performance alternatives (which admittedly tend to vary from OS to OS) that are so far superior that I wouldn't even consider the default methods unless complete code portability were a crucial factor.

    1. Re:No mention of alternatives to select? by ratboy666 · · Score: 1

      Are select() and poll() really that bad?

      Ok, the issue is how many fds you can pass. With select() you are limited to a bitmaps worth. And performance has never been much of an issue.

      Of course, poll() is a different matter -- if you are passing 100s or thousands of fds.

      But, what has this got to do with the tcp connection? Not much.

      So, you speed up poll() and still write small packets, and nagle won't write them out immediately... That's about the only connection here.

      Ratboy.

      --
      Just another "Cubible(sic) Joe" 2 17 3061
    2. Re:No mention of alternatives to select? by complexmath · · Score: 1

      But, what has this got to do with the tcp connection? Not much.

      Agreed. But the topic of his article is "Boost socket performance on Linux," not "How to optimize TCP layer use on Linux." And the article deals almost entirely with API-level settings. It just seems odd to me that he'd not mention issues with some of the classic BSD functions.

    3. Re:No mention of alternatives to select? by hackstraw · · Score: 4, Informative

      Try this:

      http://www.xmailserver.org/linux-patches/nio-impro ve.html /dev/epoll

      The website is hideous, but there used to be benchmarks against different polling/selecting methods. If I remember correctly, its kinda trial and error, YMMV, kind of stuff. Its worth a look.

    4. Re:No mention of alternatives to select? by statusbar · · Score: 2, Informative

      This page, while out of date, and referenced earlier during this discussion, needs re-emphasis. I hope it gets updated soon:

      http://www.kegel.com/c10k.html

      Very awesome paper. How do _you_ make a server that handles 10,000 connections?

      --jeff++

      --
      ipv6 is my vpn
  13. Code Portability by IAAP · · Score: 1
    There are high-performance alternatives (which admittedly tend to vary from OS to OS) that are so far superior that I wouldn't even consider the default methods unless complete code portability were a crucial factor.

    It's funny you should mention this. I was thinking of the class libraries or frameworks, if you will, included with Java, MFC (if it's still used these days), Visual Age, and so on. Does this mean, and are you saying, that the only way to get the best performance from TCP/IP is to roll your own?

    Yikes!

    1. Re:Code Portability by complexmath · · Score: 5, Informative

      There was a Boost library in the works to encapsulate all of this rather nicely, but I'm not sure if it ever made it out of beta. ACE is another option, though that tends to be overkill for some projects. I rolled my own class wrapper around this stuff, but then I enjoy library programming.

    2. Re:Code Portability by sankeld · · Score: 0
      You're speaking of boost.asio and it is currently being reviewed for inclusion. It has yet to receive a final verdict although things look positive.

      Also, there are no beta libraries in boost. A library is either in boost (and is production grade) or it isn't.

    3. Re:Code Portability by complexmath · · Score: 1

      I'm pretty sure that isn't it. This was a project that had been going for a while and the involved parties weren't able to continue development. Thus it was left unfinished, last I heard. But then I haven't been on the Boost list for a few years now.

  14. Re:Re:CRITICAL MUST READ by Anonymous Coward · · Score: 0

    Check his back catalogue, he's been serious about nothing else since the dawn of time...

  15. Nothing new by Anonymous Coward · · Score: 1, Funny

    going from Socket 7 to Socket 462 to Socket 478 boosted it quite a damn bit over the years.

  16. Already, I'm Confused. by Anonymous Coward · · Score: 0

    Bonus tip: Experimentation with Samba demonstrates that disabling the Nagle algorithm results in almost doubling the read performance when reading from a Samba drive on a Microsoft® Windows® server.

    What is he saying here? Is it faster when using Samba to access a Windows server? Perhaps he was talking about using Samba as a Windows server and making it serve faster with this technique? What it actually says, about running Samba on a Windows server, makes the least sense of all!

  17. GNU/Linux®? by Caspian · · Score: 1

    Why the corporate-style circle-R? Is this a subtle bit of sarcasm or trollery targeting RMS's followers?

    --
    With spending like this, exactly what are "conservatives" conserving?
    1. Re:GNU/Linux®? by wfberg · · Score: 2, Informative

      Most probably it's just IBM policy to always acknowledge some one else's trademarks, so as not to get in trouble. Both GNU (yeah, I know! I knooow..) and Linux are registered trademarks (... of their respective owners, of course..)

      --
      SCO employee? Check out the bounty
    2. Re:GNU/Linux®? by ratboy666 · · Score: 1

      Because they ARE registered trademarks?

      Duh?

      --
      Just another "Cubible(sic) Joe" 2 17 3061
    3. Re:GNU/Linux®? by infinityxi · · Score: 1

      Probably because of content provided at http://www.linuxmark.org/. I'm not 100% sure if that includes GNU/Linux as well, but in terms of the term Linux there is this on that same site: The registered trademark Linux® is used pursuant to a license from Linus Torvalds, owner of the mark in the U.S. and other countries.

      --
      Turn based strategy game that runs over XMPP. Phalanx
  18. IBM is getting some good Linux content... by tcopeland · · Score: 4, Interesting

    ...on developerWorks, not the least of which, if I may say so, is the GLib tutorial I wrote for them this past summer. If you wanted how to use various GLib collections and utilities - lists, tables, trees, quarks, relations, and all that - check it out. You can even download a nice PDF file for offline perusing.

    Folks who are thinking about writing something technical - give dW a shot. The editors are savvy folks and there's lots of good stuff up there already.

    Oh, and book plug!

    1. Re:IBM is getting some good Linux content... by Anonymous Coward · · Score: 0

      Does your article cover what a staming pile of shit glibc is?

      If not, I won't be reading it.

    2. Re:IBM is getting some good Linux content... by Anonymous Coward · · Score: 0

      Probably not, given that his article is about glib, not glibc.

  19. I've always wanted to know if it is possible by presarioD · · Score: 1

    to send signals to a network socket without writing code but using some ready made command-line tool (netstat?)? I've looked around for this but can't seem to find anything...

    --
    Yam, yam, uga booga, yam, yam, yade, yade, uga booga, yam, yam, yade, yade
    1. Re:I've always wanted to know if it is possible by Anonymous Coward · · Score: 0

      Maybe you want something like tcpclient and company, by DJB.

      http://cr.yp.to/ucspi-tcp.html

    2. Re:I've always wanted to know if it is possible by JeanBaptiste · · Score: 1

      telnet

    3. Re:I've always wanted to know if it is possible by m50d · · Score: 1

      There's netcat, which acts very much like cat but will either listen on a given port or output to a given host/port. Not sure whether that's what you're after.

      --
      I am trolling
    4. Re:I've always wanted to know if it is possible by pclminion · · Score: 4, Informative
      Netcat might be what you want. It has two modes, a "client" and "server" mode. In client mode, it connects to an IP/port that you specify, then reads data from stdin and sends it through that socket. In server mode, it listens on a port you specify, and prints any data it received to stdout.

      Is that what you're looking for?

    5. Re:I've always wanted to know if it is possible by Anonymous Coward · · Score: 0

      You mean like a packet crafter? Yeah google it...

    6. Re:I've always wanted to know if it is possible by jne_oioioi · · Score: 0

      Socat for every type on tingling you wanna do on a socket http://www.dest-unreach.org/socat/

    7. Re:I've always wanted to know if it is possible by temojen · · Score: 4, Funny
      which acts very much like cat

      It ignores you except at feeding time, and pees in your shoes when it's mad at you?

    8. Re:I've always wanted to know if it is possible by blofeld42 · · Score: 0, Redundant

      Signals, as in Unix signals? kill sends a signal to the process controlling the socket.

      kill -SIGHUP 1234

      If you want to send data to the process running on the socket, just use telnet

      telnet foo.com 80
      GET /index.html

    9. Re:I've always wanted to know if it is possible by presarioD · · Score: 1

      Is that what you're looking for?

      Hmm I'll look into it. Thanks!

      --
      Yam, yam, uga booga, yam, yam, yade, yade, uga booga, yam, yam, yade, yade
    10. Re:I've always wanted to know if it is possible by Anonymous Coward · · Score: 0

      If you'd ever looked at Hobbit's code, you wouldn't have to ask!

  20. Completion Ports by Anonymous Coward · · Score: 0

    I believe that Windows is using (or about to use) 'completion ports' - this is where the network hardware makes a callback direcly to an OS-nominated routine. Apparently the idea works very well in practice and as such, on appropriate hardware the Windows network stack really flies.

    Can a network I/O engineer advise how true this is and if any other OS' are impementing for this hardware?
    (Yes I do realise that I could Google for the results but I'd like some local opinions here).

    Cheers.

    1. Re:Completion Ports by Anonymous Coward · · Score: 0

      I don't know that much about Completion Ports, but they seem to be the Window's version of poll()/select() APIs. Nothing really that new, and not covered by this story's article.

      Poll/Select has been used since the days of mainframe terminals. The threading model provided a much cleaner solution on multi-tasking systems, but at a slight performance cost. The only time you'd bother with poll/select is if you're trying to squeeze the last few drops out of a server's performance.

  21. depends on how it's used by j1m+5n0w · · Score: 1
    the real performance problem with socket IO has to do with using select and poll
    That is true, but only under workloads where one process has a lot of sockets open. A (somewhat old) article on this subject is here.
    1. Re:depends on how it's used by Anonymous Coward · · Score: 0

      It doesn't matter how many sockets you have open, it's just as useful with one socket as many.

      If you don't want to block on a network read, use select(). That way you can do other stuff while waiting for network data to arrive.

      It seems a lot of Linux gui programs hang while waiting for network data - they're obviously not taking advantage of this.

    2. Re:depends on how it's used by complexmath · · Score: 1

      That is true, but only under workloads where one process has a lot of sockets open. A (somewhat old) article on this subject is here [kegel.com].
      True enough. But how many applications nowadays are written expecting no more than ~32 simultaneous connections?

    3. Re:depends on how it's used by j1m+5n0w · · Score: 1
      It doesn't matter how many sockets you have open, it's just as useful with one socket as many.

      The argument isn't that select and poll don't work with large numbers of sockets, the problem is that they aren't scalable. Both system calls take, as an argument, a list of file descriptors the program is interested in listening to. If there are hundreds (or thousands) of sockets open, those lists can become unweildy. The scalable alternatives remember which file descriptor the process has asked to be notified about previously, rather than require that the process sends the kernel a complete list every time.

      It seems a lot of Linux gui programs hang while waiting for network data - they're obviously not taking advantage of this.
      I don't doubt that that's probably true in some cases. Few client programs have very many sockets open at once, so select or poll would be a good choice.
  22. Re:Huh? by Anonymous Coward · · Score: 0

    I know somebody who thought that's what Linux was, you insensitive clod!

  23. Nagle's algorithm by Jeremi · · Score: 5, Interesting
    For an application where I want both low latency AND high bandwidth, it's not enough to leave Nagle's algorithm on or off. If I leave it on, I'll get increased bandwidth, but >200ms latency due to the Nagle delay. If I leave it off, I get low latency, but the computer will (typically?) send out one network packet per send() call, which means inefficient use of bandwidth unless the calling code is very careful to call send() only with large amounts of data per call.


    To get around the above problems, I came up with the following scheme: Leave Nagle's algorithm enabled, but create a FlushSocket() function that merely disables Nagle on the socket, then calls send() on the socket with a 0-byte buffer, then enables Nagle again. This apparently forces the TCP stack to immediately send any data that it may have accumulated in its Nagle-buffer. Therefore the only thing the calling code has to remember to do is to call FlushSocket() whenever it has called send() one or more times and doesn't think it will be sending any more data any time soon.


    The above technique seems to work pretty well under Linux, Windows, and OS/X (and is more portable than Linux-specific flags like TCP_CORK, etc), but I haven't seen it documented anywhere. Is that simply an oversight, or is there some nasty downside to this technique that I'm overlooking?

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.
    1. Re:Nagle's algorithm by convolvatron · · Score: 2, Interesting

      aren't you just drastically increasing the number of system
      calls you have to pay for?

      if you have some knowledge about the natural grouping of data,
      it would be better to just turn nagle off and do buffering
      in user space (collect up enough data and send it all in one
      go)

    2. Re:Nagle's algorithm by slashdotmsiriv · · Score: 1

      if you have some knowledge about the natural grouping of data, it would be better to just turn nagle off and do buffering in user space (collect up enough data and send it all in one go) It is not about the "natural grouping" of the data at the user space.
      Most programmers do this "natural grouping" anyway and write the data to the socket in a single buffer only when they want them to be sent. The problem is that sometimes their grouping is not
      good enough and perform multiple writes when they could perform only one, causing the short packet problem.
      You don't want the programmer to have to worry about MTU sizes or what is the best packet size according to which he should group the data.

      I think the best solution would be to have Nagle's on by default to address these issues and having a simple system call flush() that forces the transmission of a segment to be used whenever ever you write a small buffer with time-sensitive data.

    3. Re:Nagle's algorithm by buck68 · · Score: 3, Informative

      You may be interested in a paper we wrote a few years back [1]. We also started with the premise that some applications require both minimal latency and maximal bandwidth. In our case the application was our own media streaming system. We came up with our own patch to TCP (in Linux). The patch provided a new socket option, we call TCP_MINBUF. The idea is that you need a certain minimum amount of buffer allow TCP's congestion window to function, but no more. Indeed, in the paper we show that the delay due to socket buffer beyond the congestion window is often by far the dominant source of latency--not retransmissions, or delayed acks, or all the other more commonly cited things. So basically what TCP_MINBUF does, is dynamically size the socket buffer to follow the congestion window size. It had a huge impact on latency.

      [1] "Supporting Low Latency TCP-Based Media Streams", Ashvin Goel, Charles Krasic, Kang Li, and Jonathan Walpole. Tenth International Workshop on Quality of Service (IWQoS), May 2002.

      http://www.eecg.toronto.edu/~ashvin/publications/i wqos2002.pdf

    4. Re:Nagle's algorithm by wtarreau · · Score: 1

      To get around the above problems, I came up with the following scheme: Leave Nagle's algorithm enabled, but create a FlushSocket() function that merely disables Nagle on the socket, then calls send() on the socket with a 0-byte buffer, then enables Nagle again.

      I tried this in the past and it was not that good because of the added syscalls. In a pure network application, your worst ennemy are syscalls. And by avoiding this trick and carefully grouping your data into large writes, you both reduce the number of syscalls caused by the fewer calls to write(), and reduce syscalls by removing two setsockopt(). It's a win-win, and I'm never considering going back.

      willy

    5. Re:Nagle's algorithm by wtarreau · · Score: 1

      I think the best solution would be to have Nagle's on by default to address these issues and having a simple system call flush() that forces the transmission of a segment to be used whenever ever you write a small buffer with time-sensitive data.

      Not exactly, you'd better need a send() flag to tag data to be sent immediately. No need to slow down your application with another syscall.

      willy

    6. Re:Nagle's algorithm by IBitOBear · · Score: 1

      In the Linux kernel you don't need to do the empty send(). Turning Nagle off causes an imediate flush so it is enough to strobe nagle off and then back on.

      At one point I submitted a patch that would add a TCP_FLUSH "option" that saved the TCP_CORK and TCP_NDELAY flag values, called the low-level flush routine, and then reestablished the flags.

      It was rejected. (But I still use it from time to time on my own, love that Open Source. 8-)

      Meanwhile, just drop and restore Nagle as fast as you can, it will save you a needless syscall for the send().

      --
      Innocent people shouldn't be forced to pay for inferior software development.
      --"Code Complete" Microsoft Press
    7. Re:Nagle's algorithm by Jeremi · · Score: 1
      In the Linux kernel you don't need to do the empty send(). Turning Nagle off causes an imediate flush so it is enough to strobe nagle off and then back on.


      That's a good point -- the only reason the send() is in there was because otherwise the trick doesn't work under MacOS/X. I will #ifndef __linux__ that line in my code though.

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
  24. Nagle algorithm by Anonymous Coward · · Score: 0

    Their description of Nagling seems a little oversimplified. Unless I'm mistaken, the Nagle algorithm acts to consolidate short packets only when TCP is waiting on an acknowledgement from a previously-transmitted packet.

    Consequently, using TCP_NODELAY wouldn't necessarily make a difference in the Telnet example the article cites... at least, not as long as the ping time is better than the user's typing speed.

  25. libevent by Anonymous Coward · · Score: 0

    Use libevent.

  26. Whoa... by Anonymous Coward · · Score: 0

    The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet.

    This kind of knowledge is why I keep pressing the "reload current page" button after subscribing to slashdot.

  27. flush( sd ) would be nice by slashdotmsiriv · · Score: 1

    Wouldn't it be nice if C programmers were given an option similar to what fflush does for streams? Something like flush(sd) whenever you need to ignore Nagle's algorithm. In this way you can enable and disable nagling dynamically in your program without calling setsockopt to switch nagling on and off. This option is given for Java since you can easily convert a socket to any type of stream you wish, while most Stream objects have a member function flush(). Perhaps I am wrong and such an interface is already provided in C but I personally never found one, while the necessity for it appears to be obvious.

    1. Re:flush( sd ) would be nice by Jeremi · · Score: 1
      Perhaps I am wrong and such an interface is already provided in C but I personally never found one, while the necessity for it appears to be obvious.


      See my previous post above ("Nagle's Algorithm") for a way to do it.

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    2. Re:flush( sd ) would be nice by slashdotmsiriv · · Score: 1

      Yes I read your post before and it seems like a nice way to do it. However I am asking for a clean system call solution-flush() that would do this without invoking setsockopt(). Also could you post the code of your Flush function? I find the description a little confusing at some points.

    3. Re:flush( sd ) would be nice by Jeremi · · Score: 1
      However I am asking for a clean system call solution-flush() that would do this without invoking setsockopt().


      I agree, that would be nice... good luck getting it into the POSIX standard anytime soon though. :^(


      Also could you post the code of your Flush function? I find the description a little confusing at some points.


      Sure, here is the code:

      void FlushSocketOutput(int s)
      {
            SetNaglesEnabled(s, false);
            send(s, NULL, 0, 0);
            SetNaglesEnabled(s, true);
      }

      void SetNaglesEnabled(int s, bool enabled)
      {
            int enableNoDelay = enabled ? 0 : 1;
            setsockopt(s, IPPROTO_TCP, TCP_NODELAY, (char *) &enableNoDelay, sizeof(enableNoDelay));
      }

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    4. Re:flush( sd ) would be nice by multipartmixed · · Score: 1

      That's essentially the solution I would suggest. Note, I have good sockets background, but never needed to do something like this.

        - disable nagle
        - set blocking mode
        - set tcp buffer to 0 bytes
        - write 0 bytes
        - put things back the way they were ...I suspect this would have the fflush()-like functionality he's looking for, not that I've ever tried it!

      Recall that fflush() blocks until the data makes it to disk; I expect he'd want to block until the socket buffers were empty, too.

      Note 2 - this might also be OS-dependant. Read POSIX.1 for a better clue.

      --

      Do daemons dream of electric sleep()?
    5. Re:flush( sd ) would be nice by Jeremi · · Score: 2, Interesting
      Recall that fflush() blocks until the data makes it to disk; I expect he'd want to block until the socket buffers were empty, too.


      I don't know if that really makes sense for networking though... the reason you'd want fflush() to block until the data makes it to disk is so that once your call to fflush() returns you know that your written data is safe in the event of a crash or power failure. (Although with too-clever hard drive firmware I'm not so sure even that's true anymore!). With networking on the other hand, even once the data has left your Ethernet port there is no guarantee that it will get to its destination... so what would be the purpose is waiting?

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    6. Re:flush( sd ) would be nice by multipartmixed · · Score: 1

      > so what would be the purpose is waiting?

      Beats the hell out of me!

      I've noticed that in the vast majority of instances (but not all) programmers looking for this type of solution are trying to apply a band-aid to a poor design anyhow. I try not to think too hard about poor designs. :)

      --

      Do daemons dream of electric sleep()?
  28. Math error in paper? by Stiletto · · Score: 2, Informative


    throughput = window_size / RTT

    110KB / 0.050 = 2.2MBps

    If instead you use the window size calculated above, you get a whopping 31.25MBps, as shown here:

    625KB / 0.050 = 31.25MBps


    That's funny, I get 12.5MBps

    ???

    1. Re:Math error in paper? by Tom+Young · · Score: 1

      We've fixed this in the article... thanks for pointing it out.

      Boost socket performance on Linux

      Tom Young
      dW Linux editor

  29. Socket tuning by Julian+Morrison · · Score: 1

    I've used Azureus a lot on my Linux box, and one of its features is tunability and graphs. Number of connections, max up and down, etc, and watch the results. Now, I have a very asymmetric line (10:1 ratio). I've noticed that trying to use maximum upload and download at once can create sinewave patterns of slow response that look a lot like resonant feedback, and in extreme cases can wedge the line completely, throughput zero on all net apps. Running uploads at 20K and leaving the top 5K unused gets a far better total rate both up and down.

    What I'm wondering is, might it be possible to make these sort of calculations in kernel, detect congestion feedback and back off automatically? I'm not talking about the regular exponential backoff algorithm, but about some sort of best-rate prediction based on detecting the characteristic shape of feedback waves and backing off until they disappear.

    1. Re:Socket tuning by winphreak · · Score: 0

      Uploading at max is more like a DoS of your modem. Yes, it does basically "eat" the connection. Unfortunately, since my DSL isn't 1:1, I download at 5 times (by personal limit) by 1 times upload just to be nice. Because a D/U ratio of 15:1 isn't nice to the BT community.

      --
      "I'm a well-wisher, in that I don't wish you any specific harm."
    2. Re:Socket tuning by IBitOBear · · Score: 1

      I have found that I also get this pattern and it has a lot to do with overloading the transmit feed on my cable modem. The cable modem works as a strict time division multiplexer. So to get maximum throughput you want to keep the transmit buffer full, but ir you over-fill it the packets are silently discarded. As your number of connections goes up the likelyhood of overrunning your modem buffer aproaches regular certianty.

      I run a Linux firewall, so I put in a six layer quality-of-service set. I put "very small" TCP packets in the highest priority, then ssh traffic, then games, then regular services, then bulk services. The absolute throttle is then set to just a hair larger (~5%) than the theoretical maximum output.

      [Don't bother with any receive rate throttling, there is no point.]

      So this greatly increases the period of the overflow, if the period becomes longer than the Bittorrent segment size, you get essentially uniform performance.

      In Azureus I also cap the upload speed to about 90% of the cable upload speed to get (both) uniform utilization [which is all you _really_ have to do] and still keep the interactive web experience at "no apparent delay".

      I didn't tune just for Azureus, I can also game (Unreal Tournament 2004) while my rommates use the net, and sufer virtually no lag.

      --
      Innocent people shouldn't be forced to pay for inferior software development.
      --"Code Complete" Microsoft Press
    3. Re:Socket tuning by voxel · · Score: 1

      This is something a good router will take care of, but very few of them do, not even the customized Linksys routers, nor the linux routers like Smoothwall do BOTH of the required things.

      All you need to do to get outstanding performance on an asymetric line is the following:

      [ON THE ROUTER]
      1. Prioritize TCP ACK Packets as #1 to always go upstream first to your connection
      2. Restrict upload rate by 2% - 5% of the actual upload rate of the connection

      Do these three things, and enjoy a fast connection in both directions.

      The reason your downloads choke when you upload, is because the TCP ACK packets that need to go UP to the site you are downloading from (to get the next chunk of your download) can't get out because your saturating your upload connection. By prioritizing these ACK packets, they go out first, and your download runs full speed, while your uploads fly as well.

      The reason for #2, is that most DSL and Cable modems have a large buffer in them... If you saturate this buffer, then your router can't guarentee the TCP ACK packets go out first since other packets are sitting in your DSL/Cable modems buffers... This is very important to do #2.

      There are more things you to can do to help, but they get more sophistated. If you are interested in managing asynchronous TCP connections, check out this document, which covers everything including the kitchen sink:

      http://www.faqs.org/docs/Linux-HOWTO/ADSL-Bandwidt h-Management-HOWTO.html

      - Voxel

      --
      Modesty is one of life's greatest attributes
  30. It's "GNU divided by Linux" by tepples · · Score: 1

    The GNU community has been divided by Linux into two subsets: dogmatic types who work on HURD and the pragmatic majority who just use what they call "GNU/Linux". Therefore, "GNU/Linux" (also spelt "GNU÷Linux") is completely accurate :-)

  31. Hello 2003. by jd · · Score: 4, Interesting
    The paper is 2 years, 2 months old. Many of the arguments will still be valid, but the code in all cases will have evolved considerably. In addition, other code has certainly been developed (there's a hard real-time UDP patch for Linux, for example) and the state of affairs is - if anything - much more muddled today.


    Documentation like this is great and extremely valuable. It would be much more valuable, however, if it remained current. For example, can the ABISS project (which improves block I/O) be used at all? What do the numbers look like, when using profiling tools like Web100 (which profiles TCP communications)?


    Has anyone run the Linux or one of the *BSD kernels through DAKOTA, KOJAK or PAPI to determine where, precisely, bottlenecks are within the kernels? It's easy to theorise, but isn't it cleaner to measure?


    Now, I'm not saying these things aren't being done. They probably are, somewhere, by someone, but if the results aren't getting published we don't really know what impact what changes are going to have. The current method of evolving Operating System code in general is often a mix of personal theory and subjective experience based on non-random samples of activity. That can't really be a good way to do things, can it?


    If I'm wrong, feel free to say. If I'm right, then maybe it would be a good thing if someone (possibly me) put together some kind of testing kit for measuring Linux kernel performance and actually measured the stats for Linux kernels on some kind of regular basis.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  32. GNU/Linux® by McGiraf · · Score: 1

    GNU/Linux® ®? WTF®

    1. Re: GNU/Linux® by Antity-H · · Score: 1

      I believe it has to do with the recent trademarking of linux, see the linuxmark institute

  33. Nothing to do with Nagle by Anonymous Coward · · Score: 0

    The 200-millisecond delay you're experiencing is the delayed ACKs, which is independent of Nagle. Well, delayed ACKs are incompatible with Nagle. I have implemented TCP with Nagle and no delayed ACKs.

    The reason for delayed ACKs is that the OS would like to withhold sending an empty ACK packet right away because the application is likely to respond to the packet. So the kernel implements a hack, whereby it waits a while (200 ms) to give the application time to react.

    The socket API could be enhanced easily to get rid of the 200-millisecond penalty. For example, calling send(2) with zero bytes (maybe with a special flag) could allow the application to tell the kernel that a response is not forthcoming.

  34. man nc by Solilok · · Score: 1

    "tcp/ip's swiss army knife"

  35. Always liked the Winsock Lame List by MerlynEmrys67 · · Score: 1
    Of course it is rather windows centric, but most of the issues apply across platforms (only a few talk about WSA functions)

    However Lame List contains a lot of wonderful nuggets.

    I must disagree with the article however, there are so SO few times that disabling the Nagle algorythm is the correct answer that the standard answer when someone asks about it on the networking forums is that the asker doesn't understand Nagle, and to reenable it. Telnet is even a bastard case in that your networking performance may actually go UP sending smaller bursts of network characters, rather than one at a time, each in its own packet. But you have to measure your own performance.

    Frankly none of these suggestions will get you ultimate performance from a 10 Gig networking stack, and that is where networking finally becomes fun

    --
    I have mod points and I am not afraid to use them
  36. ...what about UDP? by bani · · Score: 0, Offtopic

    the article is all about TCP, which is great. how about an article on optimizing UDP though?

  37. The trouble with the Nagle algorithm by Animats · · Score: 4, Interesting
    I really should fix the bad interaction between the "Nagle algorithm" and "delayed ACKs". Both ideas went into TCP around the same time, and the interaction is terrible. That fixed timer for ACKs is all wrong.

    Here's the real problem, and its solution.

    The concept behind delayed ACKs is to bet, when receiving some data from the net, that the local application will send a reply very soon. So there's no need to send an ACK immediately; the ACK can be piggybacked on the next data going the other way. If that doesn't happen, after a 500ms delay, an ACK is sent anyway.

    The concept behind the Nagle algorithm is that if the sender is doing very tiny writes (like single bytes, from Telnet), there's no reason to have more than one packet outstanding on the connection. This prevents slow links from choking with huge numbers of outstanding tinygrams.

    Both are reasonable. But they interact badly in the case where an application does two or more small writes to a socket, then waits for a reply. (X-Windows is notorious for this.) When an application does that, the first write results in an immediate packet send. The second write is held up until the first is acknowledged. But because of the delayed ACK strategy, that acknowledgement is held up for 500ms. This adds 500ms of latency to the transaction, even on a LAN.

    The real problem is that 500ms unconditional delay. (Why 500ms? That was a reasonable response time for a time-sharing system of the 1980s.) As mentioned above, delaying an ACK is a bet that the local application will reply to the data just received. Some apps, like character echo in Telnet servers, do respond every time. Others, like X-Windows "clients" (really servers, but X is backwards about this), only reply some of the time.

    TCP has no strategy to decide whether it's winning or losing those bets. That's the real problem.

    The right answer is that TCP should keep track of whether delayed ACKs are "winning" or "losing". A "win" is when, before the 500ms timer runs out, the application replies. Any needed ACK is then coalesced with the next outgoing data packet. A "lose" is when the 500ms timer runs out and the delayed ACK has to be sent anyway. There should be a counter in TCP, incremented on "wins", and reset to 0 on "loses". Only when the counter exceeds some number (5 or so), should ACKs be delayed. That would eliminate the problem automatically, and the need to turn the "Nagle algorithm" on and off.

    So that's the proper fix, at the TCP internals level. But I haven't done TCP internals in years, and really don't want to get back into that. If anyone is working on TCP internals for Linux today, I can be reached at the e-mail address above. This really should be fixed, since it's been annoying people for 20 years and it's not a tough thing to fix.

    The user-level solution is to avoid write-write-read sequences on sockets. write-read-write-read is fine. write-write-write is fine. But write-write-read is a killer. So, if you can, buffer up your little writes to TCP and send them all at once. Using the standard UNIX I/O package and flushing write before each read usually works.

    John Nagle

    1. Re:The trouble with the Nagle algorithm by kris_lang · · Score: 1

      Ah, so you are the Nagle of the algorithm? How about an extension onto TCP as a concept:

      you can tell TCP that you are willing to accept d amount of delay, with the default being the 500 ms previously used and assigned. Thus protocols like X could state that they don't need to hang waiting for an ACK, while programs that should hang waiting for ACK will continue to do so.

      This extension would only require recompiling the programs that attempt to not do the prior default action of that delay, such as recompiling X11 or XFree86, and could be transparent/invisible to programs that do not care.

      For the obverse side, emails to Mars or TCP packets to a Mars (or Saturn or beyond..) orbiting satellite could have a much larger delay acceptable, allowing for more to be sent prior to an ACK being received. In other words, a Nagel algorithm with an additional calling parameter which adjusts the 500 ms delay as needed.

      What do you think?

  38. Re:Never trust an article with a (R) symbol... by level_headed_midwest · · Score: 1

    Linux IS a registered trademark, you know. Especially if you are an Australian...

    --
    Just "gittin-r-done," day after day.
  39. Parent? by Anonymous Coward · · Score: 0


    hahahaha!

    Geez. That was so simple. Couldn't you come up with a personality that requires a little more thought?

    Loser.

    --
    Trolling all trolls since 2001.

  40. Pining for Doors by aminorex · · Score: 1

    What really bums me out about doing network services on the Linux platform is that Linux does not support doors, a la Solaris, so you can't have multiple processes collaborating on a single socket service without a scheduler burp. There was a guy who implemented doors for 2.4, but his code was never adopted into the kernel, and now its rotting away....

    Linux is quite tragic that way. Hopefully there will be a Debian user-land on the OpenSolaris kernel soon, and then I can rock-n-roll again.

    --
    -I like my women like I like my tea: green-
    1. Re:Pining for Doors by Anonymous Coward · · Score: 0

      As far as I can tell, doors on Solaris are like a local-only, synchronous-only implementation of Windows named pipes. Since they don't work over the network, I don't see how they would help you.

      dom

  41. Reason for the eror by sonofdelphi · · Score: 1

    yeah, it seems to be an error.
    looks like they were doing the calculations with a calculator and somebody presses '*' instead of '/' !!!!

    625 * 5 = 3125

    a forgivable slip of the finger's tip.

  42. Re:GNU/Linux®...A lessefficent way to say Lin by JohnQPublic · · Score: 1

    In this case, though, "GNU/Linux®" isn't just overly wordy, it's incorrect. The advice is all about tuning the kernel's TCP stack, which is pure Linux®.