Slashdot Mirror


Zero-Copy TCP and UDP Output in NetBSD

-is writes "Jason R. Thorpe has recently added experimental code to NetBSD-current, that enables zero-copy for TCP and UDP on the transmit-side. These changes could mean significant performance improvements for FTP, WWW, and Samba servers. See Jason's announcement to the current-users mailing list for details." From the text: " On tests on an embedded system with limited memory bandwith, TCP transmit performance on 100baseTX-FDX went from ~6500KB/s to ~11100KB/s, a significant improvement." Excellent!

8 of 74 comments (clear)

  1. Re:Packet loss by AdamBa · · Score: 5, Informative
    You know when the packet was received when you get an ack for it (for TCP, with UDP you can just dump it to the network card and forget about it since it's unreliable delivery). Obviously in this situation you need to hold on to the user's buffer until you have gotten an ack, otherwise you don't have the data until you need to retransmit. But you want to optimize for the general case which is you do get an ack real quick and then you can just return the buffer...I suppose a clever TCP implementation using this would have some spare buffers, and after some time without an ack (maybe the first retransmission, maybe sooner), it would copy the data and give the user's buffer back. But again that is a rare case so the extra copy there doesn't matter.

    As fow how noticeable they will be...according to the email, it increased from 52 megabits/second to almost 90 megabits/second. But transmitting over a series of hops (like on the Internet), your speed is gated by the slowest link or router, and it is doubtful the whole end-to-end chain will be able to handle 90 Mbps (or 52 Mbps for that matter). So even with a monster send window you would still need to stop transmitting well before you got an ack and could send more.

    Of course on a local 100 Mbps Ethernet playing networked games this could be quite nice.

    - adam

  2. what about zero copy on receive? by AdamBa · · Score: 3, Interesting
    I'm surprised nobody has come up with hardware to do this. The problem is the network card needs to know about the user's buffer ahead of time. In the old days (i.e. 5 years ago) you had a mix of Netbeui and IPX and TCP on a network and it didn't make much sense to make a card intelligent enough to figure out where to put packets.

    But now all anyone cares about is TCP. Furthermore, a typical copy of data to a server goes something like:

    1) packet sent by the client to a known port on the server
    2) a few packets to set things up and assign a dedicated server port
    3) lots of data blasting from the client to the dedicated server port
    4) some cleanup packets at the end

    Step 3 is what you care about. So you would need to tell the network card, when you get packets for this port, put the data in this buffer in the order received, and put the headers here (in some small header-sized buffers TCP would also provide). Now you might get bad checksums (although the hardware could check that also) or drops or out of order, then you would need to rearrange...but in the 99%+ normal case you get all the packets in order with valid checksums. So the card stuffs the data in the right place, TCP checks the header buffers to make sure everything is kosher, and boom your data is in memory with no copies and off to disk (or wherever) it goes.

    You need some other stuff like TCP has to be able to hint this to the network card driver, and figure out if more than one app is using a port (so it can turn all this optimization off) and so on. But hey when it worked it would be cool.

    The other way this would work is if the network card was set up with a big chain of receive buffers and it would actually hand a buffer up to TCP (so it got taken out of the chain) and then eventually it would get it back...but this requires a lot of trust of the levels above TCP that ultimately decide when the receive data isn't needed anymore.

    As Dilbert said this weekend...if you can understand the preceding, you have my sympathy.

    - adam

    1. Re:what about zero copy on receive? by Espen+Skoglund · · Score: 3, Insightful
      The problem is not necessarily that of early demultiplexing of incoming packets to the appropriate ports. The problem is that in order to achieve zero-copy you'll have to store the packet at the appropriate location in memory. I.e., you must store the incoming packet in the buffer location that the user level application uses for receive. Now, obviously you can not store the packet there unless the user has requested to receive more data (the buffer may be used for other purposes). A solution to this problem is to program all the applications so that their receive buffers are aligned on page-boundaries, and the page containing the receive buffer is only used for containing the receive buffer. This allows the kernel to receive incoming data onto empty pages and map these pages into the application when the application eventually issues a receive operation.

      Of course, there are more quirks to the problem than what I've discussed here. However, the point is that one can not easily implement zero-copy TCP receive without having well behaved applications (i.e., without modification of the application). Zero-copy TCP send is easier since the location of the outgoing packet is known to the kernel once the send operation starts.

    2. Re:what about zero copy on receive? by AdamBa · · Score: 3, Insightful
      Sure you would have to only do it with certain things were true -- the user had already posted a buffer, nobody else was using that port, etc. But a standard situation like somebody ftp'ing up a huge file would match those conditions.

      Now I confess I don't know anything about the internals of any Unix version. What I worked on was NT. And in NT this would be very easy (memory-management-wise I mean) as long as you had the user's buffer ahead of time...no need to have receive buffers aligned on a page boundary or anything. Since I don't know *nix, I don't know why that restriction would be needed. The card could receive anywhere in memory (doesn't need to be aligned)...and in NT you can map any user buffer into kernel space so any device driver can access it, then lock it to a physical address so the card can access it.

      - adam

    3. Re:what about zero copy on receive? by AdamBa · · Score: 3, Interesting
      Yes, you need separate buffers for the headers (I tried to explain this in my first post but wasn't that clear).

      Let's say you get a 64K buffer from the user. So you hand it to the card and say "all data for port 0x1234 goes in here." Then you also give a chain of receive headers, 64 bytes or whatever. The processing after that should be pretty straightforward...when the card interrupts you with a packet received, it sets a flag saying that the data was put in a user buffer. Then the network card driver tells TCP it has a packet and that it has split header/data and the data is at location XXX. At that point the processing should be basically the same for TCP, verifying checksums and headers etc, but then at the last step where it would copy the data to the user's buffer, it just doesn't have to -- as long as the data was supposed to wind up at XXX.

      The tricky case is handling drops and dups and out of order. For example if the fifth packet in a transfer is received third, then TCP can't just move it to the right spot because the card may be using that spot to receive another packet. In general trying to tell the card "oh you should back up and start receiving new packets here instead of here" is tricky timing because a packet may be coming in while you are trying to tell the card that.

      Of course in situations like this TCP doesn't have to be perfectly optimized since you will likely need to retransmit anyway, but it shouldn't be terrible. And the card will also need to be given a set of general buffers for packets that are not to an expected port, or where the user buffer runs out of room, etc. Then TCP has to be clever about putting those packets in the right place.

      You could have the card be smarter and actually known about where to put each packet, it couldn't even do acks and retransmit requests...but you don't want to make it too complicated. Plus I think you want to avoid having the card need to interpret any part of the packet that is encrypted during an IPSEC session (which I don't know exactly where that begins). Some cards do IPSEC in hardware but that is another issue.

      And of course this only helps if the server is CPU-bound, as opposed to disk or network etc.

      - adam

  3. Linux has better zerocopy TCP, and here's why by Jamie+Lokier · · Score: 4, Informative

    True zerocopy has certain hardware and driver requirements. These are the network drivers in linux 2.5.9 which support zerocopy TCP: 3c59x, acenic, sunhme, 8139cp, e100, ns8320, starfire, via-rhine, sungem, e1000, 8139too, tg. (Disclaimer: Not all cards supported by those drivers necessarily support full zero copy). That's from grepping for the NETIF_F_SG and at least one of the NETIF_F_(IP|NO|HW)_CSUM flags.

    In Linux, zerocopy is performed using the sendfile(2) system call, rather than writing to a socket from a memory-mapped file, as you are meant to do with the new BSD code. Although the mmap method is a neat way to make a few existing programs faster, it is less efficient than the sendfile() method, to some degree, and certainly more complicated to implement.

    A write-from-mmap implementation has to provide a certain allowances for user space behaviour. Although it's advised not to touch the pages from user space, allowance for this basically require the OS to "pin" pages, either by modifying page tables which implies TLB and page walking cost (if the pages are actually mapped, which they probably are not in a Samba/www/ftp server), or by at least pinning the underlying page cache pages in case someone does a write() to the mapped file. sendfile() does not require the pages to be pinned, because it provides different guarantees about which data is transmitted if the data is being modified concurrently.

    Another nice thing about sendfile() is that it's quite fast even for small files. The overhead of calling mmap() and then munmap() may outweigh the copying time for a small transmission. Basically, why bother with mmap/write/munmap when you can just do a sendfile, which doesn't require the kernel to jump through hoops to decode what you meant.

    Well, I don't know if it makes much difference in performance if you only mmap() a file without referencing the pages from user space, and write it to a socket. We'll have to wait for the numbers.

    But there is another great thing about sendfile! You can use it to transmit user-space generated data, such as HTTP headers, too. This is done by memory-mapping a shared file (such as a pure virtual memory "tmpfs" file, but you can use a real disk file too). Then you can write to that mapped memory from user space, and call sendfile() to transmit what you have just generated.

    You can do what I just described, with a mapped, shared file, using the new BSD zerocopy patches too. If using sendfile(), the weaker concurrency guarantees of sendfile() vs. write() mean it is your responsibility to not modify the data until you are sure it's been received at the far end. In some ways user space has more responsibility, to be carefully manage the data pool with this method of using sendfile() for program-generated data, than using BSD-style write(). On the other hand, that's exactly why the BSD kernel must do more work of pinning pages, and in this mode of usage there is definitely TLB flushing cost and cross-CPU synchronisation cost, so if you are really crazy for performance, sendfile() may just have the edge. (Well I expect so anyway, I haven't done performance comparisons).

    By the way, write() from memory-mapped files has been discussed among linux kernel developers several times in the past, and each time the idea lost due to the feeling that page table manipulation is not that cheap (especially not on SMP), and now that we have sendfile... Well, if you were writing a really high performance user space server, you'd use sendfile anyway so writing from mmap becomes a bit moot.

    Finally, zerocopy UDP is not implemented in linux at present as far as I know, but some gory details were discussed recently on the kernel list so it is sure to arrive quite soon. The difficult infrastructure (drivers, page-referencing skbuffs) which is used by the zerocopy TCP implementation has been part of the 2.4 kernel since 2.4.4 (more than 1 year ago) and I believe it has been thoroughly tested since then.

    Enjoy,
    -- Jamie Lokier

  4. FreeBSD has zerocopy sendfile... by lamontg · · Score: 5, Informative

    A basic bit of research by running 'man sendfile' on a FreeBSD system would have told you this:

    SENDFILE(2) FreeBSD System Calls Manual SENDFILE(2)

    NAME
    sendfile - send a file to a socket

    LIBRARY
    Standard C Library (libc, -lc)

    SYNOPSIS
    #include sys/types.h
    #include sys/socket.h
    #include sys/uio.h

    int
    sendfile(int fd, int s, off_t offset, size_t nbytes,
    struct sf_hdtr *hdtr, off_t *sbytes, int flags);

    DESCRIPTION
    Sendfile() sends a regular file specified by descriptor fd out a stream
    socket specified by descriptor s.

    [...]

    IMPLEMENTATION NOTES
    The FreeBSD implementation of sendfile() is "zero-copy", meaning that it
    has been optimized so that copying of the file data is avoided.

  5. hoo baby by AdamBa · · Score: 3, Insightful
    Don't know anything about Linux, but compared to NT this seems incredibly complicated. In NT you can take any user buffer, probe it, map it, and lock it. Don't have to worry about "user space behaviour" and whatnot. If the user is doing synchronous write()s (or whatever) the call won't even return until the kernel code tells says it is done with it. And for asynchronous, you would hope the user won't be walking all over the buffers it has handed over until the call is done. And in all cases the kernel components handle the buffers in the same way, you don't need your network card to indicate with a flag whether it can do zero-copy sends (?!?!?!?!?!?!?).

    - adam