Zero-Copy TCP and UDP Output in NetBSD
-is writes "Jason R. Thorpe has recently added experimental code to NetBSD-current, that enables zero-copy for TCP and UDP on the transmit-side. These changes
could mean significant performance improvements for FTP, WWW, and Samba servers. See Jason's announcement to the current-users mailing list for details." From the text: " On tests on an embedded system with limited memory bandwith, TCP
transmit performance on 100baseTX-FDX went from ~6500KB/s to ~11100KB/s,
a significant improvement." Excellent!
With no copy on transmit, how does one know whether the packet was received? It does no one any good do double bandwidth but drop half the packets unwittingly.
;-)
Also, does this have any effect on the amount of level 1 transmission requirements (assuming that packet loss has been accounted for)? Couldn't such an increase in throughput decrease the need for level 1 and 2 trans reqs? I'd be afraid to imagine the amount of spam that is created to take up the slack
While we're on the topic, at what level will these transmission improvements be noticeable? will they simply replace the underlying infrastructure and be invisible to ordinary users?
When will this be ported to Linux? :-P
But now all anyone cares about is TCP. Furthermore, a typical copy of data to a server goes something like:
1) packet sent by the client to a known port on the server
2) a few packets to set things up and assign a dedicated server port
3) lots of data blasting from the client to the dedicated server port
4) some cleanup packets at the end
Step 3 is what you care about. So you would need to tell the network card, when you get packets for this port, put the data in this buffer in the order received, and put the headers here (in some small header-sized buffers TCP would also provide). Now you might get bad checksums (although the hardware could check that also) or drops or out of order, then you would need to rearrange...but in the 99%+ normal case you get all the packets in order with valid checksums. So the card stuffs the data in the right place, TCP checks the header buffers to make sure everything is kosher, and boom your data is in memory with no copies and off to disk (or wherever) it goes.
You need some other stuff like TCP has to be able to hint this to the network card driver, and figure out if more than one app is using a port (so it can turn all this optimization off) and so on. But hey when it worked it would be cool.
The other way this would work is if the network card was set up with a big chain of receive buffers and it would actually hand a buffer up to TCP (so it got taken out of the chain) and then eventually it would get it back...but this requires a lot of trust of the levels above TCP that ultimately decide when the receive data isn't needed anymore.
As Dilbert said this weekend...if you can understand the preceding, you have my sympathy.
- adam
True zerocopy has certain hardware and driver requirements. These are the network drivers in linux 2.5.9 which support zerocopy TCP: 3c59x, acenic, sunhme, 8139cp, e100, ns8320, starfire, via-rhine, sungem, e1000, 8139too, tg. (Disclaimer: Not all cards supported by those drivers necessarily support full zero copy). That's from grepping for the NETIF_F_SG and at least one of the NETIF_F_(IP|NO|HW)_CSUM flags.
In Linux, zerocopy is performed using the sendfile(2) system call, rather than writing to a socket from a memory-mapped file, as you are meant to do with the new BSD code. Although the mmap method is a neat way to make a few existing programs faster, it is less efficient than the sendfile() method, to some degree, and certainly more complicated to implement.
A write-from-mmap implementation has to provide a certain allowances for user space behaviour. Although it's advised not to touch the pages from user space, allowance for this basically require the OS to "pin" pages, either by modifying page tables which implies TLB and page walking cost (if the pages are actually mapped, which they probably are not in a Samba/www/ftp server), or by at least pinning the underlying page cache pages in case someone does a write() to the mapped file. sendfile() does not require the pages to be pinned, because it provides different guarantees about which data is transmitted if the data is being modified concurrently.
Another nice thing about sendfile() is that it's quite fast even for small files. The overhead of calling mmap() and then munmap() may outweigh the copying time for a small transmission. Basically, why bother with mmap/write/munmap when you can just do a sendfile, which doesn't require the kernel to jump through hoops to decode what you meant.
Well, I don't know if it makes much difference in performance if you only mmap() a file without referencing the pages from user space, and write it to a socket. We'll have to wait for the numbers.
But there is another great thing about sendfile! You can use it to transmit user-space generated data, such as HTTP headers, too. This is done by memory-mapping a shared file (such as a pure virtual memory "tmpfs" file, but you can use a real disk file too). Then you can write to that mapped memory from user space, and call sendfile() to transmit what you have just generated.
You can do what I just described, with a mapped, shared file, using the new BSD zerocopy patches too. If using sendfile(), the weaker concurrency guarantees of sendfile() vs. write() mean it is your responsibility to not modify the data until you are sure it's been received at the far end. In some ways user space has more responsibility, to be carefully manage the data pool with this method of using sendfile() for program-generated data, than using BSD-style write(). On the other hand, that's exactly why the BSD kernel must do more work of pinning pages, and in this mode of usage there is definitely TLB flushing cost and cross-CPU synchronisation cost, so if you are really crazy for performance, sendfile() may just have the edge. (Well I expect so anyway, I haven't done performance comparisons).
By the way, write() from memory-mapped files has been discussed among linux kernel developers several times in the past, and each time the idea lost due to the feeling that page table manipulation is not that cheap (especially not on SMP), and now that we have sendfile... Well, if you were writing a really high performance user space server, you'd use sendfile anyway so writing from mmap becomes a bit moot.
Finally, zerocopy UDP is not implemented in linux at present as far as I know, but some gory details were discussed recently on the kernel list so it is sure to arrive quite soon. The difficult infrastructure (drivers, page-referencing skbuffs) which is used by the zerocopy TCP implementation has been part of the 2.4 kernel since 2.4.4 (more than 1 year ago) and I believe it has been thoroughly tested since then.
Enjoy,
-- Jamie Lokier
A basic bit of research by running 'man sendfile' on a FreeBSD system would have told you this:
SENDFILE(2) FreeBSD System Calls Manual SENDFILE(2)
NAME
sendfile - send a file to a socket
LIBRARY
Standard C Library (libc, -lc)
SYNOPSIS
#include sys/types.h
#include sys/socket.h
#include sys/uio.h
int
sendfile(int fd, int s, off_t offset, size_t nbytes,
struct sf_hdtr *hdtr, off_t *sbytes, int flags);
DESCRIPTION
Sendfile() sends a regular file specified by descriptor fd out a stream
socket specified by descriptor s.
[...]
IMPLEMENTATION NOTES
The FreeBSD implementation of sendfile() is "zero-copy", meaning that it
has been optimized so that copying of the file data is avoided.
We see a lot of posts indicating that *BSD is dying. OTOH, we see a lot (literally a LOT!) of enthusiastic and excited FreeBSD (*BSD) newbies looking for help with basic *BSD questions.
An important question we all need to ask ourselves: do we want to be plain sideline observers, or do we choose to help grow *BSD? Again, I am sure all of you *BSD geeks will agree that it would be in our best interests to promote *BSD!!!
This is not an infomercial on our part, but a request for all you "expert" *BSD geeks to give back to the *BSD cause by PARTICIPATING, visiting your favorite *BSD sites (or whatever else !!!), and by helping answer newbie questions.
Newbies sometimes feel intimidated/overwhelmed by mailing lists, complex howtos, etc. and dont exactly know where to start. Some prefer asking simple questions in forums, or following simple howtos, etc. It would be in our best interest to encourage these folks and turn them towards FreeBSD, OpenBSD and NetBSD!! Encouraging them will promote *BSD committment (committers) with ports/applications, OS and security enhancements. *BSD needs our help!!
...that NetBSD will be the new choice of warez and DoS kiddies everywhere?
No way mang! MACOSX is better than dumb old *BSD!
You're right, (hangs head in shame...). ;-)
I had a nagging feeling when I wrote the article
Ah well, at least I managed to explain some of the differences between mmap-write and sendfile.
Thanks,
-- Jamie
- adam
Subject says it all.
Cool. I have a bunch of zombie servers then. Here come FreeBSD zombies for you BSD troll.
Funny post - BSD is now the highest volume *nix in the computer biz. Thanks to Apple shipping on Darwin. What's more, users *REAL USERS* don't give a hoot about "basic *BSD questions". They just want it to work.
And it can. And it does.
OK, so maybe this is a troll. But maybe it's insightful, and maybe it's just funny...
I wonder if OpenBSD will copy this new NetBSD code. Does anyone know which BSD has more users, OpenBSD or NetBSD? I hear few people ever talking about NetBSD, but many of the OpenBSD changelogs have comments like "copied this from NetBSD". Does OpenBSD simply "ripoff" NetBSD development?
cpeterso
I know that NT has been able to pin user buffers forever, and that can be used for synchronous I/O. It's not suitable for asynchronous I/O, though: it's no good "hoping" the user doesn't overwrite buffers until the TCP transmitted data has been acknowledged. Applications are often written to assume they can call the equivalent of the unix write() call, and then overwrite their application buffer to prepare for another write.
You said that all kernel components handle buffers in the same way, and thus all network cards handle zero-copy sends. In fact this is not true for all supported network cards: some cards simply don't have the hardware to transmit a packet that is in physically discontiguous memory. And that's what you have when doing zero-copy sends from user buffers. So, either NT's kernel or the vendor-supplied device driver must copy the data into contiguous memory, or onto the card itself (typical for ISA cards). When that happens it isn't zero-copy any more, although it is transparent to the application -- just like the Linux and BSD mechanisms!
have a nice day,
-- Jamie
Can someone tell me if Overlapped IO under Win32 is also zero-copy? It seems like it could be - basically you pass buffers to the OS, and the OS lets you know when it's finished with them. This works for input and output.
The other nice thing about OIO is that you don't have to populate your FDSET every time you do a select - which means if you're writing a server app with thousands of simultaneous connections, it's a whole lot faster.
Is there a Linux/BSD equivalent to this?
I guess in general avoiding copies is a good thing, it frees the CPU to do something else. You just may not always have something else for the CPU to do, but hey why not give it the opportunity.
- adam