Zero-Copy TCP and UDP Output in NetBSD

Re:Packet loss by AdamBa · 2002-05-06 15:23 · Score: 5, Informative

You know when the packet was received when you get an ack for it (for TCP, with UDP you can just dump it to the network card and forget about it since it's unreliable delivery). Obviously in this situation you need to hold on to the user's buffer until you have gotten an ack, otherwise you don't have the data until you need to retransmit. But you want to optimize for the general case which is you do get an ack real quick and then you can just return the buffer...I suppose a clever TCP implementation using this would have some spare buffers, and after some time without an ack (maybe the first retransmission, maybe sooner), it would copy the data and give the user's buffer back. But again that is a rare case so the extra copy there doesn't matter.

As fow how noticeable they will be...according to the email, it increased from 52 megabits/second to almost 90 megabits/second. But transmitting over a series of hops (like on the Internet), your speed is gated by the slowest link or router, and it is doubtful the whole end-to-end chain will be able to handle 90 Mbps (or 52 Mbps for that matter). So even with a monster send window you would still need to stop transmitting well before you got an ack and could send more.

Of course on a local 100 Mbps Ethernet playing networked games this could be quite nice.

- adam

what about zero copy on receive? by AdamBa · 2002-05-06 15:32 · Score: 3, Interesting

I'm surprised nobody has come up with hardware to do this. The problem is the network card needs to know about the user's buffer ahead of time. In the old days (i.e. 5 years ago) you had a mix of Netbeui and IPX and TCP on a network and it didn't make much sense to make a card intelligent enough to figure out where to put packets.

But now all anyone cares about is TCP. Furthermore, a typical copy of data to a server goes something like:

1) packet sent by the client to a known port on the server
2) a few packets to set things up and assign a dedicated server port
3) lots of data blasting from the client to the dedicated server port
4) some cleanup packets at the end

Step 3 is what you care about. So you would need to tell the network card, when you get packets for this port, put the data in this buffer in the order received, and put the headers here (in some small header-sized buffers TCP would also provide). Now you might get bad checksums (although the hardware could check that also) or drops or out of order, then you would need to rearrange...but in the 99%+ normal case you get all the packets in order with valid checksums. So the card stuffs the data in the right place, TCP checks the header buffers to make sure everything is kosher, and boom your data is in memory with no copies and off to disk (or wherever) it goes.

You need some other stuff like TCP has to be able to hint this to the network card driver, and figure out if more than one app is using a port (so it can turn all this optimization off) and so on. But hey when it worked it would be cool.

The other way this would work is if the network card was set up with a big chain of receive buffers and it would actually hand a buffer up to TCP (so it got taken out of the chain) and then eventually it would get it back...but this requires a lot of trust of the levels above TCP that ultimately decide when the receive data isn't needed anymore.

As Dilbert said this weekend...if you can understand the preceding, you have my sympathy.

- adam

Re:what about zero copy on receive? by EvlG · 2002-05-06 15:43 · Score: 2

This paper describes a distributed shared memory system in which just such a hardware mechanism was used, called direct deposit. The paper has some information on it.
Re:what about zero copy on receive? by Espen+Skoglund · 2002-05-06 22:37 · Score: 3, Insightful

The problem is not necessarily that of early demultiplexing of incoming packets to the appropriate ports. The problem is that in order to achieve zero-copy you'll have to store the packet at the appropriate location in memory. I.e., you must store the incoming packet in the buffer location that the user level application uses for receive. Now, obviously you can not store the packet there unless the user has requested to receive more data (the buffer may be used for other purposes). A solution to this problem is to program all the applications so that their receive buffers are aligned on page-boundaries, and the page containing the receive buffer is only used for containing the receive buffer. This allows the kernel to receive incoming data onto empty pages and map these pages into the application when the application eventually issues a receive operation.
Of course, there are more quirks to the problem than what I've discussed here. However, the point is that one can not easily implement zero-copy TCP receive without having well behaved applications (i.e., without modification of the application). Zero-copy TCP send is easier since the location of the outgoing packet is known to the kernel once the send operation starts.
Re:what about zero copy on receive? by AdamBa · 2002-05-07 01:51 · Score: 3, Insightful

Sure you would have to only do it with certain things were true -- the user had already posted a buffer, nobody else was using that port, etc. But a standard situation like somebody ftp'ing up a huge file would match those conditions.
Now I confess I don't know anything about the internals of any Unix version. What I worked on was NT. And in NT this would be very easy (memory-management-wise I mean) as long as you had the user's buffer ahead of time...no need to have receive buffers aligned on a page boundary or anything. Since I don't know *nix, I don't know why that restriction would be needed. The card could receive anywhere in memory (doesn't need to be aligned)...and in NT you can map any user buffer into kernel space so any device driver can access it, then lock it to a physical address so the card can access it.
- adam
Re:what about zero copy on receive? by AdamBa · 2002-05-07 04:35 · Score: 3, Interesting

Yes, you need separate buffers for the headers (I tried to explain this in my first post but wasn't that clear).
Let's say you get a 64K buffer from the user. So you hand it to the card and say "all data for port 0x1234 goes in here." Then you also give a chain of receive headers, 64 bytes or whatever. The processing after that should be pretty straightforward...when the card interrupts you with a packet received, it sets a flag saying that the data was put in a user buffer. Then the network card driver tells TCP it has a packet and that it has split header/data and the data is at location XXX. At that point the processing should be basically the same for TCP, verifying checksums and headers etc, but then at the last step where it would copy the data to the user's buffer, it just doesn't have to -- as long as the data was supposed to wind up at XXX.
The tricky case is handling drops and dups and out of order. For example if the fifth packet in a transfer is received third, then TCP can't just move it to the right spot because the card may be using that spot to receive another packet. In general trying to tell the card "oh you should back up and start receiving new packets here instead of here" is tricky timing because a packet may be coming in while you are trying to tell the card that.
Of course in situations like this TCP doesn't have to be perfectly optimized since you will likely need to retransmit anyway, but it shouldn't be terrible. And the card will also need to be given a set of general buffers for packets that are not to an expected port, or where the user buffer runs out of room, etc. Then TCP has to be clever about putting those packets in the right place.
You could have the card be smarter and actually known about where to put each packet, it couldn't even do acks and retransmit requests...but you don't want to make it too complicated. Plus I think you want to avoid having the card need to interpret any part of the packet that is encrypted during an IPSEC session (which I don't know exactly where that begins). Some cards do IPSEC in hardware but that is another issue.
And of course this only helps if the server is CPU-bound, as opposed to disk or network etc.
- adam
Re:what about zero copy on receive? by thorpej · 2002-05-08 09:18 · Score: 2, Insightful

A zero-copy receive path is a significantly harder problem to solve.

Basically, devices DMA into host memory. These buffers must be preallocated, since you never know when a packet might arrive, and when it does, it needs to go into memory immediately, since the temporary storage in the Ethernet MAC itself is quite small.

When the data arrives, we still don't know which application it is for. We have to parse headers, etc. to determine that. And once we do, we have a buffer that is:

1. Not page-aligned.
2. Not page-rounded.

This makes it very difficult to "page flip" the buffer into userspace.

The Trapeeze/IP project at Duke implemented a zero-copy receive for FreeBSD, but it required special modificaitons to the firmware on the Alteon ACEnic Gig-E interfaces they were using. Those interfaces aren't even manufactured anymore, and there are essentially no Gig-E interfaces on the market today which allow you to hack the firmware in such a way. So, their solution pretty much can't be used unless you have full control over the hardware that's going into your device (i.e. it's pretty much of use only to people building embedded systems from scratch).

Therefore, in the absense of another solution, you are forced to perform at least one copy on the receive side: from the interface's receive buffer into a page-aligned/page-sized buffer in the socket. Once you have that, you *can* page-flip into user space, however, and since the copy across the protection boundary is usually more expensive than a copy within the same address space, so there's still some benefit that can be realized.

--
-- Jason R. Thorpe, NetBSD and FreSSH developer
Re:what about zero copy on receive? by AdamBa · 2002-05-08 15:14 · Score: 2

I understand the issues you bring up, but I still think it can be done. You are thinking of it as receiving into one of the standard buffers you have preallocated into the netcard's receive ring, and then somehow making this map into the user's buffer. That's not what I mean.
I mean in certain cases (but cases that generally correspond to the ones you would care about, large transfers to a server), you have the user's buffer ahead of time, and furthermore there is just one client of the port, and TCP knows this because it has *assigned* the port to that client. For example when the ftp daemon is receiving a put command, it moves the client computer over to a random port, which it asks TCP to assign to it. Let's say TCP assigns port 0x9876. So TCP knows that only the ftp daemon is the only one using port 0x9876. And the ftp daemon has posted a big receive to get the data. So TCP can take that receive's buffer and hand it to the driver and say "all data [but not the headers] received on port 0x9876 goes in this buffer." Then the card packs the data in and in the normal case where every packet is received only once and in order, it works.
The firmware mod you would have to do is instead of the card having one big ring of receive buffers, it needs a bunch of "per-port" buffers (plus one big ring as before). Now that takes some space to set up the control, but the buffers themselves are all user buffers, so there is not a *lot* more storage needed. The firmware needs to pick out the port number "on the fly" while it is receiving the packet, but I think it could to that.
There may be alignment issues....maybe the card has to say "gee the next spot for packets to port 0x9876 has an address that is congruent to 3 modulo 4...so it has to put one extra byte somewhere (along with the header maybe) and then start its transfer at the 4-byte boundary.
If you want to slap down credentials, I was one of the main designers of both NDIS (the transport <-> network card interface) and TDI (the transport <-> level above) interfaces in NT. And I wrote several transports and netcard drivers. So I know a bit whereof I speak.
- adam
Re:what about zero copy on receive? by NoMoreNicksLeft · 2002-05-11 02:22 · Score: 2

What a retarded thing to say. For one thing, there will soon be ipv6 TCP, is this the tcp you're referring to, or ipv4's version?

There are too many legacy systems and apps out there, for me to have to worry if the replacement nic is going to try and outsmart me. That's something that happens all too often in windows, and I'll be damned if it happens in hardware without me bitching up a storm.

I regularly see NetBEUI, DECnet, and IPX on the systems I work with, and even something stranger from time to time. That was undoubtedly one of ethernet's core strengths, the ability to be protocol agnostic. On every other physical/logical protocol, you always had to jump through hoops to use a different protocol, but ethernet just doesn't care.

I do care about more than just TCP/IP. If you weren't a fool, you would too.

Suggestion to moderator: Flamebait, yes. Troll, no. If you are too stupid to see the difference, then I'm sure the meta-moderator won't be.

Linux has better zerocopy TCP, and here's why by Jamie+Lokier · 2002-05-06 15:56 · Score: 4, Informative

True zerocopy has certain hardware and driver requirements. These are the network drivers in linux 2.5.9 which support zerocopy TCP: 3c59x, acenic, sunhme, 8139cp, e100, ns8320, starfire, via-rhine, sungem, e1000, 8139too, tg. (Disclaimer: Not all cards supported by those drivers necessarily support full zero copy). That's from grepping for the NETIF_F_SG and at least one of the NETIF_F_(IP|NO|HW)_CSUM flags.

In Linux, zerocopy is performed using the sendfile(2) system call, rather than writing to a socket from a memory-mapped file, as you are meant to do with the new BSD code. Although the mmap method is a neat way to make a few existing programs faster, it is less efficient than the sendfile() method, to some degree, and certainly more complicated to implement.

A write-from-mmap implementation has to provide a certain allowances for user space behaviour. Although it's advised not to touch the pages from user space, allowance for this basically require the OS to "pin" pages, either by modifying page tables which implies TLB and page walking cost (if the pages are actually mapped, which they probably are not in a Samba/www/ftp server), or by at least pinning the underlying page cache pages in case someone does a write() to the mapped file. sendfile() does not require the pages to be pinned, because it provides different guarantees about which data is transmitted if the data is being modified concurrently.

Another nice thing about sendfile() is that it's quite fast even for small files. The overhead of calling mmap() and then munmap() may outweigh the copying time for a small transmission. Basically, why bother with mmap/write/munmap when you can just do a sendfile, which doesn't require the kernel to jump through hoops to decode what you meant.

Well, I don't know if it makes much difference in performance if you only mmap() a file without referencing the pages from user space, and write it to a socket. We'll have to wait for the numbers.

But there is another great thing about sendfile! You can use it to transmit user-space generated data, such as HTTP headers, too. This is done by memory-mapping a shared file (such as a pure virtual memory "tmpfs" file, but you can use a real disk file too). Then you can write to that mapped memory from user space, and call sendfile() to transmit what you have just generated.

You can do what I just described, with a mapped, shared file, using the new BSD zerocopy patches too. If using sendfile(), the weaker concurrency guarantees of sendfile() vs. write() mean it is your responsibility to not modify the data until you are sure it's been received at the far end. In some ways user space has more responsibility, to be carefully manage the data pool with this method of using sendfile() for program-generated data, than using BSD-style write(). On the other hand, that's exactly why the BSD kernel must do more work of pinning pages, and in this mode of usage there is definitely TLB flushing cost and cross-CPU synchronisation cost, so if you are really crazy for performance, sendfile() may just have the edge. (Well I expect so anyway, I haven't done performance comparisons).

By the way, write() from memory-mapped files has been discussed among linux kernel developers several times in the past, and each time the idea lost due to the feeling that page table manipulation is not that cheap (especially not on SMP), and now that we have sendfile... Well, if you were writing a really high performance user space server, you'd use sendfile anyway so writing from mmap becomes a bit moot.

Finally, zerocopy UDP is not implemented in linux at present as far as I know, but some gory details were discussed recently on the kernel list so it is sure to arrive quite soon. The difficult infrastructure (drivers, page-referencing skbuffs) which is used by the zerocopy TCP implementation has been part of the 2.4 kernel since 2.4.4 (more than 1 year ago) and I believe it has been thoroughly tested since then.

Enjoy,
-- Jamie Lokier

FreeBSD has zerocopy sendfile... by lamontg · 2002-05-06 19:59 · Score: 5, Informative

A basic bit of research by running 'man sendfile' on a FreeBSD system would have told you this:

SENDFILE(2) FreeBSD System Calls Manual SENDFILE(2)

NAME
sendfile - send a file to a socket

LIBRARY
Standard C Library (libc, -lc)

SYNOPSIS
#include sys/types.h
#include sys/socket.h
#include sys/uio.h

int
sendfile(int fd, int s, off_t offset, size_t nbytes,
struct sf_hdtr *hdtr, off_t *sbytes, int flags);

DESCRIPTION
Sendfile() sends a regular file specified by descriptor fd out a stream
socket specified by descriptor s.

[...]

IMPLEMENTATION NOTES
The FreeBSD implementation of sendfile() is "zero-copy", meaning that it
has been optimized so that copying of the file data is avoided.

Re:FreeBSD has zerocopy sendfile... by Espen+Skoglund · 2002-05-06 22:18 · Score: 2, Interesting

And you also forgot to mention:
HISTORY
sendfile() first appeared in FreeBSD3 .0. This manual page first appeared in FreeBSD 3.1.

hoo baby by AdamBa · 2002-05-07 01:58 · Score: 3, Insightful

Don't know anything about Linux, but compared to NT this seems incredibly complicated. In NT you can take any user buffer, probe it, map it, and lock it. Don't have to worry about "user space behaviour" and whatnot. If the user is doing synchronous write()s (or whatever) the call won't even return until the kernel code tells says it is done with it. And for asynchronous, you would hope the user won't be walking all over the buffers it has handed over until the call is done. And in all cases the kernel components handle the buffers in the same way, you don't need your network card to indicate with a flag whether it can do zero-copy sends (?!?!?!?!?!?!?).

- adam

Re:Port it to Linux!! by redcliffe · 2002-05-07 12:12 · Score: 2

Well I could say because BSD is dying, but I won't. It's a cool thing and worth porting.

OpenBSD and NetBSD? by cpeterso · 2002-05-07 13:24 · Score: 2

I wonder if OpenBSD will copy this new NetBSD code. Does anyone know which BSD has more users, OpenBSD or NetBSD? I hear few people ever talking about NetBSD, but many of the OpenBSD changelogs have comments like "copied this from NetBSD". Does OpenBSD simply "ripoff" NetBSD development?

--
cpeterso

Re:OpenBSD and NetBSD? by keramida · 2002-05-07 15:43 · Score: 2, Insightful

> Does OpenBSD simply "ripoff" NetBSD development?
Nope. The sharing of code and ideas among the various *BSD implementations is a Good Thing(TM). Developers from more than one groups that join their efforts and write code that is easy to port, clean, and useful to more than one of the BSDs are out there. And they have been doing a great job for quite some time. There's no ripping off, but a cooperative spirit that we should be glad for.
To use your example, what if the "ripping off" of code that OpenBSD is accused for, helps in revealing some bug that is caused by assumptions that are true only for NetBSD? Is then the "ripping off" justified and part of a "development process" or is it still a bad thing? What if the OpenBSD folks write back to the original authors and help in fixing possible problems with the NetBSD codebase? Aren't they then assisting in making NetBSD a better OS too?
- Giorgos

--
My other computer runs FreeBSD too.

Re:Ok, you're right. by thallgren · 2002-05-07 22:05 · Score: 2, Insightful

Right, but you completly failed to explain why the sendfile-only/no-mmap zero-copy in Linux is better. Instead you mention that one should create files in tmpfs and use sendfile on that. That can hardly be zero-copu anymore, can it?

Regards, Tommy

NT's method is the same as the new BSD code by Jamie+Lokier · 2002-05-08 04:18 · Score: 2, Insightful

I know that NT has been able to pin user buffers forever, and that can be used for synchronous I/O. It's not suitable for asynchronous I/O, though: it's no good "hoping" the user doesn't overwrite buffers until the TCP transmitted data has been acknowledged. Applications are often written to assume they can call the equivalent of the unix write() call, and then overwrite their application buffer to prepare for another write.

You said that all kernel components handle buffers in the same way, and thus all network cards handle zero-copy sends. In fact this is not true for all supported network cards: some cards simply don't have the hardware to transmit a packet that is in physically discontiguous memory. And that's what you have when doing zero-copy sends from user buffers. So, either NT's kernel or the vendor-supplied device driver must copy the data into contiguous memory, or onto the card itself (typical for ISA cards). When that happens it isn't zero-copy any more, although it is transparent to the application -- just like the Linux and BSD mechanisms!

have a nice day,
-- Jamie

Re:NT's method is the same as the new BSD code by AdamBa · 2002-05-08 06:13 · Score: 2

I meant asynchronous I/O where the app is expecting it to be asynchronous. With sockets there is some way to wait for a write to complete (which I don't remember offhand, I'm much more familiar with the kernel parts), and using the NT API you can do a wait...I think even with Win32 file APIs you can wait. But I think the default is to have I/O be synchronous, so a dumb app will work. And I disagree that you can't expect an app doing asynchronous I/O not to touch a buffer until the write is done. That's just a broken app. What if it is multithreaded and changes the buffer during the write? What if it zeros it out BEFORE the write?!? Etc etc.
It's true some cards can't do a busmaster send from discontiguous buffers. So they do have to copy to contiguous memory, or to card memory, or do DMA, or whatever...but that's something the card driver has to just deal with. Nobody above the network card driver has to care.
So it sounds like the network card flag you mentioned is just an indicator of the ability to send discontiguous buffers directly? That is, a status indication, not something an app has to worry about....so that's fine, I withdraw my ragging on that.
- adam

Overlapped IO / Win32 by strags · 2002-05-10 03:43 · Score: 2

Can someone tell me if Overlapped IO under Win32 is also zero-copy? It seems like it could be - basically you pass buffers to the OS, and the OS lets you know when it's finished with them. This works for input and output.

The other nice thing about OIO is that you don't have to populate your FDSET every time you do a select - which means if you're writing a server app with thousands of simultaneous connections, it's a whole lot faster.

Is there a Linux/BSD equivalent to this?

good point by AdamBa · 2002-05-15 05:07 · Score: 2

You are right .

I guess in general avoiding copies is a good thing, it frees the CPU to do something else. You just may not always have something else for the CPU to do, but hey why not give it the opportunity.

- adam

Slashdot Mirror

Zero-Copy TCP and UDP Output in NetBSD

21 of 74 comments (clear)