Zero-Copy TCP and UDP Output in NetBSD

Packet loss by Anonymous Coward · 2002-05-06 14:40 · Score: 0

With no copy on transmit, how does one know whether the packet was received? It does no one any good do double bandwidth but drop half the packets unwittingly.

Also, does this have any effect on the amount of level 1 transmission requirements (assuming that packet loss has been accounted for)? Couldn't such an increase in throughput decrease the need for level 1 and 2 trans reqs? I'd be afraid to imagine the amount of spam that is created to take up the slack ;-)

While we're on the topic, at what level will these transmission improvements be noticeable? will they simply replace the underlying infrastructure and be invisible to ordinary users?

Re:Packet loss by Anonymous Coward · 2002-05-06 14:43 · Score: 0

Honestly... I have no clue what the heck you are saying! Anybody else in the same boat?
Re:Packet loss by Anonymous Coward · 2002-05-06 14:55 · Score: 0

Heh, a first post with real content. Isn't that the first sign of the apocolypse? ;-)

The article doesn't really say much about packet loss (being an announcement and all). I'd assume that they were able to double the bandwidth without increasing packet loss. The ramifications are great. In a 2 dimensional topology such as the web (3 dimensions if you consider your LAN as well), you can see the improvements in trans speed increase exponentially. 2D means 4x the improvement. 3D is 8x! We're talking significant progress here. (One wonders why it isn't on the front page. Probably waiting for the Linux port to finish, I guess)

Your questions concerning level 1 an 2 TRs... It is true that the infrastructure necessary to host UDP is highly unstable at the lower levels (1 especially), the amount of rewiring is negligible (sp?) if you have your network set up correctly. I'd venture to say that the ripple effect of laying the zero-copy improvements will be small compared to the shaking the floor will have when your engineers jump up and down in jubilation that the network runs not only faster, but easier too. (It's the greatest thing since WinNT! *ducks*)

Anyway, what this all amounts to is that levels 1 and 2 are unlikely to be affected much at all. The bandwidth requirements of modern software is increasing geometrically, and it will no doubt fill the void, leaving the lower levels of TRs to pick up the slack introduced by the ZC protocols.
Re:Packet loss by AdamBa · 2002-05-06 15:23 · Score: 5, Informative

You know when the packet was received when you get an ack for it (for TCP, with UDP you can just dump it to the network card and forget about it since it's unreliable delivery). Obviously in this situation you need to hold on to the user's buffer until you have gotten an ack, otherwise you don't have the data until you need to retransmit. But you want to optimize for the general case which is you do get an ack real quick and then you can just return the buffer...I suppose a clever TCP implementation using this would have some spare buffers, and after some time without an ack (maybe the first retransmission, maybe sooner), it would copy the data and give the user's buffer back. But again that is a rare case so the extra copy there doesn't matter.
As fow how noticeable they will be...according to the email, it increased from 52 megabits/second to almost 90 megabits/second. But transmitting over a series of hops (like on the Internet), your speed is gated by the slowest link or router, and it is doubtful the whole end-to-end chain will be able to handle 90 Mbps (or 52 Mbps for that matter). So even with a monster send window you would still need to stop transmitting well before you got an ack and could send more.
Of course on a local 100 Mbps Ethernet playing networked games this could be quite nice.
- adam
Re:Packet loss by iansmith · 2002-05-13 02:45 · Score: 1

You are assuming tha the whole 90 mb/s stream is going to one site.

More likely you will have a hundred processes each sending to a diffrent endpoint on the internet. The routers in between are just that, routers and designed from the ground up to handle fast packet switching. I'd say this will be very useable in the real world to increase performance.

As for games, I have not yet seen a LAN game that can't be run on a 10 Mbps ethernet network. 100 is better, but you need a massive (64 players?) game to even be able to notice the improvement.
Re:Packet loss by neptuneb1 · 2002-05-15 05:50 · Score: 1

While it is true your server would be unable to send 52Mbps to a single client, chances are that your traffic goes over multiple different links once it leaves your ISP (possibly even within your ISP). If you pay for a good service, their internal structure should be able to handle your 90Mbps. It's the same reason that people invest in gigabit ethernet. The internet can't handle anywhere NEAR that bandwith end-to-end, but split among multiple clients, it's not so unrealistic.

--
No.
Re:Packet loss by Anonymous Coward · 2002-05-17 09:24 · Score: 0
Obviously in this situation you need to hold on to the user's buffer until you have gotten an ack, otherwise you don't have the data until you need to retransmit. But you want to optimize for the general case which is you do get an ack real quick and then you can just return the buffer...I suppose a clever TCP implementation using this would have some spare buffers, and after some time without an ack (maybe the first retransmission, maybe sooner), it would copy the data and give the user's buffer back.

Actually, this is solved because the page loan mechanism just marks the pages COW. By marking the pages COW rather than actually copying them, one gets to avoid copying in a lot of cases. And even if you do end up copying them, you've still done better than before. Why? I'll answer each independently
Why don't you copy in a lot of cases? Because there's a good chance that you will be done with the pages before the application tries to modify them. Many applications appear to modify the pages immediately, but please do not forget that:
1. the application may be blocked on the receipt of the ACK (depending on how large the writes are),
2. if the application is using mmap'ed space it is unlikely that it will immediately modify it, and
3. if the application modifies the pages by calling read, then the VM subsystem can just remap the page without causing a copy to occur.
If the pages are likely to be copied anyway, why does the throughput go up? Well, copying after the fact will reduce latency and hence allow higher total throughput for short-lived connections. (This one is just conjecture.) Why is this a good change, regardless of the existence or non-existence of sendfile(1)? Using sendfile(1) requires that each application with serious performance needs be modified to use a new system call. The method employed in NetBSD is complementary, and more useful because it provides performance enhancements to the existing code base.

Port it to Linux!! by redcliffe · 2002-05-06 14:46 · Score: 1, Troll

When will this be ported to Linux? :-P

Re:Port it to Linux!! by Anonymous Coward · 2002-05-06 15:03 · Score: 0

port it to windows first, oh wait windows has had it since windows 2.0
Re:Port it to Linux!! by Anonymous Coward · 2002-05-06 15:42 · Score: 0

Absolutely and there was never a single first party remote exploit against windows 2.0 either, great networking code there.
Re:Port it to Linux!! by ozzmosis · 2002-05-06 19:18 · Score: 1

Windows didn't tcp code until 3.11 (remember windows for workgroups?)
Re:Port it to Linux!! by Anonymous Coward · 2002-05-06 19:20 · Score: 0

Windows 2.0 didn't even have TCP. But if it had, getting zero copy TCP on a flaky system without memory protection isn't hard--every embedded system does it. The hard part is to do it on a multitasking, secure system. And Windows still doesn't do that.
Re:Port it to Linux!! by Anonymous Coward · 2002-05-06 19:28 · Score: 0

Zero copy networking was integrated in the middle of the 2.4 serie.
Re:Port it to Linux!! by Anonymous Coward · 2002-05-06 21:52 · Score: 0

Further proof that slashdotters will bite on anything if you repeat it enough... :-P
Re:Port it to Linux!! by Anonymous Coward · 2002-05-06 23:04 · Score: 0

YHBT:HAND
Re:Port it to Linux!! by jo42 · 2002-05-07 03:58 · Score: 1

> When will this be ported to Linux?
Why?
Re:Port it to Linux!! by Anonymous Coward · 2002-05-07 04:03 · Score: 0

Because Linux is a better system with more users.
Re:Port it to Linux!! by Anonymous Coward · 2002-05-07 09:54 · Score: 0

Well, the fact that it has more users is the prime reason it's NOT a better system. I'll let you think about that for a while.
Re:Port it to Linux!! by Anonymous Coward · 2002-05-07 10:44 · Score: 0

Linux already does this, and has done it for a while.
Re:Port it to Linux!! by Anonymous Coward · 2002-05-07 11:31 · Score: 0

Yes, it makes it harder to be a contrarian kiddie.

So whats the real reason?
Re:Port it to Linux!! by redcliffe · 2002-05-07 12:12 · Score: 2

Well I could say because BSD is dying, but I won't. It's a cool thing and worth porting.
Re:Port it to Linux!! by Anonymous Coward · 2002-05-07 14:37 · Score: 0

I can't comment on NT 3.51/4 or win 95/98/Me, but for 2000 and presumably XP, the tcp stack already does something similar.
The TCP stack runs as pseudo-priveleged client/server slave protocol system. In brief, this means there is no context switch when making a TCP read/write call, and throughput is optimized. When writing, the protection on the out buffer are modified so the application is copy on write until the TCP stack recieves an ACK, after which time it is released.
Additionally, when transmitting between 2 win2k servers on a LAN, the md5 checksum is computed with regards to the first/last packet interval, resulting in a 15-25% speed boost. That's one reason why Apache under windows, when viewed with IE, has a faster response time than Apache under linux on the same hardware.
Re:Port it to Linux!! by Anonymous Coward · 2002-05-07 14:39 · Score: 0

No it wasn't. Linus still refuses to integrate it, since it doesn't address the underlying problems involved with the layer 2 / layer 3 differential (which are a result of an ambiguous spec), but merely provides a workaround to increase system response latency.
Re:Port it to Linux!! by Anonymous Coward · 2002-05-07 19:37 · Score: 0

Wrong
Re:Port it to Linux!! by elbles · 2002-05-10 12:36 · Score: 1

IIRC, the Linux 2.4.x kernels *already* support zero-copy TCP operations; not sure about UDP though . . .
Re:Port it to Linux!! by Anonymous Coward · 2002-05-15 07:54 · Score: 0

That's one reason why Apache under windows, when viewed with IE, has a faster response time than Apache under linux on the same hardware.
Because it cheats?

what about zero copy on receive? by AdamBa · 2002-05-06 15:32 · Score: 3, Interesting

I'm surprised nobody has come up with hardware to do this. The problem is the network card needs to know about the user's buffer ahead of time. In the old days (i.e. 5 years ago) you had a mix of Netbeui and IPX and TCP on a network and it didn't make much sense to make a card intelligent enough to figure out where to put packets.

But now all anyone cares about is TCP. Furthermore, a typical copy of data to a server goes something like:

1) packet sent by the client to a known port on the server
2) a few packets to set things up and assign a dedicated server port
3) lots of data blasting from the client to the dedicated server port
4) some cleanup packets at the end

Step 3 is what you care about. So you would need to tell the network card, when you get packets for this port, put the data in this buffer in the order received, and put the headers here (in some small header-sized buffers TCP would also provide). Now you might get bad checksums (although the hardware could check that also) or drops or out of order, then you would need to rearrange...but in the 99%+ normal case you get all the packets in order with valid checksums. So the card stuffs the data in the right place, TCP checks the header buffers to make sure everything is kosher, and boom your data is in memory with no copies and off to disk (or wherever) it goes.

You need some other stuff like TCP has to be able to hint this to the network card driver, and figure out if more than one app is using a port (so it can turn all this optimization off) and so on. But hey when it worked it would be cool.

The other way this would work is if the network card was set up with a big chain of receive buffers and it would actually hand a buffer up to TCP (so it got taken out of the chain) and then eventually it would get it back...but this requires a lot of trust of the levels above TCP that ultimately decide when the receive data isn't needed anymore.

As Dilbert said this weekend...if you can understand the preceding, you have my sympathy.

- adam

Re:what about zero copy on receive? by EvlG · 2002-05-06 15:43 · Score: 2

This paper describes a distributed shared memory system in which just such a hardware mechanism was used, called direct deposit. The paper has some information on it.
Re:what about zero copy on receive? by Espen+Skoglund · 2002-05-06 22:37 · Score: 3, Insightful

The problem is not necessarily that of early demultiplexing of incoming packets to the appropriate ports. The problem is that in order to achieve zero-copy you'll have to store the packet at the appropriate location in memory. I.e., you must store the incoming packet in the buffer location that the user level application uses for receive. Now, obviously you can not store the packet there unless the user has requested to receive more data (the buffer may be used for other purposes). A solution to this problem is to program all the applications so that their receive buffers are aligned on page-boundaries, and the page containing the receive buffer is only used for containing the receive buffer. This allows the kernel to receive incoming data onto empty pages and map these pages into the application when the application eventually issues a receive operation.
Of course, there are more quirks to the problem than what I've discussed here. However, the point is that one can not easily implement zero-copy TCP receive without having well behaved applications (i.e., without modification of the application). Zero-copy TCP send is easier since the location of the outgoing packet is known to the kernel once the send operation starts.
Re:what about zero copy on receive? by AdamBa · 2002-05-07 01:51 · Score: 3, Insightful

Sure you would have to only do it with certain things were true -- the user had already posted a buffer, nobody else was using that port, etc. But a standard situation like somebody ftp'ing up a huge file would match those conditions.
Now I confess I don't know anything about the internals of any Unix version. What I worked on was NT. And in NT this would be very easy (memory-management-wise I mean) as long as you had the user's buffer ahead of time...no need to have receive buffers aligned on a page boundary or anything. Since I don't know *nix, I don't know why that restriction would be needed. The card could receive anywhere in memory (doesn't need to be aligned)...and in NT you can map any user buffer into kernel space so any device driver can access it, then lock it to a physical address so the card can access it.
- adam
Re:what about zero copy on receive? by Espen+Skoglund · 2002-05-07 02:28 · Score: 1

True. One can indeed optimize for the common case (i.e., application is waiting to receive data), and handle other cases in another manner. Just to hand the ball over to you again, though; when receiving data into a user buffer you don't want to receive packet headers, etc., into the same buffer. As such, you must have some way to split up the receive so that the headers and tailer go into one location and the packed data goes into another. This can, depending on the networking hardware, occur completely at the network card (i.e., programmable cards) or at some higher level in the OS. I don't think that one general solution to the problem exist (although I must admit that I by no means qualifies as being knowledgeable in the field).
Re:what about zero copy on receive? by AdamBa · 2002-05-07 04:35 · Score: 3, Interesting

Yes, you need separate buffers for the headers (I tried to explain this in my first post but wasn't that clear).
Let's say you get a 64K buffer from the user. So you hand it to the card and say "all data for port 0x1234 goes in here." Then you also give a chain of receive headers, 64 bytes or whatever. The processing after that should be pretty straightforward...when the card interrupts you with a packet received, it sets a flag saying that the data was put in a user buffer. Then the network card driver tells TCP it has a packet and that it has split header/data and the data is at location XXX. At that point the processing should be basically the same for TCP, verifying checksums and headers etc, but then at the last step where it would copy the data to the user's buffer, it just doesn't have to -- as long as the data was supposed to wind up at XXX.
The tricky case is handling drops and dups and out of order. For example if the fifth packet in a transfer is received third, then TCP can't just move it to the right spot because the card may be using that spot to receive another packet. In general trying to tell the card "oh you should back up and start receiving new packets here instead of here" is tricky timing because a packet may be coming in while you are trying to tell the card that.
Of course in situations like this TCP doesn't have to be perfectly optimized since you will likely need to retransmit anyway, but it shouldn't be terrible. And the card will also need to be given a set of general buffers for packets that are not to an expected port, or where the user buffer runs out of room, etc. Then TCP has to be clever about putting those packets in the right place.
You could have the card be smarter and actually known about where to put each packet, it couldn't even do acks and retransmit requests...but you don't want to make it too complicated. Plus I think you want to avoid having the card need to interpret any part of the packet that is encrypted during an IPSEC session (which I don't know exactly where that begins). Some cards do IPSEC in hardware but that is another issue.
And of course this only helps if the server is CPU-bound, as opposed to disk or network etc.
- adam
Re:what about zero copy on receive? by thorpej · 2002-05-08 09:18 · Score: 2, Insightful

A zero-copy receive path is a significantly harder problem to solve.

Basically, devices DMA into host memory. These buffers must be preallocated, since you never know when a packet might arrive, and when it does, it needs to go into memory immediately, since the temporary storage in the Ethernet MAC itself is quite small.

When the data arrives, we still don't know which application it is for. We have to parse headers, etc. to determine that. And once we do, we have a buffer that is:

1. Not page-aligned.
2. Not page-rounded.

This makes it very difficult to "page flip" the buffer into userspace.

The Trapeeze/IP project at Duke implemented a zero-copy receive for FreeBSD, but it required special modificaitons to the firmware on the Alteon ACEnic Gig-E interfaces they were using. Those interfaces aren't even manufactured anymore, and there are essentially no Gig-E interfaces on the market today which allow you to hack the firmware in such a way. So, their solution pretty much can't be used unless you have full control over the hardware that's going into your device (i.e. it's pretty much of use only to people building embedded systems from scratch).

Therefore, in the absense of another solution, you are forced to perform at least one copy on the receive side: from the interface's receive buffer into a page-aligned/page-sized buffer in the socket. Once you have that, you *can* page-flip into user space, however, and since the copy across the protection boundary is usually more expensive than a copy within the same address space, so there's still some benefit that can be realized.

--
-- Jason R. Thorpe, NetBSD and FreSSH developer
Re:what about zero copy on receive? by AdamBa · 2002-05-08 15:14 · Score: 2

I understand the issues you bring up, but I still think it can be done. You are thinking of it as receiving into one of the standard buffers you have preallocated into the netcard's receive ring, and then somehow making this map into the user's buffer. That's not what I mean.
I mean in certain cases (but cases that generally correspond to the ones you would care about, large transfers to a server), you have the user's buffer ahead of time, and furthermore there is just one client of the port, and TCP knows this because it has *assigned* the port to that client. For example when the ftp daemon is receiving a put command, it moves the client computer over to a random port, which it asks TCP to assign to it. Let's say TCP assigns port 0x9876. So TCP knows that only the ftp daemon is the only one using port 0x9876. And the ftp daemon has posted a big receive to get the data. So TCP can take that receive's buffer and hand it to the driver and say "all data [but not the headers] received on port 0x9876 goes in this buffer." Then the card packs the data in and in the normal case where every packet is received only once and in order, it works.
The firmware mod you would have to do is instead of the card having one big ring of receive buffers, it needs a bunch of "per-port" buffers (plus one big ring as before). Now that takes some space to set up the control, but the buffers themselves are all user buffers, so there is not a *lot* more storage needed. The firmware needs to pick out the port number "on the fly" while it is receiving the packet, but I think it could to that.
There may be alignment issues....maybe the card has to say "gee the next spot for packets to port 0x9876 has an address that is congruent to 3 modulo 4...so it has to put one extra byte somewhere (along with the header maybe) and then start its transfer at the 4-byte boundary.
If you want to slap down credentials, I was one of the main designers of both NDIS (the transport <-> network card interface) and TDI (the transport <-> level above) interfaces in NT. And I wrote several transports and netcard drivers. So I know a bit whereof I speak.
- adam
Re:what about zero copy on receive? by NoMoreNicksLeft · 2002-05-11 02:22 · Score: 2

What a retarded thing to say. For one thing, there will soon be ipv6 TCP, is this the tcp you're referring to, or ipv4's version?

There are too many legacy systems and apps out there, for me to have to worry if the replacement nic is going to try and outsmart me. That's something that happens all too often in windows, and I'll be damned if it happens in hardware without me bitching up a storm.

I regularly see NetBEUI, DECnet, and IPX on the systems I work with, and even something stranger from time to time. That was undoubtedly one of ethernet's core strengths, the ability to be protocol agnostic. On every other physical/logical protocol, you always had to jump through hoops to use a different protocol, but ethernet just doesn't care.

I do care about more than just TCP/IP. If you weren't a fool, you would too.

Suggestion to moderator: Flamebait, yes. Troll, no. If you are too stupid to see the difference, then I'm sure the meta-moderator won't be.
Re:what about zero copy on receive? by Anonymous Coward · 2002-05-14 06:27 · Score: 0

Damn, what happened ? Did someone shove a star-lan up your ass today ?

Linux has better zerocopy TCP, and here's why by Jamie+Lokier · 2002-05-06 15:56 · Score: 4, Informative

True zerocopy has certain hardware and driver requirements. These are the network drivers in linux 2.5.9 which support zerocopy TCP: 3c59x, acenic, sunhme, 8139cp, e100, ns8320, starfire, via-rhine, sungem, e1000, 8139too, tg. (Disclaimer: Not all cards supported by those drivers necessarily support full zero copy). That's from grepping for the NETIF_F_SG and at least one of the NETIF_F_(IP|NO|HW)_CSUM flags.

In Linux, zerocopy is performed using the sendfile(2) system call, rather than writing to a socket from a memory-mapped file, as you are meant to do with the new BSD code. Although the mmap method is a neat way to make a few existing programs faster, it is less efficient than the sendfile() method, to some degree, and certainly more complicated to implement.

A write-from-mmap implementation has to provide a certain allowances for user space behaviour. Although it's advised not to touch the pages from user space, allowance for this basically require the OS to "pin" pages, either by modifying page tables which implies TLB and page walking cost (if the pages are actually mapped, which they probably are not in a Samba/www/ftp server), or by at least pinning the underlying page cache pages in case someone does a write() to the mapped file. sendfile() does not require the pages to be pinned, because it provides different guarantees about which data is transmitted if the data is being modified concurrently.

Another nice thing about sendfile() is that it's quite fast even for small files. The overhead of calling mmap() and then munmap() may outweigh the copying time for a small transmission. Basically, why bother with mmap/write/munmap when you can just do a sendfile, which doesn't require the kernel to jump through hoops to decode what you meant.

Well, I don't know if it makes much difference in performance if you only mmap() a file without referencing the pages from user space, and write it to a socket. We'll have to wait for the numbers.

But there is another great thing about sendfile! You can use it to transmit user-space generated data, such as HTTP headers, too. This is done by memory-mapping a shared file (such as a pure virtual memory "tmpfs" file, but you can use a real disk file too). Then you can write to that mapped memory from user space, and call sendfile() to transmit what you have just generated.

You can do what I just described, with a mapped, shared file, using the new BSD zerocopy patches too. If using sendfile(), the weaker concurrency guarantees of sendfile() vs. write() mean it is your responsibility to not modify the data until you are sure it's been received at the far end. In some ways user space has more responsibility, to be carefully manage the data pool with this method of using sendfile() for program-generated data, than using BSD-style write(). On the other hand, that's exactly why the BSD kernel must do more work of pinning pages, and in this mode of usage there is definitely TLB flushing cost and cross-CPU synchronisation cost, so if you are really crazy for performance, sendfile() may just have the edge. (Well I expect so anyway, I haven't done performance comparisons).

By the way, write() from memory-mapped files has been discussed among linux kernel developers several times in the past, and each time the idea lost due to the feeling that page table manipulation is not that cheap (especially not on SMP), and now that we have sendfile... Well, if you were writing a really high performance user space server, you'd use sendfile anyway so writing from mmap becomes a bit moot.

Finally, zerocopy UDP is not implemented in linux at present as far as I know, but some gory details were discussed recently on the kernel list so it is sure to arrive quite soon. The difficult infrastructure (drivers, page-referencing skbuffs) which is used by the zerocopy TCP implementation has been part of the 2.4 kernel since 2.4.4 (more than 1 year ago) and I believe it has been thoroughly tested since then.

Enjoy,
-- Jamie Lokier

Re:Linux has better zerocopy TCP, and here's why by Anonymous Coward · 2002-05-06 17:27 · Score: 0

Thanks
Re:Linux has better zerocopy TCP, and here's why by slashtop · 2002-05-06 20:08 · Score: 0

one of zero-copy's goal is to improve performance of old program and not force them to be rewritten!
sendfile is a possible solution, but many program were not written in a such way, and because sendfile() can only send file like data but not
intermediated buffer, it can not represent every thing, improve traditional recv() and send() performance is valuable to do.
nice to see this was implemented in NetBSD, anyone
consider port it to FreeBSD?
Re:Linux has better zerocopy TCP, and here's why by qeL3-i · 2002-05-06 20:56 · Score: 1

The GNU manpage for sendfile says it's not portable and shouldn't be used. That's good enough for me, don't use it.

VERSIONS
sendfile is a new feature in Linux 2.2. The include file
is present since glibc2.1.

Other Unixes often implement sendfile with different
semantics and prototypes. It should not be used in
portable programs.
Re:Linux has better zerocopy TCP, and here's why by lamontg · 2002-05-08 06:47 · Score: 1

No, the GNU manpage says that it should not be used in programs that you expect to be easily portable. That is a warning to the programmer that if they use sendfile() that they will likely need to code for multiple different APIs, and will need to use tools like GNU autoconf to work around differences. That does not mean that sendfile() "should not be used."
In fact, apache 2.0 now does exactly this -- determines the version of sendfile() on your system and implements the appropriate code to use it.
Slashdot: Bored at work? Come explain elementary coding practices to 15 year old hax0rs!
Re:Linux has better zerocopy TCP, and here's why by thorpej · 2002-05-08 09:06 · Score: 1

Yes, Linux does have sendfile(2) while NetBSD does not. No argument there. However, sendfile(2) has some issues:

1. It only works for sending complete files, and I seem to recall that it closes the file at the end of the transaction. This doesn't work for e.g. Samba servers, which need to send chunks of files.

2. sendfile(2) doesn't work for data sourced from somthing other than a file. Consider a database server which maintains a memory-resident cache. Or consider piping output from a command, say, dump(8), over the network. Or consider the case of an iSCSI target device, which might have mmap'd a chunk of disk/file blocks to serve on-demand.

My change works for those 2 (important!) cases above.

That is not to say that NetBSD won't get a sendfile(2)-like mechanism in the future (it will probably get a splice(2) system call, which allows you to hook together 2 arbitrary file descriptors, one source, one sink -- essentially a generalization of sendfile(2)).

It's also worth noting that the NetBSD zero-copy TCP/UDP transmit path doesn't require anything special from a device driver; the driver simply needs to be able to DMA from arbitrary memory addresses. And even if a device can't do this, you have still reduced memory bandwidth consumption by eliminating the copy from user space to kernel space.

--
-- Jason R. Thorpe, NetBSD and FreSSH developer
Re:Linux has better zerocopy TCP, and here's why by Anonymous Coward · 2002-05-13 02:48 · Score: 0

> 1. It only works for sending complete files, and I seem to recall that it closes the file at the
> end of the transaction. This doesn't work for e.g. Samba servers, which need to send chunks of files.

No, sendfile() accepts the arguments:

int sendfile (int out_fd, int in_fd, off_t *offset, size_t count)

You can send arbitrary chunks of files by specifying offset and count. No, sendfile() doesn't close the file when it completes sending, either.

> 2. sendfile(2) doesn't work for data sourced from somthing other than a file. Consider a
> database server which maintains a memory-resident cache. Or consider piping output
> from a command, say, dump(8), over the network. Or consider the case of an iSCSI target device,
> which might have mmap'd a chunk of disk/file blocks to serve on-demand.

You can create a buffer acceptable to sendfile by just creating a shared memory segment and using shm_open(). True, this requires rewriting applications to take advantage of it.

Anything you can do with mmap() can be done with sendfile().

> It's also worth noting that the NetBSD zero-copy TCP/UDP transmit path doesn't require anything
> special from a device driver; the driver simply needs to be able to DMA from arbitrary memory addresses.

That is special; a lot of cheap PCI cards require 8 or 16 byte alignment on transmit.

I assume the card also needs to be able to compute IP checksums in hardware; otherwise you haven't really gained anything here.

> And even if a device can't do this, you have still reduced memory bandwidth consumption by
> eliminating the copy from user space to kernel space.

I'm not sure what you mean by this.
In traditional Linux, the copy/checksum is done simultaneously from user space to the DMA buffer for the network card.

Anyway, you need to address 3 issues regarding zerocopy:

1. Do you support full correctness on write() data? (i.e. what happens when a page is shared between two threads or processes and one does write() to a socket? If the other process modifies the page, do the modifications show up in the TCP output? Will the IP checksum still be correct in all cases?)

2. To do the above, do you need to mark all pages with write() in progress as read-only and prepare to do COW in case of modification?

3. What is the overhead in terms of MMU operations and TLB flushes for doing this?

(the consensus in Linux was that #3 is a killer, especially on SMP because you might need to modify the page tables of a process running on another CPU, which means kicking the scheduler and is messy)
Re:Linux has better zerocopy TCP, and here's why by thorpej · 2002-05-15 04:22 · Score: 1

Regarding sendfile(2)'s semantics ... thanks for correcting me on that. (I must admit it has been a while since I looked at specific sendfile(2) implementations.)

Regarding Tx DMA alignment for cheap PCI cards... I have written a fair number of Ethernet drivers over the past several years, and I can't think of any chip that required any more than 4-byte alignment of the DMA buffer. The ones that spring to mind are the RealTek 81x9, Xircom X3201, and the VIA "Rhine" series... oh, and e.g. the Alchemy Au1000 built-in Ethernet MAC also requires 4 byte alignment.

But by far the common case is that the chip can DMA from an arbitrary byte location in memory. I don't really consider this a "special" feature. Rather, I consider chips that can't do this to be "crippled".

(Note, it IS fairly common for Ethernet chips to have stupid limitations on the *receive* side, specifically 4-byte alignment of the Rx buffer, which means the IP header ends up misaligned after the 14-byte Ethernet header. This is truly annoying, since software has to copy data to fix it up.)

Regarding the 3 things you have to do to do zero-copy. NetBSD's virtual memory system has a generic framework for handling "loaning" of pages from one VM object to another. The uvm_loan facility is currently used by pipe(2) and socket writes (new with my changes). That said:

1. Yes, we support full correctness of the written data. If another thread (or the same thread) touches the page before the kernel is finished with it, the loan is considered broken, and a copy-on-write fault is taken to resolve the situation.

2. This is really the same question as (1). The pages aren't marked as "write in progress", per se, but are marked as "loaned" (loans can be used for things other than just outbound I/O, although that is the most obvious use of the facility).

3. Yes, there is TLB maniuplation traffic. This is why you pick a threshold for using the loaning facility. Doing it for small writes would be stupid, since the copy would be less overhead than the TLB traffic. That said, even in an MP system, it's not too bad, since NetBSD uses explicit barrier operations for low-level VM operations (so that the machine-dependent "pmap" module can coalesce TLB operations if it would be beneficial to do so). In any case, the expense of TLB shootdown traffic is largely an implementation issue.

--
-- Jason R. Thorpe, NetBSD and FreSSH developer

FreeBSD has zerocopy sendfile... by lamontg · 2002-05-06 19:59 · Score: 5, Informative

A basic bit of research by running 'man sendfile' on a FreeBSD system would have told you this:

SENDFILE(2) FreeBSD System Calls Manual SENDFILE(2)

NAME
sendfile - send a file to a socket

LIBRARY
Standard C Library (libc, -lc)

SYNOPSIS
#include sys/types.h
#include sys/socket.h
#include sys/uio.h

int
sendfile(int fd, int s, off_t offset, size_t nbytes,
struct sf_hdtr *hdtr, off_t *sbytes, int flags);

DESCRIPTION
Sendfile() sends a regular file specified by descriptor fd out a stream
socket specified by descriptor s.

[...]

IMPLEMENTATION NOTES
The FreeBSD implementation of sendfile() is "zero-copy", meaning that it
has been optimized so that copying of the file data is avoided.

Re:FreeBSD has zerocopy sendfile... by Espen+Skoglund · 2002-05-06 22:18 · Score: 2, Interesting

And you also forgot to mention:
HISTORY
sendfile() first appeared in FreeBSD3 .0. This manual page first appeared in FreeBSD 3.1.

*BSD needs our help! not our pity :-) by freebsddude · 2002-05-06 20:06 · Score: 1

We see a lot of posts indicating that *BSD is dying. OTOH, we see a lot (literally a LOT!) of enthusiastic and excited FreeBSD (*BSD) newbies looking for help with basic *BSD questions.

An important question we all need to ask ourselves: do we want to be plain sideline observers, or do we choose to help grow *BSD? Again, I am sure all of you *BSD geeks will agree that it would be in our best interests to promote *BSD!!!

This is not an infomercial on our part, but a request for all you "expert" *BSD geeks to give back to the *BSD cause by PARTICIPATING, visiting your favorite *BSD sites (or whatever else !!!), and by helping answer newbie questions.

Newbies sometimes feel intimidated/overwhelmed by mailing lists, complex howtos, etc. and dont exactly know where to start. Some prefer asking simple questions in forums, or following simple howtos, etc. It would be in our best interest to encourage these folks and turn them towards FreeBSD, OpenBSD and NetBSD!! Encouraging them will promote *BSD committment (committers) with ports/applications, OS and security enhancements. *BSD needs our help!!

So does this mean... by Anonymous Coward · 2002-05-06 20:26 · Score: 0

...that NetBSD will be the new choice of warez and DoS kiddies everywhere?

Re:*BSD needs our help! not our pity :-) by Anonymous Coward · 2002-05-06 20:58 · Score: 0

No way mang! MACOSX is better than dumb old *BSD!

Ok, you're right. by Jamie+Lokier · 2002-05-06 23:41 · Score: 1

You're right, (hangs head in shame...).
I had a nagging feeling when I wrote the article ;-)

Ah well, at least I managed to explain some of the differences between mmap-write and sendfile.

Thanks,
-- Jamie

Re:Ok, you're right. by thallgren · 2002-05-07 22:05 · Score: 2, Insightful

Right, but you completly failed to explain why the sendfile-only/no-mmap zero-copy in Linux is better. Instead you mention that one should create files in tmpfs and use sendfile on that. That can hardly be zero-copu anymore, can it?

Regards, Tommy

hoo baby by AdamBa · 2002-05-07 01:58 · Score: 3, Insightful

Don't know anything about Linux, but compared to NT this seems incredibly complicated. In NT you can take any user buffer, probe it, map it, and lock it. Don't have to worry about "user space behaviour" and whatnot. If the user is doing synchronous write()s (or whatever) the call won't even return until the kernel code tells says it is done with it. And for asynchronous, you would hope the user won't be walking all over the buffers it has handed over until the call is done. And in all cases the kernel components handle the buffers in the same way, you don't need your network card to indicate with a flag whether it can do zero-copy sends (?!?!?!?!?!?!?).

- adam

Re:hoo baby by Anonymous Coward · 2002-05-07 16:09 · Score: 0

The question is... if you try to use zero-copy on a file thats in userspace and you have to go from usermode to kernelmode, you lose the whole advantage of zerocopy cause going from usermode to kernelmode is a big time hit, whereas if you have a kernelmode driver you won't. I could see this being used for like a bridge or a gateway, pass packets from one interface to the next without taking the packets out of kernelmode.
Re:hoo baby by Anonymous Coward · 2002-05-11 03:41 · Score: 0

Don't know anything about Linux, but compared to NT this seems incredibly complicated.
Yes, but in Linux, you won't get a blue screen. And it'll be faster too! The "easy" interface is still available, but might need a copy operation to work, which slows it down somewhat (... in case you didn't notice: that's the whole point of this API: speed up things...).

Install NetBSD Instead by Anonymous Coward · 2002-05-07 03:55 · Score: 0

Subject says it all.

Re:Elegy for *BSD by Anonymous Coward · 2002-05-07 09:09 · Score: 0

Cool. I have a bunch of zombie servers then. Here come FreeBSD zombies for you BSD troll.

Re:*BSD needs our help! not our pity :-) by kwerle · 2002-05-07 11:38 · Score: 1

Funny post - BSD is now the highest volume *nix in the computer biz. Thanks to Apple shipping on Darwin. What's more, users *REAL USERS* don't give a hoot about "basic *BSD questions". They just want it to work.

And it can. And it does.

OK, so maybe this is a troll. But maybe it's insightful, and maybe it's just funny...

OpenBSD and NetBSD? by cpeterso · 2002-05-07 13:24 · Score: 2

I wonder if OpenBSD will copy this new NetBSD code. Does anyone know which BSD has more users, OpenBSD or NetBSD? I hear few people ever talking about NetBSD, but many of the OpenBSD changelogs have comments like "copied this from NetBSD". Does OpenBSD simply "ripoff" NetBSD development?

--
cpeterso

Re:OpenBSD and NetBSD? by Anonymous Coward · 2002-05-07 14:28 · Score: 0

Yes.

Early on, in fact, they went through the NetBSD
bug database and simply applied all the patches
users had submitted. The intent was to make
NetBSD +. However, they made a pile of crud that
is only rumored to be more secure because they
blow their horns louder.

One patch they applied, btw, was applied totally
incorrectly, and was obviously done without any
real thought. There were several #define constants at the top of the file, and they duplicated one value, so two IOCTLs had the same number.

With roots like that... secure or just rumor?
Re:OpenBSD and NetBSD? by keramida · 2002-05-07 15:43 · Score: 2, Insightful

> Does OpenBSD simply "ripoff" NetBSD development?
Nope. The sharing of code and ideas among the various *BSD implementations is a Good Thing(TM). Developers from more than one groups that join their efforts and write code that is easy to port, clean, and useful to more than one of the BSDs are out there. And they have been doing a great job for quite some time. There's no ripping off, but a cooperative spirit that we should be glad for.
To use your example, what if the "ripping off" of code that OpenBSD is accused for, helps in revealing some bug that is caused by assumptions that are true only for NetBSD? Is then the "ripping off" justified and part of a "development process" or is it still a bad thing? What if the OpenBSD folks write back to the original authors and help in fixing possible problems with the NetBSD codebase? Aren't they then assisting in making NetBSD a better OS too?
- Giorgos

--
My other computer runs FreeBSD too.
Re:OpenBSD and NetBSD? by Anonymous Coward · 2002-05-08 01:39 · Score: 0

> I wonder if OpenBSD will copy this new NetBSD code.

Most likely not, at least not any time soon. OpenBSD developers are very conservative when it comes to new untested and not time proven technology. On the other hand Theo is progressive when it comes to enforcing the security - recent removal of rlogin/rexec clients and server from the tree is a proof to this.

> Does anyone know which BSD has more users, OpenBSD or NetBSD?

OpenBSD.

> but many of the OpenBSD changelogs have comments like "copied this from NetBSD".

It's the same with NetBSD and FreeBSD changelogs. Even Sun integrates OpenSSH in Solaris-9, but OpenSSH is part of the OpenBSD project, same guys.
Re:OpenBSD and NetBSD? by alan_d_post · 2002-05-08 23:18 · Score: 1

The neat thing about information is that copying it does not put any burden on the original author. Theo forked NetBSD -- you could too, if you wanted. It hasn't exactly killed NetBSD. Changes flow between all three projects.

Of course, I've probably just been trolled . . . .
Re:OpenBSD and NetBSD? by Art+Deco · 2002-05-09 02:50 · Score: 1

As others said OpenBSD was an offshoot of NetBSD. NetBSD itself started life as a system that combined the Jolitz's 386BSD with collected patches so you could say that NetBSD "ripped off" 386BSD. Since all the software is free nobody is ripping anyone else off though. Since OpenBSD is based in Canada they can ship their system with strong encryption since they are not subject to the USA's fascist crypto laws.
Re:OpenBSD and NetBSD? by yerricde · 2002-05-12 16:55 · Score: 1

Since OpenBSD is based in Canada they can ship their system with strong encryption since they are not subject to the USA's fascist crypto laws.

The USA crypto laws are no longer fascist. The Bureau of Industry and Security (BXA) has lifted most restrictions on exporting free cryptographic software.

--
Will I retire or break 10K?
Re:OpenBSD and NetBSD? by Anonymous Coward · 2002-05-13 02:13 · Score: 0

The laws are still not good enough.

NT's method is the same as the new BSD code by Jamie+Lokier · 2002-05-08 04:18 · Score: 2, Insightful

I know that NT has been able to pin user buffers forever, and that can be used for synchronous I/O. It's not suitable for asynchronous I/O, though: it's no good "hoping" the user doesn't overwrite buffers until the TCP transmitted data has been acknowledged. Applications are often written to assume they can call the equivalent of the unix write() call, and then overwrite their application buffer to prepare for another write.

You said that all kernel components handle buffers in the same way, and thus all network cards handle zero-copy sends. In fact this is not true for all supported network cards: some cards simply don't have the hardware to transmit a packet that is in physically discontiguous memory. And that's what you have when doing zero-copy sends from user buffers. So, either NT's kernel or the vendor-supplied device driver must copy the data into contiguous memory, or onto the card itself (typical for ISA cards). When that happens it isn't zero-copy any more, although it is transparent to the application -- just like the Linux and BSD mechanisms!

have a nice day,
-- Jamie

Re:NT's method is the same as the new BSD code by AdamBa · 2002-05-08 06:13 · Score: 2

I meant asynchronous I/O where the app is expecting it to be asynchronous. With sockets there is some way to wait for a write to complete (which I don't remember offhand, I'm much more familiar with the kernel parts), and using the NT API you can do a wait...I think even with Win32 file APIs you can wait. But I think the default is to have I/O be synchronous, so a dumb app will work. And I disagree that you can't expect an app doing asynchronous I/O not to touch a buffer until the write is done. That's just a broken app. What if it is multithreaded and changes the buffer during the write? What if it zeros it out BEFORE the write?!? Etc etc.
It's true some cards can't do a busmaster send from discontiguous buffers. So they do have to copy to contiguous memory, or to card memory, or do DMA, or whatever...but that's something the card driver has to just deal with. Nobody above the network card driver has to care.
So it sounds like the network card flag you mentioned is just an indicator of the ability to send discontiguous buffers directly? That is, a status indication, not something an app has to worry about....so that's fine, I withdraw my ragging on that.
- adam

Overlapped IO / Win32 by strags · 2002-05-10 03:43 · Score: 2

Can someone tell me if Overlapped IO under Win32 is also zero-copy? It seems like it could be - basically you pass buffers to the OS, and the OS lets you know when it's finished with them. This works for input and output.

The other nice thing about OIO is that you don't have to populate your FDSET every time you do a select - which means if you're writing a server app with thousands of simultaneous connections, it's a whole lot faster.

Is there a Linux/BSD equivalent to this?

Re:Overlapped IO / Win32 by thorpej · 2002-05-10 04:07 · Score: 1

POSIX defines an asynchronous I/O interface, "aio", and it is implemented in several Unix variants. It pretty much has the semantics you describe.

--
-- Jason R. Thorpe, NetBSD and FreSSH developer

good point by AdamBa · 2002-05-15 05:07 · Score: 2

You are right .

I guess in general avoiding copies is a good thing, it frees the CPU to do something else. You just may not always have something else for the CPU to do, but hey why not give it the opportunity.

- adam

Slashdot Mirror

Zero-Copy TCP and UDP Output in NetBSD

74 comments