Torvalds Has Harsh Words For FreeBSD Devs
An anonymous reader writes "In a relatively technical discussion about the merits of Copy On Write (COW) versus a very new Linux kernel system call named vmsplice(), Linux creator Linus Torvalds had some harsh words for Mach and FreeBSD developers that utilize COW: 'I claim that Mach people (and apparently FreeBSD) are incompetent idiots. Playing games with VM is bad. memory copies are _also_ bad, but quite frankly, memory copies often have _less_ downside than VM games, and bigger caches will only continue to drive that point home.' The discussion goes on to explain how the new vmsplice() avoids this extra overhead."
Copy on Write saves you real memory, cache memory, and CPU time by pretending that each forked process has a true copy of a memory segment when it in fact is looking at the original. That is, right up until a fork tries to write to that memory location, in which case an exception is handled by making an actual copy to a new location and allowing the write.
No. Updating the page tables twice and having a fault in there is very expensive.
Linus believes that the exception will occur enough in real world usage that it will be slower than just doing the copy in the first place.
And he's right too. But he's not recommending the copy "in the first place" - he's recommending explicit notification that the pages aren't used anymore instead of an implicit notification by-way of a page fault.
Linus wants to push the manual use of zero-copy memory sharing through the vmsplice() routine. He believes that the programmer will always know better than the system when to share memory.
That's correct.
Does the exception generated really cost that much more
Yes. There isn't a grey area on it either- it's basic math: cost of page copy + exception + 2 * (page table update) is greater than cost of page copy + page table update.
The real issue is that the userland knows what it's doing. Eventually it'll want to reuse a buffer. Now does the userland start reusing pages when malloc() fails- thus incuring the exceptions when memory is tight? Or does it reuse them when the kernel says they're reusable?
The latter makes more sense if you're actually concerned about performance. The former may be easier to code, but I doubt many people will actually do that because it's hard to test.
In practice what people do is use a static buffer- that's even EASIER to code, but it means page faults happen ALL the time.
Is it really feasible to expect program developers to do manual memory management in a day in age when programs easily weigh in at hundreds of megs?
They already have to do it. Whether it's the BSD implementation or the new Linux implementation they already have to do it if they want reasonable performance in the real world.
To really take advantage of the BSD implementation, your program needs to monitor malloc() usage, and start attempting to reuse pages when it fails- oldest to newest. This is complicated and hard to test.
To really take advantage of the Linux implementation, your program waits until it gets notification (via select() or poll()) on the vmsplice() recvmsg() operation. Once that occurs, the notification says exactly which pages can be used.
The result? Userland on Linux is easier to write, and easier to test. It'll also be faster.
* Lightbulb goes on
Oohhhh, I see! So something like this is the problem:What you're saying is that every time through the loop, there's going to be a page fault as the CoW pages are wiped away by the new copy into the same logical buffer. CoW is dependent on allocating new pages every time so that you don't ever write to the old CoW pages. Correct?
Of course, this is where I'd really like to hear from the *BSD developers. Surely they must be aware of this issue? Do they expect programmers to throw away their buffers, or do they have a plan?
Javascript + Nintendo DSi = DSiCade
Well, I am an expert in kernel programming, and I can tell you that Linus has little tolerance for anyone who doesn't program the way he does. That's one reason, for example, that he doesn't support debuggers. Every other OS has a kernel debugger built-in (and therefore, generally stable and full-featured), but not Linux. Even the OS/2 kernel debugger that was created 10 years ago is better than anything Linux has.
And the men who hold high places must be the ones who start
To mold a new reality... closer to the heart
What you're saying is that every time through the loop, there's going to be a page fault as the CoW pages are wiped away by the new copy into the same logical buffer. CoW is dependent on allocating new pages every time so that you don't ever write to the old CoW pages. Correct?
//Do some stuff, but don't free the buffer!
:)
Exactly correct. Those frequent CoW operations are slow- the page faults are expensive. If you had instead written:
char *buffer;
int read = 0;
int length;
while(read < totalSize)
{
buffer = malloc(1024);
length = fread(buffer, 1, 1024, &file);
read += length;
}
Then it would operate quickly on FreeBSD. The problem then becomes exactly when do you free all those malloc()s?
On Linux, you can get a signal from the kernel- via a recvmsg() call that will tell you exactly which pages are now available to be freed- or better still, reused.
It'll be easy to check and test correctness AND the programmer has to be aware it's going on in order to use it at all.
Under FreeBSD the programmer can use the syscall, but never get the performance unless they know exactly what's going on.
Of course, this is where I'd really like to hear from the *BSD developers. Surely they must be aware of this issue?
I don't know. The article wasn't about that- I doubt Linus pays attention to what the BSD people know- in fact, I don't even think he knows for certain if FreeBSD even works this way.
The point is that using CoW is stupid for this. It makes things complicated in the hard case, and in the easy case, it makes things slower.
This isn't about fork() it's about zero copy buffers, not code and data pages in general.
Consider a block like this:Now, on the whole, if zero_write() works like write() then an awful lot of copying is going on. But if zero_write() uses the buffer for kernel space as well, it's much faster (1 less copy).
Now the trick is returning to userspace before the buffer is completely used. In FreeBSD a page fault would occur immediately during read().
Both FreeBSD and Linux agree that you shouldn't do this. Instead something like this:The trick at this point, is that elsewhere in your code, Linux can tell you when those malloc() buffers can be reused, whereas FreeBSD doesn't. It relies on the fact that you'll either make a blocking call on fd2 before you free buffer _OR_ you'll accept a page fault.
But if you can be told when it will occur, you don't need to do either of these things, and as a result, you NEVER have to wait. This means your program will be simpler and go faster.
When I need to fork(), I do not have the time to think of all the memory management invovled with fork().
This has NOTHING to do with fork(). You are used to CoW (copy-on-write for anyone else reading along) only applying to fork(), but that is not the issue under discussion at all. You, and probably 95% of the responders here, need to go RTFA.
The issue is implementing zero-copy IO. FreeBSD's way of doing it do a setsockopt() that causes any write() on that socket to mark the buffer CoW so that it can use it exclusively for handing down to the device driver. The "magic" is that if the programmer tries to use that buffer while the device driver owns it he will get a copy. BUT, the programmer has no way of knowing when that buffer is available again.
Linus's point is that marking a page CoW is very expensive - especially in an SMP environment, almost as expensive as just copying that page to begin with would be. He also argues that taking a page-fault to invoke the CoW to a new page, or simply to turn off the CoW attribute, is orders of magnitude more expensive than just copying it in the first place.
So that means the CoW for sockets is only really useful if you rarely or never reuse your buffers again. And the only place that happens is in synthetic benchmarks.
If Linus had said "Microsoft is a bunch of idiots for implementing a feature that only looks good on benchmarks" everybody would be nodding their heads in agreement. I think the reason people are not doing the same here is because they just don't understand the details.
The complaint is not about general copy-on-write, it's about BSD's ZERO_COPY_SOCKET feature vs. vmsplice().
Basic explanation: Suppose that a program is doing a lot of output to a file or socket. The program can generate data faster
than the kernel can consume it, say. So what should the kernel do with the buffer it receives from the user on each write()?
There are three options.
1) Copy its content immediately elsewhere, so that on return to User Mode, the buffer remains writable and writes are safe.
2) Change the access rights of the page containing the buffer, so that no copy need be made unless User Mode attempts
to modify its content before the kernel has completed the write(). If the user attempts to write, it either gets
permission to do so (because the kernel is done) or it gets a writable copy.
3) Let User Mode promise to not modify the buffer's content until told that it's safe to do so, leaving it writable in
the meantime.
The default behavior is (1); BSD's zero copy socket feature is (2), and the point of Torvalds' complaint; vmsplice() is (3).
"Skill shows through where genius wears thin." -Wittgenstein || Religion: uniting aviation and architecture.
There seem to be a LOT of misconceptions about the discussion of vmslice() vs COW vs copy. This has nothing to do with conserving memory and everything to do with high performance I/O. If your app just needs to send a couple small files from A to B, you probably don't care about this at all.
A little background is needed on the terminology and mechanisms of I/O for any of this to make sense. For an example, let's say your app is a very busy web server sending dynamic (but trivial to compute) pages out.
The oldest and simplest method is copy. The app calls write(int sock, char *buffer, int length) on a socket. The kernel coppies the contents of buffer from userspace memory into a kernel space buffer and at least queues the data to the TCP stack before returning.
COW is an attempt to avoid the cost of copying the outgoing data.. In that case, the reference count on the physical pages that make up buffer is bumped up (since now kernel and application are both interested in them), and marks the pages as COW. That is, the virtual memory addresses are set as read only and a flag bit is set (more or less). The latter is done so the kernel needn't worry about them again. By the time the write call returns, the app is able to immediatly write to that memory (sorta) without worry.
When that write happens, the app takes a page fault (writing to a read-only page). The kernel sees that the pages are COW, copies the data to a new physical page, and maps the page in read/write. Then it returns from the fault. OTOH, if the kernel finished with the page first (the data goes on to the wire), it re-marks the page(s) so the app can access them without a copy.
The hope is that often enough, the app WON'T try to write to the pages while they're busy and so the cost of that copy is saved. If that hope comes through often enough it MIGHT be vaguely uesful. I say MIGHT since there is a significant cost just for marking the pages (the CPU's TLB must be flushed for the change to take effect). If the faults happen, it's a BIG loss since handling a fault takes thousands of CPU cycles.
So, for it to have any chance to help, the application programmer must already know enough to TRY to avoid writing to the same buffer again until it gets to the wire. Unfortunatly, it can never be sure so most apps don't bother.
The vmsplice() proposal is fairly simple. In this case, the app explicitly requests special treatment of the write. The pages are NOT marked as read only at all. Instead, the app is on it's honor to leave them alone until the kernel notifies it that they are again available. This saves the copy and the costs of TLB flush AND the (potential) cost of page faults. If the app breaks it's promise, it is the only one to suffer as the data it sent is corrupted (no kernel housekeeping is ever stored in such pages so there are no security implications). Any damage the app might do by sending screwy data could also be done using the old copy method.
What it all comes down to is that playing tricks with page mapping LOOKS nice at first glance since it SEEMS reasonable that not copying bytes around will save CPU cycles and memory bandwidth. The re-mapping (or just permission changes) on pages SEEMS lightweight. Unfortunatly, in fact, re-mapping or changing permission forces cache invalidations and page faults are just plain expensive. With the direction CPU design is going, these things will likely get more expensive rather than less (as they have for most of the history of microprocessor design).
It's really not that complex for an application to use. At least in comparison to the complexities and level of knowledge required to write an app that performs well enough to need this in the first place.
> what the? L4Linux has to run on top another REAL kernel, usually Linux.
You're quite mistaken. L4Linux runs Linux in usermode on top of the L4 kernel.
http://os.inf.tu-dresden.de/L4/LinuxOnL4/
Done with slashdot, done with nerds, getting a life.