Interesting. So would the use of alloca() make a difference?
No. alloca() allocates memory from the stack. The above is still hiding some details- malloc() isn't really appropriate- but some device that allocates an actual page-sized chunk is what is required.
I know this is almost pseudocode, but this is a memory leak. Especially if you don't ever free the buffer.
That's the point. You have to keep track of those pages as allocated and free them later- after the operating system has had an opportunity to send the data that's in them.
In theory the OS could be allow itself write & check for overlapping calls (& avoid the COW fault), but note that the read() example really isn't interesting for zero-copy unless you're using hardware TCP offloading. Zero copy is more interesting for write(). The usual case is then:
TCP ``offloading'' is exactly what's going on here. The above was a demonstration of part of a sendfile loop. The idea is that it would really be triggered by a select() or poll() notification.
To avoid COW faults you have to be really careful that you don't accidentally write to the same page as the buffer--even indirectly by malloc updating it's inline data structures. That's pretty nasty to do--the easiest way is to allocate 8K at a time, and use a page-aligned chunk from the middle of it. Talk about a waste of memory.
That's the point. How does one avoid COW faults on FreeBSD? You can't tell whether or not the kernel is done "offloading" your data yet.
The manual for FreeBSD's zero_copy mechanism suggests using a buffer 2x the size of the TCP buffer size. It's wrong, it'll still fault, and performance will still suffer- although not as much as a fault every load of the buffer.
It's been a while since I've used straight C, I'll admit, however, why not do this via a ref count like in COM or.NET (I know, wrong crowd), and I would assume Java???? It seems to me that the compiler could implement the code to free memory malloc'd and forgotten/no longer in scope/having no reference memory.
Except that the compiler doesn't have access to this information. Userspace doesn't have access to it at all- that's the problem with FreeBSD that's being pointed out.
Linux DOES have such a mechanism so it WOULD be possible to maintain user-space reference counts.
If it just uses a buffer bigger than the 2x the TCP window, 99% of the time no copy is necessary, without a big mess of code complexity.
This simply isn't true. This zero-copy chunk is designed to run as part of a big select-loop. Allowing one of those fd's in the select loop to trigger the frees() means the code complexity is down- one doesn't need a ring buffer or anything, they simply malloc as needed, and when notifications come in, free the pointed-to chunk.
Anyone who wants to write a naive application is free to not enable zero-copy at all, and simply use write(). The normal overhead of copying buffers isn't terribly bad on modern hardware -- only extremely high-performance apps need zero-copy in the first place.
Nobody said naive application. I said naive implementation. A 2x buffer is a naive implementation- it doesn't catch the important cases and severely limits the performance of these extremely high-performance apps that this infrastructure addresses.
But that's not true in general. 99% of all fork() calls are followed by exec() and the entire space gets dumped. That's why COW is a huge win in the average case. The case of an application using fork() followed by actually doing something useful is exceptionally rare outside of the server space. In fact, Apache is about the only program I can think of that ever does this.
This isn't about fork() it's about zero copy buffers, not code and data pages in general.
Consider a block like this:
char buffer[4096]; for(i = 0; i < len;) { r = read(fd, buffer, 4096); zero_write(fd2, buffer, r); i += r; }
Now, on the whole, if zero_write() works like write() then an awful lot of copying is going on. But if zero_write() uses the buffer for kernel space as well, it's much faster (1 less copy).
Now the trick is returning to userspace before the buffer is completely used. In FreeBSD a page fault would occur immediately during read().
Both FreeBSD and Linux agree that you shouldn't do this. Instead something like this:
char *buffer; for(i = 0; i < len;) { buffer = malloc(4096); r = read(fd, buffer, 4096); zero_write(fd2, buffer, r); i += r; }
The trick at this point, is that elsewhere in your code, Linux can tell you when those malloc() buffers can be reused, whereas FreeBSD doesn't. It relies on the fact that you'll either make a blocking call on fd2 before you free buffer _OR_ you'll accept a page fault.
But if you can be told when it will occur, you don't need to do either of these things, and as a result, you NEVER have to wait. This means your program will be simpler and go faster.
In *BSD, libc is considered part of the OS. There are a lot of interfaces that are used between libc and the kernel which aren't meant for general consumption (the threading system calls for instance).
So what?
The problem remains that if the kernel doesn't provide a mechanism for solving the problem, then the problem cannot be solved until it does.
free() isn't in a position to solve the problem because it's in libc. If it were actually a system call, then free() would be slower, BUT it would have the ability to solve this problem.
Problem is that unless you're talking about declaring the pages "free" by storing more data in the heap info structure, declaring the pages free would require trapping into the kernel, and that is every bit as slow as the exception on most architectures, only now you're doing it more often, since you're doing it every time a page changes from free to not free.
No. System calls are not as slow as exceptions.
If they are on your architecture, you're not supporting a million clients at a time on that architecture. It's unreasonable.
And besides, the kernel can coalesce multiple free-returns, thus reducing the number of messages. After all, the pages are free once an interrupt occurs, and anything ready to go out can probably go at that point.
Even if you do this by just adding info in the heap structure, it isn't clear that the performance hit of doing so will be worth it in the average case, since most fork() calls are followed by exec() and thus zero copies actually occur, so you're optimizing for the 1% case and causing a performance hit throughout the entire execution of the 99% case.
You're confused. We're not talking about fork() and exec() but about COW on buffers.
CoW on fork() and exec() is smart. An exception between the two is rare and limited to one page. When using vfork() the exception NEVER occurs.
CoW on kernel fifo buffers or TCP socket buffers is stupid. An exception occurs at the top of each loop- it would've been faster to copy the single page each run, instead of generate a page fault to the same page over and over and over again.
So it seems that the FreeBSD developers realized this long ago and perhaps aren't as moronic as Linus thinks.
Except that there's no API call that detects whether or not the buffer can be reused. Only by causing a blocking call can you make sure that before the thread can run again, that the buffer is no longer used by the kernel.
This is stupid. Applications that need to handle hundreads of thousands of clients simply don't block.
If there WERE an API call that can detect the page is no longer in use, then why bother COW-ing the kernel buffer? Updating the page tables at that point is unnecessary and as a result, wasted code.
But being as how there isn't such a call, why CoW the kernel buffer anyway? Why not simply require the blocking call or luck before reusing the buffer? Updating those page tables is slow, and doesn't buy very much- except that it makes the naive approach get written in the first place.
What you're saying is that every time through the loop, there's going to be a page fault as the CoW pages are wiped away by the new copy into the same logical buffer. CoW is dependent on allocating new pages every time so that you don't ever write to the old CoW pages. Correct?
Exactly correct. Those frequent CoW operations are slow- the page faults are expensive. If you had instead written:
char *buffer;
int read = 0;
int length;
while(read < totalSize)
{ buffer = malloc(1024);
length = fread(buffer, 1, 1024, &file);
read += length;//Do some stuff, but don't free the buffer!
}
Then it would operate quickly on FreeBSD. The problem then becomes exactly when do you free all those malloc()s?
On Linux, you can get a signal from the kernel- via a recvmsg() call that will tell you exactly which pages are now available to be freed- or better still, reused.
It'll be easy to check and test correctness AND the programmer has to be aware it's going on in order to use it at all.
Under FreeBSD the programmer can use the syscall, but never get the performance unless they know exactly what's going on.
Of course, this is where I'd really like to hear from the *BSD developers. Surely they must be aware of this issue?
I don't know. The article wasn't about that- I doubt Linus pays attention to what the BSD people know- in fact, I don't even think he knows for certain if FreeBSD even works this way.:)
The point is that using CoW is stupid for this. It makes things complicated in the hard case, and in the easy case, it makes things slower.
When I need to fork(), I do not have the time to think of all the memory management invovled with fork(). I just want it to be done reliably, and I want it to be done fast.
So what? Who's talking about fork()?
This is about copy-on-write of zero-copy fifos and TCP. If you don't know what the rest of us are talking about, please just say so, and we'll be happy to tell you exactly what's going on.
Maybe you'll have something to contribute at that point, or maybe you'll just learn something.
And if FreeBSD is not an option, than I am not going to do the optimization
I want the kernel to run my code as fast as possible by default.
Sounds good. Use read() and write() because those operate predictably and faster than the zero-copy method on FreeBSD.
If scalability is important to you, investigate zero-copy methods. They aren't free- on FreeBSD you either need to wait for a competant API or use a very complicated allocator. On Linux, you already have a competant API.
In practice I think the FreeBSD approach probably does have speed advantages in most cases, and the fact that it's transparent to the userspace developer would seemingly be a big advantage.
No, it has a speed advantage over read()/write() provided you are aware of exactly how it works. The fact that it's transparent to the userspace is a bad thing because it means you have code written a certain way- that nobody will ever understand why.
Reusing the pages causes the speed benefit to go away- and in fact it'll be slower than read()/write().
This sort of thing matters almost exclusively to people doing really deep performance tuning, and for them it's better to present a simple API with large rewards for tuning, instead of transparently doing something weird to an existing API that will break in the field without you noticing and requires really weird usage to get the best performance.
I agree completely. Unfortunately, the FreeBSD API is inadequate. It's not faster in practice unless you do something really really weird (waste memory). The big difference is the Linux implementation gives explicit notification and the FreeBSD API doesn't.
FreeBSD doesn't provide an API to ask if the pages are still in use. That'd probably make their approach usable- but at that point, why bother updating the page tables at that point?
Once you're there, why bother statpage() to check to see if the page is in use? Why not have the kernel send the pages that are available via a file descriptor so you can poll() or select() on it?
At this point, you're at the Linux implementation.
I'm sorry to interrupt here, Your Holiness, but instead of being snarky and flaming the BSD kid, you could've been somewhat helpful and provided an idea as to *why* that might be the case (e.g. swappiness, etc).
Not happening. He didn't ask how to make Linux operate as he expects, better or worse, he said Linux has had a persistant problem that FreeBSD doesn't.
I said, no way, posted someone elses' report on the subject, and pointed out something in the conclusion section.
If he wants help making reliable servers, he can ask, and I'll probably help. But that's about the end of it.
Basically, the problem is caused by the fact that usermode code never releases any of the memory it has allocated.
Oh no. That's the solution actually.:)
The problem is in using a static buffer instead of allocating a buffer for each send operation. If you use a static buffer, you ALWAYS cause a fault. If you malloc() each time, you won't fault- at least until you reuse the pages later (when malloc() fails).
However, given that the "free()" routine is part of the OS in FreeBSD
No, it's not unfortunately. It's a library call that mucks up [s]brk() or munmap().
Free _could_ be smart enough to avoid actually freeing the pages until notification occurred, but userspace would still need explicit notification (or just to wait for a while).
The real issue is explicit notification versus page fault. The page fault is undesireable because it wastes time, memory, and cache. The page fault can be avoided by never reusing memory like I proposed above.
OR the userspace can simply wait for notification that the pages are done. A signal could be used, but vmsplice() actually causes a fd to wake up that can receive the notification via the recvmsg() system call.
There you're assuming that the page copy will be necessary. In cases where the W in COW does NOT occur, isn't COW much better?
Well, no, it's about the same actually.
The problem is that in the naive implementation, the page copy is always necessary. A complicated implementation (in userspace) to take advantage of COW is more complicated than with explicit notification.
but what I do know is that when you start using up a lot of memory Linux totally sucks.
Correction: when _you_ start using up a lot of memory Linux totally sucks. When I start using up a lot of memory, Linux acts exactly as I expect, and better than FreeBSD.
I think the problem with this approach is that COW will only give you a copy of the particular piece of the memory that you accessed. That means that the system has to keep huge tables of what is shared and what is not and every time you make a call to request ANY memory it's going to need to check the table. This action is going to result in an overall performance degradation since the application has to check the table for every write over the long-haul, rather than just duplicate the memory and go.
It does all those things anyway. The problem is that faults are expensive, and yes- and because they happen in real life, copying the memory IS faster in real life.
It is possible to exploit the mechanism FreeBSD uses to gain performance- Simply never touch a page after it's been sent out. Or rather, wait as long as possible- say until malloc() fails.
This would work, but it'd be hard to test and hard to get right.
What Linus suggests is explicit notification- say a select() or poll() operation that says "these pages are now free". This works out well, and is indeed faster because there aren't any copies or page faults. It's also easier to develop.
Of course, using COW for TCP buffers is stupid. That's why people don't use them on FreeBSD (at least, not once they've seen the profiler results)- it's never faster. They always use a static buffer and ALWAYS get the page fault when the system is under any load.
One thing that concerns me about making all of these copies is that it seems like a quick and easy way to blow out your L2 cache. That could in the long run have a worse performance penalty than having to play the VM tricks with CoW.
No it won't. The only way to avoid the copies is to avoid the pagefaults. Since userland doesn't get explicit notification in FreeBSD of when the pages are safe to use, the process should wait as long as possible (e.g. until malloc() starts failing)
The idea Linus pushes here is explicit notification- via select() or poll() and returnable via recvmsg(). That way the userland knows exactly which pages can be reused.
The result is that it's faster and easier to develop userland programs to take advantage of it. It's also easier to degrade gracefully into read()/write() until the FreeBSD people see the light and add support for this too.
Copy on Write saves you real memory, cache memory, and CPU time by pretending that each forked process has a true copy of a memory segment when it in fact is looking at the original. That is, right up until a fork tries to write to that memory location, in which case an exception is handled by making an actual copy to a new location and allowing the write.
No. Updating the page tables twice and having a fault in there is very expensive.
Linus believes that the exception will occur enough in real world usage that it will be slower than just doing the copy in the first place.
And he's right too. But he's not recommending the copy "in the first place" - he's recommending explicit notification that the pages aren't used anymore instead of an implicit notification by-way of a page fault.
Linus wants to push the manual use of zero-copy memory sharing through the vmsplice() routine. He believes that the programmer will always know better than the system when to share memory.
That's correct.
Does the exception generated really cost that much more
Yes. There isn't a grey area on it either- it's basic math: cost of page copy + exception + 2 * (page table update) is greater than cost of page copy + page table update.
The real issue is that the userland knows what it's doing. Eventually it'll want to reuse a buffer. Now does the userland start reusing pages when malloc() fails- thus incuring the exceptions when memory is tight? Or does it reuse them when the kernel says they're reusable?
The latter makes more sense if you're actually concerned about performance. The former may be easier to code, but I doubt many people will actually do that because it's hard to test.
In practice what people do is use a static buffer- that's even EASIER to code, but it means page faults happen ALL the time.
Is it really feasible to expect program developers to do manual memory management in a day in age when programs easily weigh in at hundreds of megs?
They already have to do it. Whether it's the BSD implementation or the new Linux implementation they already have to do it if they want reasonable performance in the real world.
To really take advantage of the BSD implementation, your program needs to monitor malloc() usage, and start attempting to reuse pages when it fails- oldest to newest. This is complicated and hard to test.
To really take advantage of the Linux implementation, your program waits until it gets notification (via select() or poll()) on the vmsplice() recvmsg() operation. Once that occurs, the notification says exactly which pages can be used.
The result? Userland on Linux is easier to write, and easier to test. It'll also be faster.
Any game that is heavily reliant on the controller to make a majority of the mechanics function is doomed from the start
You mean like the NES (origin of the D-Pad)? Or maybe you meant the DS (touching outsells Playstation and PSP)?
Standardized interfaces exist because it allows developers to map the controls they want to a widely accepted standard.
Except Nintendo invented those standards. Each one introduced new games that were more fun than what was out there. Having a dozen first person shooters to pick from doesn't mean you have a dozen fun games, and I think Nintendo knows this.
I think every gamer should give Nintendo the benefit of the doubt on this one- unless you bought a ROB or Virtual Boy:)
Games should allow custom controller maps, not require custom controllers.
I think the Jaguar, with it's overlay cards pretty much prove you wrong.
you're going to add 20-30$ to the cost of the game for the attachment alone.
I think that's exactly why Nintendo opted for spending less on the rest. People can make excellent games when they're not vying for the most polygons and highest levels of technical accuracy. Nintendo is doing the smart thing- companies that develop on the Revolution will have to compete on fun-ness instead of pixels. They're going to want to because they're suddenly going to have all these "great ideas" on new gimmicky input devices that they can manufacture cheaply.
the words "Glorified light gun" keep springing to mind....
Actually, I just bought a bargin-bin eyetoy thinking the same thing. You know what? At least half the games on the disc are fun. I mean, really fun. I mean, I'd like more games but the eyetoy is too uncommon and too glorified-light-gun looking for anyone to develop for.
I think this lends even better to the idea that by making the glorified light gun standard, we might actually see it get used. And if it gets used, it's probably going to be fun.
Never forget that was invented by Nintendo as well. Before the D-Pad, people used joysticks, and had no idea how much they hated them until the D-Pad.
In fact, being as how Nintendo was right about the D-Pad, right about the modern ``analog'' stick, right about the touching, why does it seem so difficult for people to believe that they might also be right about the freestyle wand?
Nintendo has demonstrated again and again that they invent excellent general purpose input devices, and again and again that new and exciting games take advantage of them.
Nobody cares about system files that can be replaced within hours. The important stuff generally does not require write access to do it.
You sound like you've lost your system files often enough to know this first-hand.
I on the other hand don't have "hours" to throw away every few days like you do.
Well done, you have bent over backwards to lower privileges. Most users won't, and so, this point doesn't really prove anything.
Prove what?
Did you think I was attempting to prove something to you?
Tell me, do you honestly think you understood a word that I wrote- besides "most users won't do that"?
Were you dropped as a child?
Not many people know about CoreForce? No, well, not many people know how to do what you have done either.
I'm sorry, lots of software follows this model. Qmail is a shining example of using privilege separation to avoid risk.
The only place it doesn't seem common is in Sendmail, ISC and Microsoft software.
Often in security discussions I see lots of uninformed speculation as to what "regular users" do. Suffice it to say that "regular users" do install software in large enough numbers that simply ignoring the issue is not enough.
I don't think you see any security discussions. Regular users always means unprivileged user, and NEVER does it mean "real good at home folk".
Basically you've put together a badly hacked up version of what toolkits like SELinux, AppArmor or CoreForce give you in a much cleaner and more elegant way, which is commendable but not a route I'd recommend nor would I expect others to follow it.
Maybe this is the problem. Pompus assholes like yourself that tell users that it's okay to get virus-infected or lose all your data every few days- because that's normal and it doesn't fucking matter.
I hope you shovel fast food for a living, because you'd be worthless in security.
And don't get me started on trusted GUI paths. No consumer OS today gets this right - none. Just go read a usability study of trusted path systems to see what fun we're going to have integrating this into mainstream technology.
I wasn't planning on it. You've already demonstrated yourself a moron and an asshole that likes hearing themselves talk.
Come back when you actually have something to say troll.
It's one thing for you to write a wrapper that chmods a file, runs vi, then chmods it back. But it's another to tell your mother that she can't edit files in her word processing program without giving her word processor write access, or telling her that she can't edit files she's opened with the File menu.
You don't know what you're talking about.
I didn't say chmod(), you assumed it. It actually sets group+write using a combination of chown() and chmod(). vi runs as another user using a setuid/setgid wrapper.
Anyway, all it takes is one buffer overflow bug in a standard library (say, gzip or a JPEG decoder) to let malware start up (i.e. spawn a mail or DDoS bot) which will be running until you log out, restart, or kill the process. Then all it takes is one local privilege escalation bug to let the malware install a rootkit, propogate like a virus, or just delete all the files in your home directory.
You have no idea what you're talking about.
_I_ don't have the ability to delete all the files in my home directory, so why would some program that is running with LESS privilege than me?
Furthermore, it's a sign of brain-damage that you think gzip or jpeg decoding is so common that every program should want to do it-such that it be considered a "standard" library.
And unfortunately, setuid is possibly the biggest security holes that Unix has ever had. Every setuid program is a privilege escalation attack waiting to happen. If you can't control a pre-existing setuid program, you can always just set some bits on a filesystem to create one.
You have no idea what you're talking about.
You're confusing setuid-root with setuid. The idea is that we setuid to an unprivileged user- one that has very little powers. Privileges never go up, only down.
BTW, if you think that iptables will prevent the spambot from communicating over the internet, all it has to do is debug a process (e.g. firefox) that has access rights and inject its payload there.
You have no idea what you're talking about.
Debugging another user's processes on UNIX is a privileged operation- it isn't allowed unless you're root.
It turns out that Windows has excellent security facilities, just very few programs use them, and many fail outright if they're used.
How can it possible be useful if it's not usable?
Ultimately, the only thing you can really do to keep yourself safe is simply to exercise caution with what you download. I don't download suspicious programs or visit suspicious web sites, so I am fairly secure.
Security isn't a state, it's a process. You do some parts of the process, but you fail to understand that I can and do download suspicious programs and run them in my natural sandbox without any risk.
You cannot create those kinds of sandboxes anywhere near as easily under Windows.
Of course, Windows allows you to easily run with lowered privileges. In Vista,
No, you're saying in another 2-3 years, that Windows will finally have a feature that's been available in every version of UNIX for almost 30 years.
We'll take a look in 2-3 years to see if Windows can finally compete with UNIX- but as Windows Vista is presently only promising a 40% application compatability rate, I'd say probably not.
Interesting. So would the use of alloca() make a difference?
No. alloca() allocates memory from the stack. The above is still hiding some details- malloc() isn't really appropriate- but some device that allocates an actual page-sized chunk is what is required.
I know this is almost pseudocode, but this is a memory leak. Especially if you don't ever free the buffer.
That's the point. You have to keep track of those pages as allocated and free them later- after the operating system has had an opportunity to send the data that's in them.
In theory the OS could be allow itself write & check for overlapping calls (& avoid the COW fault), but note that the read() example really isn't interesting for zero-copy unless you're using hardware TCP offloading. Zero copy is more interesting for write(). The usual case is then:
TCP ``offloading'' is exactly what's going on here. The above was a demonstration of part of a sendfile loop. The idea is that it would really be triggered by a select() or poll() notification.
To avoid COW faults you have to be really careful that you don't accidentally write to the same page as the buffer--even indirectly by malloc updating it's inline data structures. That's pretty nasty to do--the easiest way is to allocate 8K at a time, and use a page-aligned chunk from the middle of it. Talk about a waste of memory.
That's the point. How does one avoid COW faults on FreeBSD? You can't tell whether or not the kernel is done "offloading" your data yet.
The manual for FreeBSD's zero_copy mechanism suggests using a buffer 2x the size of the TCP buffer size. It's wrong, it'll still fault, and performance will still suffer- although not as much as a fault every load of the buffer.
It's been a while since I've used straight C, I'll admit, however, why not do this via a ref count like in COM or .NET (I know, wrong crowd), and I would assume Java???? It seems to me that the compiler could implement the code to free memory malloc'd and forgotten/no longer in scope/having no reference memory.
Except that the compiler doesn't have access to this information. Userspace doesn't have access to it at all- that's the problem with FreeBSD that's being pointed out.
Linux DOES have such a mechanism so it WOULD be possible to maintain user-space reference counts.
If it just uses a buffer bigger than the 2x the TCP window, 99% of the time no copy is necessary, without a big mess of code complexity.
This simply isn't true. This zero-copy chunk is designed to run as part of a big select-loop. Allowing one of those fd's in the select loop to trigger the frees() means the code complexity is down- one doesn't need a ring buffer or anything, they simply malloc as needed, and when notifications come in, free the pointed-to chunk.
Anyone who wants to write a naive application is free to not enable zero-copy at all, and simply use write(). The normal overhead of copying buffers isn't terribly bad on modern hardware -- only extremely high-performance apps need zero-copy in the first place.
Nobody said naive application. I said naive implementation. A 2x buffer is a naive implementation- it doesn't catch the important cases and severely limits the performance of these extremely high-performance apps that this infrastructure addresses.
Except vmsplice() works on a fd, and one can do a select() or poll() on it.
This isn't about fork() it's about zero copy buffers, not code and data pages in general.
Consider a block like this:Now, on the whole, if zero_write() works like write() then an awful lot of copying is going on. But if zero_write() uses the buffer for kernel space as well, it's much faster (1 less copy).
Now the trick is returning to userspace before the buffer is completely used. In FreeBSD a page fault would occur immediately during read().
Both FreeBSD and Linux agree that you shouldn't do this. Instead something like this:The trick at this point, is that elsewhere in your code, Linux can tell you when those malloc() buffers can be reused, whereas FreeBSD doesn't. It relies on the fact that you'll either make a blocking call on fd2 before you free buffer _OR_ you'll accept a page fault.
But if you can be told when it will occur, you don't need to do either of these things, and as a result, you NEVER have to wait. This means your program will be simpler and go faster.
In *BSD, libc is considered part of the OS. There are a lot of interfaces that are used between libc and the kernel which aren't meant for general consumption (the threading system calls for instance).
So what?
The problem remains that if the kernel doesn't provide a mechanism for solving the problem, then the problem cannot be solved until it does.
free() isn't in a position to solve the problem because it's in libc. If it were actually a system call, then free() would be slower, BUT it would have the ability to solve this problem.
Problem is that unless you're talking about declaring the pages "free" by storing more data in the heap info structure, declaring the pages free would require trapping into the kernel, and that is every bit as slow as the exception on most architectures, only now you're doing it more often, since you're doing it every time a page changes from free to not free.
No. System calls are not as slow as exceptions.
If they are on your architecture, you're not supporting a million clients at a time on that architecture. It's unreasonable.
And besides, the kernel can coalesce multiple free-returns, thus reducing the number of messages. After all, the pages are free once an interrupt occurs, and anything ready to go out can probably go at that point.
Even if you do this by just adding info in the heap structure, it isn't clear that the performance hit of doing so will be worth it in the average case, since most fork() calls are followed by exec() and thus zero copies actually occur, so you're optimizing for the 1% case and causing a performance hit throughout the entire execution of the 99% case.
You're confused. We're not talking about fork() and exec() but about COW on buffers.
CoW on fork() and exec() is smart. An exception between the two is rare and limited to one page. When using vfork() the exception NEVER occurs.
CoW on kernel fifo buffers or TCP socket buffers is stupid. An exception occurs at the top of each loop- it would've been faster to copy the single page each run, instead of generate a page fault to the same page over and over and over again.
So it seems that the FreeBSD developers realized this long ago and perhaps aren't as moronic as Linus thinks.
Except that there's no API call that detects whether or not the buffer can be reused. Only by causing a blocking call can you make sure that before the thread can run again, that the buffer is no longer used by the kernel.
This is stupid. Applications that need to handle hundreads of thousands of clients simply don't block.
If there WERE an API call that can detect the page is no longer in use, then why bother COW-ing the kernel buffer? Updating the page tables at that point is unnecessary and as a result, wasted code.
But being as how there isn't such a call, why CoW the kernel buffer anyway? Why not simply require the blocking call or luck before reusing the buffer?
Updating those page tables is slow, and doesn't buy very much- except that it makes the naive approach get written in the first place.
What you're saying is that every time through the loop, there's going to be a page fault as the CoW pages are wiped away by the new copy into the same logical buffer. CoW is dependent on allocating new pages every time so that you don't ever write to the old CoW pages. Correct?
//Do some stuff, but don't free the buffer!
:)
Exactly correct. Those frequent CoW operations are slow- the page faults are expensive. If you had instead written:
char *buffer;
int read = 0;
int length;
while(read < totalSize)
{
buffer = malloc(1024);
length = fread(buffer, 1, 1024, &file);
read += length;
}
Then it would operate quickly on FreeBSD. The problem then becomes exactly when do you free all those malloc()s?
On Linux, you can get a signal from the kernel- via a recvmsg() call that will tell you exactly which pages are now available to be freed- or better still, reused.
It'll be easy to check and test correctness AND the programmer has to be aware it's going on in order to use it at all.
Under FreeBSD the programmer can use the syscall, but never get the performance unless they know exactly what's going on.
Of course, this is where I'd really like to hear from the *BSD developers. Surely they must be aware of this issue?
I don't know. The article wasn't about that- I doubt Linus pays attention to what the BSD people know- in fact, I don't even think he knows for certain if FreeBSD even works this way.
The point is that using CoW is stupid for this. It makes things complicated in the hard case, and in the easy case, it makes things slower.
When I need to fork(), I do not have the time to think of all the memory management invovled with fork(). I just want it to be done reliably, and I want it to be done fast.
So what? Who's talking about fork()?
This is about copy-on-write of zero-copy fifos and TCP. If you don't know what the rest of us are talking about, please just say so, and we'll be happy to tell you exactly what's going on.
Maybe you'll have something to contribute at that point, or maybe you'll just learn something.
And if FreeBSD is not an option, than I am not going to do the optimization
I want the kernel to run my code as fast as possible by default.
Sounds good. Use read() and write() because those operate predictably and faster than the zero-copy method on FreeBSD.
If scalability is important to you, investigate zero-copy methods. They aren't free- on FreeBSD you either need to wait for a competant API or use a very complicated allocator. On Linux, you already have a competant API.
In practice I think the FreeBSD approach probably does have speed advantages in most cases, and the fact that it's transparent to the userspace developer would seemingly be a big advantage.
No, it has a speed advantage over read()/write() provided you are aware of exactly how it works. The fact that it's transparent to the userspace is a bad thing because it means you have code written a certain way- that nobody will ever understand why.
Reusing the pages causes the speed benefit to go away- and in fact it'll be slower than read()/write().
This sort of thing matters almost exclusively to people doing really deep performance tuning, and for them it's better to present a simple API with large rewards for tuning, instead of transparently doing something weird to an existing API that will break in the field without you noticing and requires really weird usage to get the best performance.
I agree completely. Unfortunately, the FreeBSD API is inadequate. It's not faster in practice unless you do something really really weird (waste memory). The big difference is the Linux implementation gives explicit notification and the FreeBSD API doesn't.
FreeBSD doesn't provide an API to ask if the pages are still in use. That'd probably make their approach usable- but at that point, why bother updating the page tables at that point?
Once you're there, why bother statpage() to check to see if the page is in use? Why not have the kernel send the pages that are available via a file descriptor so you can poll() or select() on it?
At this point, you're at the Linux implementation.
That's it. That's why it's better.
What? you -expect- it to suck...
Read the URL I posted troll.
Linux beats FreeBSD in every benchmark of scalability thrown at it (in that report).
I'm sorry to interrupt here, Your Holiness, but instead of being snarky and flaming the BSD kid, you could've been somewhat helpful and provided an idea as to *why* that might be the case (e.g. swappiness, etc).
Not happening. He didn't ask how to make Linux operate as he expects, better or worse, he said Linux has had a persistant problem that FreeBSD doesn't.
I said, no way, posted someone elses' report on the subject, and pointed out something in the conclusion section.
If he wants help making reliable servers, he can ask, and I'll probably help. But that's about the end of it.
Basically, the problem is caused by the fact that usermode code never releases any of the memory it has allocated.
:)
Oh no. That's the solution actually.
The problem is in using a static buffer instead of allocating a buffer for each send operation. If you use a static buffer, you ALWAYS cause a fault. If you malloc() each time, you won't fault- at least until you reuse the pages later (when malloc() fails).
However, given that the "free()" routine is part of the OS in FreeBSD
No, it's not unfortunately. It's a library call that mucks up [s]brk() or munmap().
Free _could_ be smart enough to avoid actually freeing the pages until notification occurred, but userspace would still need explicit notification (or just to wait for a while).
The real issue is explicit notification versus page fault. The page fault is undesireable because it wastes time, memory, and cache. The page fault can be avoided by never reusing memory like I proposed above.
OR the userspace can simply wait for notification that the pages are done. A signal could be used, but vmsplice() actually causes a fd to wake up that can receive the notification via the recvmsg() system call.
There you're assuming that the page copy will be necessary. In cases where the W in COW does NOT occur, isn't COW much better?
Well, no, it's about the same actually.
The problem is that in the naive implementation, the page copy is always necessary. A complicated implementation (in userspace) to take advantage of COW is more complicated than with explicit notification.
I'm not an expert on any of this,
That's obvious.
but what I do know is that when you start using up a lot of memory Linux totally sucks.
Correction: when _you_ start using up a lot of memory Linux totally sucks. When I start using up a lot of memory, Linux acts exactly as I expect, and better than FreeBSD.
http://bulk.fefe.de/scalable-networking.pdf
Hrm. Looks like FreeBSD panics under load in it's default configuration. So sad.
Meanwhile, I have some systems that constantly run with a run-queue length above 100.0 and are still (albeit somewhat) responsive.
I think the problem with this approach is that COW will only give you a copy of the particular piece of the memory that you accessed. That means that the system has to keep huge tables of what is shared and what is not and every time you make a call to request ANY memory it's going to need to check the table. This action is going to result in an overall performance degradation since the application has to check the table for every write over the long-haul, rather than just duplicate the memory and go.
It does all those things anyway. The problem is that faults are expensive, and yes- and because they happen in real life, copying the memory IS faster in real life.
It is possible to exploit the mechanism FreeBSD uses to gain performance- Simply never touch a page after it's been sent out. Or rather, wait as long as possible- say until malloc() fails.
This would work, but it'd be hard to test and hard to get right.
What Linus suggests is explicit notification- say a select() or poll() operation that says "these pages are now free". This works out well, and is indeed faster because there aren't any copies or page faults. It's also easier to develop.
Of course, using COW for TCP buffers is stupid. That's why people don't use them on FreeBSD (at least, not once they've seen the profiler results)- it's never faster. They always use a static buffer and ALWAYS get the page fault when the system is under any load.
One thing that concerns me about making all of these copies is that it seems like a quick and easy way to blow out your L2 cache. That could in the long run have a worse performance penalty than having to play the VM tricks with CoW.
No it won't. The only way to avoid the copies is to avoid the pagefaults. Since userland doesn't get explicit notification in FreeBSD of when the pages are safe to use, the process should wait as long as possible (e.g. until malloc() starts failing)
The idea Linus pushes here is explicit notification- via select() or poll() and returnable via recvmsg(). That way the userland knows exactly which pages can be reused.
The result is that it's faster and easier to develop userland programs to take advantage of it. It's also easier to degrade gracefully into read()/write() until the FreeBSD people see the light and add support for this too.
It's really a clever idea.
Copy on Write saves you real memory, cache memory, and CPU time by pretending that each forked process has a true copy of a memory segment when it in fact is looking at the original. That is, right up until a fork tries to write to that memory location, in which case an exception is handled by making an actual copy to a new location and allowing the write.
No. Updating the page tables twice and having a fault in there is very expensive.
Linus believes that the exception will occur enough in real world usage that it will be slower than just doing the copy in the first place.
And he's right too. But he's not recommending the copy "in the first place" - he's recommending explicit notification that the pages aren't used anymore instead of an implicit notification by-way of a page fault.
Linus wants to push the manual use of zero-copy memory sharing through the vmsplice() routine. He believes that the programmer will always know better than the system when to share memory.
That's correct.
Does the exception generated really cost that much more
Yes. There isn't a grey area on it either- it's basic math: cost of page copy + exception + 2 * (page table update) is greater than cost of page copy + page table update.
The real issue is that the userland knows what it's doing. Eventually it'll want to reuse a buffer. Now does the userland start reusing pages when malloc() fails- thus incuring the exceptions when memory is tight? Or does it reuse them when the kernel says they're reusable?
The latter makes more sense if you're actually concerned about performance. The former may be easier to code, but I doubt many people will actually do that because it's hard to test.
In practice what people do is use a static buffer- that's even EASIER to code, but it means page faults happen ALL the time.
Is it really feasible to expect program developers to do manual memory management in a day in age when programs easily weigh in at hundreds of megs?
They already have to do it. Whether it's the BSD implementation or the new Linux implementation they already have to do it if they want reasonable performance in the real world.
To really take advantage of the BSD implementation, your program needs to monitor malloc() usage, and start attempting to reuse pages when it fails- oldest to newest. This is complicated and hard to test.
To really take advantage of the Linux implementation, your program waits until it gets notification (via select() or poll()) on the vmsplice() recvmsg() operation. Once that occurs, the notification says exactly which pages can be used.
The result? Userland on Linux is easier to write, and easier to test. It'll also be faster.
Any game that is heavily reliant on the controller to make a majority of the mechanics function is doomed from the start
:)
You mean like the NES (origin of the D-Pad)? Or maybe you meant the DS (touching outsells Playstation and PSP)?
Standardized interfaces exist because it allows developers to map the controls they want to a widely accepted standard.
Except Nintendo invented those standards. Each one introduced new games that were more fun than what was out there. Having a dozen first person shooters to pick from doesn't mean you have a dozen fun games, and I think Nintendo knows this.
I think every gamer should give Nintendo the benefit of the doubt on this one- unless you bought a ROB or Virtual Boy
Games should allow custom controller maps, not require custom controllers.
I think the Jaguar, with it's overlay cards pretty much prove you wrong.
you're going to add 20-30$ to the cost of the game for the attachment alone.
I think that's exactly why Nintendo opted for spending less on the rest. People can make excellent games when they're not vying for the most polygons and highest levels of technical accuracy. Nintendo is doing the smart thing- companies that develop on the Revolution will have to compete on fun-ness instead of pixels. They're going to want to because they're suddenly going to have all these "great ideas" on new gimmicky input devices that they can manufacture cheaply.
the words "Glorified light gun" keep springing to mind....
Actually, I just bought a bargin-bin eyetoy thinking the same thing. You know what? At least half the games on the disc are fun. I mean, really fun. I mean, I'd like more games but the eyetoy is too uncommon and too glorified-light-gun looking for anyone to develop for.
I think this lends even better to the idea that by making the glorified light gun standard, we might actually see it get used. And if it gets used, it's probably going to be fun.
And isn't that what really matters?
Is a D-Pad all a good game designer ever needs?
Never forget that was invented by Nintendo as well. Before the D-Pad, people used joysticks, and had no idea how much they hated them until the D-Pad.
In fact, being as how Nintendo was right about the D-Pad, right about the modern ``analog'' stick, right about the touching, why does it seem so difficult for people to believe that they might also be right about the freestyle wand?
Nintendo has demonstrated again and again that they invent excellent general purpose input devices, and again and again that new and exciting games take advantage of them.
Of course, then there's the Virtual Boy.
Nobody cares about system files that can be replaced within hours. The important stuff generally does not require write access to do it.
You sound like you've lost your system files often enough to know this first-hand.
I on the other hand don't have "hours" to throw away every few days like you do.
Well done, you have bent over backwards to lower privileges. Most users won't, and so, this point doesn't really prove anything.
Prove what?
Did you think I was attempting to prove something to you?
Tell me, do you honestly think you understood a word that I wrote- besides "most users won't do that"?
Were you dropped as a child?
Not many people know about CoreForce? No, well, not many people know how to do what you have done either.
I'm sorry, lots of software follows this model. Qmail is a shining example of using privilege separation to avoid risk.
The only place it doesn't seem common is in Sendmail, ISC and Microsoft software.
Often in security discussions I see lots of uninformed speculation as to what "regular users" do. Suffice it to say that "regular users" do install software in large enough numbers that simply ignoring the issue is not enough.
I don't think you see any security discussions. Regular users always means unprivileged user, and NEVER does it mean "real good at home folk".
Basically you've put together a badly hacked up version of what toolkits like SELinux, AppArmor or CoreForce give you in a much cleaner and more elegant way, which is commendable but not a route I'd recommend nor would I expect others to follow it.
Maybe this is the problem. Pompus assholes like yourself that tell users that it's okay to get virus-infected or lose all your data every few days- because that's normal and it doesn't fucking matter.
I hope you shovel fast food for a living, because you'd be worthless in security.
And don't get me started on trusted GUI paths. No consumer OS today gets this right - none. Just go read a usability study of trusted path systems to see what fun we're going to have integrating this into mainstream technology.
I wasn't planning on it. You've already demonstrated yourself a moron and an asshole that likes hearing themselves talk.
Come back when you actually have something to say troll.
It's one thing for you to write a wrapper that chmods a file, runs vi, then chmods it back. But it's another to tell your mother that she can't edit files in her word processing program without giving her word processor write access, or telling her that she can't edit files she's opened with the File menu.
You don't know what you're talking about.
I didn't say chmod(), you assumed it. It actually sets group+write using a combination of chown() and chmod(). vi runs as another user using a setuid/setgid wrapper.
Anyway, all it takes is one buffer overflow bug in a standard library (say, gzip or a JPEG decoder) to let malware start up (i.e. spawn a mail or DDoS bot) which will be running until you log out, restart, or kill the process. Then all it takes is one local privilege escalation bug to let the malware install a rootkit, propogate like a virus, or just delete all the files in your home directory.
You have no idea what you're talking about.
_I_ don't have the ability to delete all the files in my home directory, so why would some program that is running with LESS privilege than me?
Furthermore, it's a sign of brain-damage that you think gzip or jpeg decoding is so common that every program should want to do it-such that it be considered a "standard" library.
And unfortunately, setuid is possibly the biggest security holes that Unix has ever had. Every setuid program is a privilege escalation attack waiting to happen. If you can't control a pre-existing setuid program, you can always just set some bits on a filesystem to create one.
You have no idea what you're talking about.
You're confusing setuid-root with setuid. The idea is that we setuid to an unprivileged user- one that has very little powers. Privileges never go up, only down.
BTW, if you think that iptables will prevent the spambot from communicating over the internet, all it has to do is debug a process (e.g. firefox) that has access rights and inject its payload there.
You have no idea what you're talking about.
Debugging another user's processes on UNIX is a privileged operation- it isn't allowed unless you're root.
It turns out that Windows has excellent security facilities, just very few programs use them, and many fail outright if they're used.
How can it possible be useful if it's not usable?
Ultimately, the only thing you can really do to keep yourself safe is simply to exercise caution with what you download. I don't download suspicious programs or visit suspicious web sites, so I am fairly secure.
Security isn't a state, it's a process. You do some parts of the process, but you fail to understand that I can and do download suspicious programs and run them in my natural sandbox without any risk.
You cannot create those kinds of sandboxes anywhere near as easily under Windows.
Of course, Windows allows you to easily run with lowered privileges. In Vista,
No, you're saying in another 2-3 years, that Windows will finally have a feature that's been available in every version of UNIX for almost 30 years.
We'll take a look in 2-3 years to see if Windows can finally compete with UNIX- but as Windows Vista is presently only promising a 40% application compatability rate, I'd say probably not.