The Really Fair Scheduler
derrida writes "During the many threads discussing Ingo Molnar's recently merged Completely Fair Scheduler, Roman Zippel has repeatedly questioned the complexity of the new process scheduler. In a recent posting to the Linux Kernel mailing list he offered a simpler scheduler named the 'Really Fair Scheduler' saying, 'As I already tried to explain previously CFS has a considerable algorithmic and computational complexity. This patch should now make it clearer, why I could so easily skip over Ingo's long explanation of all the tricks CFS uses to keep the computational overhead low — I simply don't need them.'"
I don't think any scheduler will help you with that. The slowness is due to the swapping in and out from the disk, and that's going to be limited by the horribly slow speed of the disk.
/proc/sys/vm/overcommit_memory. No more OOM killer killing some random unrelated process. Memory allocations will fail and programs will be able to handle that correctly.
/etc/security/limits.conf
You could tweak things to make this a less likely ocurrence though.
Disable overcommit by echo 2 >
Set some memory limits in
Avoid having too much swap space. It's awfully slow, if you're using it too much all you'll manage is to run more things slower.
Get more RAM, it's cheap. If you're regularly swapping then you definitely should.
Pay per scheduler, the kind that allocates time to processes that are initialized by the highest paying bidder. I am aiming for a CEO.
You can't handle the truth.
The scheduler is at the very heart of the kernel. It's relatively hard to make the logic for choicing what and when to context-switch modular, while keeping the actual context-switches fast enough. Diferent schedulers tend to have different ideas on what stats to keep, and you all want it with good memory locality. After all, we should remember that this is a piece of code that's relevant tens or hundreds of times per second, no matter what you do with your machine.
Ingo's reply can be found here. Roman's reply to that is here and here
(on a bash shell)
ulimit -v 4096
command_that_uses_memory
This will limit the amount of memory available to command_that_uses_memory, and kill it once that limit is reached. But do you really want firefox forcibly killed every time you visit youtube?
On Fri, 31 Aug 2007, Ingo Molnar wrote:
> So the most intrusive (math) aspects of your patch have been implemented
> already for CFS (almost a month ago), in a finegrained way.
Interesting claim, please substantiate.
> Peter's patches change the CFS calculations gradually over from
> 'normalized' to 'non-normalized' wait-runtime, to avoid the
> normalizing/denormalizing overhead and rounding error.
Actually it changes wait-runtime to a normalized value and it changes nothing about the rounding error I was talking about. It addresses the conversion error between the different units I was mentioning in an earlier mail, but the value is still rounded.
> > This model is far more accurate than CFS is and doesn't add an error
> > over time, thus there are no more underflow/overflow anymore within
> > the described limits.
> ( your characterisation errs in that it makes it appear to be a common
> problem, while in practice it's only a corner-case limited to extreme
> negative nice levels and even there it needs a very high rate of
> scheduling and an artificially constructed workload: several hundreds
> of thousand of context switches per second with a yield-ing loop to be
> even measurable with unmodified CFS. So this is not a 2.6.23 issue at
> all - unless there's some testcase that proves the opposite. )
> with Peter's queue there are no underflows/overflows either anymore in
> any synthetic corner-case we could come up with. Peter's queue works
> well but it's 2.6.24 material.
Did you even try to understand what I wrote? I didn't say that it's a "common problem", it's a conceptual problem. The rounding has been improved lately, so it's not as easy to trigger with some simple busy loops. Peter's patches don't remove limit_wait_runtime() and AFAICT they can't, so I'm really amazed how you can make such claims.
> All in one, we dont disagree, this is an incremental improvement we are
> thinking about for 2.6.24. We do disagree with this being positioned as
> something fundamentally different though - it's just the same thing
> mathematically, expressed without a "/weight" divisor, resulting in no
> change in scheduling behavior. (except for a small shift of CPU
> utilization for a synthetic corner-case)
Everytime I'm amazed how quickly you get to your judgements...
BTW who is "we" and how is it possible that this meta mind can come to such quick judgements?
The basic concept is quite different enough, one can e.g. see that I have to calculate some of the key CFS variables for the debug output. The concepts are related, but they are definitively not "the same thing mathematically", the method of resolution is quite different, if you think otherwise then please _prove_ it.
bye, Roman
Hmmm, ever heard of nice?
Working in a DevOps shop is like playing in a band made up entirely of keytarists.
FreeBSD likes lots of extra swap space. An idle system will notice that some process hasn't run in a month and will push it to swap, proactively freeing RAM for something else that might want it. Note that it will only page out a process's data segment; it's code segment uses the filesystem itself for paging (why copy "firefox" into swap when there's already a perfectly readable copy on the filesystem?).
Unless, of course, you unlink its executable file, in which case it allocates swap to hold the file first. Which also illustrates that while unnecessary computational complexity is bad, willingness to do complex things when the situation demands can lead to some pretty cool stuff.
Dewey, what part of this looks like authorities should be involved?
There are also Fedora bug reports with similar problems. I have two machines that were affected with this with some of the later kernels in Fedora 5. And now with Fedora 7 I am seeing something similar, but not quite the same. I have another machine that I haven't seen the problem on.
Upgrade to 1GB of RAM (2GB on Intel) and you won't see it anymore. (usually.)
-:sigma.SB
Depends a lot on your situation.Even with many many gigabytes of ram there are many situations where Apples applications (or the os) just sit there and do nothing (or spinning that pinwheel like they've nothing better to do) and you wonder if they crashed or what... Often enough, no. They're just doing the wrong or stupid thing and it eventually recovers. How often you see it depends a lot on your usage pattern.
None of these (near as I can tell) have anything to do with the scheduler. Just shoddy code and bad decisions.
Platform advocacy is like choosing a favorite severely developmentally disabled child.
Actually he admitted that he didn't pay very much attention and may have taken one incident as the norm. That single incident was in response to a troll who submitted faulty bug reports and ignored the reasons for why they were rejected. Linus stated he didn't care that he may be wrong, since in the end he got a better schedular from a developer he knows.
As I said, this is more about management and politics than a choice based on technical details. Personally I don't care which schedular won, but it wasn't on their merits.
The Solaris kernel supports several schedulers which can be changed during runtime. You can also have different processes under different schedulers at the same time.
The meaning of "fair" in this case is that it equally allocates CPU time to programs run in the same priority. Mostly this is managed by allocating certain "timeslices". The reason you want the fairest scheduler you can get is so that your priorities are properly respected and processes at the same priority aren't discriminated against in terms of CPU time, something the old scheduler failed at.
Here's a post on how the above patchset can improve the responsiveness of the system under heavy write load:
huge improvement with per-device dirty throttling
And the thread referencing the latest version of the patch posted to lkml:
per device dirty throttling -v9
Oh my gosh, the Linux scheduler is on Slashdot. Again! :-)
Frankly, this amount of interest in the Linux scheduler is certainly flattering to all of us Linux scheduler hackers, but there are certainly more important areas that need improvement: 3D support, the MM / IO schedulers, stability, compatibility, etc. (There's also the FreeBSD scheduler that went through a total rewrite recently - and it got not a single Slashdot article that i remember.)
But i digress. A couple of quick high-level points (most of the details can be found in the discussions on lkml):
I find the RFS submission interesting and useful, and i have asked the author to split the patch up a bit better, to separate the core idea from optimizations and unrelated changes - to ease review and merging of the changes, and to make the changes bisectable during QA after they have been applied to the mainstream kernel. (That is how patches are typically submitted to the Linux-kernel mailing list - it's a basic requirement before anything can be merged. CFS for example was applied to the 2.6.23 development tree in form of a series of 50 (!) separate patches. (And the scheduler works at every patching/bisection point.))
I also pointed him to the latest "bleeding edge" scheduler tree, which already implements the same non-normalized form of math and makes some of the rounding and performance arguments moot i believe. (lkml mail).
There are some issues where i disagree with Roman at the moment: even when comparing to unmodified current upstream CFS, i think Roman makes too much out of rounding behavior and i have asked him to substantiate his claims with numbers (lkml mail).
The current precision/rounding of CFS is better than one part in a million. (in fact it's currently even better than that, but i'm saying 1:1000000 here because we could in the future consciously decrease precision, if performance or simplicity arguments justify it.)
I can understand his desire towards creating interest in his patch, but IMO it should not be done by unfairly (pun unintended ;) trash-talking other people's code. The math code in CFS that achieves precision has gone through more than 5 complete rewrites already in the 20-plus CFS versions, and the current variant was not written by me but was largely authored by Thomas Gleixner and Peter Zijlstra.
New, better approaches are possible of course and the math is relatively easy to replace, due to the internal modularity of CFS. So we are keeping an open mind towards further improvements. (which includes the possibility of total replacements as well. Dozens of times has my own kernel code been replaced with new, better implementations in the past - and that includes large parts of the scheduler too. In fact only ~30% of current kernel/sched.c was authored by me, the rest has been written by the other 90+ scheduler contributors, according to the git-annotate output that covers the past ~2.5 years of kernel history. Beyond that numerous other people have contributed to the scheduler in the past.)
About the submitted code: it was a bit hard to review it because the new code did not contain any comments - it only included raw code - which is very uncommon for patches of such type. The email gave the theoretical background but there was little implementational detail in the patch itself connecting the theory to practice.
So to drive this issue forward i have today posted a question to Roman in form of a tiny patch that extracts only his suggested new math from his patch and applies it to CFS. If it is indeed what Roman intended then we can analyze that in isolation and in more detail. The patch is as small as it gets: