The trouble is that in server workloads you generally don't see ONE LARGE I/O operation - lots of small ones instead. There are very very few server workloads that involve transferring >100MB data at a time (even when it comes to DB snapshoting).
There's lots of server workloads that involve large IO requests:
- backups
- DB startup/shutdown
- DB traffic that generates or reads a lot of new data (say report generation)
- HPC workloads that work with huge data sets
- animation farms that work with huge images/movies
- web servers streaming out big files
- fsck
- virtual desktop servers where the desktops are virtual instances running on the server. There any IO load within that 'desktop' runs on the server.
etc. As there is a fair number of server workloads that are IO heavy but which use small IO requests.
On the desktop this is common (all your AVI files).
If you have those big files in networked storage or if you are backing them up to some network host then you've already transformed those kinds of IO requests into big IO requests on the server side as well: the big file you read or write on the desktop the network file/backup server will read/write from its own disks, etc.
Really, "interactivity sucks during big IO" kind of bugs can hurt servers just as much as they can hurt desktops. The boundary between desktops and servers is very fluid.
There's also the VM fix from Wu Fengguang, included in v2.6.36, which addresses similar "slowdown while copying large amounts of data" bugs.
There were about a dozen kernel bugs causing similar symptoms, which we fixed over the course of several kernel releases. They were almost evenly spread out between filesystem code, the VM and the IO scheduler. And yes, i agree that it took too long to acknowledge and address them - these problems have been going on for several years. It's a serious kernel development process failure.
If anyone here still experiences bad desktop stalls while handling big files with v2.6.36 too then we'd appreciate a quick bug report sent to linux-kernel@vger.kernel.org.
So I know some people may read this and think "haha, funny joke" but given that most users are extremely predictable regarding what programs they use and when and how they use them (same with web browsing), shouldnt it be possible to gather user activity over time and analyze it to help improve scheduling.
Yeah, that's certainly a possibility.
This is also the goal of most heuristics in the kernel: to figure out a hidden piece of information that the application (and user) has not passed to the kernel explicitly.
The problem comes when the kernel gets it wrong - the kernel and applications can easily get into a feedback loop / arms race of who knows how to trick the other one into doing what the app writer (or kernel writer) thinks is best. In such cases we get the worst of both worlds: we get the bad case and we get the cost of heuristics.
(Heuristic and predictive systems also tend to be complex and hard to analyze: you can rarely reproduce bugs without having the exact same filesystem layout and usage pattern as the user experienced, etc.)
What we found is that in terms of default behavior it's a bit better to keep things simple and predictable/deterministic and then give apps the way to inject extra information into the kernel. We have the fadvise/madvise calls which can be used with the POSIX_FADV_DONTNEED flag to drop cached content from the page cache.
Heuristics and predictive techniques are done when we can be reasonably sure that we get the decisions right: for example there's a piece of fairly advanced code in the Linux page cache trying to figure out whether to pre-fetch data or not.
The large file copy interactivity problems some have mentioned here were most likely real kernel bugs (in the filesystem, IO scheduling and VM subsystems) and were hopefully fixed in the v2.6.33 - v2.6.36 timeframe.
If you can still reproduce any such problems then please report them to linux-kernel@vger.kernel.org so we can fix it ASAP.
In any case, we could all be wrong about it, so if you have a good implementation of more aggressive predictive algorithms i'm sure a lot of people would try them out - me included. We kernel developers want a better desktop just as much as you want it.
Such drastic change! I have seen this happen on numerous systems and I just change the elevator to "deadline" and poof! The problem is gone. See this discussion for some details. The CFQ scheduler is great for a Linux server running a database, but it completely sucks for desktop or any server used to write large files to.
I see that the bug entry you referred to contains measurements from early 2010, at which point Ubuntu was using v2.6.31-ish kernels IIRC. (and that's the kernel version that is being referred to in the bug entry as well.)
A lot of work has gone into this area in the past 1-2 years, and v2.6.33 is the first kernel version where you should see the improvements. Slashdot reported on that controversy as well.
If you can still reproduce interactivity problems during large file copies with CFQ on v2.6.36 (and it goes away when you switch the IO scheduler to deadline), please report it to linux-kernel@vger.kernel.org so that it can be fixed ASAP. (You can also mail me directly, i'll forward it to the right list and people.)
Ingo, I find delays of 29-45ms to be pretty noticeable. To put it another way, if you had a delay of 10ms before, and you're now getting a delay of 50ms due to some background copy, all of your applications went from running at 100fps to 20fps, which I think even non-sensitive people can pick up on, even outside of games and smooth scrolling. VIM feels different over a 10ms LAN connection vs. a 45ms connection from my home.
Yes i agree with you that if a 45 msecs latency happens on every frame then that will snowball and will thoroughly ruin game interactivity - but note the specific context here:
(hm, my first link above was broken, sorry about that.)
Those 45 msec delays were statistical-max outliners - with the average latency at 7.3 msecs. This got cut down to 25 msecs / 6.6 msecs respectively via the patch. Note that it's also a specific, CPU overloaded workload that was measured here, so not typical of the desktop unless you are a developer running make -j build jobs.
We care about optimizing maximum latencies because those are what can cause occasional hickups on the desktop - a lagging mouse pointer - or some other non-smooth visual artifact.
Sorry dude, it looks like it's a hardware specific problem. I did that on nearly 700G of large files and then fired up the flight sim while it was still going. The only slow down was on file related activity, which is totally what you'd expect. I had it running full screen across two monitors without any drop in frame rate. AND I'm using economy hardware.
It may also be kernel version dependent - with older kernels still showing this bug.
A lot of work has gone into the Linux kernel in the past 2 years to improve this area - and yes, i think much of the criticism from those who have met this bug and were annoyed by it was fundamentally justified - this bug was real and it should have been fixed sooner.
Kernels post v2.6.33 ought to be much better - with v2.6.36 bringing another set of improvements in this area. The fixes were all over the place: IO scheduler, VM and filesystem code and few of them were simple.
No, massive unfairness is just as bad on the server as it is on the desktop - in all but a few select batch processing situations.
Replace 'desktop' with 'database', 'Apache', 'Samba' or 'number crunching job' and you get the same kind of badness.
There's not much difference really. If it sucks on the desktop then it sucks on the server too: why would it be a good server if it slows down a DB/Apache/Samba/number-crunching-job while prioritizing some large copy operation?
You are right, deadline is the other (much smaller/simpler) one - CFQ is the main IO scheduler remaining.
You can still test AS by going back to an older kernel - and as long as it's a performance regression that is reported (relative to that old kernel, running AS), it should not be ignored on lkml.
It would be useful if/bin/cp explicitly dropped use-once data that it reads into the pagecache - there are syscalls for that.
Other than opening the files O_DIRECT, how would you do that?
No need for O_DIRECT (which might not even work everywhere),/bin/cp could use fadvise/madvise with some size heuristics. (say if a file is larger than RAM and will be copied fully then it cannot be reasonably cached)
POSIX_FADV_DONTNEED should do the trick in terms of pagecache footprint - it will invalidate the page-cache.
He tried that before. I think he's given up on getting his scheduler (though perhaps not a suspiciously similar one written by Inigo) in the kernel after what happened with CFQ.
One reason for why the principle of CFS may seem to you so suspiciously similar to Con's SD scheduler is that i used Con's fair scheduling principle when writing the initial version of CFS. This is credited at the very top of today's kernel/sched.c [the scheduler code]:
* 2007-04-15 Work begun on replacing all interactivity tuning with a
* fair scheduling design by Con Kolivas.
The scheduler implementations (and even the user visible behavior) of the schedulers was and is very different - and there is where much of the disagreement and later flaming came from.
Note that this particular Slashdot article is about IO scheduling though - which is unrelated to CPU schedulers. Neither Con nor i wrote IO schedulers.
There are two main IO schedulers in Linux right now: CFQ and AS, written by Jens Axboe, Nick Piggin, et al.
What adds fuel to the confusion is that it is relatively easy to mix up 'CFQ' with 'CFS'.
The problem is in the IO scheduler. Switching from CFQ to AS minimizes the problem. It takes less than 5 seconds with google to see how wide spread the problem is. Lots of people squawking about it. CFQ is pure crap in my experience. Copying a couple gigabytes of data should not render all my applications unusable for five minutes.
Probably uninformed, but someone actually claimed that CFQ is designed to be tweaked with ionice. Yeah, a modern OS should require people to manually ionice every time they do a file copy!
It would be ideal for the schedulers and window manager to communicate and give priority to the foreground application.
Please help us resolve this issue: please post your experiences to linux-kernel@vger.kernel.org and Cc: the following gents: jaxboe@fusionio.com, torvalds@linux-foundation.org, akpm@linux-foundation.org, a.p.zijlstra@chello.nl, mingo@elte.hu.
It would be nice if you could attach latencytop numbers for CFQ and for AS, using the same workload. Latencytop will measure the worst-case delays you suffered - so you can demonstrate the CFQ versus AS effect numerically.
If you can reproduce that problem with a new kernel (v2.6.36 would be ideal) then please try to describe the symptoms in a mail to linux-kernel@vger.kernel.org, and also point out whether the tunings above improved things. Please Cc: Jens, Andrew, me and Linus as well.
To turn interactivity woes on the desktop into actual hard numbers you can use Arjan van de Ven's latencytop tool. It will measure your worst-case delays with and without big copies being done in the background, which numbers you can cite in your email.
That's great that you post your experiences with server scheduling in a topic about desktop scheduling. It's so relevant. No wait, it's not.
The boundary between the desktop space and the server space is rather fluid, and many of the problems visible on servers are also visible on desktops - and vice versa.
For example 'copying a large amount of data' on a server is similar to 'copying a big ISO on the desktop'. If the kernel sucks doing one then it will likely suck when doing the other as well.
So both cases should be handled by the kernel in an excellent fashion - with an optimization/tuning focus on desktop workloads, because they are almost always the more diverse ones, and hence are generally the technically more challenging cases as well.
Why does fsync not synchronously flush out *only* the data dirtied by the current process, rather than all buffers on all file systems dirtied by any process?
That's the intention and that's even how it's coded - but for example ext3 had a bug/misfeature that caused this operation to serialize on all pending writes to the same filesystem's journal area in some pretty common scenarios - with disastrous results to interactivity.
There was a big flamewar about it on lkml a year or two ago, and in that discussion Linus declared that it is a very high priority item to fix this, and that desktop interactivity is our main optimization focus. (IIRC the flamewar was big enough to make it to Slashdot - cannot find the link right now.)
The fsync/fdatasync performance problem was fixed/resolved shortly after that.
Kernels after v2.6.32 (and certainly the latest v2.6.36 kernel) should be much better in that area.
It seems like a bad idea for a non-root thread to have so much power over how smoothly the rest of the system runs.
Yes, indeed.
In terms of isolation guarantees, block cgroups (control groups) should be a feature for more formal isolation of one user from another. AFAIK Android puts each app into a separate user and into separate cgroups. So it's not impossible to design the user-space side properly. It was written a server feature originally, but i think it's very useful on the desktop as well.
While certainly the whole file may end up cached, the source for cp does a simple read/write with a small buffer -- not read in the whole file and then write it out.
Many apps or DB engines will have a similar pattern: they read/write in a relatively small buffer, but then expect the exact opposite of what you'd expect/bin/cp to do: they expect the file to stay cached (because they will read it again in the future).
So the kernel cannot know why the files are being read and written: will it be needed in the future (Firefox sqlite DB) or not (cp of a big file).
(Unfortunately, the planned mind reading extension to the kernel is still a few years out.)
Even in the specific case of/bin/cp often the files might be needed shortly after they have been copied. If you have 4 GB of RAM and you are copying a 750 MB ISO, you'd expect that ISO to stay fully cached so that the CD-writer tool can access it faster (and without wasting laptop power), right?
So in 99% of the cases it is the best kernel policy to keep around cached data as much as possible.
What makes caching wrong in the "copy huge ISO around" case is that both files are too large to fit into the cache and that cp reads and writes to the totality of both files. Since/bin/cp does not declare this in advance the kernel has no way of knowing this for sure as the operation progresses - and by the time we hit limits it's too late.
It would all be easier for the kernel if cp and dd used fadvise/madvise to declare the read-once/write-once nature of big files. It would all just work out of box. The question is, how can cp figure out whether it's truly use-once...
The other thing that can go wrong is that arguably other apps should not be affected by this negatively - and this was the point of the article as well. I.e. cp may fill up the pagecache, but those new pages should not throw out well-used pages on the LRU, plus other write activties by other apps should not be slowed down just because there's a giant copy going on.
Those kinds of big file operations certainly work fine on my desktop boxes - so if you see such symptoms you should report it to linux-kernel@vger.kernel.org, where you will be pointed to the right tools to figure out where the bug is. (latencytop and powertop are both a good start.)
Note that i definitely could see similar problems two years ago, with older kernels - and a lot of work went into improving the kernel in this area. v2.6.35 or v2.6.36 based systems with ext3 or some other modern filesystem should work pretty well. (The interactivity break-through was somewhere around v2.6.32 - although a lot of incremental work went upstream after that, so you should try as new of a kernel as you can.)
Also, i certainly think that the Linux kernel was not desktop-centric enough for quite some time. We didn't ever ignore the desktop (it was always the primary focus for a simple reason: almost every kernel developer uses Linux as their desktop) - but the kernel community certainly under-estimated the desktop and somehow thought that the real technological challenge was on the server side. IMHO the exact opposite is true.
Fortunately, things have changed in the past few years, mostly because there's a lot of desktop Linux users now, either via some Linux distro or via Android or some of the other mobile platforms, and their voice is being heard.
I know some of the patches have made it back into the mainline kernel, any idea when they all will be merged?
The -tip tree contains development patches for the next kernel version for a number of kernel subsystems (scheduler, irqs, x86, tracing, perf, timers, etc.) - and i'm glad that you like it:-)
We typically send all patches from -tip into upstream in the merge window - except for a few select fixlets and utility patches that help our automated testing. We merge back Linus's tree on a daily basi and stabilize it on our x86 test-bed - so if you want some truly bleeding edge kernel but want proof that someone has at least built and booted it on a few boxes without crashing then you can certainly try -tip;-)
Otherwise we try to avoid -tip specials. I.e. there are no significant out-of-tree patches that stay in -tip forever - there are only in-progress patches which we try to push to Linus ASAP. If we cannot get something upstream we drop it. This happens every now and then - not every new idea is a good idea. If we cannot convince upstream to pick up a particular change then we drop it or rework it - but we do not perpetuate out-of-tree patches.
So the number of extra commits/changes in -tip fluctuates, it typically ranges from up to a thousand down to a few dozen - depending on where we are in the development cycle.
Right now we are in the first few days of the v2.6.37 merge window and Linus pulled most of our pending trees already in the past two days, so -tip contains small fixes only. While v2.6.37 is being releasified in the next ~2.5 months, -tip will fill up again with development commits geared towards v2.6.38 - and we will also keep merging back Linus's latest tree - and so the cycle continues.
(1) As soon as RAM is exhausted and the kernel starts swapping out to disk, the desktop experience is severely impacted (and immediately so). [...]
Right. If a desktop starts swapping seriously then it's usually game over, interactivity wise. Typical desktop apps produce so much new dirty data that it's not funny if even a small portion of it has to hit disk (and has to be read back from disk) periodically.
But please note that truly heavy swapping is actually a pretty rare event. The typical event for desktop slowdowns isn't deadly swap-thrashing per se, but two types of scenarios:
1) dirty threshold throttling: when an app fills up enough RAM with dirty data (which has to be written to disk sooner or later), then the kernel first starts a 'gentle' (background, async) writeback, and then, when a second limit is exceeded starts a less gentle (throttling, synchronous) writeback. The defaults are 10% and 20% of RAM - and you can set them via. To see whether you are affected by this phenomenon you can try much more agressive values like:
These set async writeback to kick in ASAP (the disk can write back in the background just fine), but sets the 'aggressive throttling' limit up really high. This tuning might make your desktop magically faster. It may also cause really long delays if you do hit the 90% limit via some excessively dirtying app (but that's rare).
2) fsync delays. A handful of key apps such as Firefox use periodic fsync() syscalls to ensure that data has been saved to disk - and rightfully so. Linux fsync() performance used to be pretty dismal (the fync had to wait for a really long time on random writers to the disk, delaying Firefox all the time) and went through a number of improvements. If you have v2.6.36 and ext3 then it should be all pretty good.
I think a fair chunk of the "/bin/cp/from/large.iso/to/large.iso" problem could be eliminated if cp (and dd) helped the kernel and dropped the page-cache on large copies via fadvise/madvise. Linux really defaults to the most optimistic assumption: that apps are good citizens and will dirty only as much RAM as they need. Thus the kernel will generally allow apps to dirty a fair amount of RAM, before it starts throttling them.
VM and caching heuristics are tricky here - a app or DB startup sequence can produce very similar patterns of file access and IO when it warms up its cache. In that case it would be absolutely lethal to performance to drop pagecache contents and to sync them out agressively.
If the cp app did something as simple as explicitly dropping the page-cache via the fadvise/madvise system calls then a lot of user side grief could be avoided i suspect. DVD and CD burning apps are already rather careful about their pagecache footprint.
But, if you have a good testcase you should contact the VM and IO developers on linux-kernel@vger.kernel.org - we all want Linux desktops to perform well. (server workloads are much easier to handle in general and are secondary in that aspect.) We have various good tools that allow more than enough data to be captured to figure out where delays come from (blktrace, ftrace, perf, etc.) - we need more reporters and more testers.
I think the Phoronix article you linked to is confusing the IO scheduler and the VM (both of which can cause many seconds of unwanted delays during GUI operations) with the CPU scheduler.
The CPU scheduler patch referenced in the Phoronix article deals with delays experienced during high CPU loads - a dozen or more tasks running at once and all burning CPU time actively. Delays of up to 45 milliseconds were reported and they were fixed to be as low as 29 milliseconds.
Also, that scheduler fix is not a v2.6.37 item: i have merged a slightly different version and sent it to Linus, so it's included in v2.6.36 already: you can see the commit here.
If you are seeing human-perceptible delays - especially in the 'several seconds' time scale, then they are quite likely not related to the CPU scheduler (unless you are running some extreme workload) but more likely to the CFQ IO scheduler or to the VM cache management policies.
In the CPU scheduler we usually deal with milliseconds-level delays and unfairnesses - which rarely raise up to the level of human perception.
Sometimes, if you are really sensitive to smooth scheduling, can see those kinds of effects visually via 'game smoothness' or perhaps 'Firefox scrolling smoothness' - but anything on the 'several seconds' timescale on a typical Linux desktop has to have some connection with IO.
Yes. Here there is another problem at play: cp reads in the whole (big) file and then writes it out. This brings the whole file into the Linux pagecache (file cache).
That, if the VM is not fully detecting that linear copy correctly, can blow a lot of useful app data (all cached) out of the pagecache. That in turn has to be read back once you click within Firefox, etc. - which generates IO and is a few orders of magnitude slower than reading the cached copy. That such data tends to be fragmented (all around on the disk in various small files) and that there is a large copy going on does not help either.
Catastrophic slowdowns on the desktop are typically such combined 'perfect storms' between multiple kernel subsystems. (for that reason they also tend to be the hardest ones to fix.)
It would be useful if/bin/cp explicitly dropped use-once data that it reads into the pagecache - there are syscalls for that.
And yes, we'd very much like to fix such slowdowns via heuristics as well (detecting large sequential IO and not letting it poison the existing cache), so good bugreports and reproducing testcases sent to linux-kernel@vger.kernel.org and people willing to try out experimental kernel patches would definitely be welcome.
FYI, the IO scheduler and the CPU scheduler are two completely different beasts.
The IO scheduler lives in block/cfq-iosched.c and is maintained by Jens Axboe, while the CPU scheduler lives in kernel/sched*.c and is maintained by Peter Zijlstra and myself.
The CPU scheduler decides the order of how application code is executed on CPUs (and because a CPU can run only one app at a time the scheduler switches between apps back and forth quickly, giving the grand illusion of all apps running at once) - while the IO scheduler decides how IO requests (issued by apps) reading from (or writing to) disks are ordered.
The two schedulers are very different in nature, but both can indeed cause similar looking bad symptoms on the desktop though - which is one of the reasons why people keep mixing them up.
If you see problems while copying big files then there's a fair chance that it's an IO scheduler problem (ionice might help you there, or block cgroups).
I'd like to note for the sake of completeness that the two kinds of symptoms are not always totally separate: sometimes problems during IO workloads were caused by the CPU scheduler. It's relatively rare though.
Analysing (and fixing;-) such problems is generally a difficult task. You should mail your bug description to linux-kernel@vger.kernel.org and you will probably be asked there to perform a trace so that we can see where the delays are coming from.
On a related note i think one could make a fairly strong argument that there should be more coupling between the IO scheduler and the CPU scheduler, to help common desktop usecases.
Incidentally there is a fairly recent feature submission by Mike Galbraith that extends the (CPU) scheduler with a new feature which adds the ability to group tasks more intelligently: see Mike's auto-group scheduler patch
This feature uses cgroups for block IO requests as well.
You might want to give it a try, it might improve your large-copy workload latencies significantly. Please mail bug (or success) reports to Mike, Peter or me.
You need to apply the above patch on top of Linus's very latest tree, or on top of the scheduler development tree (which includes Linus's latest), which can be found in the -tip tree
(Continuing this discussion over email is probably more efficient.)
Oh my gosh, the Linux scheduler is on Slashdot. Again!:-)
Frankly, this amount of interest in the Linux scheduler is certainly flattering to all of us Linux scheduler hackers, but there are certainly more important areas that need improvement: 3D support, the MM / IO schedulers, stability, compatibility, etc. (There's also the FreeBSD scheduler that went through a total rewrite recently - and it got not a single Slashdot article that i remember.)
But i digress. A couple of quick high-level points (most of the details can be found in the discussions on lkml):
I find the RFS submission interesting and useful, and i have asked the author to split the patch up a bit better, to separate the core idea from optimizations and unrelated changes - to ease review and merging of the changes, and to make the changes bisectable during QA after they have been applied to the mainstream kernel. (That is how patches are typically submitted to the Linux-kernel mailing list - it's a basic requirement before anything can be merged. CFS for example was applied to the 2.6.23 development tree in form of a series of 50 (!) separate patches. (And the scheduler works at every patching/bisection point.))
I also pointed him to the latest "bleeding edge" scheduler tree, which already implements the same non-normalized form of math and makes some of the rounding and performance arguments moot i believe. (lkml mail).
There are some issues where i disagree with Roman at the moment: even when comparing to unmodified current upstream CFS, i think Roman makes too much out of rounding behavior and i have asked him to substantiate his claims with
numbers (lkml mail).
The current precision/rounding of CFS is better than one part in a million. (in fact it's currently even better than that, but i'm saying 1:1000000 here because we could in the future consciously decrease precision, if performance or simplicity arguments justify it.)
I can understand his desire towards creating interest in his patch, but IMO it should not be done by unfairly (pun unintended;) trash-talking other people's code. The math code in CFS that achieves precision has gone through more than 5 complete rewrites already in the 20-plus CFS versions, and the current variant was not written by me but was largely authored by Thomas Gleixner and Peter Zijlstra.
New, better approaches are possible of course and the math is relatively easy to replace, due to the internal modularity of CFS. So we are keeping an open mind towards further improvements. (which includes the possibility of total replacements as well. Dozens of times has my own kernel code been replaced with new, better implementations in the past - and that includes large parts of the scheduler too. In fact only ~30% of current kernel/sched.c was authored by me, the rest has been written by the other 90+ scheduler contributors, according to the git-annotate output that covers the past ~2.5 years of kernel history. Beyond that numerous other people have contributed to the scheduler in the past.)
About the submitted code: it was a bit hard to review it because the new code did not contain any comments - it only included raw code - which is very uncommon for patches of such type. The email gave the theoretical background but there was little implementational detail in the patch itself connecting the theory to practice.
So to drive this issue forward i have today posted a question to Roman in form of a tiny patch that extracts only his suggested new math from his patch and applies it to CFS. If it is indeed what Roman intended then we can analyze that in isolation and in more detail. The patch is as small as it gets:
I agree - modifying a text file is a messy complicated business only suitable for the elite super hackers out there. It's much simpler for me to apply the patch and recompile the kernel.
I kid, I kid.
ok, you are kidding, but i'll still bite:-)
Firstly, the patch is mainly about modifying relatime behavior to make it more compatible and more usable.
The fact that you dont have to change fstab is no big deal, provided you have the right util-linux package installed, with the relatime user-space patch applied which not even the latest distro devel repositories have included.
If you dont have that then adding "relatime" to your fstab might leave you with a read-only mounted root filesystem and some commandline (or rescue-image) tinkering to do.
People prefer all-in-one kernel patches that just turns on the feature they are interested in. You'd be surprised how many people are willing to try almost arbitrary kernel patches but loathe to touch their user-space environment in any way.
And... it's also kind of ironic that this relatively small patch often brings more practical benefits to the desktop than all the "big" desktop interactivity/latency features (cfs, swap-prefetch, -rt kernel) combined.
For example, your 1 gig machine only has 2^(1024*1024*1024*8) states it can go through to reach an answer, not including disk IO... and as we all know, O(2^[1024*1024*1024*8]) =~ O(10^2585827972) = O(1).:-)
You are nitpicking:-) But let me nitpick too: your 1 git machine has _a lot more_ states than 2^(1024*1024*1024*8). It is described by a quantum wave function that has at least 10^28 particles in it, with each particle having infinite observable states. Even if we applied some common-sense granularity to the observation of the 4 coordinates of the particles in question, the total number of possible states is more on the order of 2^(10^28) than on the order of 2^(8gig);-)
See? What i said was totally accurate when in nitpicking mode, still it missed the big picture by being purist and it didnt bring the discussion forward even one inch. In fact it had the exact opposite effect - the only effect this paragraph had is maybe some extra global warming;-)
What i tried to say with my O(1) comment is that even for the worst-case scenario, CFS's algorithms never go deeper than 15-20 in the tree. Which compares quite favorably to the 140 worst-case steps the "O(1) scheduler" has to take (on an architecture that has no in-hardware bit-search instruction).
So despite that you'd not try to point this out, just because the mathematical definition of its algorithm says O(log2(N))? That would plainly defeat the fundamental purpose of why we define ordo/theta notations: to be able to compare algorithms along their worst-case/best-case/average performance characteristics.
And if you'd try to point it out, wouldnt you do it similarly to how i did, by describing the worst-case behavior in words and pointing out the failure of the strict definition?
There's lots of server workloads that involve large IO requests:
- backups
- DB startup/shutdown
- DB traffic that generates or reads a lot of new data (say report generation)
- HPC workloads that work with huge data sets
- animation farms that work with huge images/movies
- web servers streaming out big files
- fsck
- virtual desktop servers where the desktops are virtual instances running on the server. There any IO load within that 'desktop' runs on the server.
etc. As there is a fair number of server workloads that are IO heavy but which use small IO requests.
If you have those big files in networked storage or if you are backing them up to some network host then you've already transformed those kinds of IO requests into big IO requests on the server side as well: the big file you read or write on the desktop the network file/backup server will read/write from its own disks, etc.
Really, "interactivity sucks during big IO" kind of bugs can hurt servers just as much as they can hurt desktops. The boundary between desktops and servers is very fluid.
Yeah, that's what the discussion was about - we improved that particular case, see this commit (which can be found in v2.6.36), and Phoronix reported about that upstream fix.
Thanks,
Ingo
There's also the VM fix from Wu Fengguang, included in v2.6.36, which addresses similar "slowdown while copying large amounts of data" bugs.
There were about a dozen kernel bugs causing similar symptoms, which we fixed over the course of several kernel releases. They were almost evenly spread out between filesystem code, the VM and the IO scheduler. And yes, i agree that it took too long to acknowledge and address them - these problems have been going on for several years. It's a serious kernel development process failure.
If anyone here still experiences bad desktop stalls while handling big files with v2.6.36 too then we'd appreciate a quick bug report sent to linux-kernel@vger.kernel.org.
Thanks,
Ingo
Yeah, that's certainly a possibility.
This is also the goal of most heuristics in the kernel: to figure out a hidden piece of information that the application (and user) has not passed to the kernel explicitly.
The problem comes when the kernel gets it wrong - the kernel and applications can easily get into a feedback loop / arms race of who knows how to trick the other one into doing what the app writer (or kernel writer) thinks is best. In such cases we get the worst of both worlds: we get the bad case and we get the cost of heuristics.
(Heuristic and predictive systems also tend to be complex and hard to analyze: you can rarely reproduce bugs without having the exact same filesystem layout and usage pattern as the user experienced, etc.)
What we found is that in terms of default behavior it's a bit better to keep things simple and predictable/deterministic and then give apps the way to inject extra information into the kernel. We have the fadvise/madvise calls which can be used with the POSIX_FADV_DONTNEED flag to drop cached content from the page cache.
Heuristics and predictive techniques are done when we can be reasonably sure that we get the decisions right: for example there's a piece of fairly advanced code in the Linux page cache trying to figure out whether to pre-fetch data or not.
The large file copy interactivity problems some have mentioned here were most likely real kernel bugs (in the filesystem, IO scheduling and VM subsystems) and were hopefully fixed in the v2.6.33 - v2.6.36 timeframe.
If you can still reproduce any such problems then please report them to linux-kernel@vger.kernel.org so we can fix it ASAP.
In any case, we could all be wrong about it, so if you have a good implementation of more aggressive predictive algorithms i'm sure a lot of people would try them out - me included. We kernel developers want a better desktop just as much as you want it.
Such drastic change! I have seen this happen on numerous systems and I just change the elevator to "deadline" and poof! The problem is gone. See this discussion for some details. The CFQ scheduler is great for a Linux server running a database, but it completely sucks for desktop or any server used to write large files to.
I see that the bug entry you referred to contains measurements from early 2010, at which point Ubuntu was using v2.6.31-ish kernels IIRC. (and that's the kernel version that is being referred to in the bug entry as well.)
A lot of work has gone into this area in the past 1-2 years, and v2.6.33 is the first kernel version where you should see the improvements. Slashdot reported on that controversy as well.
If you can still reproduce interactivity problems during large file copies with CFQ on v2.6.36 (and it goes away when you switch the IO scheduler to deadline), please report it to linux-kernel@vger.kernel.org so that it can be fixed ASAP. (You can also mail me directly, i'll forward it to the right list and people.)
Thanks,
Ingo
Ingo, I find delays of 29-45ms to be pretty noticeable. To put it another way, if you had a delay of 10ms before, and you're now getting a delay of 50ms due to some background copy, all of your applications went from running at 100fps to 20fps, which I think even non-sensitive people can pick up on, even outside of games and smooth scrolling. VIM feels different over a 10ms LAN connection vs. a 45ms connection from my home.
Yes i agree with you that if a 45 msecs latency happens on every frame then that will snowball and will thoroughly ruin game interactivity - but note the specific context here:
you can see the commit referenced by Phoronix here
(hm, my first link above was broken, sorry about that.)
Those 45 msec delays were statistical-max outliners - with the average latency at 7.3 msecs. This got cut down to 25 msecs / 6.6 msecs respectively via the patch. Note that it's also a specific, CPU overloaded workload that was measured here, so not typical of the desktop unless you are a developer running make -j build jobs.
We care about optimizing maximum latencies because those are what can cause occasional hickups on the desktop - a lagging mouse pointer - or some other non-smooth visual artifact.
Thanks,
Ingo
Sorry dude, it looks like it's a hardware specific problem. I did that on nearly 700G of large files and then fired up the flight sim while it was still going. The only slow down was on file related activity, which is totally what you'd expect. I had it running full screen across two monitors without any drop in frame rate. AND I'm using economy hardware.
It may also be kernel version dependent - with older kernels still showing this bug.
A lot of work has gone into the Linux kernel in the past 2 years to improve this area - and yes, i think much of the criticism from those who have met this bug and were annoyed by it was fundamentally justified - this bug was real and it should have been fixed sooner.
Kernels post v2.6.33 ought to be much better - with v2.6.36 bringing another set of improvements in this area. The fixes were all over the place: IO scheduler, VM and filesystem code and few of them were simple.
This Slashdot article from 1.5 years ago shows when more attention was raised to this category of Linux interactivity bugs.
Thanks,
Ingo
No, massive unfairness is just as bad on the server as it is on the desktop - in all but a few select batch processing situations.
Replace 'desktop' with 'database', 'Apache', 'Samba' or 'number crunching job' and you get the same kind of badness.
There's not much difference really. If it sucks on the desktop then it sucks on the server too: why would it be a good server if it slows down a DB/Apache/Samba/number-crunching-job while prioritizing some large copy operation?
You are right, deadline is the other (much smaller/simpler) one - CFQ is the main IO scheduler remaining.
You can still test AS by going back to an older kernel - and as long as it's a performance regression that is reported (relative to that old kernel, running AS), it should not be ignored on lkml.
Thanks,
Ingo
It would be useful if /bin/cp explicitly dropped use-once data that it reads into the pagecache - there are syscalls for that.
Other than opening the files O_DIRECT, how would you do that?
No need for O_DIRECT (which might not even work everywhere), /bin/cp could use fadvise/madvise with some size heuristics. (say if a file is larger than RAM and will be copied fully then it cannot be reasonably cached)
POSIX_FADV_DONTNEED should do the trick in terms of pagecache footprint - it will invalidate the page-cache.
Thanks,
Ingo
He tried that before. I think he's given up on getting his scheduler (though perhaps not a suspiciously similar one written by Inigo) in the kernel after what happened with CFQ.
One reason for why the principle of CFS may seem to you so suspiciously similar to Con's SD scheduler is that i used Con's fair scheduling principle when writing the initial version of CFS. This is credited at the very top of today's kernel/sched.c [the scheduler code]:
* 2007-04-15 Work begun on replacing all interactivity tuning with a
* fair scheduling design by Con Kolivas.
It was added in this commit.
The scheduler implementations (and even the user visible behavior) of the schedulers was and is very different - and there is where much of the disagreement and later flaming came from.
Note that this particular Slashdot article is about IO scheduling though - which is unrelated to CPU schedulers. Neither Con nor i wrote IO schedulers.
There are two main IO schedulers in Linux right now: CFQ and AS, written by Jens Axboe, Nick Piggin, et al.
What adds fuel to the confusion is that it is relatively easy to mix up 'CFQ' with 'CFS'.
Thanks,
Ingo
The problem is in the IO scheduler. Switching from CFQ to AS minimizes the problem. It takes less than 5 seconds with google to see how wide spread the problem is. Lots of people squawking about it. CFQ is pure crap in my experience. Copying a couple gigabytes of data should not render all my applications unusable for five minutes.
Probably uninformed, but someone actually claimed that CFQ is designed to be tweaked with ionice. Yeah, a modern OS should require people to manually ionice every time they do a file copy!
It would be ideal for the schedulers and window manager to communicate and give priority to the foreground application.
Please help us resolve this issue: please post your experiences to linux-kernel@vger.kernel.org and Cc: the following gents:
jaxboe@fusionio.com, torvalds@linux-foundation.org, akpm@linux-foundation.org, a.p.zijlstra@chello.nl, mingo@elte.hu.
It would be nice if you could attach latencytop numbers for CFQ and for AS, using the same workload. Latencytop will measure the worst-case delays you suffered - so you can demonstrate the CFQ versus AS effect numerically.
Thanks,
Ingo
Ingo,
I believe most desktop users run into this problem when they complain about IO schedulers. Is there any immediate plan to address it?
Thanks,
Jason
Regarding plans you need to ask the VM and IO folks (Andrew Morton, Jens Axboe, Linus, et al).
Regarding that bugzilla entry, there's this suggestion in one of the comments:
echo 10 > /proc/sys/vm/vfs_cache_pressure /sys/block/sda/queue/nr_requests /sys/block/sda/queue/read_ahead_kb /proc/sys/vm/swappiness /proc/sys/vm/dirty_ratio /proc/sys/vm/dirty_background_ratio
echo 4096 >
echo 4096 >
echo 100 >
echo 0 >
echo 0 >
or use "sync" fs-mount option.
If you can reproduce that problem with a new kernel (v2.6.36 would be ideal) then please try to describe the symptoms in a mail to linux-kernel@vger.kernel.org, and also point out whether the tunings above improved things. Please Cc: Jens, Andrew, me and Linus as well.
To turn interactivity woes on the desktop into actual hard numbers you can use Arjan van de Ven's latencytop tool. It will measure your worst-case delays with and without big copies being done in the background, which numbers you can cite in your email.
Thanks,
Ingo
That's great that you post your experiences with server scheduling in a topic about desktop scheduling. It's so relevant. No wait, it's not.
The boundary between the desktop space and the server space is rather fluid, and many of the problems visible on servers are also visible on desktops - and vice versa.
For example 'copying a large amount of data' on a server is similar to 'copying a big ISO on the desktop'. If the kernel sucks doing one then it will likely suck when doing the other as well.
So both cases should be handled by the kernel in an excellent fashion - with an optimization/tuning focus on desktop workloads, because they are almost always the more diverse ones, and hence are generally the technically more challenging cases as well.
Thanks,
Ingo
That's the intention and that's even how it's coded - but for example ext3 had a bug/misfeature that caused this operation to serialize on all pending writes to the same filesystem's journal area in some pretty common scenarios - with disastrous results to interactivity.
There was a big flamewar about it on lkml a year or two ago, and in that discussion Linus declared that it is a very high priority item to fix this, and that desktop interactivity is our main optimization focus. (IIRC the flamewar was big enough to make it to Slashdot - cannot find the link right now.)
The fsync/fdatasync performance problem was fixed/resolved shortly after that.
Kernels after v2.6.32 (and certainly the latest v2.6.36 kernel) should be much better in that area.
Yes, indeed.
In terms of isolation guarantees, block cgroups (control groups) should be a feature for more formal isolation of one user from another. AFAIK Android puts each app into a separate user and into separate cgroups. So it's not impossible to design the user-space side properly. It was written a server feature originally, but i think it's very useful on the desktop as well.
Thanks,
Ingo
Many apps or DB engines will have a similar pattern: they read/write in a relatively small buffer, but then expect the exact opposite of what you'd expect /bin/cp to do: they expect the file to stay cached (because they will read it again in the future).
So the kernel cannot know why the files are being read and written: will it be needed in the future (Firefox sqlite DB) or not (cp of a big file).
(Unfortunately, the planned mind reading extension to the kernel is still a few years out.)
Even in the specific case of /bin/cp often the files might be needed shortly after they have been copied. If you have 4 GB of RAM and you are copying a 750 MB ISO, you'd expect that ISO to stay fully cached so that the CD-writer tool can access it faster (and without wasting laptop power), right?
So in 99% of the cases it is the best kernel policy to keep around cached data as much as possible.
What makes caching wrong in the "copy huge ISO around" case is that both files are too large to fit into the cache and that cp reads and writes to the totality of both files. Since /bin/cp does not declare this in advance the kernel has no way of knowing this for sure as the operation progresses - and by the time we hit limits it's too late.
It would all be easier for the kernel if cp and dd used fadvise/madvise to declare the read-once/write-once nature of big files. It would all just work out of box. The question is, how can cp figure out whether it's truly use-once ...
The other thing that can go wrong is that arguably other apps should not be affected by this negatively - and this was the point of the article as well. I.e. cp may fill up the pagecache, but those new pages should not throw out well-used pages on the LRU, plus other write activties by other apps should not be slowed down just because there's a giant copy going on.
Those kinds of big file operations certainly work fine on my desktop boxes - so if you see such symptoms you should report it to linux-kernel@vger.kernel.org, where you will be pointed to the right tools to figure out where the bug is. (latencytop and powertop are both a good start.)
Note that i definitely could see similar problems two years ago, with older kernels - and a lot of work went into improving the kernel in this area. v2.6.35 or v2.6.36 based systems with ext3 or some other modern filesystem should work pretty well. (The interactivity break-through was somewhere around v2.6.32 - although a lot of incremental work went upstream after that, so you should try as new of a kernel as you can.)
Also, i certainly think that the Linux kernel was not desktop-centric enough for quite some time. We didn't ever ignore the desktop (it was always the primary focus for a simple reason: almost every kernel developer uses Linux as their desktop) - but the kernel community certainly under-estimated the desktop and somehow thought that the real technological challenge was on the server side. IMHO the exact opposite is true.
Fortunately, things have changed in the past few years, mostly because there's a lot of desktop Linux users now, either via some Linux distro or via Android or some of the other mobile platforms, and their voice is being heard.
Thanks,
Ingo
The -tip tree contains development patches for the next kernel version for a number of kernel subsystems (scheduler, irqs, x86, tracing, perf, timers, etc.) - and i'm glad that you like it :-)
We typically send all patches from -tip into upstream in the merge window - except for a few select fixlets and utility patches that help our automated testing. We merge back Linus's tree on a daily basi and stabilize it on our x86 test-bed - so if you want some truly bleeding edge kernel but want proof that someone has at least built and booted it on a few boxes without crashing then you can certainly try -tip ;-)
Otherwise we try to avoid -tip specials. I.e. there are no significant out-of-tree patches that stay in -tip forever - there are only in-progress patches which we try to push to Linus ASAP. If we cannot get something upstream we drop it. This happens every now and then - not every new idea is a good idea. If we cannot convince upstream to pick up a particular change then we drop it or rework it - but we do not perpetuate out-of-tree patches.
So the number of extra commits/changes in -tip fluctuates, it typically ranges from up to a thousand down to a few dozen - depending on where we are in the development cycle.
Right now we are in the first few days of the v2.6.37 merge window and Linus pulled most of our pending trees already in the past two days, so -tip contains small fixes only. While v2.6.37 is being releasified in the next ~2.5 months, -tip will fill up again with development commits geared towards v2.6.38 - and we will also keep merging back Linus's latest tree - and so the cycle continues.
Thanks,
Ingo
Right. If a desktop starts swapping seriously then it's usually game over, interactivity wise. Typical desktop apps produce so much new dirty data that it's not funny if even a small portion of it has to hit disk (and has to be read back from disk) periodically.
But please note that truly heavy swapping is actually a pretty rare event. The typical event for desktop slowdowns isn't deadly swap-thrashing per se, but two types of scenarios:
1) dirty threshold throttling: when an app fills up enough RAM with dirty data (which has to be written to disk sooner or later), then the kernel first starts a 'gentle' (background, async) writeback, and then, when a second limit is exceeded starts a less gentle (throttling, synchronous) writeback. The defaults are 10% and 20% of RAM - and you can set them via. To see whether you are affected by this phenomenon you can try much more agressive values like:
echo 1 > /proc/sys/vm/dirty_background_ratio /proc/sys/vm/dirty_ratio
echo 90 >
These set async writeback to kick in ASAP (the disk can write back in the background just fine), but sets the 'aggressive throttling' limit up really high. This tuning might make your desktop magically faster. It may also cause really long delays if you do hit the 90% limit via some excessively dirtying app (but that's rare).
2) fsync delays. A handful of key apps such as Firefox use periodic fsync() syscalls to ensure that data has been saved to disk - and rightfully so. Linux fsync() performance used to be pretty dismal (the fync had to wait for a really long time on random writers to the disk, delaying Firefox all the time) and went through a number of improvements. If you have v2.6.36 and ext3 then it should be all pretty good.
I think a fair chunk of the "/bin/cp /from/large.iso /to/large.iso" problem could be eliminated if cp (and dd) helped the kernel and dropped the page-cache on large copies via fadvise/madvise. Linux really defaults to the most optimistic assumption: that apps are good citizens and will dirty only as much RAM as they need. Thus the kernel will generally allow apps to dirty a fair amount of RAM, before it starts throttling them.
VM and caching heuristics are tricky here - a app or DB startup sequence can produce very similar patterns of file access and IO when it warms up its cache. In that case it would be absolutely lethal to performance to drop pagecache contents and to sync them out agressively.
If the cp app did something as simple as explicitly dropping the page-cache via the fadvise/madvise system calls then a lot of user side grief could be avoided i suspect. DVD and CD burning apps are already rather careful about their pagecache footprint.
But, if you have a good testcase you should contact the VM and IO developers on linux-kernel@vger.kernel.org - we all want Linux desktops to perform well. (server workloads are much easier to handle in general and are secondary in that aspect.) We have various good tools that allow more than enough data to be captured to figure out where delays come from (blktrace, ftrace, perf, etc.) - we need more reporters and more testers.
Thanks,
Ingo
I think the Phoronix article you linked to is confusing the IO scheduler and the VM (both of which can cause many seconds of unwanted delays during GUI operations) with the CPU scheduler.
The CPU scheduler patch referenced in the Phoronix article deals with delays experienced during high CPU loads - a dozen or more tasks running at once and all burning CPU time actively. Delays of up to 45 milliseconds were reported and they were fixed to be as low as 29 milliseconds.
Also, that scheduler fix is not a v2.6.37 item: i have merged a slightly different version and sent it to Linus, so it's included in v2.6.36 already: you can see the commit here.
If you are seeing human-perceptible delays - especially in the 'several seconds' time scale, then they are quite likely not related to the CPU scheduler (unless you are running some extreme workload) but more likely to the CFQ IO scheduler or to the VM cache management policies.
In the CPU scheduler we usually deal with milliseconds-level delays and unfairnesses - which rarely raise up to the level of human perception.
Sometimes, if you are really sensitive to smooth scheduling, can see those kinds of effects visually via 'game smoothness' or perhaps 'Firefox scrolling smoothness' - but anything on the 'several seconds' timescale on a typical Linux desktop has to have some connection with IO.
Thanks,
Ingo
Yes. Here there is another problem at play: cp reads in the whole (big) file and then writes it out. This brings the whole file into the Linux pagecache (file cache).
That, if the VM is not fully detecting that linear copy correctly, can blow a lot of useful app data (all cached) out of the pagecache. That in turn has to be read back once you click within Firefox, etc. - which generates IO and is a few orders of magnitude slower than reading the cached copy. That such data tends to be fragmented (all around on the disk in various small files) and that there is a large copy going on does not help either.
Catastrophic slowdowns on the desktop are typically such combined 'perfect storms' between multiple kernel subsystems. (for that reason they also tend to be the hardest ones to fix.)
It would be useful if /bin/cp explicitly dropped use-once data that it reads into the pagecache - there are syscalls for that.
And yes, we'd very much like to fix such slowdowns via heuristics as well (detecting large sequential IO and not letting it poison the existing cache), so good bugreports and reproducing testcases sent to linux-kernel@vger.kernel.org and people willing to try out experimental kernel patches would definitely be welcome.
Thanks,
Ingo
FYI, the IO scheduler and the CPU scheduler are two completely different beasts.
The IO scheduler lives in block/cfq-iosched.c and is maintained by Jens Axboe, while the CPU scheduler lives in kernel/sched*.c and is maintained by Peter Zijlstra and myself.
The CPU scheduler decides the order of how application code is executed on CPUs (and because a CPU can run only one app at a time the scheduler switches between apps back and forth quickly, giving the grand illusion of all apps running at once) - while the IO scheduler decides how IO requests (issued by apps) reading from (or writing to) disks are ordered.
The two schedulers are very different in nature, but both can indeed cause similar looking bad symptoms on the desktop though - which is one of the reasons why people keep mixing them up.
If you see problems while copying big files then there's a fair chance that it's an IO scheduler problem (ionice might help you there, or block cgroups).
I'd like to note for the sake of completeness that the two kinds of symptoms are not always totally separate: sometimes problems during IO workloads were caused by the CPU scheduler. It's relatively rare though.
Analysing (and fixing ;-) such problems is generally a difficult task. You should mail your bug description to linux-kernel@vger.kernel.org and you will probably be asked there to perform a trace so that we can see where the delays are coming from.
On a related note i think one could make a fairly strong argument that there should be more coupling between the IO scheduler and the CPU scheduler, to help common desktop usecases.
Incidentally there is a fairly recent feature submission by Mike Galbraith that extends the (CPU) scheduler with a new feature which adds the ability to group tasks more intelligently: see Mike's auto-group scheduler patch
This feature uses cgroups for block IO requests as well.
You might want to give it a try, it might improve your large-copy workload latencies significantly. Please mail bug (or success) reports to Mike, Peter or me.
You need to apply the above patch on top of Linus's very latest tree, or on top of the scheduler development tree (which includes Linus's latest), which can be found in the -tip tree
(Continuing this discussion over email is probably more efficient.)
Thanks,
Ingo
Oh my gosh, the Linux scheduler is on Slashdot. Again! :-)
Frankly, this amount of interest in the Linux scheduler is certainly flattering to all of us Linux scheduler hackers, but there are certainly more important areas that need improvement: 3D support, the MM / IO schedulers, stability, compatibility, etc. (There's also the FreeBSD scheduler that went through a total rewrite recently - and it got not a single Slashdot article that i remember.)
But i digress. A couple of quick high-level points (most of the details can be found in the discussions on lkml):
I find the RFS submission interesting and useful, and i have asked the author to split the patch up a bit better, to separate the core idea from optimizations and unrelated changes - to ease review and merging of the changes, and to make the changes bisectable during QA after they have been applied to the mainstream kernel. (That is how patches are typically submitted to the Linux-kernel mailing list - it's a basic requirement before anything can be merged. CFS for example was applied to the 2.6.23 development tree in form of a series of 50 (!) separate patches. (And the scheduler works at every patching/bisection point.))
I also pointed him to the latest "bleeding edge" scheduler tree, which already implements the same non-normalized form of math and makes some of the rounding and performance arguments moot i believe. (lkml mail).
There are some issues where i disagree with Roman at the moment: even when comparing to unmodified current upstream CFS, i think Roman makes too much out of rounding behavior and i have asked him to substantiate his claims with numbers (lkml mail).
The current precision/rounding of CFS is better than one part in a million. (in fact it's currently even better than that, but i'm saying 1:1000000 here because we could in the future consciously decrease precision, if performance or simplicity arguments justify it.)
I can understand his desire towards creating interest in his patch, but IMO it should not be done by unfairly (pun unintended ;) trash-talking other people's code. The math code in CFS that achieves precision has gone through more than 5 complete rewrites already in the 20-plus CFS versions, and the current variant was not written by me but was largely authored by Thomas Gleixner and Peter Zijlstra.
New, better approaches are possible of course and the math is relatively easy to replace, due to the internal modularity of CFS. So we are keeping an open mind towards further improvements. (which includes the possibility of total replacements as well. Dozens of times has my own kernel code been replaced with new, better implementations in the past - and that includes large parts of the scheduler too. In fact only ~30% of current kernel/sched.c was authored by me, the rest has been written by the other 90+ scheduler contributors, according to the git-annotate output that covers the past ~2.5 years of kernel history. Beyond that numerous other people have contributed to the scheduler in the past.)
About the submitted code: it was a bit hard to review it because the new code did not contain any comments - it only included raw code - which is very uncommon for patches of such type. The email gave the theoretical background but there was little implementational detail in the patch itself connecting the theory to practice.
So to drive this issue forward i have today posted a question to Roman in form of a tiny patch that extracts only his suggested new math from his patch and applies it to CFS. If it is indeed what Roman intended then we can analyze that in isolation and in more detail. The patch is as small as it gets:
Firstly, the patch is mainly about modifying relatime behavior to make it more compatible and more usable.
The fact that you dont have to change fstab is no big deal, provided you have the right util-linux package installed, with the relatime user-space patch applied which not even the latest distro devel repositories have included.
If you dont have that then adding "relatime" to your fstab might leave you with a read-only mounted root filesystem and some commandline (or rescue-image) tinkering to do.
People prefer all-in-one kernel patches that just turns on the feature they are interested in. You'd be surprised how many people are willing to try almost arbitrary kernel patches but loathe to touch their user-space environment in any way.
And ... it's also kind of ironic that this relatively small patch often brings more practical benefits to the desktop than all the "big" desktop interactivity/latency features (cfs, swap-prefetch, -rt kernel) combined.
Hey, Slashdot posted an article about me! [ They also renamed me to Linus - what more can a geek ask for? ;-) ]
In any case, the latest version of the better-relatime patch can be picked up from:
http://redhat.com/~mingo/relatime-patches/
Apply it, build it, reboot into the new kernel and enjoy a faster (and lower latency) desktop. (no fstab twiddling needed)
You are nitpicking :-) But let me nitpick too: your 1 git machine has _a lot more_ states than 2^(1024*1024*1024*8). It is described by a quantum wave function that has at least 10^28 particles in it, with each particle having infinite observable states. Even if we applied some common-sense granularity to the observation of the 4 coordinates of the particles in question, the total number of possible states is more on the order of 2^(10^28) than on the order of 2^(8gig) ;-)
See? What i said was totally accurate when in nitpicking mode, still it missed the big picture by being purist and it didnt bring the discussion forward even one inch. In fact it had the exact opposite effect - the only effect this paragraph had is maybe some extra global warming ;-)
What i tried to say with my O(1) comment is that even for the worst-case scenario, CFS's algorithms never go deeper than 15-20 in the tree. Which compares quite favorably to the 140 worst-case steps the "O(1) scheduler" has to take (on an architecture that has no in-hardware bit-search instruction).
So despite that you'd not try to point this out, just because the mathematical definition of its algorithm says O(log2(N))? That would plainly defeat the fundamental purpose of why we define ordo/theta notations: to be able to compare algorithms along their worst-case/best-case/average performance characteristics.
And if you'd try to point it out, wouldnt you do it similarly to how i did, by describing the worst-case behavior in words and pointing out the failure of the strict definition?