The State of Linux IO Scheduling For the Desktop?

← Back to Stories (view on slashdot.org)

The State of Linux IO Scheduling For the Desktop?

Posted by timothy on Saturday October 23, 2010 @06:38AM from the in-and-out-and-in-and-out dept.

pinkeen writes "I've used Linux as my work & play OS for 5+ years. The one thing that constantly drives me mad is its IO scheduling. When I'm copying a large amount of data in the background, everything else slows down to a crawl while the CPU utilization stays at 1-2%. The process which does the actual copying is highly prioritized in terms of I/O. This is completely unacceptable for a desktop OS. I've heard about the efforts of Con Kolivas and his Brainfuck Scheduler, but it's unsupported now and probably incompatible with latest kernels. Is there any way to fix this? How do you deal with this? I have a feeling that if this issue was to be fixed, the whole desktop would become way more snappier, even if you're not doing any heavy IO in the background." Update: 10/23 22:06 GMT by T : As reader ehntoo points out in the discussion below, contrary to the submitter's impression, "Con Kolivas is still actively working on BFS, it's not unsupported. He's even got a patch for 2.6.36, which was only released on the 20th. He's also got a patchset out that I use on all my desktops which includes a bunch of tweaks for desktop use." Thanks to ehntoo, and hat tip to Bill Huey.

33 of 472 comments (clear)

It sucks I agree by Anonymous Coward · 2010-10-23 06:40 · Score: 4, Interesting

This issue got so bad for me I switched to FreeBSD.
1. Re:It sucks I agree by Anonymous Coward · 2010-10-23 07:06 · Score: 4, Insightful
  
  We switched our dedicated web servers from Linux to FreeBSD and OpenSolaris. When we upload videos (usually 10GB or larger) over our 100Mbps internet connection to the server, or a client was downloading the videos, those who were accessing the server using the web server complained it took seconds serve each web page. The videos were on a magnetic hard drives, the OS and web server was on SSDs (which was mirrored in RAM). Server logs were fine, CPU utilisation was low, the servers have 1Gbps connection. We put it down to I/O scheduling. Switching the OS solved the problems.
2. Re:It sucks I agree by ObsessiveMathsFreak · 2010-10-23 07:13 · Score: 4, Interesting
  
  This is the number one problem with all Linux installations I have ever used. The problem is most noticeable in Ubuntu where, any time one of the frequent update/tracker programs runs, the entire system will become all but unusable for several minutes.
  I don't know if it's all that related, but swap slowdown is an appalling issue as well. If a single program spikes in RAM usage, I often have to reboot the whole system as it hangs indefinitely. As I work with Octave a lot, often a script will gobble up a few hundred megs of memory and push the system into swap. Once that happens, it's often too late to do anything about it as programs simply will not respond.
  
  --
  May the Maths Be with you!
3. Re:It sucks I agree by Anonymous Coward · 2010-10-23 07:43 · Score: 4, Informative
  
  swapoff -a && swapon -a
  will force everything back into memory
4. Re:It sucks I agree by ChipMonk · 2010-10-23 08:42 · Score: 4, Informative
  
  It's more than that. Since most Linux systems use ext{2,3,4}, CFQ is designed to behave very well with them. However, XFS and JFS do better with deadline or no-op. In fact, on my Athlon 64 X2 w/ 4G RAM, using XFS and CFQ at 2.5GHz did worse than XFS and deadline at 1GHz. Yes, CFQ and XFS clash that badly.
  
  (Site pimp: I did some of my own testing, and reported on it here. I also provide basic shell scripts, so others can do their own tests.)
5. Re:It sucks I agree by makomk · 2010-10-23 09:12 · Score: 4, Informative
  
  For years I've wondered if it was just me; everyone I'd asked naturally denied any problems, when all I had to do was delete a 1GB file and I could kiss goodbye to my system for 20 seconds or so.
  That's a very well known ext2/ext3 problem - they're really slow at deleting huge files, and the amount of disk access involved in doing so slows down any other application accessing the disk. ext4 should fix the issue. (There's also another subtle bug, finally fixed in 2.6.36, where heavy disk IO can cause processes that aren't doing any IO to become unresponsive.)
6. Re:It sucks I agree by Yokaze · 2010-10-23 09:45 · Score: 4, Insightful
  
  Renaming is O(filename), usually a single table entry. Deleting is probably more along the lines of O(filesize).
  You have to keep track of the free blocks, too.
  
  --
  "Between strong and weak, between rich and poor [...], it is freedom which oppresses and the law which sets free"
7. Re:It sucks I agree by julesh · 2010-10-23 09:47 · Score: 4, Informative
  
  Depends on the paranoia of the user. FTFY.
  Any "sane" filesystem will simply unlink that entry in the directory or table.
  The only reason to be physically overwriting the entire space occupied by the 1GB file is some "super secret secure" filesystem used by people scared of having their porn browsing habits discovered by the FBI.
  The problem isn't overwriting the data, it's adding the space previously used by the file to the free space bitmap/list. For a 1GB file on an FS with 1k blocks (not uncommon), you're going to be deallocating about a million blocks. Now, unless your system is fragmented horrendously, a lot of those are going to be hits to the same bitmap block (or similar), but you're still looking at writing about 5,000 or so blocks, probably scattered over several cylinders of your disk (=> more than one seek), so on a typical hard disk the process is going to take tens or hundreds of milliseconds at best. If badly fragmented, this could easily take over a second.
8. Re:It sucks I agree by Waffle+Iron · 2010-10-23 13:18 · Score: 4, Interesting
  
  MythTV added a feature a while back to work around this issue. IIRC, they now keep a handle open to video files while they delete them. This causes the kernel to not actually do the delete, then over a span of about 10 minutes MythTV repeatedly shaves chunks off the end using truncate() until the file reaches 0 bytes.
  Prior to this, the system could get really bogged down right after deleting shows. I was careful not to delete too many shows at once; I had actually seen the back end lock up after telling it to delete a bunch of shows.
9. Re:It sucks I agree by Ingo+Molnar · 2010-10-23 21:02 · Score: 4, Informative
  
  Such drastic change! I have seen this happen on numerous systems and I just change the elevator to "deadline" and poof! The problem is gone. See this discussion for some details. The CFQ scheduler is great for a Linux server running a database, but it completely sucks for desktop or any server used to write large files to.
  I see that the bug entry you referred to contains measurements from early 2010, at which point Ubuntu was using v2.6.31-ish kernels IIRC. (and that's the kernel version that is being referred to in the bug entry as well.)
  A lot of work has gone into this area in the past 1-2 years, and v2.6.33 is the first kernel version where you should see the improvements. Slashdot reported on that controversy as well.
  If you can still reproduce interactivity problems during large file copies with CFQ on v2.6.36 (and it goes away when you switch the IO scheduler to deadline), please report it to linux-kernel@vger.kernel.org so that it can be fixed ASAP. (You can also mail me directly, i'll forward it to the right list and people.)
  Thanks,
  Ingo
10. Re:It sucks I agree by Ingo+Molnar · 2010-10-23 22:50 · Score: 4, Interesting
  
  There's also the VM fix from Wu Fengguang, included in v2.6.36, which addresses similar "slowdown while copying large amounts of data" bugs.
  There were about a dozen kernel bugs causing similar symptoms, which we fixed over the course of several kernel releases. They were almost evenly spread out between filesystem code, the VM and the IO scheduler. And yes, i agree that it took too long to acknowledge and address them - these problems have been going on for several years. It's a serious kernel development process failure.
  If anyone here still experiences bad desktop stalls while handling big files with v2.6.36 too then we'd appreciate a quick bug report sent to linux-kernel@vger.kernel.org.
  Thanks,
  Ingo
have you tried ionice? by larry+bagina · 2010-10-23 06:43 · Score: 5, Informative

have you tried ionice?

--
Do you even lift?
These aren't the 'roids you're looking for.
1. Re:have you tried ionice? by atrimtab · 2010-10-23 07:01 · Score: 5, Informative
  
  ionice works great in a terminal window, but isn't integrated into any of the Desktop GUIs.
  I suppose you could prefix the various file transfer commands used by the GUI with an added "ionice -c 3", but I haven't bothered to look.
  Using ionice to lower the i/o priority of various portions of MythTV like mythcommflag, mythtranscode, etc. can make it quite snappy.
  
  --
  Facebook is billions of individual "Skinner Boxes." And if you use it you are the pigeon!
2. Re:have you tried ionice? by JohnFluxx · 2010-10-23 09:35 · Score: 4, Informative
  
  Poor me! I added ionice integration into KDE since pretty much the dawn of time.
  In KDE, just press ctrl+esc to bring up my System Activity, right click on a process, then chose renice. You get a really pretty (imho heh) dialog letting you change the CPU or hard disk priority, scheduler, and so on.
BFS Isn't Unsupported by ehntoo · 2010-10-23 06:44 · Score: 5, Informative

Con Kolivas is still actively working on BFS, it's not unsupported. He's even got a patch for 2.6.36, which was only released on the 20th. http://ck.kolivas.org/patches/bfs/ He's also got a patchset out that I use on all my desktops which includes a bunch of tweaks for desktop use. http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/
Linux I/O scheduling by Animats · 2010-10-23 06:49 · Score: 5, Insightful

If the CPU utilization is that low, it's an I/O scheduling problem. See Linux I/O scheduling.
The CFQ scheduler is supposed to be a fair queuing system across processes, so you shouldn't have a starvation problem. Are you thrashing the virtual memory system? How much I/O is going into swapping. (Really, today you shouldn't have any swapping; RAM is too cheap and disk is too slow.)
Re:what about servers? by Anonymous Coward · 2010-10-23 06:55 · Score: 5, Informative

There are some interactive-response fixes queued up for 2.6.37 that may help (a lot!) with this stuff.
Start reading here: http://www.phoronix.com/scan.php?page=news_item&px=ODU0OQ
Is it really only a matter of scheduling? by mrjb · 2010-10-23 06:56 · Score: 4, Interesting

I've wondered on occasion if this problem is really only due to scheduling. After all, most of us still write our file access code more or less as follows: x=fopen('somefilename'); while ( !eof(x)) { print readln(x,1024); /* ---- */ } fclose(x); Point being, there's nothing that tells the marked line that the process should gracefully go to sleep while the drive is doing its thing, and there's no callback vector defined either- nothing that indicates we're dealing with non-blocking I/O. I'd like to think that our compilers have silently been improved to hide those implementation details from us, but I have no proof that this is the case. Unless the system functions use some dirty stack manipulation voodoo to extract the return address of the function and use that as callback vector?

--
Visit http://ringbreak.dnd.utwente.nl/~mrjb/growingbettersoftware to download your free copy of the book
1. Re:Is it really only a matter of scheduling? by Anonymous Coward · 2010-10-23 07:06 · Score: 5, Informative
  
  The kernel will preempt the process calling "readln", in other words putting it to sleep.
  The kernel will make sure the I/O happens, allowing other processes to work at the same time.
  You only need non-blocking code if your own process needs to other things at the same time.
2. Re:Is it really only a matter of scheduling? by Anonymous Coward · 2010-10-23 07:07 · Score: 4, Informative
  
  The process will go to sleep inside the read() system call (inside readln() somewhere presumably). Other processes will be able to run in the meantime. It works by interrupting into kernel code, and the kernel changes the stack pointer (and program counter, and lots of other registers) to that of another process. When the data comes back from the disk, the kernel will consult its tables and see that your process is runnable again, and when the scheduler decides it's its turn, in a timer interrupt, the stack pointer will be switched back to your stack. (So yes, dirty stack manipulation voodoo.) Every modern OS works this way.
3. Re:Is it really only a matter of scheduling? by Ingo+Molnar · 2010-10-23 07:55 · Score: 5, Informative
  
  Yes. Here there is another problem at play: cp reads in the whole (big) file and then writes it out. This brings the whole file into the Linux pagecache (file cache).
  That, if the VM is not fully detecting that linear copy correctly, can blow a lot of useful app data (all cached) out of the pagecache. That in turn has to be read back once you click within Firefox, etc. - which generates IO and is a few orders of magnitude slower than reading the cached copy. That such data tends to be fragmented (all around on the disk in various small files) and that there is a large copy going on does not help either.
  Catastrophic slowdowns on the desktop are typically such combined 'perfect storms' between multiple kernel subsystems. (for that reason they also tend to be the hardest ones to fix.)
  It would be useful if /bin/cp explicitly dropped use-once data that it reads into the pagecache - there are syscalls for that.
  And yes, we'd very much like to fix such slowdowns via heuristics as well (detecting large sequential IO and not letting it poison the existing cache), so good bugreports and reproducing testcases sent to linux-kernel@vger.kernel.org and people willing to try out experimental kernel patches would definitely be welcome.
  Thanks,
  Ingo
Re:Is Desktop Linux [still] relevant? by bieber · 2010-10-23 07:09 · Score: 4, Informative

That was a joke, right? You don't really think that all the millions of desktop Linux users just up and vanished because some idiot at PCWorld wanted a catchy headline?
Probably not the IO scheduler by crlf · 2010-10-23 07:13 · Score: 5, Informative

This is almost certainly not the IO scheduler's problem. IO scheduling priorities are orthogonal to CPU scheduling priorities.
What you are likely running into is the dirty_ratio limits. In Linux, there is a memory threshold for "dirty memory" (memory that is destined to be written out to disk), that once crossed, will cause symptoms like you've described. The dirty_ratio values can be tuned via /proc, but beware that the kernel will internally add its own heuristics to the values you've plugged in.
When the threshold is crossed, in an attempt to "slow down the dirtiers", the Linux kernel will penalized (in rate-limited fashion) any and every task on the system that tries to allocate a page. This allocation may be in response to userland needing a new page, but it can also occur if the kernel is allocating memory for internal data structures in response to a system call the process did. When this happens, the kernel will force that allocating thread (again, rate-limited) to take part in the flushing process, under the (misguided) assumption that whoever is allocating a lot of memory is the same thread that is dirtying a lot of memory.
There are a couple ways to work around this problem (which is very typical when copying large amounts of data). For one, the copying process can be fixed to rate limit itself, and to synchronously flush data at some reasonable interval. Another way that a system administrator can manage this sort of task (if automated of course) is to use Linux's support for memory controllers which essentially isolates the memory subsystem performance between tasks. Unfortunately, it's support is still incomplete and I don't know of any popular distributions that automate this cgroup subsystem's use.
Either way, it is very unlikely to be the IO scheduler.
IO scheduler != CPU scheduler by Ingo+Molnar · 2010-10-23 07:32 · Score: 5, Insightful

FYI, the IO scheduler and the CPU scheduler are two completely different beasts.
The IO scheduler lives in block/cfq-iosched.c and is maintained by Jens Axboe, while the CPU scheduler lives in kernel/sched*.c and is maintained by Peter Zijlstra and myself.
The CPU scheduler decides the order of how application code is executed on CPUs (and because a CPU can run only one app at a time the scheduler switches between apps back and forth quickly, giving the grand illusion of all apps running at once) - while the IO scheduler decides how IO requests (issued by apps) reading from (or writing to) disks are ordered.
The two schedulers are very different in nature, but both can indeed cause similar looking bad symptoms on the desktop though - which is one of the reasons why people keep mixing them up.
If you see problems while copying big files then there's a fair chance that it's an IO scheduler problem (ionice might help you there, or block cgroups).
I'd like to note for the sake of completeness that the two kinds of symptoms are not always totally separate: sometimes problems during IO workloads were caused by the CPU scheduler. It's relatively rare though.
Analysing (and fixing ;-) such problems is generally a difficult task. You should mail your bug description to linux-kernel@vger.kernel.org and you will probably be asked there to perform a trace so that we can see where the delays are coming from.
On a related note i think one could make a fairly strong argument that there should be more coupling between the IO scheduler and the CPU scheduler, to help common desktop usecases.
Incidentally there is a fairly recent feature submission by Mike Galbraith that extends the (CPU) scheduler with a new feature which adds the ability to group tasks more intelligently: see Mike's auto-group scheduler patch
This feature uses cgroups for block IO requests as well.
You might want to give it a try, it might improve your large-copy workload latencies significantly. Please mail bug (or success) reports to Mike, Peter or me.
You need to apply the above patch on top of Linus's very latest tree, or on top of the scheduler development tree (which includes Linus's latest), which can be found in the -tip tree
(Continuing this discussion over email is probably more efficient.)
Thanks,
Ingo
1. Re:IO scheduler != CPU scheduler by Ingo+Molnar · 2010-10-23 09:24 · Score: 4, Informative
  
  I know some of the patches have made it back into the mainline kernel, any idea when they all will be merged?
  The -tip tree contains development patches for the next kernel version for a number of kernel subsystems (scheduler, irqs, x86, tracing, perf, timers, etc.) - and i'm glad that you like it :-)
  We typically send all patches from -tip into upstream in the merge window - except for a few select fixlets and utility patches that help our automated testing. We merge back Linus's tree on a daily basi and stabilize it on our x86 test-bed - so if you want some truly bleeding edge kernel but want proof that someone has at least built and booted it on a few boxes without crashing then you can certainly try -tip ;-)
  Otherwise we try to avoid -tip specials. I.e. there are no significant out-of-tree patches that stay in -tip forever - there are only in-progress patches which we try to push to Linus ASAP. If we cannot get something upstream we drop it. This happens every now and then - not every new idea is a good idea. If we cannot convince upstream to pick up a particular change then we drop it or rework it - but we do not perpetuate out-of-tree patches.
  So the number of extra commits/changes in -tip fluctuates, it typically ranges from up to a thousand down to a few dozen - depending on where we are in the development cycle.
  Right now we are in the first few days of the v2.6.37 merge window and Linus pulled most of our pending trees already in the past two days, so -tip contains small fixes only. While v2.6.37 is being releasified in the next ~2.5 months, -tip will fill up again with development commits geared towards v2.6.38 - and we will also keep merging back Linus's latest tree - and so the cycle continues.
  Thanks,
  Ingo
Re:what about servers? by joaosantos · 2010-10-23 07:34 · Score: 5, Informative

I just did it and didn't notice any slowdown.
Re:if i have many gigs of data to copy over somewh by eldepeche · 2010-10-23 07:44 · Score: 4, Insightful

I would definitely ditch an OS that fucked up a file copy because I used the computer for something else while I was waiting.
Re:what about servers? by ksandom · 2010-10-23 08:21 · Score: 4, Interesting

Sorry dude, it looks like it's a hardware specific problem. I did that on nearly 700G of large files and then fired up the flight sim while it was still going. The only slow down was on file related activity, which is totally what you'd expect. I had it running full screen across two monitors without any drop in frame rate. AND I'm using economy hardware.

--
Funnyhacks - Wierd, unusual, and fun hacks
Re:what about servers? by Ingo+Molnar · 2010-10-23 08:33 · Score: 5, Informative

I think the Phoronix article you linked to is confusing the IO scheduler and the VM (both of which can cause many seconds of unwanted delays during GUI operations) with the CPU scheduler.
The CPU scheduler patch referenced in the Phoronix article deals with delays experienced during high CPU loads - a dozen or more tasks running at once and all burning CPU time actively. Delays of up to 45 milliseconds were reported and they were fixed to be as low as 29 milliseconds.
Also, that scheduler fix is not a v2.6.37 item: i have merged a slightly different version and sent it to Linus, so it's included in v2.6.36 already: you can see the commit here.
If you are seeing human-perceptible delays - especially in the 'several seconds' time scale, then they are quite likely not related to the CPU scheduler (unless you are running some extreme workload) but more likely to the CFQ IO scheduler or to the VM cache management policies.
In the CPU scheduler we usually deal with milliseconds-level delays and unfairnesses - which rarely raise up to the level of human perception.
Sometimes, if you are really sensitive to smooth scheduling, can see those kinds of effects visually via 'game smoothness' or perhaps 'Firefox scrolling smoothness' - but anything on the 'several seconds' timescale on a typical Linux desktop has to have some connection with IO.
Thanks,
Ingo
Re:easy solution: by DarkOx · 2010-10-23 10:41 · Score: 4, Insightful

Generally Windows runs badly without a swap. Don't listen to people who tell you to disable it. You should have a swap file on Windows no matter how much memory you have.
Tweakers who don't really understand anything about Windows paging often conclude turning off the swap is a good idea, because they only run trivial applications and don't experience certain memory backed I/O operation failing with it off. They do see an initial speed boost though. The reason is NT is very pessimistic about memory. Windows assumes you will need to page out to disk. It therefore flush the set of static pages to disk almost right away. This is why there is so much more disk thrashing on Windows than say Linux when you start an application and plenty of memory is free. It will do its best to keep the working set out of the page file of course. This does give Windows a performance advantage under memory pressure however. When there is not enough memory to start a new application Windows can just drop the pages from memory of the application being paged out without the need to flush them to disk because they are already there; Linux will need to write those pages.
Given that Windows boxes (desktops anyway) tend to have large numbers proccess running in the background so they usually are under that memory pressure.

--
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Re:if i have many gigs of data to copy over somewh by grepya · 2010-10-23 10:45 · Score: 4, Funny

I would definitely not let a monkey like you get near my computers if some intense file copy was going on and they wanted to start doing other things while that was going on, sure you can do it but that does not make it a prudent thing to do, and the file may copy over just fine, and it may lose a few bits without even reporting any errors and that can happen on any OS, BSD, Linux, Winders & etc...etc...etc...
You sir, are a perfect specimen of a BOFH. You only have a dim notion of what actually goes on inside those mysterious boxes that are unfortunately left under your care. And yet, by some curious accident of nature, you've been entrusted with root passwords for said boxes. You use phrases like "intense file copy" like they mean anything. You place every idiotic restriction that you can think of on the users of said boxes (who, incidentally, are almost always smarter and more qualified than you in whatever field of work they're in) by using words like "prudent" and "safety"... or god forbid... "security". You actually think that because I run a second program along with your "intense" copy, it can result in loss of "a few bits without even reporting any errors" due to what ? The magical fairies that dance inside those little chips getting angry ? Tired ? Can you do everybody a favor and reduce the amount of utter nonsense emanating out of that tiny, befuddled brain ?
Re:Perhaps if Con Kolivas named his scheduler .. by Ingo+Molnar · 2010-10-23 10:52 · Score: 4, Informative

He tried that before. I think he's given up on getting his scheduler (though perhaps not a suspiciously similar one written by Inigo) in the kernel after what happened with CFQ.
One reason for why the principle of CFS may seem to you so suspiciously similar to Con's SD scheduler is that i used Con's fair scheduling principle when writing the initial version of CFS. This is credited at the very top of today's kernel/sched.c [the scheduler code]:
* 2007-04-15 Work begun on replacing all interactivity tuning with a * fair scheduling design by Con Kolivas.
It was added in this commit.
The scheduler implementations (and even the user visible behavior) of the schedulers was and is very different - and there is where much of the disagreement and later flaming came from.
Note that this particular Slashdot article is about IO scheduling though - which is unrelated to CPU schedulers. Neither Con nor i wrote IO schedulers.
There are two main IO schedulers in Linux right now: CFQ and AS, written by Jens Axboe, Nick Piggin, et al.
What adds fuel to the confusion is that it is relatively easy to mix up 'CFQ' with 'CFS'.
Thanks,
Ingo
Re:what about servers? by Ingo+Molnar · 2010-10-23 20:38 · Score: 4, Informative

Sorry dude, it looks like it's a hardware specific problem. I did that on nearly 700G of large files and then fired up the flight sim while it was still going. The only slow down was on file related activity, which is totally what you'd expect. I had it running full screen across two monitors without any drop in frame rate. AND I'm using economy hardware.
It may also be kernel version dependent - with older kernels still showing this bug.
A lot of work has gone into the Linux kernel in the past 2 years to improve this area - and yes, i think much of the criticism from those who have met this bug and were annoyed by it was fundamentally justified - this bug was real and it should have been fixed sooner.
Kernels post v2.6.33 ought to be much better - with v2.6.36 bringing another set of improvements in this area. The fixes were all over the place: IO scheduler, VM and filesystem code and few of them were simple.
This Slashdot article from 1.5 years ago shows when more attention was raised to this category of Linux interactivity bugs.
Thanks,
Ingo