Slashdot Mirror


ArsTechnica Explains O(1) Scheduler

geogeek writes "The recent release of Linux's 2.6 kernel introduces a number sweeping improvements. These can be hard to understand without a background in programming. This week's Linux.ars examines the improved scheduler for an enthusiast audience, concisely explaining its nuances and the practical effects of an O(1) design."

30 of 295 comments (clear)

  1. Ars' Piece by WarpFlyght · · Score: 5, Interesting

    I read this piece yesterday, and while it did "dumb down" the basics (as the first poster noted), I thought it did a very good job of putting it all into a nutshell that those of us not as familiar with Big-O and schedulers in general might easily understand. For Linux.Ars' format, I thought it was of appropriate length, and had enough detail to "belong." I'm sure there are more detailed writeups on the O(1) scheduler in place in 2.6. Does anyone have any links?

    --

    "Aye, and if my grandmother had wheels, she'd be a wagon!" -- Montgomery Scott, ST:III
    1. Re:Ars' Piece by sfe_software · · Score: 5, Informative

      Regarding "dumbing it down":

      There is one exception to this load balancing rule -- some special processes may be fixed to a certain runqueue. This attribute is called "thread affinity;" a thread is simply another name for a process.

      I think it over-simplified this a bit. Granted, Linux doesn't make all that much distinction between a "thread" and a "process" as compared to (say) Windows, but the distinction is still important.

      Otherwise, it's a decent article, and gave me some insight to one of the major improvements in 2.6 (which I've yet to try out).

      I remember when the "pre-empt" patch came out for 2.4, before it was integrated into the kernel. For a desktop (or otherwise interactive, like my media box) system, it really made a major perceived improvement in performance (while *actual* performance wasn't affected; it was just about priorities and scheduling, making it "feel" more responsive).

      The O(1) scheduling algorithm likely improves this further, and is enough to tempt me to spend XMas evening upgrading... these things are important, especially on a desktop system.

      Windows XP, for example, isn't all that great in this area, and often one process can too easily slow things down. This is even further emphasized when all the processes you're working with are actually the main "explorer.exe" process; eg, you do something in Windows Explorer that blocks (spin up a CD drive) and everything else (the "desktop", task bar, etc) all become unresponsive...

      I enjoy seeing such improvements in the Linux kernel. The pre-empt patch (later integrated into the mainstream kernel) made a drastic usability improvement (especially when the box is also a web server, MySQL server, firewall/router, etc) and I couldn't see a Windows box handling similar tasks without becoming unusable (note: not unstable, just not practical to use when responsiveness is hindered).

      I think the kernel team is on the right track with this, specifically for desktop applications (though it does help in many other applications as well; anything interactive, really, even over the network/Internet).

      --
      NGWave - Fast Sound Editor for Windows
    2. Re:Ars' Piece by Sloppy · · Score: 5, Interesting
      Windows XP, for example, isn't all that great in this area, and often one process can too easily slow things down. This is even further emphasized when all the processes you're working with are actually the main "explorer.exe" process; eg, you do something in Windows Explorer that blocks (spin up a CD drive) and everything else (the "desktop", task bar, etc) all become unresponsive...

      I actually saw this behavior once on a customer's machine. I closed the CD door, and the machine went unresponsive for several seconds. I was shocked. Then I realized: millions of people think this is normal. And those millions of people live in 2003, not 1983. It was like some sort of Lovecraftian revelation.

      --
      As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
    3. Re:Ars' Piece by ikewillis · · Score: 3, Informative

      I'm sure there are more detailed writeups on the O(1) scheduler in place in 2.6. Does anyone have any links? http://67.163.90.104/~jim/scheduler.html Unlike the Ars Technica article which seems to be nothing more than an overview of how schedulers work in general, this article goes into great technical detail about the actual implementation detauls of Linux's O(1) scheduler.

    4. Re:Ars' Piece by spectecjr · · Score: 5, Insightful

      Windows XP, for example, isn't all that great in this area, and often one process can too easily slow things down. This is even further emphasized when all the processes you're working with are actually the main "explorer.exe" process; eg, you do something in Windows Explorer that blocks (spin up a CD drive) and everything else (the "desktop", task bar, etc) all become unresponsive...

      The kernel hasn't slowed down in that case; it's the GUI layer which has slowed down. It hit a bottleneck.

      Interestingly, it would appear that Windows has used an O(1) scheduler in Windows NT and up since at least NT 4.0 (which came out in 1996).

      It's also interesting to note that it has the same priority-boosting for IO threads that Linux has.

      "Windows NT also uses additional priority adjustments based on other events, such as momentarily boosting thread priority when it returns from an I/O call"

      Detailed explanation of the NT4.0 scheduler

      Easier to read version of the above

      Part 1 of a slightly more indepth (but in some places inaccurate)* article

      Part 2 of the article

      *The article makes it sound like it's O(N) in places; it's not. The scheduler is using a set of priority lists, which are kept in-order at insertion time when a thread exits its context, by the thread inserting itself at the end of the list for its priority.

      --
      Coming soon - pyrogyra
  2. A more detailed description of O() notation... by ctrl-alt-elite · · Score: 4, Informative

    ...can be found here (PDF file).

  3. Hmm. by Anonymous Coward · · Score: 5, Interesting

    I clicked on this article expecting to see an explanation of how it was that the O(1) scheduler worked, and by what tricks it was able to schedule in O(1) rather than having to spend extra time as extra processes are added, and what the real-life effects of such a situation are.

    Instead the article was just "This is what O(1) is. This is what a scheduler is. This is what "preemptive" means. The 2.6 SMP can load-balance effectively across many processors. I used this really cool mp3 player this week." and just about nothing else.

    Anyone want to explain exactly how it is that an O(1) scheduler is a difficult thing, and how exactly linux 2.6 achieves it?

    -- Super Ugly Ultraman

    1. Re:Hmm. by AmirPC · · Score: 5, Informative

      I'm the author of that article..

      The original article was MUCH MUCH more in depth it was toned down because of the "glaze over" effect people were getting. I didn't have much say in the editorial process in the matter.

      I think this way atleast it appeals to a broader audience, not just the CS audience who would know what a binary min heap is if I were to use it.

      To each his own I suppose.

    2. Re:Hmm. by KrispyKringle · · Score: 4, Informative

      this guy says that the scheduling itself is O(1) but the balancing is O(n). I don't know this myself, but that makes a whole lot more sense; according to the ars article, the scheduler simply keeps two stacks of processes--which is can seemingly do in constant time--but of course doing the load balancing is proportionate to the number of processes (times some constant--you remove constant multipliers in big-O--to do the context switching).

    3. Re:Hmm. by AmirPC · · Score: 5, Informative

      http://67.163.90.104/~jim/scheduler.html Thats the completely unedited first draft its all I could find. That has zero editing done, and is even missing a quote up at the top. It goes further in depth, but as you can see...my unedited writing isn't so great ;)

    4. Re:Hmm. by windows · · Score: 4, Informative

      I'm not particularly familiar with the O(1) scheduler, but I know enough about it to give an explanation that I believe is accurate enough.

      There are two queues used in the Ingo scheduler. One contains processes that are ready to run. The other contains processes that have already run. The run queue is scanned, and when a process is found that can be run, it is automatically run without scanning the rest of the queue. When the process finishes its time quantum, it is put into the second queue.

      When the first queue (the active one) is empty, the queues are switched. This is just done by swapping the pointers to the queues.

      This scheduler is O(1) because it runs the first process it can run without scanning the entire queue as an O(n) scheduler would do. That's why it's O(1).

      If you want a bit more information on it, search for the "Ingo scheduler" on Google or your other search engine of choice. I don't see any detailed explanations of what happens, but I suspect as the 2.6 kernel goes into wider use, those will come.

    5. Re:Hmm. by SilentStrike · · Score: 3, Informative

      A thread is not the same thing as a process. I hope that was the editors fault too.

    6. Re:Hmm. by windows · · Score: 4, Informative

      You raise a good question. Here's a link that might explain a little better than I can.

      Your assumption is flawed, though. You seem to be assuming that only one process in the queue is runnable. We can't make that assumption that at any given time there's only one runnable process.

      I believe you have a point about the worst case of the algorithm. That's not really relevant here, though. Consider the Quicksort algorithm. It has a nasty worst case - data that's already been sorted. As a matter of fact, the worst case is an O(n^2) and approaches the slowness that is the bubble sort. Despite this well-documented worst case, we say that quicksort is an O(n*log n) algorithm, which is a more typical case.

      I believe the worst case of the Ingo scheduler is O(n). If only one process can be run, it indeed has to check n/2 processes on average, and therefore is O(n).

      You raise a good question, and figuring out what order an algorithm is, is something I'm not particularly good at. Generally the average case is what matters, but I don't know how you determine what the average case is for the state of processes.

      Anyone care to try to explain this? ;-)

    7. Re:Hmm. by Crypto+Gnome · · Score: 4, Interesting

      Personally, I *really* like the idea of split-brained articles. (ie such a shame that wasn't done here)

      By that I mean the 'default'/'main' article having the not-too-geeky/JoeSixPack-reads-slashdot version, AND includes a link to "for those who want the gory details".

      That way the editors have an article which appeals to the vast majority of "reasonably intelligent" people, as well as catering to the hardcore techo-savvy crowd. (ie wrote a brainfuck program before breakfast)

      --
      Visit CryptoGnome in his home.
  4. Yay for Ars by be-fan · · Score: 3, Interesting

    While Ars definately isn't targeted at the same audience as, say, KernelTrap, its nice to see there are a few technology websites/publications that aren't dumbed down. I remember when Byte magazine used to publish articles detailing the PowerPC architecture, down to the level of registers and the types of pipelines in the first set of implementations. Compare this to the ZD rags, which are a hair away from calling the CPU the "brain" of the computer!

    --
    A deep unwavering belief is a sure sign you're missing something...
  5. Re:Multi core proc's by autopr0n · · Score: 4, Informative

    With all the recent talk about multi core processors, will this sort of thing be relegated to hardware; or will there still be a need for software scheduler?

    Well, only if you want to run only as many threads as you have CPUs. The windows machine I'm using right now has 37 processes, some I'm sure with multiple threads. I suppose you could have a hardware scheduler, but I don't think it would be a good idea. Scheduling doesn't take much time (certainly not this O(1) jobby), so you would lose a lot of flexibility without a lot of gain.

    --
    autopr0n is like, down and stuff.
  6. Interesting by Gary+Whittles · · Score: 5, Informative
    The scheduler is an important part of the kernel, so this is an important thing to understand. This article describes it well, but just to summarize a bit more technically, the benefits of the O(1) scheduler are:
    • Lean and mean (low overhead).
    • Scales well with the number of tasks (O(1)).
    • Scales well with the number of processors (O(1) for scheduling, O(N) for balancing).
    • Strong affinity avoids tasks bouncing needlessly between CPUs.
    • Initial affinity makes it likely that request/response-type tasks stay on the same CPU (i.e., good for LMbench lat_udp et al)
    BTW, It's good to see that the starvation and affinity problems that plagued the early versions of the O(1) scheduler have been ironed out.
  7. sidebars??? by mefus · · Score: 3, Insightful

    Isn't that what a sidebar is for?

    Couldn't they have linked out into more in depth treatments, and saved the complexity for the interested (technically) reader while saving 'readability' for the non-technical ppl?

    --
    mefus
    In Open Society, GPL Software frees YOU!
  8. Ars got it slightly wrong about O(1)... by andy666 · · Score: 5, Informative

    O(1) doesn't mean the time is constant, but that in the limit it is bounded by a constant. It can get enourmous, then shrink for large inputs. It could go to 0 in fact.

  9. Preemption defined incorrectly, I think by vrt3 · · Score: 4, Insightful
    In the article, you write:
    It is important to remember that a process can be stopped in the middle of its timeslice; this is the purpose of preemption.
    The purpose of preemption is not to stop processes in the middle of their timeslice; the purpose is to stop them when the scheduler decides to, not only when the process itself decides to give another process a chance to run. The latter is called cooperative multitasking, and is used in Windows 3.x and MacOS prior to X.

    Timeslices are used in order to implement preemption: a process stops either when it blocks (waiting for I/O) or otherwise voluntarily gives up the rest of its timeslice, or when it's timeslice is used up.

    About the article in general: I'm surprised about the dumbing down. Other articles on Ars, including but not limited to the series about processor technologies (e.g. the one about hyperthreading) are much more thorough and detailed.

    --
    This sig under construction. Please check back later.
  10. UNIX catches up with mainframes by Animats · · Score: 5, Informative
    For historical reasons, revolving around the "process table" design in PDP-11 UNIX, the schedulers in UNIX-like OSs were originally very dumb. The one in V6 UNIX was so bad that if three processes were compute-bound, the scheduler would stall.

    Mainframe schedulers have been O(1) for a long time. The one for UNIVAC 1108 EXEC 8 was O(1) in 1967. (And you can still buy a Unisys ClearPath server with that code in it.)

    The traditional implementation is a set of FIFO queues, one for each priority. For systems with only a few priorities, that's just fine. As the number of different priorities increases, you spend too much time checking the empty queues, so there are tree-like structures to get around that problem.

    Scheduling today is more complex because of cache issues. If you context-switch too much, you thrash in the CPU caches. On the other hand, everybody has so much memory today that paging is mostly unnecessary.

  11. Linus Didn't steal it from SCO. by Anonymous Coward · · Score: 3, Funny

    Please stop posting SCO code here, it only gets Darl all excited.

  12. Sounds like an RPG by blixel · · Score: 4, Funny

    When a process uses its timeslice, the scheduler calculates a new timeslice by adding the dynamic priority bonus to the static priority. The process then gets inserted in the second list. When the first list becomes empty, the second list takes the place of the first, and vice-versa. This allows the scheduler to continuously calculate timeslices with minimal computational overhead.

    Sounds like the rules to an AD&D adventure.

  13. Question by wolfb · · Score: 3, Interesting


    The mechanism for recalculation of timeslices in previous Linux kernel's was very simple. When every process had its timeslice completely depleted (they were all 0) the kernel would simply go through every process and recalculate its timeslice and start execution again at the highest priority runnable process. While this is the most obvious solution it is also very inefficient, executing in O(n) time.


    Ok, its easy to see why this is O(n).

    The 2.6 scheduler uses a simple yet effective method for getting rid of this problem, it uses two priority arrays! One priority array is for processes that are runnable, and one priority array is for processes that are not runnable (they have depleted their timeslice). This way if when a process has depleted its timeslice the scheduler simply recalculates its timeslice, removes it from the active array, and inserts it into the expired array.

    How is this not O(n)? The time slice calculation still occurs for each process, just not all at once for all processes. Each process still gets its time slice calcuated, it is removed from one queue, and inserted into another. Is there some other unmentioned trick that eliminates the calculations? Or was there something else that made the 2.4 scheduler O(n), such as finding the highest priority process?

    So when all processes have depleted their timeslices there is no need to recalculate timeslices for every process, the two arrays are just switched (for the code oriented among us: they are accessed via pointers and the pointers are simply switched).

    So the calculation is done per process as they finish their time slice, rather then at the end when all the processes are done. I still don't see why this would imply better efficiency. Am I missing something?

    At any rate, thanks for the link, it was much more informative than the published article.

    1. Re:Question by Nexx · · Score: 4, Informative

      It's not. However, instead of having a slight pause (that scales linearly with the number of processes in-flight) after all the processes have exhausted their timeslices, there is a constant *tiny* pause after each process exhausting their timeslice. Since previously, the kernel was not preemptable, the O(n) recalculation would potentially cause, on large systems especially, a pause while the timeslices were recalculated. The new scheduler increases system responsiveness by eliminating this mass-recalculation.

      I think the majority of major improvements on the 2.6 kernel over 2.4 were placed for increasing system responsiveness.

      (and yes, I'm using my Karma bonus)

    2. Re:Question by krakrjak · · Score: 5, Informative

      Let's see if I can remember anything from my algorithms class.

      How is this not O(n)?

      Read on for an explanation. I do believe you gave the answer to your own question in your post. I will also assume that you understand "Big-O" notation well enough to follow the answer.

      Each process still gets its time slice calcuated, it is removed from one queue, and inserted into another.

      Correct. You've nailed it. Insertion into a Que operates in O(1) time. Removal from the Que is also a constant time algorithm. A Que inserts at the tail of the list (or the end of the array) and De-Ques from the Head of the list (or the beginning of the array). Ques do not permit searching, sorting or accessing the data after the first element.

      If we De-Que a process, run it then calculate some information and Que it up that would run in O(c1*O(1)+c2*O(1)) = O(1).

      This is still an oversimplification and does not take into account the possibility of a Priority Que in which insertion runs in O(lg n) and De-Queing runs in O(1). So a heap sort using a Priority Que runs in O(n*lg n).

      Hope that helps.

  14. Preemption and disk requests - educate me by ortholattice · · Score: 4, Insightful
    I think this may be important:

    The benefits of preemption are great enough that the kernel developers decided to make the kernel itself preemptible. This allows a kernel task such as disk I/O to be preempted, for instance, by a keyboard event. This allows the system to be more responsive to the user's demands. The 2.6 kernel is able to efficiently manage its own tasks in addition to user processes.

    One of the most time-wasting, counterproductive things for me on Windows, and to a lesser extent on Linux, is that background tasks can essentially take over the computer because of disk thrashing. Sometimes (on Windows XP) I've seen it literally take minutes to bring a window to the front and repaint it while the disk is thrashing doing god-knows-what in the background. It is impossible to get any serious work done when that is happening. Setting my process to highest priority (on Windows) doesn't seem to help when disk thrashing is involved. I can understand how it might take a few seconds to swap another process out and bring mine in, but not minutes.

    Now I don't know exactly what's happening behind the scenes, but as soon as I click on a window to bring it in focus, I want everything on the computer to be devoted to doing just that until its done, regardless of what disk blocks from other processes are in the queue. My disk block requests should go in front of all of the others unconditionally. In other words I should see essentially no more latency than if the computer were completely idle in the background, other than just the minimum delay required to bring it back into memory if it's swapped out. For all I care just completely freeze all other processes on the computer until my request is finished, or at least give me the option to specify that it be done that way. My productivity should override everything else on my own desktop/workstation computer.

    So, educate me about what I may be overlooking here, and will 2.6 help this? I mean for Linux of course, not Windows, which may be a lost cause (:

    1. Re:Preemption and disk requests - educate me by borgasm · · Score: 4, Informative

      Trashing occurs when you have to page out to virtual memory that is mapped to the hard drive. Your computer is spending more time moving data in and out of virtual memory than it is doing the actual computation.

      This usually happens when a process is in a loop, and repeatedly uses the same memory blocks, but can't keep them all in physical memory. Thus, it has to replace the pages every time it accesses/modifies them.

      Even if a process is at a low priority...if it comes on the CPU, it has the potential to trash memory and slow the works down.

      If an application is trashing, then you might be able to get away with scheduling it to not hit the disk. If the OS is trashing, you are SOL.

    2. Re:Preemption and disk requests - educate me by Halo1 · · Score: 4, Informative
      Exactly, this is the problem. I want the ability the force all (non-critical, non-realtime) processes not associated with whatever window is currently in focus to freeze completely, until the process associated with the window in focus is idle, waiting for neither CPU nor disk.
      The problem with this scheme is that you can get deadlocks. Suppose the high and low priority processes share a block of memory, and that the high priority process is waiting for a spinlock in that block of memory to be cleared (basically a variable which is currently 1 and which has to be set to 0 by the process/thread that owns the lock) by the low priority process.

      In your scheme, the low priority process will never get a chance to run, because the high priority process is running all the time. So the high priority process keeps running forever doing nothing useful.

      In other words I am suggesting a system option to do this: if the process associated with the window that's in focus is not in the idle state, make everything else freeze.
      So if your email client was checking mail in the background, its connection will time out, file sharing will completely die, background printing fails due to communication time-outs, ...
      Or even this: an option to freeze everything else unless the process associated with the window that's in focus has been idle for at least x seconds (user settable). This latter would prevent swapping out while the user is typing.
      These kind of things were possible in older systems (e.g. Mac OS 9 and below froze every other application while you were dragging a window around), but modern operating systems cannot work that way anymore. You don't want downloads to stop when they're in the background, firewalls to stop working because they're never the frontmost application etc...
      --
      Donate free food here
  15. explorer.exe == bad design outside of the kernel by IGnatius+T+Foobar · · Score: 4, Informative

    Windows XP, for example, isn't all that great in this area, and often one process can too easily slow things down. This is even further emphasized when all the processes you're working with are actually the main "explorer.exe" process; eg, you do something in Windows Explorer that blocks (spin up a CD drive) and everything else (the "desktop", task bar, etc) all become unresponsive...

    Unfortunately, even the best kernel process scheduler in the known universe would not fix this design flaw in Microsoft Windows, because what you're seeing is not a thread/process scheduling problem.

    As you correctly observed, many Windows programs are forced to make "shell calls" -- which doesn't mean the same thing it does in Unix, where you fork a /bin/sh and tell it to do something for you. It means that you send "events" to the currently-running EXPLORER.EXE with an idea of what you want it to do for you.

    The reason it bogs down when it's busy is because it is waiting on a single event queue. Mounting/unmounting of media, network lag, other processes sending/receiving messages, etc. all give EXPLORER.EXE more events to wait on. That's why it's a bottleneck, even when there's plenty of CPU to go around.

    It's an awful design, and it's one of those things that's fundamentally broken about Windows.

    --
    Tired of FB/Google censorship? Visit UNCENSORED!