ArsTechnica Explains O(1) Scheduler
geogeek writes "The recent release of Linux's 2.6 kernel introduces a number sweeping improvements. These can be hard to understand without a background in programming. This week's Linux.ars examines the improved scheduler for an enthusiast audience, concisely explaining its nuances and the practical effects of an O(1) design."
I read this piece yesterday, and while it did "dumb down" the basics (as the first poster noted), I thought it did a very good job of putting it all into a nutshell that those of us not as familiar with Big-O and schedulers in general might easily understand. For Linux.Ars' format, I thought it was of appropriate length, and had enough detail to "belong." I'm sure there are more detailed writeups on the O(1) scheduler in place in 2.6. Does anyone have any links?
"Aye, and if my grandmother had wheels, she'd be a wagon!" -- Montgomery Scott, ST:III
...can be found here (PDF file).
I clicked on this article expecting to see an explanation of how it was that the O(1) scheduler worked, and by what tricks it was able to schedule in O(1) rather than having to spend extra time as extra processes are added, and what the real-life effects of such a situation are.
Instead the article was just "This is what O(1) is. This is what a scheduler is. This is what "preemptive" means. The 2.6 SMP can load-balance effectively across many processors. I used this really cool mp3 player this week." and just about nothing else.
Anyone want to explain exactly how it is that an O(1) scheduler is a difficult thing, and how exactly linux 2.6 achieves it?
-- Super Ugly Ultraman
With all the recent talk about multi core processors, will this sort of thing be relegated to hardware; or will there still be a need for software scheduler?
Well, only if you want to run only as many threads as you have CPUs. The windows machine I'm using right now has 37 processes, some I'm sure with multiple threads. I suppose you could have a hardware scheduler, but I don't think it would be a good idea. Scheduling doesn't take much time (certainly not this O(1) jobby), so you would lose a lot of flexibility without a lot of gain.
autopr0n is like, down and stuff.
- Lean and mean (low overhead).
- Scales well with the number of tasks (O(1)).
- Scales well with the number of processors (O(1) for scheduling, O(N) for balancing).
- Strong affinity avoids tasks bouncing needlessly between CPUs.
- Initial affinity makes it likely that request/response-type tasks stay on the same CPU (i.e., good for LMbench lat_udp et al)
BTW, It's good to see that the starvation and affinity problems that plagued the early versions of the O(1) scheduler have been ironed out.O(1) doesn't mean the time is constant, but that in the limit it is bounded by a constant. It can get enourmous, then shrink for large inputs. It could go to 0 in fact.
Timeslices are used in order to implement preemption: a process stops either when it blocks (waiting for I/O) or otherwise voluntarily gives up the rest of its timeslice, or when it's timeslice is used up.
About the article in general: I'm surprised about the dumbing down. Other articles on Ars, including but not limited to the series about processor technologies (e.g. the one about hyperthreading) are much more thorough and detailed.
This sig under construction. Please check back later.
Mainframe schedulers have been O(1) for a long time. The one for UNIVAC 1108 EXEC 8 was O(1) in 1967. (And you can still buy a Unisys ClearPath server with that code in it.)
The traditional implementation is a set of FIFO queues, one for each priority. For systems with only a few priorities, that's just fine. As the number of different priorities increases, you spend too much time checking the empty queues, so there are tree-like structures to get around that problem.
Scheduling today is more complex because of cache issues. If you context-switch too much, you thrash in the CPU caches. On the other hand, everybody has so much memory today that paging is mostly unnecessary.
When a process uses its timeslice, the scheduler calculates a new timeslice by adding the dynamic priority bonus to the static priority. The process then gets inserted in the second list. When the first list becomes empty, the second list takes the place of the first, and vice-versa. This allows the scheduler to continuously calculate timeslices with minimal computational overhead.
Sounds like the rules to an AD&D adventure.
The benefits of preemption are great enough that the kernel developers decided to make the kernel itself preemptible. This allows a kernel task such as disk I/O to be preempted, for instance, by a keyboard event. This allows the system to be more responsive to the user's demands. The 2.6 kernel is able to efficiently manage its own tasks in addition to user processes.
One of the most time-wasting, counterproductive things for me on Windows, and to a lesser extent on Linux, is that background tasks can essentially take over the computer because of disk thrashing. Sometimes (on Windows XP) I've seen it literally take minutes to bring a window to the front and repaint it while the disk is thrashing doing god-knows-what in the background. It is impossible to get any serious work done when that is happening. Setting my process to highest priority (on Windows) doesn't seem to help when disk thrashing is involved. I can understand how it might take a few seconds to swap another process out and bring mine in, but not minutes.
Now I don't know exactly what's happening behind the scenes, but as soon as I click on a window to bring it in focus, I want everything on the computer to be devoted to doing just that until its done, regardless of what disk blocks from other processes are in the queue. My disk block requests should go in front of all of the others unconditionally. In other words I should see essentially no more latency than if the computer were completely idle in the background, other than just the minimum delay required to bring it back into memory if it's swapped out. For all I care just completely freeze all other processes on the computer until my request is finished, or at least give me the option to specify that it be done that way. My productivity should override everything else on my own desktop/workstation computer.
So, educate me about what I may be overlooking here, and will 2.6 help this? I mean for Linux of course, not Windows, which may be a lost cause (:
It's not. However, instead of having a slight pause (that scales linearly with the number of processes in-flight) after all the processes have exhausted their timeslices, there is a constant *tiny* pause after each process exhausting their timeslice. Since previously, the kernel was not preemptable, the O(n) recalculation would potentially cause, on large systems especially, a pause while the timeslices were recalculated. The new scheduler increases system responsiveness by eliminating this mass-recalculation.
I think the majority of major improvements on the 2.6 kernel over 2.4 were placed for increasing system responsiveness.
(and yes, I'm using my Karma bonus)
Let's see if I can remember anything from my algorithms class.
How is this not O(n)?
Read on for an explanation. I do believe you gave the answer to your own question in your post. I will also assume that you understand "Big-O" notation well enough to follow the answer.
Each process still gets its time slice calcuated, it is removed from one queue, and inserted into another.
Correct. You've nailed it. Insertion into a Que operates in O(1) time. Removal from the Que is also a constant time algorithm. A Que inserts at the tail of the list (or the end of the array) and De-Ques from the Head of the list (or the beginning of the array). Ques do not permit searching, sorting or accessing the data after the first element.
If we De-Que a process, run it then calculate some information and Que it up that would run in O(c1*O(1)+c2*O(1)) = O(1).
This is still an oversimplification and does not take into account the possibility of a Priority Que in which insertion runs in O(lg n) and De-Queing runs in O(1). So a heap sort using a Priority Que runs in O(n*lg n).
Hope that helps.
Windows XP, for example, isn't all that great in this area, and often one process can too easily slow things down. This is even further emphasized when all the processes you're working with are actually the main "explorer.exe" process; eg, you do something in Windows Explorer that blocks (spin up a CD drive) and everything else (the "desktop", task bar, etc) all become unresponsive...
/bin/sh and tell it to do something for you. It means that you send "events" to the currently-running EXPLORER.EXE with an idea of what you want it to do for you.
Unfortunately, even the best kernel process scheduler in the known universe would not fix this design flaw in Microsoft Windows, because what you're seeing is not a thread/process scheduling problem.
As you correctly observed, many Windows programs are forced to make "shell calls" -- which doesn't mean the same thing it does in Unix, where you fork a
The reason it bogs down when it's busy is because it is waiting on a single event queue. Mounting/unmounting of media, network lag, other processes sending/receiving messages, etc. all give EXPLORER.EXE more events to wait on. That's why it's a bottleneck, even when there's plenty of CPU to go around.
It's an awful design, and it's one of those things that's fundamentally broken about Windows.
Tired of FB/Google censorship? Visit UNCENSORED!