ArsTechnica Explains O(1) Scheduler
geogeek writes "The recent release of Linux's 2.6 kernel introduces a number sweeping improvements. These can be hard to understand without a background in programming. This week's Linux.ars examines the improved scheduler for an enthusiast audience, concisely explaining its nuances and the practical effects of an O(1) design."
I read this piece yesterday, and while it did "dumb down" the basics (as the first poster noted), I thought it did a very good job of putting it all into a nutshell that those of us not as familiar with Big-O and schedulers in general might easily understand. For Linux.Ars' format, I thought it was of appropriate length, and had enough detail to "belong." I'm sure there are more detailed writeups on the O(1) scheduler in place in 2.6. Does anyone have any links?
"Aye, and if my grandmother had wheels, she'd be a wagon!" -- Montgomery Scott, ST:III
With all the recent talk about multi core processors, will this sort of thing be relegated to hardware; or will there still be a need for software scheduler?
I setup a small computer today with 2.6. It boots from an initrd but I noticed during bootup the kernel will pause for a few seconds with:
checking if image is initramfs...it isn't (no cpio magic); looks like an initrd
I couldn't google much info on it. Anybody know more about it? Or how I can stop the kernel checking for it.
...can be found here (PDF file).
I clicked on this article expecting to see an explanation of how it was that the O(1) scheduler worked, and by what tricks it was able to schedule in O(1) rather than having to spend extra time as extra processes are added, and what the real-life effects of such a situation are.
Instead the article was just "This is what O(1) is. This is what a scheduler is. This is what "preemptive" means. The 2.6 SMP can load-balance effectively across many processors. I used this really cool mp3 player this week." and just about nothing else.
Anyone want to explain exactly how it is that an O(1) scheduler is a difficult thing, and how exactly linux 2.6 achieves it?
-- Super Ugly Ultraman
While Ars definately isn't targeted at the same audience as, say, KernelTrap, its nice to see there are a few technology websites/publications that aren't dumbed down. I remember when Byte magazine used to publish articles detailing the PowerPC architecture, down to the level of registers and the types of pipelines in the first set of implementations. Compare this to the ZD rags, which are a hair away from calling the CPU the "brain" of the computer!
A deep unwavering belief is a sure sign you're missing something...
- Lean and mean (low overhead).
- Scales well with the number of tasks (O(1)).
- Scales well with the number of processors (O(1) for scheduling, O(N) for balancing).
- Strong affinity avoids tasks bouncing needlessly between CPUs.
- Initial affinity makes it likely that request/response-type tasks stay on the same CPU (i.e., good for LMbench lat_udp et al)
BTW, It's good to see that the starvation and affinity problems that plagued the early versions of the O(1) scheduler have been ironed out.That was kinda interesting though it doesn't really explain how they achieve the O(1)ness, and for example how did kernel 2.4 fare in comparison. Anyway, an O(1) scheduler is nice and all under low to moderate loads, but at high loads, when a lot of threads are running, the O(1) behaviour should enusre that the tasks are switching extremely fast, and as every context switch has some related overhead in terms of saving the context etc., that means that performance of the O(1) schedular should be worse than that of 2.4(where tasks are not switching as fast) in theory atleast. So is this really true?
What's under yellowstone?
Isn't that what a sidebar is for?
Couldn't they have linked out into more in depth treatments, and saved the complexity for the interested (technically) reader while saving 'readability' for the non-technical ppl?
mefus
In Open Society, GPL Software frees YOU!
O(1) doesn't mean the time is constant, but that in the limit it is bounded by a constant. It can get enourmous, then shrink for large inputs. It could go to 0 in fact.
Thanks; that third paragraph alone was much, much more informative than the linked article. Perhaps, if the editors can be bothered to edit a bit more (unlikely), could Ars provide links to these more technical articles?
Timeslices are used in order to implement preemption: a process stops either when it blocks (waiting for I/O) or otherwise voluntarily gives up the rest of its timeslice, or when it's timeslice is used up.
About the article in general: I'm surprised about the dumbing down. Other articles on Ars, including but not limited to the series about processor technologies (e.g. the one about hyperthreading) are much more thorough and detailed.
This sig under construction. Please check back later.
www.usenix.org/publications/library/proceedings/a
When it comes time to select a process, the scheduler finds the highest priority level containing available processes. It then selects a process from the desired priority level. Because the number of priority levels is fixed, the scheduler always takes a constant amount of time.
Yes, but it does need to inspect all the processes within a priority level, right??
As the total number of processes grows, the average number of processes within a priority level also grows, and so does scheduling time. So it's not O(1).
Unless choosing a process from a priority level is also constant time -- say, if it were a round-robin queue.
What am I missing?
Mainframe schedulers have been O(1) for a long time. The one for UNIVAC 1108 EXEC 8 was O(1) in 1967. (And you can still buy a Unisys ClearPath server with that code in it.)
The traditional implementation is a set of FIFO queues, one for each priority. For systems with only a few priorities, that's just fine. As the number of different priorities increases, you spend too much time checking the empty queues, so there are tree-like structures to get around that problem.
Scheduling today is more complex because of cache issues. If you context-switch too much, you thrash in the CPU caches. On the other hand, everybody has so much memory today that paging is mostly unnecessary.
Please stop posting SCO code here, it only gets Darl all excited.
When a process uses its timeslice, the scheduler calculates a new timeslice by adding the dynamic priority bonus to the static priority. The process then gets inserted in the second list. When the first list becomes empty, the second list takes the place of the first, and vice-versa. This allows the scheduler to continuously calculate timeslices with minimal computational overhead.
Sounds like the rules to an AD&D adventure.
Err. Oops. I blame Soviet Russia.
The mechanism for recalculation of timeslices in previous Linux kernel's was very simple. When every process had its timeslice completely depleted (they were all 0) the kernel would simply go through every process and recalculate its timeslice and start execution again at the highest priority runnable process. While this is the most obvious solution it is also very inefficient, executing in O(n) time.
Ok, its easy to see why this is O(n).
The 2.6 scheduler uses a simple yet effective method for getting rid of this problem, it uses two priority arrays! One priority array is for processes that are runnable, and one priority array is for processes that are not runnable (they have depleted their timeslice). This way if when a process has depleted its timeslice the scheduler simply recalculates its timeslice, removes it from the active array, and inserts it into the expired array.
How is this not O(n)? The time slice calculation still occurs for each process, just not all at once for all processes. Each process still gets its time slice calcuated, it is removed from one queue, and inserted into another. Is there some other unmentioned trick that eliminates the calculations? Or was there something else that made the 2.4 scheduler O(n), such as finding the highest priority process?
So when all processes have depleted their timeslices there is no need to recalculate timeslices for every process, the two arrays are just switched (for the code oriented among us: they are accessed via pointers and the pointers are simply switched).
So the calculation is done per process as they finish their time slice, rather then at the end when all the processes are done. I still don't see why this would imply better efficiency. Am I missing something?
At any rate, thanks for the link, it was much more informative than the published article.
To achieve O(1) you usually have to do some tricks that will get you quite a large constant.
I dont think that is necessarily true. You can (as they are doing here) trade memory use for speed by enumerating a bucket for every possible option and having a method to decide which bucket to look in. The memory requirements change for the number of options but the speed of the decision stays at a (low constant) O(1).In general, its a matter of building clever, info-rich lookup keys and organizing the data into a data structure that has the same access time for each key.
However, from http://67.163.90.104/~jim/scheduler.html
Basically a function (find_busiest_queue) is called which will return another runqueue if there exists a runqueue that has 25%+ more runnable processes than the current process. If there is such a runqueue then the runqueues are locked (via each runqueues spinlock) and processes are transfered from the busier runqueue to the current runqueue, then the runqueues are unlocked.
Since I think people lump the loadbalancing portion of an SMP system in with the "scheduler" then clearly the overall scheduling and running of processes is not O(1).
I haven't heard of/read about Sun/HP/IBM/SCO* speaking about their SMP system schedulers and runqueue balancers separately. Anyone? Anyone?
* Yeah, that is a joke, because SCO would never talk about their scheduler (scheduler headers files, OK, yeah, maybe in an open letter to rich CIO's) since they already have the Linux 2.6 O(1) scheduler in their secret underground stash of "scanned into TIFFs and copied onto CD's" source code - Ha!
The benefits of preemption are great enough that the kernel developers decided to make the kernel itself preemptible. This allows a kernel task such as disk I/O to be preempted, for instance, by a keyboard event. This allows the system to be more responsive to the user's demands. The 2.6 kernel is able to efficiently manage its own tasks in addition to user processes.
One of the most time-wasting, counterproductive things for me on Windows, and to a lesser extent on Linux, is that background tasks can essentially take over the computer because of disk thrashing. Sometimes (on Windows XP) I've seen it literally take minutes to bring a window to the front and repaint it while the disk is thrashing doing god-knows-what in the background. It is impossible to get any serious work done when that is happening. Setting my process to highest priority (on Windows) doesn't seem to help when disk thrashing is involved. I can understand how it might take a few seconds to swap another process out and bring mine in, but not minutes.
Now I don't know exactly what's happening behind the scenes, but as soon as I click on a window to bring it in focus, I want everything on the computer to be devoted to doing just that until its done, regardless of what disk blocks from other processes are in the queue. My disk block requests should go in front of all of the others unconditionally. In other words I should see essentially no more latency than if the computer were completely idle in the background, other than just the minimum delay required to bring it back into memory if it's swapped out. For all I care just completely freeze all other processes on the computer until my request is finished, or at least give me the option to specify that it be done that way. My productivity should override everything else on my own desktop/workstation computer.
So, educate me about what I may be overlooking here, and will 2.6 help this? I mean for Linux of course, not Windows, which may be a lost cause (:
(methinks) Someone has a little something stuck in his craw.
Could there possibly be a reason for an offtopic mod? Lets see:
-- Article/thread about an article about the linux scheduler.
-- Respondent asks a technical support question having nothing to do with the article referenced or the scheduler in general.
-- Respondent gets moded down for jumping off topic because this isn't a Linux support forum nor is it generally an audience who would want it to *become* a support forum.
-- Someone who doesn't like to see a lot of the linux-centricisim in the forum *complains* BECAUSE others who don't want to read a technical support thread mod the offtopic post... "offtopic."
===
As a side note, you rigorous thinker you, a question or comment in the range of "I don't know why" ("this pause here with that message there confuses me", et. al.) indicates that the user wants information about his system and is not prima facia evidence that there is something "making Linux look bad" in any meaningful sense.
Consider: Sort of like saying that a single customer asking "what are the ingreedents of the 'special sauce', cause it feels oily?" must automatically make McDonalds "look bad" for filling the oily sauce role with an oily sauce not understood by a single questioner.
If you took the time to gather one single iota of information you would know that the message he doesn't like is in the startup sequence and happens only once (per boot) and lasts a couple of seconds at most.
If, on the other hand, he even vaguely intimated that the message lasted X seconds and was repeated every N seconds, then there would be some "looking bad" to be had here. If he could *FURTHER* even remotely tie this to the scheduler then the post would suddnely not be "offtopic".
Shame none of that happened. Which kind of leaves you wearing the inflatable rubber ducky on the subway there dude. You probably shoud stop yelling about the rising watter too. (IF you can follow a cultural metaphore there dude. 8-)
====
Point of (bonus) fact, there wasn't enough information in the question to even address an answer. Even so, (and without even checking), if the initrd (etc) system is compiled in there but the user isn't using initrd then he should compile the kernel without this bit. If he *IS* using the initrd image then it is kind of normal that the unpacking and mounting of the initrd image would cause a pause at that point because the kernel is working on something that everything else has to wait for.
So, nothing to see here folks, conspiracy theorist presenting tinfoil-hat theory about an demonstrably offtopic post getting modded down for being demostrably off topic. These arn't the droids your looking for. Move along... 8-)
====
Not hard to see why you posted AC...
Innocent people shouldn't be forced to pay for inferior software development.
--"Code Complete" Microsoft Press
...but wouldnt' the fact that only one process is operated on at a time (i.e. a fixed amount of computing) as opposed to scanning the enitre list of processes (could be one...could be thousands) each time make the difference between O(n) and O(1)? Sure, things will be slower if you have 10,000 things to check as opposed to one, but at least here we're not checking 10,000 things for each and every one in the list...
(IANAKH)
CAn'T CompreHend SARcaSm?
Without explaining the O() of the 2.4 kernel, what good is an article explaining that 2.6 has a O(1)?
LordBodak's journal.
The article was okay, I guess. Very, very elementary. I would assert that most schedulers are O(1), though. Its really nothing new. This article was probably good for someone who never heard of a scheduler or worked with one.
Although the article was about scheduling, I would have like to have seen some more text about Big-O notation and why it is important to programmers. To me, any good programmer should have a good concept of Big-O notation, how it relates to traversing data structures (such as trees), et cetera.
The article did not make a point as to why good schedulers use priority queues. Which is simple, really. All priority queues are O(1) for getting the next item in the queue. It has nothing to do with scheduling, but it is the nature of the priority queue itself... as a data structure.
I would have liked to have seen the article in its whole form. I wonder if it went into more depth about Big-O, et cetera.
Windows XP, for example, isn't all that great in this area, and often one process can too easily slow things down. This is even further emphasized when all the processes you're working with are actually the main "explorer.exe" process; eg, you do something in Windows Explorer that blocks (spin up a CD drive) and everything else (the "desktop", task bar, etc) all become unresponsive...
/bin/sh and tell it to do something for you. It means that you send "events" to the currently-running EXPLORER.EXE with an idea of what you want it to do for you.
Unfortunately, even the best kernel process scheduler in the known universe would not fix this design flaw in Microsoft Windows, because what you're seeing is not a thread/process scheduling problem.
As you correctly observed, many Windows programs are forced to make "shell calls" -- which doesn't mean the same thing it does in Unix, where you fork a
The reason it bogs down when it's busy is because it is waiting on a single event queue. Mounting/unmounting of media, network lag, other processes sending/receiving messages, etc. all give EXPLORER.EXE more events to wait on. That's why it's a bottleneck, even when there's plenty of CPU to go around.
It's an awful design, and it's one of those things that's fundamentally broken about Windows.
Tired of FB/Google censorship? Visit UNCENSORED!