Preemptible Linux Kernel: Interviews and Info
An anonymous submitter sends: "MontaVista and Robert Love are developing a patch for the Linux kernel to make it fully preemptible. Lots of users are involved, and tests show huge reductions in latency. Robert's kernel patches are here. Finally, an interview with Robert, on preemption and more."
We're sorry, but tonight's "Linux" will not be aired. Normally you would find 2.4.12 or 2.4.13-pre2 on Sunday nights, but not this evening. Now that Linux is fully preemptible, NBC will be airing a four-hour music-and-ice-skating tribute to Bill Gates.
We apologize for any inconvenience, and for the reduced uptime. Enjoy the show.
JA: What tips and inspiration can you offer aspiring kernel hackers?
Robert Love: Read the source, play with the source, and bathe regularly.
All computer science labs should have available eye-wash style emergency showers.
Troll Like a Champion Today
So long as the console driver and the keyboard driver are alive, root should always be able to open a new shell and kill an offending process that is hanging the rest of the system. Right now, this is too frequently a non-option.
I am not a lawyer. Do not take my words as legal advice. If you need legal advice, consult an attorney.
The interrupt handlers can't allow premption during the context switch of an interrupt because the registers are intransit. Basically, you can't have an interrupt while your in the process of any kind of context switch otherwise you're never sure what registers you were able to flush to and from the CPU to the stack.
Critical Sections (such as access to the IP stack or I/O queues) have to be protected. With the advent of multi-processor systems under the SMP scheme, there is already considerable locking within the kernel to synchronize access of critical resources between processors. Critical regions also need to be protected from interrupt concurrent access as well.
Bottom Half handlers generaly are fast track implementations to quickly deal with the interrupts. To avoid concurrency collisions of reasources used within the bottom half handlers, interrupts (for that particular handler) must be disabled during the handler's execution.
All in all, this is basic non-preemptive stuff. What I don't understand is that this strategy that he is defining is a textbook NON-premtive approach to kernel design. I'm not too sure where he gets off claiming that the kernel is fully-preemptive here.
Someone you trust is one of us.
If you're wondering what the heck a preemptive kernel entails, then here's some background.
Also, if you don't like Robert Love's implementation, then Andrew Morton maintains a patch with a similar low-latency goal.
- Linux Devices Article detailed, very nice, on this issue from Sept 6
- Kernel Hacking How-to Page on this again, very detailed.
- The Kernel SpinLock metering FAQ
All around useful stuff, enough to get you started destro^H^H^H^H^H^H hacking your own kernel"It is a greater offense to steal men's labor, than their clothes"
I don't want to sound like I'm contradicting you, but did you happen to read this link from the article? It's specifically about realtime audio. Key paragraph:
*EXCITING* NEWS: things getting almost perfect ! Ingo's lowlatency-2.2.10-N6 patch with the shm.c part backed out and a modification of filemap.c (thanks to Roger Larsson) performs _REALLY_ well, using my usual latencytest parameters (4.3ms buffer), I got NO DROP-OUTS anymore, with sporadic maximum peaks of ONLY 2.9ms This is really exciting because it opens the doors to a whole new class of Realtime applications for Linux, simply using userspace processes scheduled SCHED_FIFO. I heard of comparable low-latencies only from BEOS, Windows can't simply guarantee these kind of latencies, not even using DirectX. Using a soft-synth on Win98 on my BOX I must use 15-20ms audio buffers to get _SOMEWHAT_ reliable audio. This is actually about more than 3-4times the buffer I used for testing under Linux ( 4.35ms).
I don't know much about the field, but the page seems to speak to several of the audio-related concerns mentioned above.
Kill, Tux, kill!
Only on slashdot would "IP stack", "I/O queues", "interrupt concurrent access" and "SMP" be considered laymans terms.
pre-emptive is a form of multi-threading. the other form is co-operative.
Co-operative means that threads relinquish control on their own. This meant that a greedy thread could put a serious stranglehold on the OS and lock-up the system, forcing a reboot.
Co-operative was used in every Mac prior to and including OS-9, which made it very unstable should a thread crash.
Pre-empt means the OS decides when the thread loses control. A thread can still voluntarily relinquish control, but the final call still comes down to the OS.
OS-X is fully pre-empt, meaning a crashed thread doesnt crash the entire system, bettering the stability overall as that will usually only crash the program that thread belonged to, not the entire system.
I dont know what MS has for their threading model, they seem to have a very bad hybrid system. The threading in Windows 95/98 tends to cause a good number of BSODs. NT/2000 OTOH, had a better model and crash a lot less often, which is why they have traditionally been the more stable MS OS.
Task scheduling has to do with what thread gets control next. Priority and other factors decide that. Solaris threads have 2^31 possible levels of priority, Windows (all versions, IIRC) has 5 classes and then 5 sub-classes of priority for each (a REALLY screwed up and tough to understand and explain technique, iow not a clear-cut 25 levels), and Java has 10 levels for cross-platform threading. Each model has their plusses and minuses, but that's getting offtopic from preemptive vs. co-operative.
The One Rule Of Chess You'll Ever Need: Don't play someone who carries a kit in their bookbag.
I originally felt I should stay out of any discussion here, but I want to answer some of these questions and clear some of this stuff up. To be honest, it is a little embarrassing having everyone read and comment on the interview. :)
Bottom Half handlers generaly are fast track implementations to quickly deal with the interrupts. To avoid concurrency collisions of reasources used within the bottom half handlers, interrupts (for that particular handler) must be disabled during the handler's execution.
Interrupts, even just the in question, are not disabled during a bottom half, at least in general. The reason we can't preempt bottom halves is that they are guaranteed to be serialized w.r.t CPUs (ie a given BH runs on only one CPU at a time). Because of this, the BHs are designed without a regard reentrancy. So we can't preempt them.
All in all, this is basic non-preemptive stuff. What I don't understand is that this strategy that he is defining is a textbook NON-premtive approach to kernel design. I'm not too sure where he gets off claiming that the kernel is fully-preemptive here.
Hardly. Would you say an SMP system is not SMP if it is non-concurrent inside critical sections? No, you wouldn't, and this is the same situation we have here with preemption. We can't preempt inside critical regions. We have concurrency and reentrancy concerns, just like SMP does. We also can't preempt inside interrupt handlers or bh's because they aren't designed to be preempted (nor would you want to interrupt the top half of an interrupt, anyhow).
The current kernel is not preemptive _anywhere_. The only way, in fact, kernel code ever yields execution is if it explicitly does so or returns. Since with the preempt-kernel patch we can now preempt in 90% of the kernel, I think its safe to say we have a preemptible kernel now.
I thought the Slack 2.0 release had a 1.1 kernel.
It could of, I just seem to remember a 1.0 kernel.
Can anyone give a nice layman's description of what he is talking about here?
Basically I am explaining the modifications to the kernel we made in order to make it preemptible. To try to put it more for the layman, besides just allowing the kernel to preempt itself as needed, we had to prevent some certain situations from being preempted. This is the same situation with SMP. We use SMP's locks to disallow preemption, for concerns of concurrency and reentrancy. We can't preempt during interrupt or BH handling because those things are not designed for concurrency, either.
To sum it up, we have to prevent preemption in some situations. Those situations are: while locks are held, while handling interrupts and bottom halves, and while inside the scheduler itself.
Can someone fill me in... Hasn't Microsoft been claiming windows has been preemptive since win95??? Is this some other form of 'preemptiveness'?
You are thinking of forms of multitasking. One form is preemptive, in which tasks are given a specific period in which to run (timeslice) and then forcibly preempted by the next runnable task when that quanta ends. Win95, NT, all Unices, and anything decent fit in here
The other form is cooperative, in which tasks run until they yield execution. This is how Win 3.1 is. In 3.1, tasks ran until they finished processing their current Windows Message or called yield().
This article is about a preemptive kernel, where actually the same ideas apply. Inside the kernel, things are currently cooperative in the sense the kernel code runs until it completes or yields control. This patch makes it preemptive -- it will be preempted when something more important needs to happen.
Win95 does not have a preemptive kernel (it isn't even reentrant). NT might. Solaris does. Linux does with this patch.
http://www.linuxdevices.com/articles/AT4185744181. html
Goofed that up.
Nice discussion, from Sept 6, with related links
[sigh]
"It is a greater offense to steal men's labor, than their clothes"
Really? That's the first time I hear about that.
;-)
There is a difference between pre-emptive multitasking and pre-emptible kernel.
Pre-emptive multitasking means that the kernel can interrupt any thread and give control to another thread, so that a thread cannot hog the CPU resources. This is what all modern operating systems do, except Windows 3.1/9x/Me and MacOS (pre- X), though it could be argued that Windows 3.1/9x/Me is not an operating system much less a modern one
Pre-emptible kernel is a different beast. It means that the kernel can interrupt itself (i.e. a thread running in the kernel mode) and give control to another thread running in the kernel mode. This is used in real-time operating systems where you need to have a guaranteed maximum response time (i.e. a thread must not wait longer than a certain amount of time before it gets control). However, this is not all that useful for general-purpose OSes and may even be detrimental to servers, where throughput matter more than response time. So it's good to know that this will be a compile-time option.
___
If you think big enough, you'll never have to do it.
Unless you're writing a game, where sounds have to happen at specific times synchronized with events on-screen. Or you're in KDE and you want a "minimize" sound effect to happen when you press the button, not a second afterward. Or you're writing a media player and you want to have an EQ that responds immediately rather than a second from now, making it a tedious chore to adjust the settings.
Large latency is very noticable in these situations. While it may sound like pointless whining to complain about the "minimize" sound effect being a second late, it really creates a bad perception in the user's mind about the speed of KDE. These things are actually important.
main(c,r){for(r=32;r;) printf(++c>31?c=!r--,"\n":c<r?" ":~c&r?" `":" #");}
Disclaimer: It's my patch
:)
I think this is a good short-term solution for the latency problems but I personally wouldn't include it in the main kernel releases. I believe that it *might* be a good idea to fork the kernel releases (temorarily) in two groups: One for servers and one for workstations until the problems have been solved.
I tend to look at this more of a long-term solution, and I think people who see it has a short-term solution or hack are missing the point. First, this is a feature. We aren't kludging kernel code so that we can lower latency by stopping it when needed. We are effectively using the SMP code to multitask better within the kernel.
Second, forking the kernel over this is a terrible idea. Since it is a config setting, this is a non-issue anyhow, but I really don't want to see this thing forked off. In fact, I think the ideal situation is where we can get a preemptible kernel that benefits throughput so that server processes benefit from it as well.
I think that (for now) using this patch on workstations is a pretty good idea
Agreed
And I think there should be a better solution for the problem witch should THEN be something along the lines of kernel 3.0
There isn't a better solution that is not a hack. There is a reason Solaris, NT, and all RTOS are preemptible inside the kernel: it is the only way to achieve real-time response. You just _have_ to be able to respond to events when needed.
The "better" solutions in this case are "simpler" -- if we can hack some conditional schedules into places, perhaps simplify some algorithms, etc. then we can perhaps reduce latency without preemption. This is what Andrew Morton's low-latency patches do. But we need more. The point is not that preempt-kernel is a hack, but that it is a whole new high-tech feature, and some people want to find a simpler solution.
Personally, I don't think a simpler solution exists, and I believe the preemptive kernel satisfies other problems (and it also a neat feature:>). Thus I work on it.
Disclaimer: It is my patch
but why does anyone need better latency? afaikt, the latency here is strictly for people who want to do RT audio effects. this has nothing to do with audio playback , which has no latency sensitivity (because of buffering). this also has nothing to do with "feel", since humans are terribly slow, and cannot possibly feel the difference between 5 and 10ms.
You ever have an mp3 skip? Audio become out of sync in a game? That is caused by scheduling latencies becoming greater than the duration of the audio buffer. Ie, audio playback does not just need x units of CPU but it also needs it every y units of time. The preempt-kernel patch helps alleviate this.
I hope that Linus will look at whether these patches hurt the normal case. "normal" means things like kernel compilation, not just an arbitrary latency measure and dbench (one of the least realistic benchmarks possible!)
Not only does preempt not hurt a kernel compile, but it helps it. I and many users have benchmarks. One of my requests from users is to get a lot of benchmarks and "feelings" so I can substantiate the patch. I am _not_ an audio guy. I use my Linux machine to code, go on the net, etc. just like 90% of the people here. Preemption helps me. I don't want to hurt the common case either.
Even so, it is a configure item. Merging it into the kernel does not equate to you having to use it. But I bet you would want to!
there are good reasons to be skeptical of all-out premptiveness: it will unavoidably lower throughput in easy-to-define cases. any intro OS text will talk about optimal scheduling, where 'optimal' requires a definition of throughput or some other metric.
The cases in which we lower throughput are cases in which file I/O is favored since it runs until completition. In this case, you can extend that argument to be that I/O-intense tasks should just be cooperatively scheduled. An I/O task won't be preempted unless its timeslice has run out (ie, it should be preempted, and it would be if it were in userspace). If the I/O is so critical, run it at a higher priority. Hell, maybe we should look into a higher timeslice.
Note that a lot of this is a non-issue, since we don't affect throughput (or actually improve it!) In the cases throughput is decreased, it is just a couple of percent, which could be cost-benefited to the increase in response some other application gets.
we need to set a target (5ms would be fine imo) and meet it. going beyond such a goal will hurt the normal case.
This is very very true, and an insightful point. One of the problems with this whole latency quest is that eventually we are going to reach some point and have to decide if enough-is-enough. We can always keep doing more and eventually the work _is_ going to be detrimental to the common-case. I agree we need to set a threshold and celebrate when we reach it. The super-special situations needing much lower latency can apply super-special solutions.
Linus himself said, that they should rather fix the CAUSE of those latencies instead of the symptoms.
Isn't that like saying that you'd rather fix all buggy applications instead of providing a protected memory environment?
I thought that what (certain) kernel hackers really objected to is preemption while locks are held. The complications (eg priority inversion) they talked about seem only to arise in that case.
There are a few reasons other hackers complain, although I didn't know this was one of them. Since MontaVista's original preemptive kernel work, I believe, we have never preempted inside of locks. Note that you can, but then you reach the issues with deadlocks and thus the need for priority-inversion that you spoke of.
So, first, does "fully-preemtive" traditionally mean with or without locks? Are Solaris, NT, and RTOS preemtible when locks are held?
I would say it means sans locks. None of the mentioned OS's are preemptive while holding a lock. You always have to respect the lock. Now, you can preempt during the lock and go do other things. If you do this, you are assuming the lock is going to be held long (or else it is favorable to just spin for a cycle or two). In this situation you want to use semaphores, which we _do_ preempt during.
When a process hits a semaphore that is in use, it goes to sleep and something else continues. The process awakes when the resource is available. Now we reach the problem you wrote of above: priority inversion. What if task A holds resource Y and sleeps waiting for resource X and task B holds resource X and sleeps waiting for resource Y? You deadlock.
Thus we need to use a type of semaphore called a priority-inheriting mutex, which inverts the priority of the task holding a resource so it will always complete and release the lock. I know Solaris has these. However, I would consider any kernel that can preempt itself in general a preemptible kernel.
Second, observed results aside, what reason do you have to believe that preempting the lock-less parts of the kernel is "good enough". All else equal, one would expect the latency distribution to be similar with and without locks, so you would expect plenty of "worst cases" to occur with locks. Of course, there is already a pressure to reduce the time that critical locks are held, but I wouldn't be surprised to see non-contended locks (especially outside the kernel core) held for long times. So is there a good reason that the important "worst cases" are happen without locks?
First, before I cast results aside, let me mention that observations show we are already lowering latency a great amount. But, you are right, periods in which locks are held are a problem. This is why I mentioned in the interview the use of things like Andrew Morton's low-latency patch, the preempt-stats patch (for finding the locks), etc.
Some of the problems still occur while locks are held, but thankfully the point of a spinlock is that they are held for a VERY short time. A solution to this may be to replace the spinlocks held for a long time with a priority-inhereting mutex.