Removing the Big Kernel Lock

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Saturday May 17, 2008 @04:13AM from the wait-i-thought-locks-made-it-secure dept.

Corrado writes "There is a big discussion going on over removing a bit of non-preemptable code from the Linux kernel. 'As some of the latency junkies on lkml already know, commit 8e3e076 in v2.6.26-rc2 removed the preemptable BKL feature and made the Big Kernel Lock a spinlock and thus turned it into non-preemptable code again. "This commit returned the BKL code to the 2.6.7 state of affairs in essence," began Ingo Molnar. He noted that this had a very negative effect on the real time kernel efforts, adding that Linux creator Linus Torvalds indicated the only acceptable way forward was to completely remove the BKL.'"

18 of 222 comments (clear)

Looks like "Worse is Better" all over by paratiritis · 2008-05-17 04:33 · Score: 5, Insightful

Worse is Better (also here) basically says that fast (and crappy) approaches dominate in fast-moving software, because they may produce crappy results, but they allow you to ship products first.
That's fine, but once you reach maturity you should be trying to do the "right thing" (the exact opposite.) And the Linux kernel has reached maturity for quite a while now.
I think Linus is right on this.
Re:Translation? by Burdell · 2008-05-17 04:48 · Score: 5, Informative

When the Linux kernel first supported multiprocessor systems, it was done with a single lock protecting access to all the kernel (the Big Kernel Lock); the kernel could still only do one thing at a time. Over time, most sections of the kernel have introduced their own fine-grained locking and moved out from under the BKL, allowing many parts of the kernel to be running at the same time on multiple processors. The BKL has shrunk over time, but it still exists over a chunk of the kernel. The kernel hackers recently tried to replace the hard lock with a preemptable lock, but that had some bad interactions with the scheduler (which determines what process/kernel thread runs when), so Linus switched back to the old-style BKL.

Now, a group is trying to see if it is possible to weed out all the remaining uses of the BKL and replace them with localized locking for specific sections of the kernel. This is tricky, as there are side-effects of the BKL that are not always obvious.
Re:Translation? by Anonymous Coward · 2008-05-17 04:52 · Score: 5, Informative

Ok so here's the deal:
Linux is a preemptive multi-tasking kernel. What this means is that a hardware interrupt like a keyboard click or the system timer will interrupt whatever is currently running on the CPU, and an interrupt handler in the kernel starts running code. In order to make sure that all the states of the kernel are consistent (ie: not corrupt), the different parts of the kernel are supposed to lock the data that they are using or modifying (ie, readlock or writelock) in case another code path gets run at the same time trying to modify the same data. It becomes even more important in a multi-cpu environment where locks have to be atomic (happen at the same time on all CPUs). So what you are supposed to do is only lock the resources you currently need (a file system drivers would only lock parts of the filesystem, not a character device). Because some programmers are lazy, or not sure what they are doing, they just use the big kernel lock which locks pretty much everything in the kernel. This is bad for multi-tasking and multi-processing because it means you can only have one codepath using the lock at a time.

Note: it's been a while since I've done kernel work, so I'm sure this is not 100% true, but hope it helps you understand.
Re:I don't understand by diegocgteleline.es · 2008-05-17 04:53 · Score: 5, Informative

Because these days the BKL is barely used in the kernel core, or so Linus says: the core kernel, VM and networking already don't really do BKL. And it's seldom the case that subsystems interact with other unrelated subsystems outside of the core areas. IOW, it's rare to hit BKL contention - and in those cases, you want the contention period to be as short as possible. And spinlocks are the faster locking primitive, so making the BKL a spinlock (which is not preemptable) makes the BKL contention periods faster. A mutex/spinlock brings you "preemptability" and hides a bit the fact that there's a global lock being used sometimes at the expense of performance, which may be a good thing for RT/lowlatency users, but apparently Linus prefers to choose the solution that is faster and doesn't hid the real problem.
Punchline by Anonymous Coward · 2008-05-17 05:03 · Score: 5, Informative

Since the summary doesn't cut to the chase, and the article was starting to get a little boring and watered-down, I read Ingo's post and here's what I got from it: the BKL is released in the scheduler, so a lot of code is written that grabs the lock and assumes it will be released later, which is bad. Giving it the usual lock behavior of having explicit release will break lots of code. Ingo created a new branch that does this necessary breakage so that the broken code can be detected and fixed. He wants people to test this "highly experimental" branch and report error messages and/or fixes.

Assuming everything is stable and correct, the next step is to break the BKL into locks with finer granularity so that the BKL can go the way of the dodo.
Comment removed by account_deleted · 2008-05-17 05:07 · Score: 5, Interesting

Comment removed based on user account deletion
Re:Translation? by Anonymous Coward · 2008-05-17 05:25 · Score: 5, Informative

"That part of the code" is the difficult part. The BKL assumption is present in thousands of place all around the kernel, and nobody really know where. You can have two pieces of code, that looks totally unrelated, that happen to work because in all the code path leading to them the BKL is taken. Removing the BKL and "code it all over again" will create this new race condition.

There would be thousands of such, and you'll probably never succeed in debugging it.

The approach suggested in the article is to replace the BKL by a true lock, then "pushing it down", which means understanding WHY that code want the BKL, and get smaller locks instead in subroutines.

For instance, one piece of code could take the BKL because it will change 3 data structure. You could then remove the BKL and use, in the 3 part of code that changes those 3 structure, and use a finer grained lock for each of those.

By iterating this way, you should always get a somewhat working kernel, and slowly kill the BKL.
Keep these on Front Page... by TheNetAvenger · 2008-05-17 05:28 · Score: 5, Insightful

Keep these on Front Page...

This is the type of stuff that needs to be kept in the news, as the people who post here often have no understanding of, and the ones that do, have the opportunity to explain this stuff, bringing everyone up to a better level of understanding.

Maybe if we did this, real discussions about the designs and benefits of all technologies could be debated and referenced accruately.. Or even dare say, NT won't have people go ape when someone refers to a good aspect of its kernel design.
Bad interaction with the generic semaphores. by Anonymous Coward · 2008-05-17 05:37 · Score: 5, Informative

The recent semaphore consolidation assumed that semaphores are not timing critical. Also it made semaphores fair. This interacted badly with the BKL (see [1]) which is a semaphore.

The consensus was to not revert the generic semaphore patch, but to fix it another way. Linus decided on a path that will make people focus on removing the BKL rather than a workaround in the generic semaphore code. Also, Linus doesn't think that the latency of the non-preemptable BKL is too bad [2].

[1] http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-05/msg03526.html
[2] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8e3e076c5a78519a9f64cd384e8f18bc21882ce0
Re:Translation? by LordNimon · 2008-05-17 05:39 · Score: 5, Informative

Wouldn't it be easier and mainly better to start all over?
No.
You know, like, remove that part of the code and code it all over again, see what is broken, and continue this way?
It's not that simple. When it comes to locking, there is no "part of the code" that can be replaced. Locking governs interaction between two pieces of code, sometimes two pieces that are very different but have some small thing in common.
Besides, the kernel is too big to just start throwing parts of it out and redoing them from scratch. It's much better to make incremental improvements, because then the people working on them will actual learn how to solve the problem. The BKL is not just a coding problem, but also a people and project management problem.

--
And the men who hold high places must be the ones who start
To mold a new reality... closer to the heart
Re:Fascinating. by ResidntGeek · 2008-05-17 06:07 · Score: 5, Insightful

Slashdot's not supposed to be interesting to every reader all the time. If you want someone to cater to a least common denominator, you'd be better off somewhere else.

--
ResidntGeek
(Performance != Speed) // in an RT system by Arakageeta · 2008-05-17 06:14 · Score: 5, Insightful

That's a terrible excuse. There are many applications where a real-time Linux kernel is highly desired. Besides, it is important to note that real time systems do not focus on speed. This is a subtle difference from "performance" which usually caries speed as a connotation; it doesn't for a real time system. The real time system's focus is on completing tasks by the time the system promised to get them done (meeting scheduling contracts). It's all about deadlines, not speed. So from this point of view, the preemptible BKL, even with the degraded speed, could still be viewed as successful for a real time kernel.
Comment removed by account_deleted · 2008-05-17 06:17 · Score: 5, Interesting

Comment removed based on user account deletion
Re:Sounds like the Linux kernel needs some tests.. by pherthyl · 2008-05-17 06:28 · Score: 5, Insightful

Whatever your large project is, I'm willing to bet it's nowhere near as complex as the kernel. Whenever you get the feeling that they must have missed something that seems obvious, you're probably the one that's wrong. No offense, but they have a lot more experience dealing with unique kernel issues than you do.

You talk about unit testing, but how exactly are you going to unit test multi-threading issues? This is not some simple problem that you can run a test/fail test against. These kinds of things can really only be tested by analysis to prove it can't fail, or extensive fuzz testing to get it "good enough"..
This is why monolithic kernels do real-time badly by Animats · 2008-05-17 06:30 · Score: 5, Insightful

This task is not easy at all. 12 years after Linux has been converted to an SMP OS we still have 1300+ legacy BKL using sites. There are 400+ lock_kernel() critical sections and 800+ ioctls. They are spread out across rather difficult areas of often legacy code that few people understand and few people dare to touch.
This is where microkernels win. When almost everything is in a user process, you don't have this problem.
Within QNX, which really is a microkernel, almost everything is preemptable. All the kernel does is pass messages, manage memory, and dispatch the CPUs. All these operations either have a hard upper bound in how long they can take (a few microseconds), or are preemptable. Real time engineers run tests where interrupts are triggered at some huge rate from an external oscillator, and when the high priority process handling the interrupt gets control, it sends a signal to an output port. The time delay between the events is recorded with a logic analyzer. You can do this with QNX while running a background load, and you won't see unexpected delays. Preemption really works. I've seen complaints because one in a billion interrupts was delayed 12 microseconds, and that problem was quickly fixed.
As the number of CPUs increases, microkernels may win out. Locking contention becomes more of a problem for spinlock-based systems as the number of CPUs increases. You have to work really hard to fix this in monolithic kernels, and any badly coded driver can make overall system latency worse.
Re:I don't understand by QX-Mat · 2008-05-17 08:12 · Score: 5, Informative

There are a few added benefits from a fully real-time OS that most people gloss over.

For example, you will *know* your PC will never become utterly unusable to the point it's unsafe with your data. ie: while it's handling many IO operations (say you're being ddosed whilst transcoding a dvd and flossing with a sata cable) unless you run completely out of system memory. Nothing should run away wasn't design to. This stems from the predictability of code execution times that pre-empting offers.

The predictably allows devices to make guarantees, for example, if your mouse is aware it's going to get a time slice, at worst, every 100ms, at least it'll be doing something every 100ms and you gain a visually responsive mouse (aye, it's not as great as it could be). The non-preempt side of life has your CPU tied up doing work that was sits inside a BKL - ie: dealing with a character device or ioctl - your mouse could be waiting 500ms or 1000ms before updating it's position: giving you the impression your PC is dying.

Code that is stuck inside the BKL isn't pre-emptable (you *must* wait for it to finish.) - there's a lot of it that does a lot of regular stuff. Often this will hold up other cores if you've got a cooperative multi-threaded program. The net effect is a slow PC.

RT systems have a different use: they want a guarentee that something is unable to ever delay the system, in particular interrupts. The BKL allows code to take a time slice and run away with it, because you thought it was very important and wrapped it in [un]lock_kernel. This then delays IRQs (IRQs cant run until the lock has finish - at least, I'd like to believe second level IRQs can't run, I'm unsure of the specifics) which will delay data coming in and out of your PC - hard file, disk and display buffers suffer: they fill and you start to loose data because there's nothing dealing with it.

Preempt kernels are good :)

(viva the Desktop)

Matt
Why testing isn't enough by mikeb · 2008-05-17 09:33 · Score: 5, Informative

It's worth pointing out here that the kind of races (bugs) introduced by faulty locking in general suffer from a very important problem: YOU CANNOT TEST FOR THEM.

Race conditions are mostly eliminated by design, not by testing. Testing will find the most egregious ones but the rest cause bizarre and hard-to-trace symptoms that usually end up with someone fixing them by reasoning about them. "Hmm" you think to yourself "that sounds like a race problem. Wonder where it might be?" and thinking about it, looking at the code, inventing scenarios that might trigger a race; that's how you find them.
Re:I don't understand by QX-Mat · 2008-05-17 11:06 · Score: 5, Interesting

I'm brave enough to want per CPU microkernels (with a messaging master?). I envisage all multi-CPU systems addressing memory in an non-unified manor soon enough - it'll be like the jump from segmented addressing to protected mode, but for CPUs.

The monolithic design is slowly forming a focal point in performance: something has to do a lot of locked switching - if SMP machines could do what they do best and handle IRQs and threads concurrently without waiting for a lock (they're better spent sending/receiving lockless messages), life would be easier on the scalability gurus.