Removing the Big Kernel Lock
Corrado writes "There is a big discussion going on over removing a bit of non-preemptable code from the Linux kernel. 'As some of the latency junkies on lkml already know, commit 8e3e076 in v2.6.26-rc2 removed the preemptable BKL feature and made the Big Kernel Lock a spinlock and thus turned it into non-preemptable code again. "This commit returned the BKL code to the 2.6.7 state of affairs in essence," began Ingo Molnar. He noted that this had a very negative effect on the real time kernel efforts, adding that Linux creator Linus Torvalds indicated the only acceptable way forward was to completely remove the BKL.'"
That's fine, but once you reach maturity you should be trying to do the "right thing" (the exact opposite.) And the Linux kernel has reached maturity for quite a while now.
I think Linus is right on this.
If this bores you, every lkml thread would cause your head to explode.
ResidntGeek
Its like rubbing cheetah blood on the engine of your car to make it go faster.
Obligatory blog plug: http://www.caseybanner.ca/
Why did they remove the preemptable BKL?
I'm not a kernel developer, but I'd say it's because there's widespread belief that the preemtable BKL is "the wrong way forward". Statements like these lead me to believe this:
In any large software project there's always a path to get from where you are, to where you want to be. It sounds like any version of BKL is considered ugly and causes problems, and patching it just won't work. In other words, fixing this part of the kernel isn't really possible, so they need to start over and change any code that relies on it to rely on something different entirely.
AccountKiller
What's linux?
The future.
Seven puppies were harmed during the making of this post.
When the Linux kernel first supported multiprocessor systems, it was done with a single lock protecting access to all the kernel (the Big Kernel Lock); the kernel could still only do one thing at a time. Over time, most sections of the kernel have introduced their own fine-grained locking and moved out from under the BKL, allowing many parts of the kernel to be running at the same time on multiple processors. The BKL has shrunk over time, but it still exists over a chunk of the kernel. The kernel hackers recently tried to replace the hard lock with a preemptable lock, but that had some bad interactions with the scheduler (which determines what process/kernel thread runs when), so Linus switched back to the old-style BKL.
Now, a group is trying to see if it is possible to weed out all the remaining uses of the BKL and replace them with localized locking for specific sections of the kernel. This is tricky, as there are side-effects of the BKL that are not always obvious.
Anytime you have more than one application running, they could get into an argument about who gets to use the serial port, the video display, memory, or drive storage. This is especially critical in multi-processor systems.
The answer is to allow sections of code to "lock" access for a brief duration -- "I'm working with this right now, don't anyone else touch it." Simple in theory, very difficult in concept.
Note that I'm speaking generically; I'm not an expert on the Linux kernel. Ideally, though, you want locks to be "granular" -- in other words, you only lock that specific hardware and/or portion of memory that you need exclusive access to. Apparently, the "big kernel lock" takes a brick wall and hammer approach, locking access (and claiming exclusive access during the lock, preventing anything from running). It's not granular.
If I'm wrong, someone else here can correct me. Like I said, I'm not an expert on the Linux kernel.
new semaphore code was introduced that simplified locking. Unfortunately in many kernel situations it's proven to affect performance at around something like 40% - which isn't just considerable its disastrous.
rather than merge the old locking code back in, and reintroduce the many different locking primitives they had, someone decided to simply reenable the BKL - the downside of which is they have to either fix the regression caused by the simpler semaphore code (not likely, it's very simple and clean - everyone's favourite pet/child) or remove instances of where the semaphore code is likely to be called (the BKL).
Matt
Ok so here's the deal:
Linux is a preemptive multi-tasking kernel. What this means is that a hardware interrupt like a keyboard click or the system timer will interrupt whatever is currently running on the CPU, and an interrupt handler in the kernel starts running code. In order to make sure that all the states of the kernel are consistent (ie: not corrupt), the different parts of the kernel are supposed to lock the data that they are using or modifying (ie, readlock or writelock) in case another code path gets run at the same time trying to modify the same data. It becomes even more important in a multi-cpu environment where locks have to be atomic (happen at the same time on all CPUs). So what you are supposed to do is only lock the resources you currently need (a file system drivers would only lock parts of the filesystem, not a character device). Because some programmers are lazy, or not sure what they are doing, they just use the big kernel lock which locks pretty much everything in the kernel. This is bad for multi-tasking and multi-processing because it means you can only have one codepath using the lock at a time.
Note: it's been a while since I've done kernel work, so I'm sure this is not 100% true, but hope it helps you understand.
Because these days the BKL is barely used in the kernel core, or so Linus says: the core kernel, VM and networking already don't really do BKL. And it's seldom the case that subsystems interact with other unrelated subsystems outside of the core areas. IOW, it's rare to hit BKL contention - and in those cases, you want the contention period to be as short as possible. And spinlocks are the faster locking primitive, so making the BKL a spinlock (which is not preemptable) makes the BKL contention periods faster. A mutex/spinlock brings you "preemptability" and hides a bit the fact that there's a global lock being used sometimes at the expense of performance, which may be a good thing for RT/lowlatency users, but apparently Linus prefers to choose the solution that is faster and doesn't hid the real problem.
If this bores you, every lkml thread would cause your head to explode.
Hey, I consider myself a code junky (and yes, even consider the issue of the BKL somewhat interesting), but I realize that this topic has about as much appeal to the average Slashdotter as mowing the lawn.
Since the summary doesn't cut to the chase, and the article was starting to get a little boring and watered-down, I read Ingo's post and here's what I got from it: the BKL is released in the scheduler, so a lot of code is written that grabs the lock and assumes it will be released later, which is bad. Giving it the usual lock behavior of having explicit release will break lots of code. Ingo created a new branch that does this necessary breakage so that the broken code can be detected and fixed. He wants people to test this "highly experimental" branch and report error messages and/or fixes.
Assuming everything is stable and correct, the next step is to break the BKL into locks with finer granularity so that the BKL can go the way of the dodo.
"I like to lick butts!" by MobileTatsu-NJG (#32700246) (Score:5, Informative)
A mutex/spinlock brings you "preemptability"
Duh, I meant mutex/semaphore. And Linux semaphores have become slower, meanwhile mutexes still are fast as old semaphores were, as #23446368 says. The options were to move from a semaphore to mutexes or spinlocks, but Linus chose spinlocks because the RT/low-latency crow will notice it and will try to remove the remaining BKL users.
Comment removed based on user account deletion
here (for subscribers. I dare not post a free link here :)
you had me at #!
Hey, its not easy keep every blade of grass within 0.3mm in length and maintain cross-colour length rules while keeping a close watch on weather-judged per-species expected length margins, you insensitive clod!
...
Keep off my lawn too, you pesky kernel hackers^WWkids
Caesar si viveret, ad remum dareris.
If that is true then it sounds like a bad decision.
If the BKL code is rarely used then the general usage performance impact is minimal and the efficiency of a spinlock vs mutex is irrelevant. If this is not true then saying it is rarely used is misleading.
However for real-time use you either do or don't meet a given worst case latency spec - the fact that a glitch only rarely happens is of little comfort.
It seems like it should have been a no-brainer to leave the pre-emptable code in for the time being. If there's a clean way to redesign the lock out altogether then great, but that should be a seperate issue.
Matthew Wilcox replaced the per platform semaphore code with a generic implementation because it was likely to be less buggy, reduced code size and most places that are performance critical should be using mutexes now.
Unfortunately this caused a 40% regression in the AIM7 benchmark. The BKL was now a (slower) semaphore and the high lock contention on it was made worse by its ability to be preempted. As the ability to build a kernel without BKL preemption had been removed Linus decided that the BKL preemption would go. Ingo suggested semaphore gymnastics to try and recover performance but Linus didn't like this idea.
As the the BKL is no longer be preemptible it is now a big source of latency (since it could no longer be interrupted). People still want low latencies (that's why they made the BKL preemptible in the first place) so they took the only option left and started work to get rid of the BKL.
(Bah half a dozen other people have replied in the time it's taken me to edit and redit this. Oh well...)
"That part of the code" is the difficult part. The BKL assumption is present in thousands of place all around the kernel, and nobody really know where. You can have two pieces of code, that looks totally unrelated, that happen to work because in all the code path leading to them the BKL is taken. Removing the BKL and "code it all over again" will create this new race condition.
There would be thousands of such, and you'll probably never succeed in debugging it.
The approach suggested in the article is to replace the BKL by a true lock, then "pushing it down", which means understanding WHY that code want the BKL, and get smaller locks instead in subroutines.
For instance, one piece of code could take the BKL because it will change 3 data structure. You could then remove the BKL and use, in the 3 part of code that changes those 3 structure, and use a finer grained lock for each of those.
By iterating this way, you should always get a somewhat working kernel, and slowly kill the BKL.
Keep these on Front Page...
This is the type of stuff that needs to be kept in the news, as the people who post here often have no understanding of, and the ones that do, have the opportunity to explain this stuff, bringing everyone up to a better level of understanding.
Maybe if we did this, real discussions about the designs and benefits of all technologies could be debated and referenced accruately.. Or even dare say, NT won't have people go ape when someone refers to a good aspect of its kernel design.
Yeah we got rid of the Giant lock, this is just the big lock, totally different things.
IranAir Flight 655 never forget!
The recent semaphore consolidation assumed that semaphores are not timing critical. Also it made semaphores fair. This interacted badly with the BKL (see [1]) which is a semaphore.
The consensus was to not revert the generic semaphore patch, but to fix it another way. Linus decided on a path that will make people focus on removing the BKL rather than a workaround in the generic semaphore code. Also, Linus doesn't think that the latency of the non-preemptable BKL is too bad [2].
[1] http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-05/msg03526.html
[2] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8e3e076c5a78519a9f64cd384e8f18bc21882ce0
No.
You know, like, remove that part of the code and code it all over again, see what is broken, and continue this way?
It's not that simple. When it comes to locking, there is no "part of the code" that can be replaced. Locking governs interaction between two pieces of code, sometimes two pieces that are very different but have some small thing in common.
Besides, the kernel is too big to just start throwing parts of it out and redoing them from scratch. It's much better to make incremental improvements, because then the people working on them will actual learn how to solve the problem. The BKL is not just a coding problem, but also a people and project management problem.
And the men who hold high places must be the ones who start
To mold a new reality... closer to the heart
Your not wrong, and like you I am going to continue in an over simplified style so the non programs can understand. The part you are leaving out is why you want your locking to be granular.
The granularity is important because you want other threads(jobs) to beable to get something done. At some point there is this thing called a scheduler that assigns your thread to execute, because every job needs a CPU. You get to work until your time allotment is expired or you have to stop because something you need to continue is not availible, because its say "locked".
Think of this like working in a shop along side someone else. You have one set of tools, you need a little screw driver, and a big one to do your work. The other guy needs the little scew driver and a pair of pliers. You want him to put the screw driver down while he is using the pliers so that you can use it if you need to. If he instead puts it in his breast pocket you are going to have to wait to finish your job until he finishes his. Even though its your turn at the work bench(CPU) you can't do anything with it because you don't have what you need. So all you can do is yeild the rest of the time to the other task, and hope he finished up soon.
In the kernel world this really short circuits the work of the scheduler. It might want to give time to other threads and it will but they are going to just turn around and give that up because whichever thread is holding the BKL is likely the only one who can actually do any work. As an end user this means something like data gets read from your network card ok but your sound keeps skipping.
The tricky part with more granular locks though is avoiding circular conditions, these can crash the system. Imagine: Job One needs resource A and B and has A locked, its waiting for B. Job Two needs B and C and has B locked and is waiting on C. Job Three needs A and C and has C locked and is waiting on A. Unless the system can detect this condition which is hard to do in many cases none of these threads will ever be able to run. The kernel contributors likely have some work ahead to eliminate the BKL and not cause these types of problems.
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Perhaps its growth is exponential.
Wow. It sounds like it's about time someone on the kernel team reads Working Effectively With Legacy Code by Michael Feathers.
I'm a software developer myself on a very large project myself, and this book has absolutely revolutionized what I do. Having things break silently in the kernel is a sure sign that dependency problems in the code exist, and most of this book is about ways to break dependencies effectively and get code under test. And that's the other thing... if they aren't writing tests for everything they do, then even the code they write today is legacy code. Code without tests can't be easilly checked for correctness when a change is made, can fail silently easilly, and can't be understood as easilly.
That's what this book is about, and if things in the kernel have deteriorated to such a state then they need to swallow their pride and take a look at resources designed to cope with this. I know they are all uber-coders in many respects, but everyone has something they can improve on, and from the description they give of their own code, this is their area for improving.
Beware of bugs in the above code; I have only proved it correct, not tried it.
Its gone from 0% to 100% on my pcs, everything is relative.
IranAir Flight 655 never forget!
// MD_Update(&m,buf,j);
And you learn more when you write your own (virtual) device driver which crashes your kernel and renders it to unbootable state :)
Not that I know anyone who has done so... or at least I wont admit it!
You don't know what you don't know.
Slashdot's not supposed to be interesting to every reader all the time. If you want someone to cater to a least common denominator, you'd be better off somewhere else.
ResidntGeek
That's a terrible excuse. There are many applications where a real-time Linux kernel is highly desired. Besides, it is important to note that real time systems do not focus on speed. This is a subtle difference from "performance" which usually caries speed as a connotation; it doesn't for a real time system. The real time system's focus is on completing tasks by the time the system promised to get them done (meeting scheduling contracts). It's all about deadlines, not speed. So from this point of view, the preemptible BKL, even with the degraded speed, could still be viewed as successful for a real time kernel.
In other words:
TESTS DON'T VERIFY THAT YOUR CODE IS NOT BUGGY. YOU VERIFY THAT YOUR CODE ISN'T BUGGY.
Contrary to the popular belief, there indeed is no God.
Comment removed based on user account deletion
Keep dreaming ...
With all those processors, you'll want to be saving energy, so you'll be aiming to turn off individual processors until needed, and run the remaining processors at full load, so you'll still need a scheduler, locks, etc.
And yes, it's possible even today to use up more than 4 gig of ram and have to hit swap.
Whatever your large project is, I'm willing to bet it's nowhere near as complex as the kernel. Whenever you get the feeling that they must have missed something that seems obvious, you're probably the one that's wrong. No offense, but they have a lot more experience dealing with unique kernel issues than you do.
You talk about unit testing, but how exactly are you going to unit test multi-threading issues? This is not some simple problem that you can run a test/fail test against. These kinds of things can really only be tested by analysis to prove it can't fail, or extensive fuzz testing to get it "good enough"..
This task is not easy at all. 12 years after Linux has been converted to an SMP OS we still have 1300+ legacy BKL using sites. There are 400+ lock_kernel() critical sections and 800+ ioctls. They are spread out across rather difficult areas of often legacy code that few people understand and few people dare to touch.
This is where microkernels win. When almost everything is in a user process, you don't have this problem.
Within QNX, which really is a microkernel, almost everything is preemptable. All the kernel does is pass messages, manage memory, and dispatch the CPUs. All these operations either have a hard upper bound in how long they can take (a few microseconds), or are preemptable. Real time engineers run tests where interrupts are triggered at some huge rate from an external oscillator, and when the high priority process handling the interrupt gets control, it sends a signal to an output port. The time delay between the events is recorded with a logic analyzer. You can do this with QNX while running a background load, and you won't see unexpected delays. Preemption really works. I've seen complaints because one in a billion interrupts was delayed 12 microseconds, and that problem was quickly fixed.
As the number of CPUs increases, microkernels may win out. Locking contention becomes more of a problem for spinlock-based systems as the number of CPUs increases. You have to work really hard to fix this in monolithic kernels, and any badly coded driver can make overall system latency worse.
It's hard to test whether you've broken a driver when you don't have the hardware to test with. Perhaps the future will be Qemu emulation of all the different hardware in your system : )
This is not to say that there need to be tests for things that can be caught at compile time or run time regardless of hardware but there is only so far you can take it.
It's not like the kernel doesn't have any testing done on it though. There's the Linux Test Project which seems to test new kernel's nightly. If you ever look in the kernel hacking menu of the kernel configuration you will see tests ranging from Ingo Molnar's lock dependency tester (which checks to see locks are taken in the right order at run time), memory poisoning, spurious IRQ at un/registration time, rcu torture testing, softlockup testing, stack overflow checking, marking parts of the kernel readonly, changing page attributes every 30 seconds... Couple that with people like Coverity reporting static analysis checks on the code. Tools like sparse have been developed to try and so some of the static checks on kernel developer machines while they are building the code.
But this is not enough. Bugs STILL get through and there are still no go areas of code. If you've got the skills to write tests for the Linux kernel PLEASE do! Even having more people testing and reporting issues with the latest releases of the kernel would also help. It's only going to get more buggy without help...
Linux is something like nearly half the servers in existence and most of the top supercomputers. Desktop is a slower road of course, but it is still chugging along slowly but surely. Look at apple, originally a big percentage of desktops, then dropped to almost nothing, now inching its way back up because it got good. Stuff changes. The linux desktop market is big enough for there to be a lot of credible choices just within "linux" itself, there are half a dozen or so really good desktops and dozens of pretty good desktop linuxes out there now. And word gets around. It will be like FF, 0% to now upwards of one quarter to one half depending on where you look around the planet. There's some magic number that is hard to pinpoint but once anything reaches a certain level of use/adoption it really takes off then, usually near as I can see around 10%, then it makes huge jumps. Bad car analogy time, toyota prius is now more than one million cars sold from zero cars ten years ago, and the first with a mass market hybrid system that they really tried to make and sell in decent numbers (compared to honda for example who only fooled around with their insight). Now look, all the major manufacturers either have their own hybrids or will have them shortly. Ten years, that's all it takes once some threshold hits and it looks "real" to joe consumer to go from exotic to normal. I think this year the asus eeePC made linux "real" to a lot of people, so I am expecting ubiquitous linux as a choice to be along shortly with most computer makers as an option. And that is leaving out all the gadgets people use day to day running some smallish embedded linux, gps systems, cellphones, etc.
I'll agree that this should have been shorted out long since. But it wasn't, and very few people though that it was reasonable to expend time on something so obviously unreasonable. (Multiprocessors were things like Illiac IV, huge monsters that were utterly impractical.)
Time passes, technology changes, and now it's become urgent to deal with this, so now it's being dealt with.
One should, perhaps, wonder what currently unreasonable problem should actually start being addressed RIGHT NOW!! The things I can think of divide neatly into two camps. 1) We don't know enough to even get started, and 2) It really seems utterly implausible, even given this example to work from. Unfortunately, somewhere in there is something that's being overlooked, and I don't know what. Kernel support for Actors? Kernel security to control Actors? Kernel support for Language parsing? They all seem implausible.
What is clearly needed soon is software that facilitates the use of multi-processor environments. Dataflow languages have promise, but there may be other reasonable choices. Possibly some interface that would easily allow different computer languages to work together, but that may be a real impossibility. Or even a language basically like C or C++. but extended with a "foreach" operator that allowed parallel execution of the loop body...but the language would need to be smart enough to tell what needed to be read locked and what needed to be write locked, and what could just be ignored. This implies that use of pointers is *severely* circumscribed! And if you're going to do that, you probably ought to have garbage collection. It might sound like I'm talking about Java, but that would be wrong. This language would need to be close to the metal, so it could adapt itself (at run time!!) to the local machine. And since we want as much efficiency as possible, virtual machines, interpreters, etc. are probably out.
I don't know of any language that meets the specs I've outlined, but I know of many languages that meet large parts of them. Of the languages I know, D (Digital Mars D) comes the closest, but its totally missing on even the parallelization that C/C++ have (as an add-on).
But that doesn't really say where the kernel should be going...except that possibly C isn't the best language to use for a multiprocessor environment. (But C is still the most efficient in most places, and it DOES have add-ons for parallelization...though whether you can use those add-ons in kernel programming isn't something I've investigated.)
I think we've pushed this "anyone can grow up to be president" thing too far.
Are you arguing for a microkernel style solution, sir? If so, I salute your bravery! ;-)
Linux is a preemptive multi-tasking kernel. What this means is that a hardware interrupt like a keyboard click or the system timer will interrupt whatever is currently running on the CPU, and an interrupt handler in the kernel starts running code. In order to make sure that all the states of the kernel are consistent (ie: not corrupt), the different parts of the kernel are supposed to lock the data that they are using or modifying (ie, readlock or writelock) in case another code path gets run at the same time trying to modify the same data. It becomes even more important in a multi-cpu environment where locks have to be atomic (happen at the same time on all CPUs). So what you are supposed to do is only lock the resources you currently need (a file system drivers would only lock parts of the filesystem, not a character device). Because some programmers are lazy, or not sure what they are doing, they just use the big kernel lock which locks pretty much everything in the kernel. This is bad for multi-tasking and multi-processing because it means you can only have one codepath using the lock at a time. Like putting too much air in a balloon!
#DeleteChrome
You sure that's all in the kernel? I have a feeling that's mostly, if not entirely, in userland APIs, which is not uncommon to happen on Linux either. Witness X and the toolkits.
Specifically its just the bootloader for GNU Emacs, the finest most complete operating system known to mankind.
Since removing the BKL will cause deadlock situations like the one you describe, perhaps a solution to this problem is to re-think the way locking is implemented. If a program knows that it will need access to resources A, B, and C, it could put in a request to reserve all three of those resources simultaneously. If the three resources are available at that moment, they will all be locked simultaneously, the task will execute, and then they will be unlocked simultaneously. But if one or more of the resources are not available at that moment, that task will simply stop executing (it won't be scheduled) until the first instance that all three become available. This way, a resource doesn't become locked until it is actually going to be used.
McCain/Palin '08. Now THAT's hope and change!
but I realize that this topic has about as much appeal to the average Slashdotter as mowing the lawn. This topic is probably mainly of historical interest. (BKL used to be one of those bread-n-butter slashdot stories in the early days)
The funny thing is that the reply quality here is quite high for technical topics, but over time slashdot management has found that retarded political threads are much more popular.
Business. Numbers. Money. People. Computer World.
yes I agree. the BKL wasn't fair - which worked. happiness. but hey my views are pointless because I'm pro desktop/nvidia/life (and I have a soul).
The new semaphore implementation gives way in a very fair manor proving that sometimes you can be too fair and Kon was right from the get go (a desktop user's opinion, i don't mean to bait). if something didn't explicitly ask for a time slice, ie: a kernel section was preempted by an interrupt and then queued back in normally, then it doesn't get one until the pope said yes. the BKL as we knew it would allow a section enclosed in [un]lock_kernel to queue jump - at the expense of some extra code trickery.
Sadly the BKL was one of the more demanding locks - if it didn't get what it wanted fast enough, subtle differences in character device operations, which all seem to be strategically linked to the tty and other vfs layers from what I can make of KernelTrap, would work against the overall throughput of the kernel and into usermode (archaic ioctls suffer)
the best "fix" would be to implement the old semaphore types for depreciated BKL-only use (don't export them?) - but the old semaphore code was quite substantial, and required some arch specific implementations which have now... I don't see that much work being reintroduced in it's old form any time soon.
Matt
while a 2.7 branch would make sense if stability was important, its not, with 2.6 linus decided he couldn't be bothered with the boring part of stability and so told distros to do it themselves. This happens (with varying degrees of success), and has allows the kernel to develop at a much faster pace. Unfortunately this particular change is too much even for the 'unstable' kernel, so a temporary testing branch to get other peoples code sorted is being formed (a virtual 2.7 in effect only its so experimental it will never get released)
IranAir Flight 655 never forget!
Comment removed based on user account deletion
There are a few added benefits from a fully real-time OS that most people gloss over.
:)
For example, you will *know* your PC will never become utterly unusable to the point it's unsafe with your data. ie: while it's handling many IO operations (say you're being ddosed whilst transcoding a dvd and flossing with a sata cable) unless you run completely out of system memory. Nothing should run away wasn't design to. This stems from the predictability of code execution times that pre-empting offers.
The predictably allows devices to make guarantees, for example, if your mouse is aware it's going to get a time slice, at worst, every 100ms, at least it'll be doing something every 100ms and you gain a visually responsive mouse (aye, it's not as great as it could be). The non-preempt side of life has your CPU tied up doing work that was sits inside a BKL - ie: dealing with a character device or ioctl - your mouse could be waiting 500ms or 1000ms before updating it's position: giving you the impression your PC is dying.
Code that is stuck inside the BKL isn't pre-emptable (you *must* wait for it to finish.) - there's a lot of it that does a lot of regular stuff. Often this will hold up other cores if you've got a cooperative multi-threaded program. The net effect is a slow PC.
RT systems have a different use: they want a guarentee that something is unable to ever delay the system, in particular interrupts. The BKL allows code to take a time slice and run away with it, because you thought it was very important and wrapped it in [un]lock_kernel. This then delays IRQs (IRQs cant run until the lock has finish - at least, I'd like to believe second level IRQs can't run, I'm unsure of the specifics) which will delay data coming in and out of your PC - hard file, disk and display buffers suffer: they fill and you start to loose data because there's nothing dealing with it.
Preempt kernels are good
(viva the Desktop)
Matt
The best course of action would be to redesign the Linux kernel from scratch and this time integrate all possible drivers. Hardware support would be a lot easier!
I would even go so far as to suggest integrating the most important server tools into the kernel to decrease latency. Why not integrate Apache? You could even integrate the shell for added responsiveness!
Linus has demonstrated that micro kernels are a footnote in history. Nowadays memory is cheap and we can afford the have a large (or very large) kernel.
Dragonfly is just the latest branch of development of a codebase with decades of history!
On the other point (which I wasn't disputing or answer)--about the giant lock, I don't know much about the linux internals, but it sounds like the bsd GIANT is the same as the linux BKL. FreeBSD is in something like its 6th year of removing giant!
Yes, and since the kernel can and is branched, they can decline to apply this patch and keep the 2.6.7-2.6.22 or whatever style BKL... or they can help everyone and rewrite various BKL-using code to not use it. I'd rather have a kernel that has low latency AND behaves correctly, but if I have to chose I prefer correct behavior every time.
So when can we run QNX-Ubntu?
Just asking . . .
It's worth pointing out here that the kind of races (bugs) introduced by faulty locking in general suffer from a very important problem: YOU CANNOT TEST FOR THEM.
Race conditions are mostly eliminated by design, not by testing. Testing will find the most egregious ones but the rest cause bizarre and hard-to-trace symptoms that usually end up with someone fixing them by reasoning about them. "Hmm" you think to yourself "that sounds like a race problem. Wonder where it might be?" and thinking about it, looking at the code, inventing scenarios that might trigger a race; that's how you find them.
This is one approach to deadlocks; it would fall generally under "avoidance".
The problem is that if you have any serious contention over the resources, it is entirely possible that the process will _never_ get the resources (because one of them always gets snatched up before another gets released, so all n resources are never available at once to the requesting process). This leads to starvation and general sadness.
If the system has minimal contention (so the normal case is that all three resources are unclaimed) and resources are held very briefly (so if a resource is taken it is likely to be released before another is taken, anyway) then it may work. In real systems these are hard properties to guarantee.
Also, the scheme requires a process to know in advance which locks it will need. A lot of algorithms may discover this on the fly (e.g. if you are traversing a data structure), which becomes a problem. The best you could hope to do is to lock aggressively--taking everything you might need--but this is ugly, and would tend to violate the conditions above (locking everything will lead to contention, locking everything in advance will lead to holding locks for a long time). Alternatively, if you discover that you need a new lock that you don't yet have, you could give up all the locks you do have and then try to lock again (with the new lock added to the set). This is also ugly and increases the chance of starvation (since now you need to lock a bunch of resources several times). Additionally, since you have to unlock in the middle, the algorithm becomes much more complicated. For example, when you discover you need a new lock, you must put the world in a consistent state before unlocking. And when you re-lock, you must check to make sure that the world hasn't been modified under your feet (which is entirely possible, and may very well cause you to need a still-different lock).
Basically it doesn't work that well.The proposed outcome is for there to be increased opportunities to switch between programs/kernel or to run multiple things at the same time.
For those who enable the option this should reduce the chance of a hardware's buffer not being filled in time (so audio is less likely to skip in demanding environments). If you are an audio recording person or need VERY (less than hundredths of seconds) fast responses all the time your experience should improve. If you run VERY big workloads that have lots of pieces that can happen simultaneously on computers with 2-1024 CPUs, your experience (increase in work finished per second say) should improve. Typical desktop performance may improve a little if you have multiple CPUs/cores but one would guess not enough to be noticeable without careful measurement.
The trade off is increased risk of system hangs / data corruption due to the programming being trickier although instances of this happening should fall over time with popular hardware.
I'm brave enough to want per CPU microkernels (with a messaging master?). I envisage all multi-CPU systems addressing memory in an non-unified manor soon enough - it'll be like the jump from segmented addressing to protected mode, but for CPUs.
The monolithic design is slowly forming a focal point in performance: something has to do a lot of locked switching - if SMP machines could do what they do best and handle IRQs and threads concurrently without waiting for a lock (they're better spent sending/receiving lockless messages), life would be easier on the scalability gurus.
On second thoughts, disregard that... I just remembered that Emacs has a Vi emultation mode!
"Slashdot - News and Chat Sites Deviant". (Click "homepage" link above for details).
http://michaelsmith.id.au
What would happen if there were two distinct ways of handling the BKL problem, which could be implemented simultaneously: Method number 1, the existing method. This would remain in place to support all the program that use it. Method number 2, a newfangled method that new programs will need to be made aware of. The idea is that "well behaved" programs would, over time, begin to use the second method. The first method would actually use the second method in its implementation, so really there's only one method, although it appears that there are two.
Each resource in the kernel that might potentially be locked receives its own small lock. The idea is that the more finely grained the locking, the less likely it is that more than one program will need the same resource at the same time. The deadlock and resource contention problems will still exist, of course, but they will occur less frequently. Programs that know exactly which locks they'll need over the course of an operation would put in a request for all of those resources. At this moment, the kernel checks if all requested resources are available, and if so, the resources are locked immediately (and simultaneously) and program execution continues into the portion that uses those resources. However, if one or more of the resources are not available, the kernel will queue the request and one of two things will happen at the program's choice: Either the call will block until the resources are available, at which point program execution continues normally (we'll call this "synchronous"); or the program can be notified that "I'll get back to you on that" (we'll call this "asynchronous") in which case it can continue executing some other, unrelated part of its code, such as the part that updates the screen, outputs sound, or whatever. Obviously, this will be a bit trickier for application programmers to implement, but the reward will be programs that appear responsive to a user while they "wait" for locks. There should be an additional option in the locking request: a timeout value (with an option to never timeout). If the request times out, the program can do something appropriate. The kernel function that releases a lock will be modified to check the queue of lock requests, and immediately transfer control of the lock to a waiting program (according to some prioritization method). This means unblocking something that is blocked synchronously, or sending a signal to the process that queued the asynchronous lock request. The BKL will be implemented by requesting all locks, synchronous, never time out. Simple. The lock request function will return a value telling you whether you received the lock or why you didn't receive it. The scheduler "knows" if there are two or more processes in its list that are blocked waiting for a lock. If at least one of them has provided a timeout value other than "never," the scheduler does nothing since the problem will fix itself eventually, a la Office Space. Otherwise, it can examine the lock relationships. If a deadlock situation exists, it will choose one of the processes and return it an answer that it didn't receive the lock for deadlock prevention. That process can then do something appropriate. Code that doesn't know how to handle this situation will crash, but the system won't block indefinitely.
McCain/Palin '08. Now THAT's hope and change!
I'll respond because it is fantastic to see new people thinking about these issues. But I must agree with twizmer on this - grabbing multiple resources might solve the problem, but it is very clumsy. Some resources (eg storage) may take milliseconds to complete, whereas others (eg graphics) might take only microseconds. Holding up the fast ones while the slow ones complete is very undesirable (for all the reasons twizmer gives).
There are techniques that are used for problems like deadlocks and starvation: changing priorities on the fly; or enforcing mutex ordering; or even 'prodding' deadlocked tasks, but they are somewhat ugly. You'll find chapters in any book on OS design.
The essential problem is that the use of semaphores (and mutexes etc) is a low level way to control multiple processes; it is analogous to using the goto for flow control. There are languages that have attempted to address this (eg Occam or Modula I) with slightly higher level constructs, but they have not become popular, and are not totally radical.
I believe that we will need new programming languages to achieve safer parallelism. My bet would be on a language with message passing primitives (since they fit well with our object oriented models), and perhaps the use of Petri net formalism to prevent deadlocks. I gather that Nokia's phone OS uses this message passing model.
It should be noted that current processor design does not suit efficient message passing (the emphasis is more on an efficient stack, since that corresponds to the procedural flow of control - an exception may be the old Transputer architecture). However, I think the languages need to be developed first, even if they are not efficient to compile; processor development will support the most popular languages (as it has grown to support the use of C and other procedural languages).
Yup and Mica designed in the late 80's as a Microkernel replacement for VMS.
http://en.wikipedia.org/wiki/DEC_PRISM
It was canned by DEC and Cutler and most of the engineers were hired by Microsoft where they wrote Windows NT.
Interesting Mica was supposesed to have multiple personalities - posix and VMS. Similarly Windows NT had Win32, Win16/Dos, Posix and originally OS/2.
The NT model is to packetize IO as IRPs and pipeline requests across multiple CPUs. E.g. on Windows NT under very heavy load a disk driver can be starting a request on one CPU, in an ISR on another and in an APC on yet another. That makes the rules for writing drivers quite complex of course, but that's a price worth paying. That's an extreme case though, normally DPCs have an affinity for the CPU that generated the interrupt because that CPU is likely to have the driver code in its cache. But the fact it's possible illustrates how far Mica/NT is from the tradional Unix model of treating device drivers as synchronous subroutines that only one kernel thread can be in.
And it seems pretty obvious to me that Mica/NT works like this because they were designed from the start to run on a lot of small Risc processors wired up as an SMP machine. Unix wasn't - back when Unix was designed the dominant paradigm was one big CISC CPU. In that world, drivers as a synchronous library is actually better since you don't have the overhead of building packets.
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
You've gotta know that's complete nonsense, right? You wanna create an abstraction layer between every single bit of code in the kernel and the hardware it's meant to be running on to handle locking dynamically, making decisions at runtime, rather than expecting developers to be able to design a system and make those decisions for the code outside of its runtime and having the best way compiled to run directly on the hardware at full speed?
Managed languages have their place (and in fact all my work now is with managed languages)... but they also have to run on something. You think your java apps will just run without an operating system underneith it with these kinds of features to support the running of the virtual machine? What language do you think your java machine is written in? Do you think that's air you're breathing now? Hmm...
"This is innacurate, not correct and misleading at the same time"
But hey at least you prefixed what you had to say with this so we did know.
The revolution will not be televised... but it will have a page on Wikipedia
All lines of code are not equal.
There's a huge difference between typical application code and system code. Very little application code is as performance-sensitive as system code, because the goal of the system code is to use as little time as possible (consistent with some other goals), to make as much time as possible available to run application code.
OS code is performance-tuned to a degree that application code almost never is, and that focus on performance results in complexity that isn't well-represented by counting lines of code. Further, most application code isn't nearly as multiprocessor-aware as modern OS code, which introduces another huge complexity factor. Finally, the role of OS code is to interact directly with hardware, and if you've ever written on-the-metal code you know how much complexity that adds. Linux, of course, takes that even further by trying to work on a wide variety of hardware platforms, abstracting commonality where possible, but only when it doesn't interfere with performance.
No, I don't think there are many, if any, unit tests.
Altogether, I estimate that system code is an order of magnitude more complex, per line, than application code.
If you RTFA you'll see it was a series of system tests which demonstrated the problem in the first place. Although the fact the kernel doesn't seem to have a standardised set of system/unit tests does concern me. (correct me if I'm wrong)There are a bunch of sets of system tests in place for Linux. They're created and executed by multiple groups of people around the world and the results are made available to the developers (some of whom are the same people executing the tests).
This is a different approach than is common in the normal, centralized development model, but it's one that's very effective for the sort of decentralized development model used by the Linux kernel team. People who are interested in different aspects of Linux create tests designed to evaluate the kernel according to those aspects. When they see problems, or opportunities for improvement, they post their results to LKML, often with patches to address the issue they identified.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.