Slashdot Mirror


Linux Gets O(1) SMP Patch As Late Christmas Gift

bodin writes: "Now that new-year's parties are over things are getting boring again. For those who want to see and perhaps even try something more complex, Ingo Molnar is announcing this patch that is a pretty radical rewrite of the Linux scheduler. This is big stuff!"

38 comments

  1. Oxymoron by sydb · · Score: 3, Funny

    Hehe!
    I liked O Xymoron's enthusiastic response:


    On Fri, 4 Jan 2002, Ingo Molnar wrote:

    > this is more than 6 million context switches per second!

    Everyone knows scheduling is boring.

    :)

    --
    Yours Sincerely, Michael.
    1. Re:Oxymoron by Wesley+Felter · · Score: 3, Funny

      I think that was referencing Linus's earlier comment that he wasn't worried about scheduling because it's so easy. It looks like Linus was right: Just wait around long enough and Ingo will write a new scheduler for you. :-)

  2. O(1) == constant time by adadun · · Score: 5, Informative

    For those of you who haven't read enough computer science to know what O(1) means, here is a short explanation.

    The Big O-notation is a way to describe how the (asymptotic) execution time of an algorithm depends on the inputs to the algorithm. For instance, an algorithm that loops over n values is said to have the asymptotic execution time O(n) - it is proportional to the number of times the loop is executed.

    Similarly, an algorithm that runs in constant time, i.e., that takes equally long to execute for 10 values and for 1000000 values, is said to be O(1). The execution time is proportional to 1.

    For the Linux scheduler, switching processes is O(p), where p is the number of currently running processes. This new scheduler switches processes in O(1) time.

    This means that even though the old scheduler might be fast for low numbers of running processes, it will take longer and longer timer to switch processes when the number of active processes grow. The new scheduler will switch processes just as fast for 2 processes or for 200 processes. Even though the new scheduler might be slower in when there are few processes, it will be faster when the number of running processes increase.

  3. I hope Linus listens (and I'm not making a fool.. by dalutong · · Score: 3, Insightful

    of myself)

    I don't know if in 2.ed kernels Linus still likes the "small patch" idea. but this is pretty amazing. I am no kernel coder, but some of these tests showed 600% percent improvement and (seemed to me to be) impressive scaling. All the kernel gurus out there, what is the chance that this will make it into the kernel? (2.5)

    --

    What comes first, finding a teacher or becoming a student?
  4. Will this help... by bihoy · · Score: 3, Interesting

    Before attempting to install this on my test box I'd like to know exactly what the performance inplications are to specific types of applications and services. For example I am extremely interested in improving JVM performance on a Linux box.

    Any information or direction about this would be very helpful.

    1. Re:Will this help... by p3d0 · · Score: 2

      Answer: don't install it. It's not yet stable. Some people on the mailing list had trouble booting with the patch.

      I'm looking forward to the day it appears in kernel 2.6.

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    2. Re:Will this help... by Anonymous Coward · · Score: 0

      I too am interested in improving JVM performance. Using the IBM JDK 1.3.0, I ran the volanomark with this patch. I ran the test first on 2.4.17 without the patch and then with the patch. On a single Athlon 1.3GHz the improvement was anywhere from 2.6% to 6.5%, but on a Dell 2550 with dual P3-1.13GHz the improvement was nearly 20%. Although I agree with previous posts that this patch is not yet stable (I couldn't boot with H5 version of the patch, but can with the H6 version) I would say that this improvement is nothing to ignore. It has already made it into 2.5.2-pre10 (which I also tried) and the performance is all there.

      I suggest you install it. With lilo and grub you can have your choice of production and test kernels to boot from. If you have problems you can always boot your old kernel.

  5. Warm Fuzzy by p3d0 · · Score: 5, Interesting

    Wow. I read this guy's description of what he has done. I hope he's a teacher, because his explanation of those complex issues was a joy to read.

    I have a BP6 dual-celeron Debian machine which already gives me the benefits of countless hours of volunteer time, including the SMP kernel and ReiserFS, along with dozens of free development tools. Now I see this guy working like a dog to tune the heck out of the scheduler for SMP machines, and I know that when I eventually run the 2.6 kernel, I'm again going to reap the benefits of his work.

    It's almost enough to make me learn to hack the Linux kernel out of a sense of obligation.

    --
    Patrick Doyle
    I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    1. Re:Warm Fuzzy by ameoba · · Score: 2

      For the most part, the important big changes are all in the kernel (and I still don't use more than 640k at a time ^_^). If you really want to help Linux out, write documentation or work on userland apps that your grandmother can (and would) use.

      --
      my sig's at the bottom of the page.
    2. Re:Warm Fuzzy by Anonymous Coward · · Score: 1, Insightful
      work on userland apps that your grandmother can (and would) use.


      Why do we want everyone's grandmother to be using Linux? To do that, we'd basically need to design a system like MacOS (pre-X) -- very simple UI, lots of big-name applications (Office, Photoshop, etc), and good marketing. All 3 of these are completely unappealing to me, and are not the strengths of the open source community. Why not just design the best OS we can, the OS we want to use? For me, that means a functional GUI (Gnome or KDE more than suffice), but an absolutely rocking server OS + software development environment.



      I'd much rather see Linux replace Solaris/AIX/HPUX/Win2k. As for desktop dominance, who cares?

    3. Re:Warm Fuzzy by Anonymous Coward · · Score: 2, Insightful
      Why do we want everyone's grandmother to be using Linux?
      Good cheap hardware with Linux drivers.
  6. Playing with patch....... by orangesquid · · Score: 5, Informative

    By the way, since the links /and/ the URL's (as far as I can tell) in the post are broken, this should help:

    http://people.redhat.com/mingo/O(1)-scheduler/

    I've created diffs between 2.5.1 and 2.5.2pre8 with the O(1) scheduler, and between 2.5.2pre8 and 2.5.2pre8 with the O(1) scheduler.

    2.5.2pre8 actually patches pretty well with the original scheduler patch (drivers/char/serial.c.rej can be ignored, and you have to make a few changes to kernel/sched.c in order for it to patch correctly), but because it took me at least ten minutes of fiddling with sched.c I've decided to make a diff for 2.5.2pre8.

    No guarantee that either of these works, though :) but hey, my kernel compiled just fine.

    # diff -ru linux-2.5.1 linux-2.5.2pre8schedO1|grep -v '^Only in '|gzip -f >/home/web/patch-2.5.1-2.5.2pre8schedO1.patch.g z


    http://os.markbach.com:8080/patch-2.5.1-2.5.2pre 8s chedO1.patch.gz
    (396,961 bytes)

    # diff -ru linux-orig linux |grep -v '^Only in ' |gzip -f >/home/web/patch-2.5.2pre8-2.5.2pre8schedO1.pat ch.gz


    http://os.markbach.com:8080/patch-2.5.pre8-2.5.2 pr e8schedO1.patch.gz
    (31,124 bytes)

    Good luck to anyone who tries to use these :)

    And no, I didn't patch in the kdev_t stuff from people/aeb on kernel.org because there's lots of kdev_t stuff in the Changelog for pre7 and pre8, so I decided to assume (yes, I know, assuming makes an ass out of u and me) I didn't need it... of course, when the system crashes after five seconds, maybe I'll change my mind ;)

    And if, for some odd reason, you can't connect on port 8080, just connect on port 80 and let's hope you're not blocked by @home's or my firewall. :)

    Damn, I'm using too many smileys :-\

    --
    --TheOrangeSquid Is it any wonder things seem so awry? We swim in a sea of confusion and don't have to think to survive
  7. Ingo Molnar by Anonymous Coward · · Score: 3, Funny
    My name is Ingo Molnar. You killed my kernel. Prepare to die.

    Am I the only person that thinks that every time I see his name?

  8. What I would like to know. by Anonymous Coward · · Score: 1, Interesting

    Will this make apache 1.x as fast as 2.0? I thought one of the main performance bottle necks in the 1.x series was that for every connection a new process had to be forked. And that the system suffered because each process added overhead because it was dependent on a scheduler that didn't scale well. Apache 2.0 solved this problem by using threads instead of forks. But with this new patch will the 1.x version work just as well as a 2.0 version that doesn't use the fancy new features

    1. Re:What I would like to know. by swright · · Score: 2, Insightful

      There's a second factor in that - the cost of actually creating the processes/threads. On the whole threads are cheaper to create than processes IIRC.

    2. Re:What I would like to know. by p3d0 · · Score: 2

      Linux threads aren't tremendously more efficient than its processes, but threads can be pooled. When a connection closes, its thread doesn't terminate; instead, it blocks and waits to be assigned to a new connection. That way, the thread creation cost is amortized over a lot more connection.

      It's also possible to pool processes, but pooling threads is easier and more efficient, because you don't need interprocess communication or special shared-memory segments. Instead, you just have the threads all block on the same job queue.

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    3. Re:What I would like to know. by Anonymous Coward · · Score: 2, Informative

      Apache 1.x does in fact pool processes, as does the 2.0 prefork MPM. All processes save one block on a semaphore, and only the one gets into listen (2).

      The real reason threads can be faster than processes if all other things are equal is context switch time. Switching between threads that share all their page tables is just a matter of restoring all the registers. When switching between processes, you must flush the TLB also.

    4. Re:What I would like to know. by pthisis · · Score: 5, Informative

      The real reason threads can be faster than processes if all other things are equal is context switch time. Switching between threads that share all their page tables is just a matter of restoring all the registers. When switching between processes, you must flush the TLB also.

      But when threads modify memory on SMP, they have to ensure cache coherency between CPUs. Processes can don't have this restriction, which is why processes can often be faster than threads. It all depends on your usage patterns and OS. On e.g. Solaris or Win32, threads are far faster at context switching than processes (mainly because processes are so heavyweight on those systems; Win32 also doesn't offer an API to efficiently implement multiprocess programs). On identical hardware, Linux's processes switches are faster than Solaris or Win32 thread switches and Linux thread switches aren't that much faster than process switches (beacause process switching is implemented in an efficient manner).

      So on Linux, the fact that memory isn't shared between processes (hence CPU coherency isn't the same issue) often makes processes a substantial performance _win_.

      This is aside from the fact that threads are extremely difficult to program properly--there's a reason that OSes spent all that time implementing protected memory (remember all the hoopla about that?) and threads throw that concept out the window, plus tend to lure the programmer into architectures that need nasty locking and synchronization.

      That said, for some applications threads are the right answer. But using multiple processes, possibly with shared memory for a limited set of data, is generally a more maintainable solution that's easier to architect and implement. It certainly makes it explicit which data structures are meant to be shared, which a multithreaded solution does not.

      Sumner

      --
      rage, rage against the dying of the light
    5. Re:What I would like to know. by Mike+Miller · · Score: 1
      The original poster is correct. Just for clarification here...

      Cache coherency is maintained in a UMA SMP (i.e. all 2 and 4-way x86 platforms) at all times, and is done by the hardware not the software, with a few minor exceptions like I/O and the page tables. But the only time that you need to modify the page tables in a way that requires you to update the other CPUs is when you are mallocing or freeing large chunks of memory, and that's an OS call (read: context switch) anyway.

      Also keep in mind that x86 Linux process switches are fairly low cost to begin with because there isn't that much x86 state to save in comparison to other ISAs that have a lot more architectual state associated with any given process. Also, a lot of effort has gone into making context switches as low-cost as possible since it is a large contributor to OS performance, so the process context switches are highly optimized.

      What I would consider more interesting is how to you schedule in a SMP with on a multithreaded processor design (like Intel's upcoming Hyperthreading Xeon, or some of the MIT research CPUs) where threads 'borrow' resources from each other. It's also called Shared Resource Multiprocessing (ShReMP). And while the virtual CPUs are symetric, they differ in "effective" computing power based on what is (or isn't) running on the hardware next to them.

      &nbsp - Mike http://pooh.asric.com/~mikem/index.php

    6. Re:What I would like to know. by Electrum · · Score: 2, Insightful

      Apache would be MUCH faster if the Apache developers would give up on the outdated and innefficient method of using a separate process per connection, and went with a non blocking I/O model like Boa, thttpd or Zeus. A non blocking I/O model is much better, from many standpoints. It places the I/O handling back in the kernel where it belongs, reduces memory usage and drastically increases speed.

      I've talked to a lot of people about this issue, and spent a lot of time reading through the Apache developer mailing list. The main reasons that come up is that Apache is meant to be correct first, fast second, and that it is harder for people to extend Apache because it makes the code more complicated. The first reason is invalid, because it's quite possible to make a web server correct and fast at the same time. The three web servers I mentioned prove this. And there are several more non blocking web servers out there. The I/O is a very small part. Given a proper design, everything else can easily work with that model. Extending it is not made any more difficult, because one would very rarely, if ever, modify the base I/O model. You have normal HTTP and HTTPS, and that's it. Anyone who would be extending that critical portition of the web server should be able to understand the conceptually simple concept of non blocking I/O, especially as it is explained very well in a variety of sources.

    7. Re:What I would like to know. by p3d0 · · Score: 2
      So on Linux, the fact that memory isn't shared between processes (hence CPU coherency isn't the same issue) often makes processes a substantial performance _win_.
      I think you're comparing apples and oranges. If you have two threads that are not modifying the same portion of the address space, there is no coherency traffic. OTOH, if they do modify the same pages, presumably it's because they need to communicate, which they would need to do with processes anyway.

      Thus, I still don't see any reason that processes are inherently more efficient than threads. Unless you think you can settle the shared-memory-versus-message-passing debate once and for all. :-)

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    8. Re:What I would like to know. by pthisis · · Score: 2

      I still don't see any reason that processes are inherently more efficient than threads.

      Right, that's why I said that for some things threads are better. But a lot of people pick threads over processes for reasons other than "I need to share all my memory". Which is bad. Choosing to use threads should be a conscious decision, made only when you've determined that e.g. processes with a small shared memory segment, message passing, or other approaches are worse.

      Sumner

      --
      rage, rage against the dying of the light
  9. Scheduling details: switching process != O(p) by sorbits · · Score: 1
    For the Linux scheduler, switching processes is O(p), where p is the number of currently running processes. This new scheduler switches processes in O(1) time.

    I haven't really worked with Linux, but despite my technical aversions toward this OS then I sure don't hope that switching processes is linear in time.

    An OS maintain a list of running tasks (processes) and each time a task has used up its timeslice (which is around one tenth of a second) or volunterely given up the processor (because it has no more work to compute), then the next task is taken from the list (and the previously running task is appended to the end, unless it volunterely gave up the processor) -- this is "task switching" and the principle is often referred to as roundrobin. Something that most should be able to implement as a constant-time operation.

    Often one would use a priority queue instead of a list, that is, the task each have a priority, and instead of adding tasks to the end of the list, they are inserted according to their priority (this can still be done in constant time by e.g. having an array of lists, where the index in the array is the priority).

    Using priorities you ensure that tasks with high priority always get the processor when they need it (e.g. a mouse driver) and tasks with low priorities to never interfer with the users work (like a batch job running in the background).

    The above is often called static scheduling, because the priorities are set once, and never touched. But this is often not desireable, because it's a tough job adjusting all those priorities, and perhaps not even possible on a multi user system.

    So what do we do? Well... this is where black magic comes into play. One simple solution could be to boost the priority of all the tasks which doesn't use their full time slice (so they'll have a higher priority next time they run, because we assume that they won't run for long).

    Many heuristics exist to try and make dynamic scheduling (priority rearranging) fair. Often these heuristics only rearrange priorities a few times pr. minute, based on statistics on how the tasks have behaved the last n seconds.

    Since this process involve all the running tasks, then it is in nature linear in time (but of course some tricks exist). But as said, this algorithm runs only a few times pr. minute, but the result affect the entire fairness/responsiveness of the machine. So time is probably better spent crafting a better heuristic than making it O(1) instead of O(n).

    1. Re:Scheduling details: switching process != O(p) by Anonymous Coward · · Score: 0

      Context switches are O(1) with respect to the number of runnable processes. Picking a process to run is O(p) as long as no real-time processes are runnable (which is the vast majority of the time).

    2. Re:Scheduling details: switching process != O(p) by sorbits · · Score: 2, Informative
      Picking a process to run is O(p) as long as no real-time processes are runnable (which is the vast majority of the time)

      ehm... most of the time there are no processes to run, they all wait for a message/interrupt. When one arrives, they are moved to the list (priority queue) of running/ready processes.

      The schedular always just pick the next in this list of running/ready processes, which is O(1). At no time does it need to look at the iddle processes, to pick one to run, as they don't need to be run... this is not Microsoft's cooperative multitasking...

  10. kpreempt by Anonymous Coward · · Score: 0

    In additional good news, Ingo and Robert are right now working on making the O(1) scheduler and kernel-preempt play nice together - suddenly, linux starts looking very attractive for BeOS/media-desktop junkies...

  11. Re:I hope Linus listens (and I'm not making a fool by pthisis · · Score: 3, Insightful

    Davide Libenzi's been shooting some holes in the architecture of this patch--he's also written a highly scalable multiqueue scheduler. He and Mingo are currently hammering out some ideas, almost certainly something will go into 2.5 but this may not be it.

    --
    rage, rage against the dying of the light
  12. Learning scheduler by adam613 · · Score: 2, Interesting
    Speaking of better schedulers, something I've been thinking about: What would happen if I made a scheduler that used a neural network to decide process priorities based on past behavior of the processes spawned by a specific program (how much time the process got last time, whether it gave up the CPU voluntarily, blocked for input, or got timer-interrupted, etc)?

    Each time a process became runnable, the NN could assign a priority, and the process would be placed on a priority queue (this isn't O(1), but it's better than O(p)). It seems to me like this would work; it would slow down wake_up_process() (i can't remember the exact name; i haven't looked at the scheduler since october) a bit, but the payoff as the NN got trained should make up for it...



    (if this is a terrible idea, tell me so before you moderate me down :) )

    1. Re:Learning scheduler by pthisis · · Score: 3, Interesting

      This is a terrible idea. The fact is, for 99% of applications the scheduler isn't a bottleneck. Even on Alan Cox's worst-case 8-way SMP machines the scheduler doesn't eat tremendous amounts of CPU; _way_ more than it should, but an approach like Ingo Molnar's or Davide Libenzi's will get rid of most of that overhead.

      But the real reason it's a terrible idea is because of the cache-line impact. At this level of the system, keeping the CPU cache intact and full of process data is critical to good performance. Otherwise when you switch tasks you wind up having to go out to main memory, which is sloooow. So compact code is more important than fast code in a lot of these situations, and a neural net like you describe is going to be enormous relative to the approaches of Davide and Mingo.

      And Linus rightly pointed out that as simple as the current scheduler is, it's basically the only part of the kernel that's pretty much the same code as it was 8 years ago, mostly because it does work very well for the vast majority of cases. Not to say that Mingo, Davide, Rob Love, et al aren't doing a huge boon, and on 4-way and bigger systems it's more of a concern, but e.g. the VM and I/O subsystems are areas that could reap far greater performance rewards for the most common cases (and the VM is getting a ton of eyes, and the I/O subsystem has been the focus of most of the early 2.5 work thanks to Jens Axboe and others)

      Sumner Hayes

      --
      rage, rage against the dying of the light
  13. Software in the Public Interest Donations by Bob_Robertson · · Score: 2, Insightful

    I wish SPI would get a PayPal account so I could make these "warm fuzzy" donations on the spur of the moment.

    Bob-

    --
    The Ludwig von Mises Institute. The reasoning individuals economics
  14. SMP Quality and Speed is a very GoodIdea(tm) by Bob_Robertson · · Score: 2

    Since Linux already whups Windows in so many other ways, I am looking forward to it beating Win's SMP integration and scaling. Yes, I know it's likely just FUD, but my memory of the last time I read a scaling test was that Win did CPU.gt.2 better.

    Clustering and SMP are different answers to different questions, but they both lead to multiplying your firepower without multiplying your cost. Add that to the issue of purchase cost for Win and Lin, and the differences become even more severe. Every effective Linux improvement multiplies Linux effectiveness overall.

    As an individual user, I have little use for SMP. Games run just fine on one CPU. So for now, I'm just a fan. We shall see what happens.

    BTW, is it just me, or is the fact that these two developers, with different ideas of "right", are working together to make something even better, lighting up other peoples days too?

    Bob-

    --
    The Ludwig von Mises Institute. The reasoning individuals economics
  15. O(1) and Grandma by cpeterso · · Score: 2, Funny

    Why do we want everyone's grandmother to be using Linux?

    My grandma refuses to use Linux because its scheduler is O(n) and its "goodness" function can blow her L1 cache. If Linus adopts Ingo's O(1) scheduler, she just might reconsider.

  16. Linus has included it in 2.5 by Shanes · · Score: 4, Interesting
    Linus has included Ingo's patch in 2.5.2-pre10

    Cool!

  17. Re:in USSR by Anonymous Coward · · Score: 0

    In US, you always find party. In USSR, party always find you!

  18. Re:Will this help... (not really) by amimjf · · Score: 1

    Sorry but this is not going to magicly make your system that much faster.

    The whole point of this patch is that it makes the switch time linar O(1) as opposed to exponetial O(p), most people not run machines with enough seperate processes (i only have 80, and im in a java dev sessio now), to see a difference.

    What it does mean is that the big Oracle / IBM (the people who wanted the improvement), database systems running on 8 / 16 CPU untis with 10,000s of database transactions will be /much/ quicker.

    -matthew

    --
    Since when did ignorance become a point of view ??
  19. Re:Will this help... (not really) by peter · · Score: 1

    > it makes the switch time linar O(1) as opposed to exponetial O(p)

    No, it makes it constant, instead of linear with the number of processes that are ready to run. Processes that are waiting for I/O (or otherwise sleeping) are not on this queue, and thus aren't slowing you down with even with the old scheduler. Your 80 processes are _not_ all running.

    Besides the algorithmic complexity improvement, Ingo says he's improved CPU affinity, and made fork()+exec() more efficient by running the child on the same processor.

    But as you say, the main improvement will be in large multiproc systems that are worked to the limit (so they have lots of _runnable_ processes). You won't, of course, see 600% improvement in Oracle performance or anything. That was only in synthetic benchmark. The 600% improvement is in something that takes up a fairly small (but not insignificant) part of a DB server's time.

    Another thing this should help is people who run distributed.net or SETI clients. They run at nice 19, and if I read Ingo's message correctly, such processes will be treated as SCHED_IDLE: That is, they will get _no_ CPU time while you're recompiling something, or compressing a vorbis audio file, instead of the usual 7% or so they get now, when a nice=0 process wants all the CPU it can get.

    Well, I'm impressed. I like how well he presented his work, it was pretty easy to understand.

    --
    #define X(x,y) x##y
    Peter Cordes ; e-mail: X(peter@cordes , .ca)
  20. Re:I hope Linus listens (and I'm not making a fool by Anonymous Coward · · Score: 0

    600% improvement on something that your machine probably spends very little time doing. It's not going to turn your 1GHz system into a 6GHz system...