Linux Gets O(1) SMP Patch As Late Christmas Gift
bodin writes: "Now that new-year's parties are over things are getting boring again. For those who want to see and perhaps even try something more complex, Ingo Molnar is
announcing this patch that is a pretty radical rewrite of the Linux
scheduler. This is big stuff!"
Hehe!
I liked O Xymoron's enthusiastic response:
On Fri, 4 Jan 2002, Ingo Molnar wrote:
> this is more than 6 million context switches per second!
Everyone knows scheduling is boring.
:)
Yours Sincerely, Michael.
For those of you who haven't read enough computer science to know what O(1) means, here is a short explanation.
The Big O-notation is a way to describe how the (asymptotic) execution time of an algorithm depends on the inputs to the algorithm. For instance, an algorithm that loops over n values is said to have the asymptotic execution time O(n) - it is proportional to the number of times the loop is executed.
Similarly, an algorithm that runs in constant time, i.e., that takes equally long to execute for 10 values and for 1000000 values, is said to be O(1). The execution time is proportional to 1.
For the Linux scheduler, switching processes is O(p), where p is the number of currently running processes. This new scheduler switches processes in O(1) time.
This means that even though the old scheduler might be fast for low numbers of running processes, it will take longer and longer timer to switch processes when the number of active processes grow. The new scheduler will switch processes just as fast for 2 processes or for 200 processes. Even though the new scheduler might be slower in when there are few processes, it will be faster when the number of running processes increase.
of myself)
I don't know if in 2.ed kernels Linus still likes the "small patch" idea. but this is pretty amazing. I am no kernel coder, but some of these tests showed 600% percent improvement and (seemed to me to be) impressive scaling. All the kernel gurus out there, what is the chance that this will make it into the kernel? (2.5)
What comes first, finding a teacher or becoming a student?
Before attempting to install this on my test box I'd like to know exactly what the performance inplications are to specific types of applications and services. For example I am extremely interested in improving JVM performance on a Linux box.
Any information or direction about this would be very helpful.
Wow. I read this guy's description of what he has done. I hope he's a teacher, because his explanation of those complex issues was a joy to read.
I have a BP6 dual-celeron Debian machine which already gives me the benefits of countless hours of volunteer time, including the SMP kernel and ReiserFS, along with dozens of free development tools. Now I see this guy working like a dog to tune the heck out of the scheduler for SMP machines, and I know that when I eventually run the 2.6 kernel, I'm again going to reap the benefits of his work.
It's almost enough to make me learn to hack the Linux kernel out of a sense of obligation.
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
By the way, since the links /and/ the URL's (as far as I can tell) in the post are broken, this should help:
:) but hey, my kernel compiled just fine.
e 8s chedO1.patch.gz
2 pr e8schedO1.patch.gz
:)
;)
:)
:-\
http://people.redhat.com/mingo/O(1)-scheduler/
I've created diffs between 2.5.1 and 2.5.2pre8 with the O(1) scheduler, and between 2.5.2pre8 and 2.5.2pre8 with the O(1) scheduler.
2.5.2pre8 actually patches pretty well with the original scheduler patch (drivers/char/serial.c.rej can be ignored, and you have to make a few changes to kernel/sched.c in order for it to patch correctly), but because it took me at least ten minutes of fiddling with sched.c I've decided to make a diff for 2.5.2pre8.
No guarantee that either of these works, though
# diff -ru linux-2.5.1 linux-2.5.2pre8schedO1|grep -v '^Only in '|gzip -f >/home/web/patch-2.5.1-2.5.2pre8schedO1.patch.g z
http://os.markbach.com:8080/patch-2.5.1-2.5.2pr
(396,961 bytes)
# diff -ru linux-orig linux |grep -v '^Only in ' |gzip -f >/home/web/patch-2.5.2pre8-2.5.2pre8schedO1.pat ch.gz
http://os.markbach.com:8080/patch-2.5.pre8-2.5.
(31,124 bytes)
Good luck to anyone who tries to use these
And no, I didn't patch in the kdev_t stuff from people/aeb on kernel.org because there's lots of kdev_t stuff in the Changelog for pre7 and pre8, so I decided to assume (yes, I know, assuming makes an ass out of u and me) I didn't need it... of course, when the system crashes after five seconds, maybe I'll change my mind
And if, for some odd reason, you can't connect on port 8080, just connect on port 80 and let's hope you're not blocked by @home's or my firewall.
Damn, I'm using too many smileys
--TheOrangeSquid Is it any wonder things seem so awry? We swim in a sea of confusion and don't have to think to survive
Am I the only person that thinks that every time I see his name?
The real reason threads can be faster than processes if all other things are equal is context switch time. Switching between threads that share all their page tables is just a matter of restoring all the registers. When switching between processes, you must flush the TLB also.
But when threads modify memory on SMP, they have to ensure cache coherency between CPUs. Processes can don't have this restriction, which is why processes can often be faster than threads. It all depends on your usage patterns and OS. On e.g. Solaris or Win32, threads are far faster at context switching than processes (mainly because processes are so heavyweight on those systems; Win32 also doesn't offer an API to efficiently implement multiprocess programs). On identical hardware, Linux's processes switches are faster than Solaris or Win32 thread switches and Linux thread switches aren't that much faster than process switches (beacause process switching is implemented in an efficient manner).
So on Linux, the fact that memory isn't shared between processes (hence CPU coherency isn't the same issue) often makes processes a substantial performance _win_.
This is aside from the fact that threads are extremely difficult to program properly--there's a reason that OSes spent all that time implementing protected memory (remember all the hoopla about that?) and threads throw that concept out the window, plus tend to lure the programmer into architectures that need nasty locking and synchronization.
That said, for some applications threads are the right answer. But using multiple processes, possibly with shared memory for a limited set of data, is generally a more maintainable solution that's easier to architect and implement. It certainly makes it explicit which data structures are meant to be shared, which a multithreaded solution does not.
Sumner
rage, rage against the dying of the light
Davide Libenzi's been shooting some holes in the architecture of this patch--he's also written a highly scalable multiqueue scheduler. He and Mingo are currently hammering out some ideas, almost certainly something will go into 2.5 but this may not be it.
rage, rage against the dying of the light
This is a terrible idea. The fact is, for 99% of applications the scheduler isn't a bottleneck. Even on Alan Cox's worst-case 8-way SMP machines the scheduler doesn't eat tremendous amounts of CPU; _way_ more than it should, but an approach like Ingo Molnar's or Davide Libenzi's will get rid of most of that overhead.
But the real reason it's a terrible idea is because of the cache-line impact. At this level of the system, keeping the CPU cache intact and full of process data is critical to good performance. Otherwise when you switch tasks you wind up having to go out to main memory, which is sloooow. So compact code is more important than fast code in a lot of these situations, and a neural net like you describe is going to be enormous relative to the approaches of Davide and Mingo.
And Linus rightly pointed out that as simple as the current scheduler is, it's basically the only part of the kernel that's pretty much the same code as it was 8 years ago, mostly because it does work very well for the vast majority of cases. Not to say that Mingo, Davide, Rob Love, et al aren't doing a huge boon, and on 4-way and bigger systems it's more of a concern, but e.g. the VM and I/O subsystems are areas that could reap far greater performance rewards for the most common cases (and the VM is getting a ton of eyes, and the I/O subsystem has been the focus of most of the early 2.5 work thanks to Jens Axboe and others)
Sumner Hayes
rage, rage against the dying of the light
Cool!