Slashdot Mirror


Robert Love Explains Variable HZ

An anonymous reader writes "Robert Love, author of the kernel preemption patch for Linux, has backported a new performancing boosting patch from the 2.5 development kernel to the 2.4 stable kernel. This patch allows one to tune the frequency of the timer interrupt, defined in 2.4 as "HZ=100". Robert explains 'The timer interrupt is at the heart of the system. Everything lives and dies based on it. Its period is basically the granularity of the system: timers hit on 10ms intervals, timeslices come due at 10ms intervals, etc.' The 2.5 kernel has bumped the HZ value up to 1000, boosting performance."

62 comments

  1. In FreeBSD by CounterZer0 · · Score: 3, Interesting

    This is actually a easy to tune kernel config variable. Quick and easy performance boosts to be had by all!

  2. Finally... by ActiveSX · · Score: 5, Funny

    An overclockable operating system. You wouldn't believe how long I've waited for this. Now where do I get a software heatsink?

    1. Re:Finally... by vidnet · · Score: 2

      I just direct my heat to /dev/null, and use a regular heat sink on my pci bitbucket accelerator card.

    2. Re:Finally... by frozenray · · Score: 1

      The famous German computer magazine, c't, actually featured a hardware accelerated null device, the "Hypertronics 82C997 ENUL" in their 4/95 issue (as an April fool's joke, of course).

      The article is not available online unfortunately, but some of the amused reactions of their readers are here (in German), and you can even find a picture of the gizmo (note the photoshopped activity LED).

      --
      "There are already a million monkeys on a million typewriters, and Usenet is NOTHING like Shakespeare." - Blair Houghton
  3. It doesn't improve performance. by Professor+Collins · · Score: 5, Informative
    One of the great paradoxes of computer science is that perceived performance and actual performance almost always come at a tradeoff. By raising the frequency of the timer interrupt, individual timeslices are shorter and the processor needs to make more context switches, resulting in less overall processing being performed. However, because these context switches occur more frequently, it appears to the user that apps are more responsive and fluid.

    To make a long story short, for number crunching machines, servers, and other applications which don't need much user interaction, larger timeslices are preferable because it doesn't matter how responsive the user interface is. For desktop systems, the timeslice can be decreased to improve the responsiveness of the user interface and give a better "feel" to the system at the expense of a minor performance loss. Being able to tune these parameters to meet your needs is one of Linux's great strengths.

    1. Re:It doesn't improve performance. by zenyu · · Score: 5, Informative

      One of the great paradoxes of computer science is that perceived performance and actual performance almost always come at a tradeoff. By raising the frequency of the timer interrupt, individual timeslices are shorter and the processor needs to make more context switches, resulting in less overall processing being performed.

      This is not quite true. If you only have a single program running just one thread this is true. You have to do a context switch at each tick to Ring 0 and back, which takes maybe 500 cycles, or 1/20 microseconds on a 1Ghz machine. Do this 1000 times and you've lost 50 microseconds of processing time.

      BUT once you have more than one program or thread running the situation is different. Say you have one thread running flat out and another that needs to do 100microseconds of work. With 100 ticks per second you will lose 5 usec to context switching and 9900 usec to waiting for the next context switch. With 1000 ticks per second you lose 50us to context switching and 900 usec to waiting for the next context switch. So you get more work done.

      For someone who always runs at 100% processor utilization 1000 ticks per second is probably a setting since you are probably just running one thread 99% of the time and once in a while writing logs to disk or responding to some other events. If you are more like me and run at 1% of your processor utilization most of the time, with the 100% utilization only happening when you compile so you would rather be able to continue to use the computer than save 1ms on the 5 minute compile then an even higher value might make sense. 10000 maybe, assuming there aren't limitations in the kernel that prevent the higher value.

      Disclaimer: I've been applying Love's patches for a while now. They make a real difference in the responsiveness of X, esp if your running stuff like Mozilla or Gnome/KDE on your box. I haven't applied it on any servers cuz the preempt patch is not quite stable.

    2. Re:It doesn't improve performance. by jquirke · · Score: 4, Informative

      Actually the last time I checked, the kernel had to be recompiled to change the HZ variable. Not trolling or anything, but it's been pointed out FreeBSD has this as a sysctl parameter. Hopefully Linux will offer this (correct me if I'm wrong!).

      Also, you don't necessarily have to increase the clock frequency by a whole order of magnitude. A fair compromise could be 200Hz, or 250Hz, or 500Hz. A typical workstation running X-Windows could use 250 or 500, for example.

    3. Re:It doesn't improve performance. by pyman · · Score: 2, Interesting
      Interestingly enough, that is basically the only major difference between NT4 Server and NT4 Workstation. I believe there was also a flag in the registry somewhere which apps could query to determine whether they were on a Server or Workstation. If you changed this setting, it inexplicably was reset when you next viewed the registry. After some research I discovered there is a tiny thread in the kernel specifically for checking and resetting that particular registry setting...

      NT Server has a larger timeslice and more caching for some system functions, while NT Workstation has a smaller timeslice with caching geared for user apps.

      I know NT is old technology, and I'm not sure if this still applies to the latest MS offerings. Hardly justifies the price difference between Server and Workstation!

      --
      a ^= b; b ^= a; a ^= b;
    4. Re:It doesn't improve performance. by Piquan · · Score: 2, Redundant

      I disagree with your analysis.

      If a process isn't doing processing, that's because it's blocked in the kernel. (Q: What does a HLT do in userland?) As soon as the kernel puts a process on a wait queue, it reschedules. So you don't have any loss 'waiting for the next context switch'; that's just time that another process is running, or if nothing has anything to do, that the kernel halts the processor.

      Note: I haven't studied how process scheduling is handled under Linux, but I can't imagine any OS that wouldn't do what I said here... or at least, I can't imagine one that would halt the processor after a process blocks, while it waits for a timer interrupt to schedule the next process.

      Okay, maybe one.

    5. Re:It doesn't improve performance. by Anonymous Coward · · Score: 0

      A quantum (time slice) doesn't have to be used up.

      >9900 usec to waiting for the next context switch.

      The quantum will only be used up, when the process requires it.

      When a thread "idles", it either blocks on a resource or some synchronisation object (mutex/wait condiditon) or maybe a timer event.

      When a process blocks, the scheduler comes into play and the next eligible process gets the CPU context. The rest of the quantum will be attributed to the old scheduled process and gives it a priority boost.

      The only situation, where your description applies is when the thread would be polling some variable.
      In that case, the programmer is plain stupid, (or he better really knows, what he is doing) and no modification of the OS could improve the
      performance.

    6. Re:It doesn't improve performance. by Piquan · · Score: 3, Interesting

      Okay, after re-reading the article, I did see one performance gain this could get: the case of select/poll. (This is blatantly stated in the article; I shot my mouth off before reading the article closely enough.)

      Under BSD, as I understand it (I don't have the Daemon Book handy, but a quick reading of the source seems to agree), select will put a process on the wait queue until something arrives. During a select, the kernel does nothing with the process-- timer or not.

      From the look of the article, under Linux, select actually does some sort of polling at or related to HZ. It may be on some sort of almost-run queue: a selecting process gets allocated timeslices; on its slice, it polls and either returns to userland or goes back to onto the almost-run queue. I don't have time to verify that-- I don't know my way around the Linux kernel-- but it seems to be reasonable, based on the article. Can I get a Linux developer to confirm/deny my guess?

      So it seems that in the case of something selecting, primarily on an otherwise idle or near-idle system, increasing HZ may improve performance. This situation is less common than it used to be in today's world of multithreaded servers (since each thread typically blocks only on a single fd), but it's still potentially significant.

    7. Re:It doesn't improve performance. by Anonymous Coward · · Score: 0

      Can't confirm it, but I do know that the author of the preempt patch quite rightly hates the select/poll mechanism, and has proposed a saner replacement, that would seriously up linux's web/file-serving performance (provided apache knew about it).

      Another interesting hack is IO-Lite, which is a proposal to add new file system calls that use sort of read-only buffers in most situations, and unify caching across the whole system. Can increase server performance by 40-80% (!) - but would represent a change to the very core semantics of a Unix box - Unix-FD-IO and stdio and sockets would be backward-compatibility wrappers on top of a new IO suybsytem...

    8. Re:It doesn't improve performance. by Kopretinka · · Score: 2, Informative
      AFAIK, the parent is not quite true. 8-)

      A reschedule does not happen only on the timer tick (100 or 1000 times a second depending on HZ setting), it happens on a number of occasions, timer tick being one of them. The other ones remove the concerns zenyu seems to be having:

      1. when a process sleeps - when a process calls the kernel in order to sleep, the kernel reschedules because sleeping can be handled using normal timer and in the meantime other processes may work
      2. when a process yields - when a process says that it's done its stuff in this tick, whatever that means
      3. when idle, on any interrupt - when no process wants to work, the first one that wants to work is scheduled right away

      The second point may seem a little weird, but a process can only become willing to do something as a result of some interrupt - a timer if the process was sleeping for a given amount of time; a i/o interrupt if the process handles the keyboard or the mouse. In any case, interrupts are handled by the kernel and so if a process is to wake up from its sleep or if a process gets something in some stream on which it is waiting (stdin on keyboard interrupt, socket on network card interrupt etc.), that process is just scheduled to wake up and work.

      So on an idle machine the HZ does not really have much impact, and on a utilized machine the smoothness of process interaction (like window manager vs. X server) increases with increased HZ but this also increases the overhead.

      Hope it's clearer.

      --
      Yesterday was the time to do it right. Are we having a REVOLUTION yet?
    9. Re:It doesn't improve performance. by p3d0 · · Score: 2
      Say you have one thread running flat out and another that needs to do 100microseconds of work. With 100 ticks per second you will lose 5 usec to context switching and 9900 usec to waiting for the next context switch. With 1000 ticks per second you lose 50us to context switching and 900 usec to waiting for the next context switch. So you get more work done.
      This is not true. When a process/thread has nothing to do, it does not just sit around waiting to be preempted. By definition, if it has "nothing to do", that is because it has yielded the CPU. For instance, if the process does 100us of work and then makes an IO call, it will immediately yield to another process.

      Sure, there are some poorly-written apps that do excessive busy-waiting, but they are the exception, and there's not much the OS can do about it anyway.

      The only benefit of increasing HZ is latency.

      <RANT>
      BTW, I'd just like to mention a pet peeve of mine. In the article, they mention that "RedHat shipped their 8.0 kernel at HZ=512". There is no reason whatsoever that this should be a power of two, so I believe it should not be. Powers of two have a magical status in the computer world, but I think you should not give your code this kind of connotation unless you have actually decided that a power of two is the best choice. Otherwise, you should pick a number that reflects the ad-hoc nature of your choice. Powers of ten reflect this better than powers of two. Thus, all else being equal, they should have chosen 500 over 512.
      </RANT>

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    10. Re:It doesn't improve performance. by Anonymous Coward · · Score: 1, Interesting

      This setting is still there in W2K
      System Control Panel -> Performance -> Optimize Performance for 'Applications' or 'Background Services'.

      NT 4.0 Server's default was oddly "Applications"!

      NT also has a priority boost for interactive apps. However, if the GUI is 'dead' for a long period of time (such as on a server), it will stop doing this. That's why if you walk up to an even lightly loaded W2K server, it's got that X11-style laggy mouse that your workstation never has.

      AFAICT, there's no real operational difference between "Server" and "Workstation" at least for NT4 and 5, althout at least W2K Server has some sane non-workstation default settings. The kernel thread/registry entries are for licencing purposes only.

    11. Re:It doesn't improve performance. by zenyu · · Score: 2

      The only situation, where your description applies is when the thread would be polling some variable. In that case, the programmer is plain stupid, (or he better really knows, what he is doing) and no modification of the OS could improve the performance.

      Like he knows that he needs to poll to get decent response times when there are no interrupts to wake the process up a quarter timeslice from now.

      There are examples that are silly, like the lvcool user space idle loop I run on my AMD laptop cuz the kernel doesn't halt the processor in low power mode. I need to kill it to play a DVD, but if I don't run lvcool the fan runs constantly and the CPU still gets very hot. This should be in the kernel (and I've read is in 2.5 now as part of ACPI).

      Then there are examples that aren't going to change unless the time slices get >muchdo exist even in some well written code. And sometimes it will spin until another thread gets control and releases some resource.

    12. Re:It doesn't improve performance. by pthisis · · Score: 4, Informative

      From the look of the article, under Linux, select actually does some sort of polling at or related to HZ. It may be on some sort of almost-run queue: a selecting process gets allocated timeslices; on its slice, it polls and either returns to userland or goes back to onto the almost-run queue. I don't have time to verify that-- I don't know my way around the Linux kernel-- but it seems to be reasonable, based on the article. Can I get a Linux developer to confirm/deny my guess?

      Deny. It's actually the idle timeout that's affected by HZ. select() itself doesn't poll at all, and e.g. a select() call with an infinite timeout will be completely unaffected by HZ (select will wake up when the network gets an interrupt resulting in readable data/writeable buffer space).

      Example of the timeout effect: a game could have a select() loop that waits on user input, but also has a timeout argument so that it can go ahead and update the screen, do enemy AI, etc. The kernel, in absence of interupts, schedules on HZ boundaries. Suppose that you as a programmer put a 1/60 second timeout argument in the select loop (intending to update the screen with a 60 HZ refresh and figure out where everything's moving). If you call select() right after a HZ boundary, you could find yourself waiting until 1/50 second passes even on an idle machine with HZ=100; after 1/100 sec, your timeout hasn't expired yet. Next chance to schedule is at 2/100 (1/50) sec.

      With HZ=1000, you'll schedule no more than 1/1000 sec after the 1/60 sec boundry (on an idle machine).

      This example is really simplified; a real-life app would adjust for scheduling creep by keeping track of wall-time. But the same concept, with more complicated apps, can cause faster HZ ticks to give you better CPU utilization (especially in e.g. video editing apps and such) because you get around to using the CPU closer to when you want it.

      The preempt kernel is an even better example of where decreasing latency can increase throughput, sometimes significantly. There you can really get around to dealing with I/O quickly, keeping CPU saturated (and saturated with cache-warm data) and benefiting things like heavily loaded web servers just as much as sound editing stations.

      Sumner

      --
      rage, rage against the dying of the light
    13. Re:It doesn't improve performance. by Piquan · · Score: 2

      Okay, that makes much more sense, thank you!

      I don't yet see why the timeout of select would be such a big deal (outside of a few specific cases), but I'll have to think about it, and your examples, more carefully.

      FWIW: BSD has the same select setup as you described.

      So here's my thought: how expensive is it to reprogram the timer chip? Would it be possible to adjust it dynamically to create perfect granularity in sleep/select?

    14. Re:It doesn't improve performance. by pthisis · · Score: 2

      FWIW: BSD has the same select setup as you described.

      Yeah, pretty much every Unix has interrupt-driven returns for the non-timeout case, anything else would be pretty bogus--though some systems (e.g. Linux 2.5.x) do interrupt mitigation under high load, but that's more of an "above and beyond" thing. The timeout case is handled differently on several Unices.

      So here's my thought: how expensive is it to reprogram the timer chip? Would it be possible to adjust it dynamically to create perfect granularity in sleep/select?

      There is a tickless Linux implementation.

      I can't find the home page at the moment, but see e.g.
      http://www.uwsg.iu.edu/hypermail/linux/kernel/01 04 .1/0137.html

      There are a lot of other ways of dealing with this, and tickless has some negative attributes I don't fully understand (among them is that it's not portable to older hardware, and there is some overhead to programming timer interrupts). I think the nanosecond kernel patches (which are starting to go into 2.5) address the select/sleep granularity issue in a different way but I'm really fuzzy on the details.

      Sumner

      --
      rage, rage against the dying of the light
    15. Re:It doesn't improve performance. by pthisis · · Score: 2

      There's a better link at LWN explaining the approach and drawbacks. It links to the high-resolution timers project (Anzinger's), which I believe is going into 2.5.

      Sumner

      --
      rage, rage against the dying of the light
    16. Re:It doesn't improve performance. by Piquan · · Score: 1

      Terrific, thanks! The IBM project it discusses sounds a lot like my half-verbalized idea. I'll have to delve deeper into this idea, and what they've done so far.

    17. Re:It doesn't improve performance. by pthisis · · Score: 2

      Note that while the list of drawbacks is only addressed briefly, increased schedule() overhead and increased system call overhead are potentially large drawbacks.

      Also, after further investigation the Anzinger solution is _not_ in 2.5.x yet; Linus has looked at the patch, asked for clarification, and Anzinger recently replied with an updated patch. Search linux-kernel archives for "high-res-timer" or "POSIX timer" patches for more info.

      Sumner

      --
      rage, rage against the dying of the light
    18. Re:It doesn't improve performance. by frozenray · · Score: 1

      Some information about the way NT handles timers can be found at Sysinternals, here and here (Quantums).

      --
      "There are already a million monkeys on a million typewriters, and Usenet is NOTHING like Shakespeare." - Blair Houghton
    19. Re:It doesn't improve performance. by 10Ghz · · Score: 2

      "Not trolling or anything, but it's been pointed out FreeBSD has this as a sysctl parameter. Hopefully Linux will offer this (correct me if I'm wrong!)."

      You can do that in Linux too.

      --
      Lesbian Nazi Hookers Abducted by UFOs and Forced Into Weight Loss Programs - -all next week on Town Talk.
  4. I think RedHat did this... by GreyWolf3000 · · Score: 5, Informative
    ...since one of the biggest criticisms of X is how choppy window movement is due to the networked architecture of X (a signal gets sent to the server, the server responds, etc). When the timeslices are reduced, the "lag" gets significantly decreased, since the signal gets processed sooner, the server gets the message sooner, the server can report back sooner, etc.

    I tried recompiling the stock RedHat kernel, and sure enough that was a on option in there to increase the hz for the internal timer.

    --
    Slashdot: Where people pretend to be twice as smart as they really are by behaving like children.
  5. Moore's law by tunah · · Score: 2

    Note that although that looks like a tenfold change, by time 2.6 is released processing power will have doubled about twice since 2.4, so the change is equivalent to running a 2.4 system with HZ=250 instead of 100.

    --
    Free Java games for your phone: Tontie, Sokoban
    1. Re:Moore's law by WolfWithoutAClause · · Score: 4, Informative
      It's not quite that simple though. It's more tied to memory speed. The processors are improving at a faster rate than the memory is- and this clock tick is more related to memory speed.

      The reason is that across a scheduling tick the processors cache gets flushed and reloaded. This means that you end up doing a burst of memory reads, and that will dominate if the clock tick is too short.

      --

      -WolfWithoutAClause

      "Gravity is only a theory, not a fact!"
    2. Re:Moore's law by ealar+dlanvuli · · Score: 3, Insightful

      Timeslices didn't decrease in said time, those have been prety constant for a while.

      I seriously doubt we are going to be needing 1/10th second slices for quite a few years, and by that time I expect the kernel to run something in idle time to auto-tune the slices for my current workload average. Remember the higher HZ only improves "responsivness", it actually decreases system performance computation wise. There is a specific number that is best for every system at any particular time, and going above or below that number hurts performance.

      --
      I live in a giant bucket.
    3. Re:Moore's law by Anonymous Coward · · Score: 0

      You've missed the point, the time-slices are generated by an external clock interupt every 10ms. 10ms will still be 10ms regardless of processor speed. More instructions will be able to be proccessed in the same time slice, but "responsiveness" will stay the same because 10ms (or 1ms or 100ms) will have to pass before the next process gets control.

    4. Re:Moore's law by zenyu · · Score: 3, Informative

      The reason is that across a scheduling tick the processors cache gets flushed and reloaded.

      Whoa! What architecture is that!

      That just doesn't sound right. The register files get flushed(well swapped), but if that 2 meg cache got flushed on every context switch there wouldn't be much point in having it at all. You can get cache thrashing if too many cache hungry programs are running simultaniously but that's why you get a bigger cache if you run lots of those programs, it so that their working set is saved across context switches.

      Perhaps you mean the L1 caches? They can get tossed out cuz it can only hold a few inner loops and a few small working sets at a time anyway, but all that stuff should still be in the L2 cache and get loaded very quickly into those puny L1 caches, the L1 data cache is practically a register file anyway on P4's, 64 bit moves to/from them happen in a cycle...

      Those L2->L1 moves might start to affect you at 1,000,000 ticks per second, but no one is proposing that, right? Even so in a typical environment the other context is just the scheduler which I can't imagine filling the L1 cache... It's not that complicated on a mostly idle machine. (Quick & Dirty schedulers have been written, some which looked through the entire process list. Erm, but on my machine there are less than 100 processes right now, still not so bad for L1 ;)

      Anyway I think 1000 is just fine, if you're doing real-time music synthesis on lotsa channels a larger number might be better. Someone in Europe is working on a music disto, so maybe they will discover that 8000 is the magic number for 16 channels at 48000khz on a P4 at 2Ghz.

      It would be neat if someone came up with metrics so that the tick was set so that 99.999% of the time the sound systems got their slices once every 500 usec but otherwise the timeslices were as large as possible. Then you could just tune that 500 usec thing, make it longer if you're on a 386, shorter if you really need more than half millisecond timings. I guess any program that needed frequent time slices could write to some proc file how much more often it should be called, or if it could afford to be called less often. For example 1.2 if it want's to be called more often, 0.8 if it's time needs were met. The kernel would only have to insure all the numbers it got were less than 1.0, and if the largest one were less than 0.95 it could even afford fewer time slices. The kernel might also want to ensure through process accounting that the time sensitive processes never got more than a certain percentage of the cycles available even if it meant they got called less often. This to prevent a denial of service where you just always write 10 to that proc file whenever you get run so the time tick grows until you spend all your time in the scheduler. It might also want to set a floor, so that a human can interact with the machine. Ticks should never be less than say 10 for instance on a PC(or 250 if it's my machine). Though for some special purpose interstellar Linux probe you might want to sleep for a whole second at a time before checking your direction once on your way so a tick of 1 would be acceptible once out of your solar system. (You still want 64 bit uptimes for you're interstellar probe it would be so embarassing if it arrived and the aliens were like, "Woah this species can't develop an operating system with more than 3 day uptime for a space probe that took like 40 years to get here, what l0s3rs!")

    5. Re:Moore's law by WolfWithoutAClause · · Score: 4, Interesting
      Whoa! What architecture is that!

      All of them AFAIK.

      That just doesn't sound right.

      Well, it is. Deal ;-)

      The problem occurs when the memory management unit gets modified to maintain the virtual memory 'illusion'. Then you have to flush the caches to maintain consistency. Of course it doesn't happen on every clock tick, you hope.

      That means that all the caches above the memory management unit need to get flushed. This includes the program cache; and any other data too.

      I did a quick check on the web for this, but I haven't managed to find a good reference to where the MMU is placed in the different architectures yet.

      Anyway, that's one of the main reasons the OS scheduling isn't shorter, but any decent OS has to do quite a bit of dorking around at that time.

      --

      -WolfWithoutAClause

      "Gravity is only a theory, not a fact!"
    6. Re:Moore's law by zenyu · · Score: 2

      any decent OS has to do quite a bit of dorking around at that time.

      Sure, but flush the whole cache? The virtual memory arguement justifies flushing the TLB cache if there is an actual switch to another running process. If it's just to the scheduler and back doesn't that have a valid mapping in any process (that whole reserve 1G for the OS out of the 4GB directly addressible must be for this purpose right?) But while I'm not familiar with the actual implementation of these chips I can't see why the cache wouldn't just be addressed by physical locations in memory, hence no need to invalidate their data, just because you change their virual adresses.

      I'm not an Intel expert, but I know they have a GDT and LDT. That is a Global Descriptor Table and a Local one so the scheduler should be able to use the global one while the application uses the local one. I actually have the manuals so I looked but it's a bit esoteric. What I found that supports the TLB flushing is that whenever you load a new LDT you invalidate all the local TLB entries. You can have over 8000 entries in a LDT, but the OS needs to use one for each user level process in order to protect an applications memory from other applications. So if you're Amiga OS you just use the GLT for the kernel and an LDT for your apps, but if you're Linux each app gets it's own LDT. The Pentium 4 has a PGE flag that can be used to prevent flushing frequently used tables. So you could prevent flushing the entries if you had some use level app that was run frequently enough to get special treatment.

      I'm still not convinced the actual L2 & L3 caches get flushed, esp since you can even avoid TLB flushes. The TLB is small, which is why you would want to flush it before running a different process, the caches are relatively big...

    7. Re:Moore's law by jrstewart · · Score: 1

      Nope. Recent processors cache physical memory so
      you don't flush the cache on a context switch.

    8. Re:Moore's law by jovlinger · · Score: 2

      Given the speed of processors, how likely is a line in the cache to survive a context switch, even if it weren't flushed explicitly? I would imagine fairly small. (I'm punting on which L cache I mean: too lazy to think hard about it.)

      However, I've always wondered if there was a performance win to multiple threads running in the same memory space as compared to multiple processes, for this very reason.

      Anecdotally no: I spoke to the BeOS guys at a conference back in the days of wanting every cycle you could get, and they didn't give threads from the same process as the previous thread any higher probability of running next, which would be the natural thing to do if it were a performance win.

    9. Re:Moore's law by WolfWithoutAClause · · Score: 2
      Given the speed of processors, how likely is a line in the cache to survive a context switch, even if it weren't flushed explicitly? I would imagine fairly small.

      It had better be astronomically small otherwise user programs will gradually screw up; and I don't think it is that small in fact.

      --

      -WolfWithoutAClause

      "Gravity is only a theory, not a fact!"
    10. Re:Moore's law by WolfWithoutAClause · · Score: 2

      Ok. My bad. Guess this enables the change.

      --

      -WolfWithoutAClause

      "Gravity is only a theory, not a fact!"
    11. Re:Moore's law by j3110 · · Score: 2

      I don't think the guy knows what he's talking about :) You never have to flush the TLB either, it just won't be very usefull to the next program. I'm 99% sure that the cache is set associative to the physical address of ram, not the virtual address. The TLB was invented (AFAIK) to help mostly with pre-cache translations, so that the processor isn't waiting on translations before it can get cache. This may not be accurate L1 though.

      --
      Karma Clown
  6. Good for streaming media by Anonymous Coward · · Score: 1, Informative

    My first thought is, "It's about time..." FreeBSD has had this for ages, and it struck me as strange that Linux was nailed to HZ=100 when I started porting some apps over.

    Among other things, streaming media is an important beneficiary of this change. Let's say you have a medium-bitrate video stream (about 2.5 to 5 megabits). That means that your packets should be spaced about 2 to 4 milliseconds apart. This is easy to schedule when your system has a 1 millisecond granularity, but is a disaster when your clocks are 10 milliseconds apart -- your packets end up going out in clumps. Your 100bT network may not care either way, but if you are pushing video over ADSL, 802.11b, or ATM, you may find your packets getting lost along the way.

    1. Re:Good for streaming media by DeadInSpace · · Score: 2, Insightful
      That means that your packets should be spaced about 2 to 4 milliseconds apart. This is easy to schedule when your system has a 1 millisecond granularity, but is a disaster when your clocks are 10 milliseconds apart -- your packets end up going out in clumps. Your 100bT network may not care either way, but if you are pushing video over ADSL, 802.11b, or ATM, you may find your packets getting lost along the way.
      That's not true.
      When an application sends data over the network, it does a send() (or possibly a write()) on a socket. These are systemcalls, so the CPU switches context to the kernel, and the data send by the program is placed in the kernel network buffers. Note that this happens immediately, without waiting for another timeslice.
      Then the kernel sends as much as possible (depends on the buffer size on the network card itself) of the data to the network card (after slapping on IP and TCP headers), after which the kernel returns to the application.

      Now comes the difference: you suggest that when the network card is done sending the data, it'll have to wait for the next timeslice (because then a context switch to kernelspace occurs and the kernel can do some work), but this is not true!
      When the network card is done sending the data, it immediately generates an interrupt (what do you think IRQs are for?). On interrupt, the CPU switches context to the kernel, and the kernel (still having the data to be send in the network buffers) can immediately replenish the buffer on the network card, allowing packets to follow very closely on eachother, regardless of timer granularity.


      By the way, somewhat modern network cards can burst packets. That is, they can receive a whole batch of packets from the kernel, which they will then send at the appropriate speed of the medium, so that not everey packet will generate an interrupt. And that's a good thing (tm), because high interrupt loads (think towards 100,000 interrupts/sec for gigabit - without jumbo frames and bursts) are performance killers.
  7. Never thought I'd say this... by Wonko42 · · Score: 1

    ...but Windows had this way back in '95. Ouch.

    1. Re:Never thought I'd say this... by Anonymous Coward · · Score: 0

      Can you prove it? As I recall, Windows would always say "you must run this program in MS-DOS mode" any time something tried to change the system timer tick rate from the default 18.2Hz.

    2. Re:Never thought I'd say this... by WolfWithoutAClause · · Score: 1
      Yeah, but we're talking about a real O/S here. One of the reasons '95 crashed so much was because it didn't have virtual memory. Therefore the OS didn't have to page out all the memory at the context switch tick and they could afford to tick more often, because the costs were lower.

      I think they more than made up for it in reboot time ;-)

      --

      -WolfWithoutAClause

      "Gravity is only a theory, not a fact!"
  8. because 1/50th second minimum select timeout sucks by truth_revealed · · Score: 1

    The only benefit of increasing HZ is latency.

    Presumably you meant "The only benefit of increasing HZ is decreasing latency" which is not a bad thing unto itself. Most people run interactive desktop applications, not scientific number crunching jobs for days at a time.

    Having a minimum granularity of 1/50th of a second for a select() when HZ=100 really sucks, quite frankly.
    Music players and animation programs have to resort to busy wait loops to get good response and tie up all CPU in the process. This is completely unnecessary in a modern OS.
    It's 1/50th not 1/100th of a second with HZ=100 because of the way POSIX defines select() you have to wait for two jiffies at a minimum according to Linus.

    Anyway, HZ > 500 sure as hell is better than HZ=100.
    A HZ-less kernel with on-demand timer scheduling would be much better, though. IBM has such a kernel patch for their mainframe version of Linux to improve responsiveness when hundreds of Linux VMs are running concurrently.

    Pity about the USER_HZ = 100 thing to accomodate all the borken programs that pick up HZ from the linux kernel header file and assume it is a) constant, or worse yet b) 100.
    Had HZ had been a proper syscall instead of a #define in the first place for user-land programs this would not have been a problem today.

    Can someone do me a big favor and post RedHat 8.0's asm-i386/param.h file so I can see how they defined HZ, USER_HZ and friends? I'd like to see it without actually going to the trouble of installing RedHat 8.0.

  9. Wow. by ZigMonty · · Score: 2

    Therefore the OS didn't have to page out all the memory at the context switch tick and they could afford to tick more often, because the costs were lower.

    Wow. Are you saying that linux pages out the running process at every context switch? I think I might have found an explanation for X's choppiness.

    1. Re:Wow. by WolfWithoutAClause · · Score: 2
      Pretty much, although it's not paging to disk of course (unless you're really short of memory ;-) )

      It has to do stuff like that to keep the processes address space separate- otherwise one rogue process would kill all the others, like in 95

      --

      -WolfWithoutAClause

      "Gravity is only a theory, not a fact!"
  10. Wrong! Re:It doesn't improve performance. by WolfWithoutAClause · · Score: 2
    BUT once you have more than one program or thread running the situation is different.

    Yes. Say you have one thread running flat out and another that needs to do 100microseconds of work. With 100 ticks per second you will lose 5 usec to context switching and 9900 usec to waiting for the next context switch.

    No! The task does 100 microseconds of work and then calls the sleep command, or does I/O or whatever. This ultimately goes through the kernel and the kernel does an early context switch. It certainly doesn't waste the rest of the timeslice.

    Incidentally, the overhead of doing the context switch is much bigger than you say here- one of the things that the kernel has to do is flush the caches as it swaps the virtual memory in and out- that will slow the system for tens of thousands of instructions afterwards.

    Anyway, you're wrong about it not improving performance; it certainly can improve latency, which is very definitely a performance metric; but obviously you'll lose some cpu time due to the more frequent context switches that will occur.

    --

    -WolfWithoutAClause

    "Gravity is only a theory, not a fact!"
    1. Re:Wrong! Re:It doesn't improve performance. by Anonymous Coward · · Score: 0

      Thank you.

    2. Re:Wrong! Re:It doesn't improve performance. by zenyu · · Score: 2

      Incidentally, the overhead of doing the context switch is much bigger than you say here- one of the things that the kernel has to do is flush the caches as it swaps the virtual memory in and out- that will slow the system for tens of thousands of instructions afterwards.

      It is higher if you switch to another userland application. If you go to the scheduler and decide to keep running the same app the TLB does not have to be flushed. Even if you do switch to another app it's unlikely that it's going to thrash the cache. Those gnome-apps aren't so data intensive. It's even more unlikely that you will have to page in virtual memory. I don't think I've even bothered to allocate virtual memory lately, when 2-4 Gigs are cheap why bother? (On a P4 you can even tell the processor not to flush the local's out of the TLB when you load a new LDT, and it never flushes the kernel's entries unless you change the GDT for some reason.) You can thrash the cache if you want to, just start a compile with -j# with # greater than the number of processors. But those little applications that need a small timeslice once in a while aren't gonna do it. There might be a security arguement for flushing the cache, so that some app can't communicate with another by reading or not reading in a memory location into the cache. But at that level of paranoia I wouldn't be using Linux anyway.

      If you're swaping in virtual memory from a hard drive who cares what your timeslice is? It's going to take milliseconds just to get the page anyway! The only benefit of virtual memory is that it can swap out unused code so only the working set uses up RAM, in which case you still rarely actually swap things in since your working set is in RAM by definition. Overlays probably had a better granularity for that purpose. I'm always afraid virtual memory will be abandoned, even though it could be useful at some future date when you might have just 4GB of fast RAM, and 64TB of plan old DDR-RAM or something else the processor can't handle without OS help. (Yes I know there are machines that actually use virtual memory, but I'm not going to argue that they should have more ticks, they might benefit from fewer, in fact. I just haven't seen one of those machines in at least two years, so I think addressing the Athlons, and P3 & P4's of the world isn't such a bad idea.)

    3. Re:Wrong! Re:It doesn't improve performance. by WolfWithoutAClause · · Score: 2

      I think once you've switched on the memory management unit; you lose very little by using virtual memory. I don't think it's going to go away any time soon. And the cost of switching on the MMU is very small, given the relatively deep pipelines we have right now with these processors.

      --

      -WolfWithoutAClause

      "Gravity is only a theory, not a fact!"
    4. Re:Wrong! Re:It doesn't improve performance. by zenyu · · Score: 2

      you lose very little by using virtual memory

      True, but if people stop even creating swap partitions in large numbers who's going to want to maintain the code? Code dies when it's not maintaned... I don't think this will happen to Linux or any of the free OS's anytime soon since they are still used on old hardware where it's hard to even find anyone selling compatible memory. And, as long as the embedded people don't all just switch over to uLinux/rtLinux it won't happen either. Something like PS2 Linux really needs swap files with just 32 megs of RAM. (The chip can support gigs of RAM, but the hack requires lots of soldering and expert knowledge of the MMU.)

      I think the chance of Linux abandoning the MMU is 0%, if only because you need memory protection any general purpose machine. Also I think you'd be capped at 36bits of address space on i386, or only 64 Gigs of RAM, sounds great now, but won't in 5 years.

    5. Re:Wrong! Re:It doesn't improve performance. by WolfWithoutAClause · · Score: 3, Informative
      No, look swapping is not the same as virtual memory. Virtual memory is useful even in the absence of any disk or swap space at all.

      The point is that virtual memory reduces the amount of real memory you need for each thread- each only takes what it really needs. Sure if memory is cheap, it may not matter so much. But even if it is cheap do you really want to give each process 1 gig of space on the off-chance that it might need it? I don't think so.

      Virtual memory is when a process thinks it has 1 gigabyte of memory, but it actually only has, say 128 megabytes. It can read or write to any bit of it, and the OS does what is necessary to ensure that it never notices the difference; obviously upto the actual system limits.

      Virtual memory and swap space go together very nicely, but one does not imply the other. You can use virtual memory to implement garbage collection for example; with no backing store at all.

      I guess there are other ways to do similar things- for example, don't use virtual memory, use real memory and set up the MMU so that each thread can only see its own map. But there are issues with that, and it isn't necessarily faster.

      --

      -WolfWithoutAClause

      "Gravity is only a theory, not a fact!"
  11. Great - now binaries are broken. by ProtonMotiveForce · · Score: 1, Informative

    Way to go. Any binary that used the 'HZ' variable (a constant defined in a header file) will need to be recompiled for these new kernels. Way to go, Linux. Keep it up.

    1. Re:Great - now binaries are broken. by Dan+Aloni · · Score: 4, Informative

      That's not true. The kernel still reports HZ=100 to userspace, and as far as jiffies calculation concern toward userspace, nothing has changed.

      --
      0x2b or not 0x2b, the answer is -1
    2. Re:Great - now binaries are broken. by Anonymous Coward · · Score: 0

      Not that any binary ever should.

  12. Huh? Win 95 had virtual memory... by Tom7 · · Score: 2

    Windows 95 absolutely does have virtual memory. (Are you thinking of Mac OS 9??) It's true that it crashed a lot, and that's because the protections afforded by a real OS were not in '95 (it was easy to turn off virtual memory protections and trample on the address space of another process). But each process definitely had its own virtual address space, and most of the things that a real OS does (page table, TLB, paging to disk, etc.) were in 95. I don't know what this business is about not having to page out all the memory -- I never saw the 95 source code but it probably does what any other real OS does: set the page table to the one of the process and flush the TLB.

  13. virtual memory by zenyu · · Score: 2

    No, look swapping is not the same as virtual memory. Virtual memory is useful even in the absence of any disk or swap space at all.

    I wasn't clear enough, I see 0% chance that virtual memory will disappear from Linux because it provides protection from one application playing with another's memory.

    I do fear for swapping, but only years from now when it's not so common. I do not fear for the loss of MMU support including virtual memory.

    It isn't clear this is what I'm saying from that post, but if you read what I said before I think it's clear. I was agreeing with you on the point of virtual memory not being a big deal, but adding that swapping was in dirge territory on the modern systems that will benefit from upping HZ. Your original comment on swapping is what inspired me to write the comment, cuz I thought you were making the point that it's not a performance loss to use virtual memory even if you never swap, while my point on swapping had nothing to do with performance, but code maintinance. If an signal never fires who cares how long it takes to handle it after all.

    If you have to do any swapping to disk I don't care how much you try to tune HZ, you need to buy more memory or run fewer apps to get a snappier system.

    But enough on this point, it's tangental and I think I agree with everything you said in this last comment without exception.

  14. Robert Love to talk about all this in LA by irabinovitch · · Score: 2, Interesting

    Robert Love will be giving a talk 2.5 and the preption patches at the Southern California Linux Expo
    If you use the promo code: F633F you can get into the expo free.

    1. Re:Robert Love to talk about all this in LA by irabinovitch · · Score: 2, Interesting
  15. Solaris vs. Linux and Interrupt Clock Resolution by Anonymous Coward · · Score: 0

    I find this funny... Solaris defaults to 100 interrupts per second.

    As a matter of fact in the book "Solaris Internals" RMC and JM basically
    say "be *very* careful when increasing the
    interrupt rate, because this can reduce system
    performance dramatically" (pg. 56). On Solaris
    this is accomplished by adding the line:

    set hires_tick = 1

    to /etc/system. Setting hires_tick to true increases the programmable clock interrupt frequency to 1000 interrupts per second. Now I realize the comparing Solaris to Linux is
    like comparing apples to oranges, but I am
    pretty sure that the functions that take place
    on interrupt are pretty standard:

    - timeslicing
    - tracking of various resources (mem, cpu, etc)
    - checking/calculating of paging parameters

    and so on. I can understand that this would
    be good for real time systems, but for the
    average desktop or server, will this really
    increase preformance? Are Solaris and Linux
    really that different on what would seem to
    be a rather fundamental issue?

    - Andrew

  16. Huh? Mac OS 7 had virtual memory... by yerricde · · Score: 1

    Windows 95 absolutely does have virtual memory. (Are you thinking of Mac OS 9??)

    Mac OS 7 had virtual memory. It just wasn't protected virtual memory until Mac OS X.

    --
    Will I retire or break 10K?
    1. Re:Huh? Mac OS 7 had virtual memory... by Tom7 · · Score: 3, Insightful

      I don't think this is true. What the classic Mac OSes called virtual memory wasn't really virtual memory like what I'm talking about. Yes, they had a menu item where you could make disk space into "virtual memory" (I'm not sure what this did, really), but processes still had one unified address space. (Why else did we have to set the amount of memory we wanted to allocate to each program?) It's not like they were using the MMU of the processor and actually doing virtual memory, but just had the protections turned off -- they were doing a software simulation of some aspects of VM (like they simulated multitasking, for instance). It wasn't really VM.

  17. If this helps, something is broken by Animats · · Score: 2
    If turning up the system tick rate helps, that's an indication of one of two problems.
    • Something is polling that should be event-driven. Some applications (Older versions of Netscape come to mind) like to do something on every tick. (For Netscape, that was a lousy architectural decision made so it would work on the classic MacOS and 16-bit Windows.) There are also some really crappy interprocess communication systems that are polled. Find and fix.
    • Thread scheduling priorities are wrong. This is a subtle issue, but basically, the threads that aren't CPU bound but have tight latency requirements have to have priority over the threads that are CPU bound and don't have tight latency requirements. Smarter schedulers try to achieve this automatically, but some of the guesses made in the UNIX world are tied, for historical reasons, to the TTY end of the system and are no longer appropriate.
    A useful exercise is to turn the tick rate way down (maybe 1HZ) and put a compute loop job in the background. Everything that's broken according to the above criteria will turn into a toad, which helps debug the problem.