Slashdot Mirror


Hyper-Threading Speeds Linux

developerWorks writes "The Intel Xeon processor introduces a new technology called Hyper-Threading (HT) that makes a single processor behave like two logical processors. The technology allows the processor to execute multiple threads simultaneously, which can yield significant performance improvement. But, exactly how much improvement can you expect to see? This article gives the results the investigation into the effects of Hyper-Threading (HT) on the Linux SMP kernel. It compares the performance of a Linux SMP kernel that was aware of Hyper-Threading to one that was not." Ah, the joys of high performance.

14 of 239 comments (clear)

  1. What's really cool also by esconsult1 · · Score: 4, Interesting

    We've used XEON's on our DB server for a few months now. The performance has been outstanding. You also see 4 processors when you run top.

    At first we thought this was an error, and got in touch with Dell's tech support. But the geeks there said this is normal behavior.

    1. Re:What's really cool also by JCholewa · · Score: 2, Interesting

      > We've used XEON's on our DB server for a few months now. The performance
      > has been outstanding. You also see 4 processors when you run top.

      > At first we thought this was an error, and got in touch with Dell's tech support.
      > But the geeks there said this is normal behavior.

      Of course it's normal behavior. Windows is (well, basically) counting the number of threads that the system can simultaneously execute (that's probably not entirely an accurate depiction), not the number of physical processors. But this does not mean that you're getting the performance of four processors. You still only have the execution resources of two processors at your system's disposal. The best that simultaneous multithreading can do is make more efficient use of the existing execution units. This can result in very nice performance boosts, no performance boosts at all, and (in some rarer circumstances) performance penalties. But it is in no way anything near like having that actual number of processors.

      Probably, a good rule of thumb would be "if it already stresses the execution units, then you won't see a boost, but if the code causes frequent thread stalls, then you'll probably see a nice jump".

      *EDIT* Crap, I didn't notice that you said top. Made the assumption about Windows. Sorry about that. My post more or less stands as the same, though. :)

    2. Re:What's really cool also by Richard_at_work · · Score: 4, Interesting
      Windows2000Pro only shows two as thats all it can handle. Its part of the Windows2000 limitation:
      • Windows2000pro - 2 cpus
      • Windows2000server - 4 cpus
      • Windows2000AdvServer - 8 cpus

      We put win2kserver on a dual Xeon with HT, and it showed 4 cpus (this was when we realised we had HT capable Xeons! Suree enough, after checking, we were right)
  2. But the real question... by Jace+of+Fuse! · · Score: 3, Interesting

    Does SMP support automatically allow benefits from Hyperthreading, or does that require special support all it's own?

    --

    "Everything you know is wrong. (And stupid.)"

    Moderation Totals: Wrong=2, Stupid=3, Total=5.
    1. Re:But the real question... by lederhosen · · Score: 3, Interesting

      Yes, as said in the other posts,
      BUT, you want to schedule the
      same process on the same CPU in
      order to not trash the cache.

      I.e. you can make a huge inprovement
      by make the scheduler aware of
      processors *and* logical processors.

  3. What are you talking about? by pVoid · · Score: 2, Interesting
    The article clearly shows that syscalls and basically OS dependant stuff rarely improves in performance, in fact decreases in most spots.

    Of course multi-threaded applications are going to improve. What's your point?

    For those who didn't RTFA:

    Simple syscall 1.10 1.10 0%

    Simple read 1.49 1.49 0%

    Simple write 1.40 1.40 0%

    Simple stat 5.12 5.14 0%

    Simple fstat 1.50 1.50 0%

    Simple open/close 7.38 7.38 0%

    Select on 10 fd's 5.41 5.41 0%

    Select on 10 tcp fd's 5.69 5.70 0%

    Signal handler installation 1.56 1.55 0%

    Signal handler overhead 4.29 4.27 0%

    Pipe latency 11.16 11.31 -1%

    Process fork+exit 190.75 198.84 -4%

    Process fork+execve 581.55 617.11 -6%

    Process fork+/bin/sh -c 3051.28 3118.08 -2%

    is it just me? or does the linux kernel not perform so much better in SMP HT?

    1. Re:What are you talking about? by pVoid · · Score: 5, Interesting
      You're assuming the kernel is one thread =)

      The kernel has lots of work to do when you call into it. Of course it wouldn't boost up if all kernel calls were like this:

      void myKernelFunc( long param ) { return param * 2 - 23; }

      But kernels do work.

      See, something that very few people know is that the NT kernel is fully pre-emptible, fully-interuptible, and re-entrant by design. ie. even for single processor systems, it's like that.

      The linux kernel is not.

      It's very hard to 'tack on' SMP features onto a system that wasn't made with that in mind in the first place.

      This has some advantages, and some drawbacks... NT kernel programming is *frigging* hard. HARD.

      But it also has the advantage that it makes *much* better use of SMP.

      Sad, but true.

  4. 51% speed-up! by core+plexus · · Score: 5, Interesting
    An excellent, detailed article. For those in a hurry:

    "Conclusion
    Intel Xeon Hyper-Threading is definitely having a positive impact on Linux kernel and multithreaded applications. The speed-up from Hyper-Threading could be as high as 30% in stock kernel 2.4.19, to 51% in kernel 2.5.32 due to drastic changes in the scheduler run queue's support and Hyper-Threading awareness."

    My questions: What's the downside? Is AMD doing anything similar?

    Fight with computer brings SWAT team

  5. HT hurt perf by steelerguy · · Score: 5, Interesting

    Tested HT running couple large jobs on a 2 CPU box with each process using over a GB of RAM. Performance went down.

    Also HT can play havoc with a openMosix cluster since processes can start being migrated around to CPU's that do not really exist and appear to have no load, yet the physical CPU may be 100% loaded in reality.

    It is not all peaches and cream.

  6. Re:Also the Pentium 4 - 3 Ghz is hyperthreaded. by TokyoBoy · · Score: 2, Interesting

    Does anyone know if AMD will be doing something similar or if their current processors do something like this? I know that many High Performance Clusters use SMP machines and multi-threaded code and could take advantage of HT. Many clusters are made with AMD processors due to the fact that they are so much less expensive than Intel.

  7. Re:Fundamental mistake by JCholewa · · Score: 2, Interesting

    > Well, unless you have a computer that has multiple physical CPUs.

    You have a point, but he does, as well. SMT ("hyper-threading") should work automatically for multiprocessor systems. So if you have a dual processor, SMT-capable board in a system that's unaware of the SMT functionality, you should still get a boost from SMT. Unless Hyper-Threading is a really, really bizarre implementation of SMT. Reviewers should really compare against an SMP system that is incapable of doing SMT, because it'll do it automatically (or it should), even if you don't tell it to. Alternatively, you could approximate the same results by forcing the system to only use a number of threads equivalent to the number of processors. Not all programs can do this, though (compiling is the only thing that immediately comes to mind).

    Granted, I've been out of the loop a bit, so I might be making some really off the wall (and inaccurate) assumptions about Intel's SMT implementation.

  8. Re:Fundamental mistake by afidel · · Score: 5, Interesting

    nope, you make incorrect assumptions, the hyperthreaded portion of the cpu shows up to software as a seperate cpu. For this reason a win2k pro machine has to have hyperthreading disabled on a dual xeon machine or else it will just use the first physical cpu and its child hyperthread. This is why artificial smp limitations suck. Also win2k server will only allow 4 cpus in standard edition so it can only utilize two physical cpus and their hyperthreads. Windows Server 2003 ups the amount of cpus allowed for standard edition to 8 to account for this.

    --
    There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  9. Re:Technical Summary by cartman · · Score: 3, Interesting

    What you said was false.

    Take the example of database & OLTP applications. Database transactions are heavily dependent on repeated access to RAM. Virtually no database is small enough to fit into cache, and there is often little regularity in which data is accessed. Memory latency will REQUIRE a non-SMT processor to wait IDLY each time there is a memory latency, which takes >100 proc cycles on a modern CPU. This has NOTHING to do with he p4 architecture or long pipelines.

    "But remember that HT requires OS support, may require application support..."

    HT does not require OS support as long as the OS is capable of recognizing more than 1 CPU. Any threaded app can benefit from HT.

  10. SMT will become increasingly important by cartman · · Score: 5, Interesting

    SMT (hyperthreading) will become increasingly important when processors are able to execute more than 2 threads simultaneously.

    This development is inevitable. Previously, each new processor generation was faster than the prior one at a given clock rate, because each new processor core had more execution units, and was therefore able to perform more work in parallel. This trend abruptly ended recently, for one reason: there is no more instruction-level parallelism (ILP) to exploit. It is impossible for a processor to look at a thead of execution and find more than a few instructions to execute in parallel.

    The only parallelism left to exploit is THREAD-LEVEL parallelism (TLP). Therefore the only way to continually increase performance is to increase the number of threads that a CPU can execute in parallel. This requires two modifications to CPU cores: first, increase the number of thread contexts per CPU, and second, increase the number of pipelines to which those threads can be dispatched.

    With the P4, it would be pointless to have more than 2 thread contexts, because there aren't enough CPU resources lying idle to execute more than 2 threads. But future CPUs could make use of more than 2 thread contexts by having enough CPU resources to execute all of them. Future CPUs could have 20 execution units or more, which would be enough to execute several threads. Remember that the number of transistors per CPU continues to increase exponentially.

    It's easy to forsee a time when processors have 20 execution units (10 integer, 10 fp) and 4 thread contexts, offering more than triple the performance of a non-SMT cpu. In the future, non-SMT CPUs will make as little sense as a non-superscalar CPU would today.