Slashdot Mirror


Hyperthreading Hurts Server Performance?

sebFlyte writes "ZDNet is reporting that enabling Intel's new Hyperthreading Technology on your servers could lead to markedly decreased performance, according to some developers who have been looking into problems that have been occurring since HT has been shipping automatically activated. One MS developer from the SQL server team put it simply: 'Our customers observed very interesting behaviour on high-end HT-enabled hardware. They noticed that in some cases when high load is applied SQL Server CPU usage increases significantly but SQL Server performance degrades.' Another developer, this time from Citrix, was just as blunt. 'It's ironic. Intel had sold hyperthreading as something that gave performance gains to heavily threaded software. SQL Server is very thread-intensive, but it suffers. In fact, I've never seen performance improvement on server software with hyperthreading enabled. We recommend customers disable it.'"

27 of 255 comments (clear)

  1. This is news? by Anonymous Coward · · Score: 5, Informative

    Anybody who understands HT has been saying this since chips supported it, I have it enabled because I find that at typical loads our DB servers performance benefits from HT aware scheduling. Welcome to 2002.

    1. Re:This is news? by magarity · · Score: 2, Informative

      You should have been taught that two 1Ghz CPUs may or may not be slower than one 2Ghz CPU depending on what the server does for a living. The OS consideration is miniscule; cross CPU communication is almost as fast as the internals of the CPUs until you get up to NUMA type machines. Even then, no, a blanket statement such as the one you say you've been taught is incorrect for at least as many cases as it is correct.

  2. The code wasn't changed by ocelotbob · · Score: 5, Informative

    I read the intel assembly guide section regarding hyperthreading, and it clearly states that performance will drop if you don't take the shared cache into consideration. The two logical threads contend for the cache, causing the performance problems that were described. In order for there to be a true benefit to hyperthreading, either the program, the OS or the compiler needs to determine that hyperthreading is enabled, and model the code to only use less than half the cache. It's been known that way since the beginning, and frankly, is silly that MS is scratching their heads wondering why this is. Lower the cache footprint, and I'll be willing to bet that performance rises dramatically.

    --

    Marxism is the opiate of dumbasses

    1. Re:The code wasn't changed by canavan · · Score: 4, Informative

      When optimizing code, the compiler should worry about cache size and cache footprint, so that it doesn't unroll inner loops too far or cause the code size to increase enough as to cause thrashing. HT has just cut the maximum cache footprint where increasing size for possily minor performance boosts may make sense in half. GCC has an option called --param max-unrolled-insns=VALUE, which controls just that. There are possibly others with similar effects, possibly also for other compilers. Additionally, it may make sense to have the compiler optimize for size instead of speed in some cases.

  3. sort of obvious by Vlad_the_Inhaler · · Score: 4, Informative

    If you have a system thread cleaning out blocks of disk cache memory then of course it is going to suffer. The whole point of hyperthreading was that one thread could run while another was waiting for I/O.

    The first tests on Linux when Hyperthreading came out were also pretty discouraging.

    --
    Mielipiteet omiani - Opinions personal, facts suspect.
    1. Re:sort of obvious by Mateorabi · · Score: 2, Informative
      Except that in multitasking, when a process blocks and swaps you suffer hundreds to thousands of cycles while the OS swaps out processes structs, rewrites VM tables, etc. This usualy happens at the os syscall level too.

      In hyperthreading, one thread simply stops contending for functional units for 10s of cycles letting the other, already loaded and running thread max out its ALU/FPU usage while the other waits for cache to get filled from DRAM. This is much higher granularity: the OS doesn't force a swap penalty for every single cache miss, because the act of swapping is way more expensive.

      The problem is if both threads are simultaneously memory (vs cpu) intensive then you end up with two waiting threads and don't see a performance boost. Even worse, they both start fighting over the same cache lines. This is the HT process equivalent of virtual memory thrashing, only its at the DRAMcache level instead of the diskDRAM level.

      --
      "You saved 1968." - Ms. Valerie Pringle to the crew of Apollo 8

  4. How is this news? by logicnazi · · Score: 3, Informative

    This sort of effect has been talked about for as long as I remember hearing about hyperthreading. It was common knowledge long before the chips came out that running two threads on the same cache can cause performance issues. One can see this with two chips sharing an L2 cache so why should it be a surprise here?

    The real question is whether this issue can be optimized for. If the developers design their code with HT in mind will this still be a problem since the other thread may belong to another process or would properly optimized code be able to deal with his?

    Most importantly is this a rare effect or a common one? Would it be rare or common if you optimize your programs for an HT machine?

    --

    If you liked this thought maybe you would find my blog nice too:

  5. Re:Poor mans dual-core by Chewbacon · · Score: 2, Informative

    I have a dual core on my desktop at home and HT on my machine at work. I'll take the dual core over HT any day that ends with a Y. You can multi-task so well with the Pentium D it becomes blissful. Want to archive a DVD movie and put your favorite CD on your mp3 player? Set the two apps to run on different cores. On the other hand, my HT workstation goes nuts-slow if I try to do two intensive tasks at once.

    --
    Chewbacon
    The Bible is like Wikipedia: written by a bunch of people and verifiable by questionable sources.
  6. HT problems with firebird database (slowdowns) by mAriuZ · · Score: 2, Informative

    Usual response is to disable it from bios

    One possible solution (code patch)

    http://sourceforge.net/mailarchive/message.php?msg _id=12403341

    Other threads with hyperthreading problems (slowdowns)
    http://sourceforge.net/search/?forum_id=6330&group _id=9028&words=hyperthreading&type_of_search=mlist s

    --
    developer http://flamerobin.org
  7. Not a developer from Citrix by Bluey · · Score: 2, Informative

    I know asking for them to research is a stretch, but the submitter should at least read the acticle before submitting it. The quote was from a Technical Director at a consulting company that sells Citrix software, not from a developer at Citrix. Hyperthreading can definitely help performance of Metaframe running under Windows 2003. Enabling it in the bios on a server running Windows 2000 was where the problem resided.

  8. shared cache versus local memory by erwincoumans · · Score: 2, Informative

    "I read the intel assembly guide section regarding hyperthreading, and it clearly states that performance will drop if you don't take the shared cache into consideration." This is a general problem. XBox 360 has similar issues, 3 cores sharing the same cache. Having multiple independent cpu's with each its local memory (like multiprocessor or PS3 SPU's),doesn't suffer from these issues.

  9. Re:Poor mans dual-core by masklinn · · Score: 2, Informative
    In fact, with the dual channel memory model, dual core AMD systems might be a little better than generic dual CPU, since each processor has it's "own" memory.

    Nope, both cores use the same bridge to access central memory so that point is moot. On the other hand, the cores of an AthlonX2 get to discuss with one another through a special link while regular multiprocessor have to use the FSB (or HyperThreading for AMD's Opterons) link, and therefore have to compete with every other device using said FSB/HT (on top of getting much higher latencies)

    --
    "The way we can tell it's C# instead of Haskell is because it's nine lines instead of two." -- wadler
  10. Of course it can hurt performance by Anonymous Coward · · Score: 2, Informative

    HT is a very simple concept: Virtualize 2 CPUs by cutting all caches in half and allocating each half to one of the CPUs, and allow the ALUs to process data from either thread. Ths can give good performance, for instance when one thread has a cache miss and is waiting for data from main memory (or god forbid there is a fault and you need to read from the HDD). In a normal single CPU operation, this ties up resources, and that thread can't make any progress. with HT on, the second thread can continue processing data. Or even without a cache miss, there are 4 (or more) ALUs on the die, and only certain types of applications can effectively make use of them all simulatneously. Having HT allows a higher probability that all the resources on the chip are used. But the cost, as I said above, is cutting the cache sizes in half (effectively). And cache is king for some applications. there are many job types where doubling the cache gives much better performance than even doubling the CPU speed (well, that is probably pushing it, ut certainly adding 10% more cache can be better than 10% higher clock rate), as it means less time going to main memory.

    It isn't a foolproof technology, but it has it's benefits. SQL can be very heavy on the cache, and I'm not surprised that it doesn't perform optimally without some tuning.

  11. Re:HT kills my ATI All in Wonder by laffer1 · · Score: 2, Informative

    I don't run with htt on but I do have an SMP box. (2 xeon 2ghz) The ati software works fine on my desktop, so it must be an issue specific to hyperthreading.

  12. HT on Linux by RAMMS+EIN · · Score: 3, Informative

    Hyperthreading Speeds Linux.

    In a nutshell:

      - hyperthreading decreases syscall speed by a few percent
      - on single-threaded workloads, the effect is often negligible, with occasional large improvements or degradations
      - on multithreaded workloads, around 30% improvement is common
      - Linux 2.5 (which introduced HT-awareness) performs significantly better than Linux 2.4

    So, from that benchmark (and others like it, just STFW) it appears that HT offers significant benefits; you need multithreading to take advantage of it, and having a HT-aware OS helps.

    --
    Please correct me if I got my facts wrong.
  13. Interesting Technical Analysis on the subject by morcego · · Score: 4, Informative
    You will find here a very interesting technical analysis on the subject, by Bryan J. Smith, on why Hyperthreading is crappy engeneering. From the message:


    Since then, Intel has made a number of "hacks" to the i686 architecture.
    One is HyperThreading which tries to keep its pipes full by using its
    control units to virtualize two instruction schedulers, registers, etc...
    In a nutshell, it's a nice way to get "out-of-order and register
    renaming for almost free." Other than basic coherency checking as
    necessary in silicon, it "passes the buck" to the OS, leveraging its
    context switching (and associated overhead) to manage some details.

    That's why HyperThreading can actually be slower for some applications,
    because they do not thread, and the added overhead in _software_
    results in reduced processing time for the applications.
    --
    morcego
  14. Re:Poor mans dual-core by Malor · · Score: 4, Informative

    I think you're kind of saying this already, but I felt confused by your wording and thought I'd chime in. I'm a little blurry on a few of these details, and too lazy to go look things up, so pay attention to replies... don't treat this as gospel.

    As far as I know, all multi-cpu AMD packages use exactly the same method to talk amongst themselves, HyperTransport. They absolutely use a private, dedicated HT bus between cores. I *think* that when you run two single core Opterons, each has a link to main memory, and they also share a direct link. In the case of a 4-die system, I think the third and fourth CPUs 'piggyback' on the 1st and 2nd... they talk to processors 1 and 2, and each other. Processors 1 and 2 do main-memory fetches on their behalf. Each CPU has its own dedicated cache, and I think the cache ends up being semi-unified... so that if something is in processor 2's cache, when processor 4 requests the data, it comes from processor 2 instead of main memory. That's not quite as fast as direct cache, but it's a LOT faster than the DRAM.

    The X2 architecture is like half of a 4-way system. There's one link to main memory, and one internal link between the two CPUs... the second one is piggybacking, just like processors 3 and 4 do in a the 4-way system. It's not quite as good as a dedicated bus per processor, but the AMD architecture isn't that bandwidth-starved, and a 1gb HT link is usually fine for keeping two processors fed. You do lose a little performance, but not that much.

    Intel dual cores share a single 800mhz bus, with no special link between the chips. And the Netburst architecture is extremely memory bandwidth hungry. Because of its enormous pipeline, a branch mispredict/pipeline stall hurts terribly. The RAM needs to be very very fast to refill the pipeline and get the processor moving again.

    So running two Netburst processors down a single, already-starved memory bus is just Not a Good Idea. It's a crummy, slapped-together answer to the much, much better design of the AMD chips. It's a desperate solution to avoid the worst of all possible fates... not being in a high-end market segment at all.

    Next year this could all be different again, but at the moment, AMD chips, particularly dual core, are a lot better from nearly every standpoint.

  15. Re:Poor mans dual-core by volsung · · Score: 4, Informative
    That's not quite true either. Each Opteron has a separate memory controller (dual-channel), which means that each CPU can have its own pipe to a bank of memory. So if the CPU needs to access memory in its banks, it will not have to contend with the other CPU over the HT link. A NUMA-aware OS will try to schedule processes on the same CPU which controls the process's allocated memory. If your programs can fit in one CPU's memory bank, then you can get bus contention down pretty low.

    This is why SMP makers are going nuts over the Opteron. Your effective memory bandwidth scales linearly with the number of processors, assuming your processes partition nicely.

  16. Re:Poor mans dual-core by InvalidError · · Score: 4, Informative

    AMD Opterons each have their own local RAM and can access each other's RAM over the HT links to form a a cache-coherent non-uniform memory architecture - ccNUMA.

    Multi-core Opterons have a special internal crossbar switch that allow the cores to share the memory controller and HT links, they do not 'piggy back' on the other. This reduces latencies and increases bandwidth for communication between the two cores and gives both cores the equal-opportunity access to the HT ports and CPU's local RAM. With a NUMA-enabled OS, applications will run off the CPU's local RAM whenever possible to minimize bus contention and this allows Opteron servers' overall bandwidth and processing power to scale up almost linearly with the number of CPUs.

    As for Intel's dual-cores, the P4 makes sub-optimal use of its very limited available bandwidth. Turning HT on in a quad-core setup where the FSB is already dry on bandwidth naturally only makes things worse by increasing bus contention. Netburst was a good idea but it was poorly executed and the shared FSB very much killed any potential for scalability. If Intel gave the P4 an integrated RAM controller and a true dual-core CPU (two cores connected through a crossbar switch to shared memory and bus controllers like AMD did for the X2s), things would look much better. I'm not buying Intel again until Intel gets this obvious bit of common sense. The CPU is the largest RAM bandwidth consumer in a system, it should have the most direct RAM access possible. Having to fill pipelines and hide latencies with distant RAM wastes many resources and a fair amount of performance - and to make this bad problem worse, Intel is doing this on a shared bus. Things will get a little better with the upcoming dual-bus chipsets with quad-channel FBDIMM but this will still put a hard limit on practical scalability thanks to the non-scalable RAM bandwidth.

    On modern high-performance CPUs, shared busses kill scalability. AMD moved towards independant CPU busses with the K7 and integrated RAM controllers with the K8 to swerve around the scalability brick wall Intel was about to crash into many years ago and has kept on ramming ever since. Right now, Intel's future dual-FSB chipset is nothing more than Intel finally catching up with last millenia's dual-processor K7 platforms, only with bigger bandwidth figures.

  17. Re:Figures by MourningBlade · · Score: 2, Informative

    Well, AFAIK, the HTT thing only allows for the processor to sort of split execution units (FPU, ALU, etc) so that one can work on one thread, the other on another one. If an application resorts heavily to one of those units -- and my somewhat uninformed feeling is that software like SQL probably works mostly on the ALU, it, can't possibly GAIN performance. On the other hand, I can see the effort of thrying to pigeonhole the idle threads on the wrong execution unit (will it even try that?) completely borking performance. So yeah, no surprises here.

    Intel and AMD's Hyperthreading is one-at-a-time: the chip is either working as CPU 0 or it's working as CPU 1. I believe the newest Power5 is the only chip out there that can divvy up the internal units amongst the threads and work simultaneously.

    And they have noticed performance improvements, though (once again) you have the cache size issue. Then again, a Power5 has an enormous CPU cache compared to an x86 processor.

  18. Re:Intel's Hyperthreading vs Sun's Chip Mulithread by Anonymous Coward · · Score: 1, Informative

    Not at all. One of the big problems with HyperThreading as Intel has designed it is that they did not provide sufficient memory bandwidth to be able to feed both threads. This problem also plagues Intel's "dual core" chips. Ultimately, it eliminates the supposed benefit of switching over to the other thread when the first is blocked on memory access because as soon as the second thread needs information from memory it will actually slow things down. Also, even if there were sufficient memory bandwidth, the comparatively long fetch times would still mean that the CPU would be blocked parts of the time waiting on memory because there are only two threads available. Finally, there is a cost to switching between the threads, so even if you had the memory bandwidth and enough threads to prevent idle time it would still lose time because of the overhead in switching to another thread whenever the active one gets blocked.

    On the other hand, the new UltraSPARC T1 (aka, Niagara) has massive memory bandwidth, shorter fetch times, four threads per core rather than two, and zero penalty for switching between threads. The result is incredible throughput with a total of 32 hardware threads (8 cores with four threads per core) in a single chip. And by the way, that single chip draws a fraction of the power and generates much less heat than a single Intel HT processor (I swear it seems like the systems are blowing out cool air).

    Note, however, that the T1 chip may not be ideal for all workloads. It does have a relatively slow single-threaded performance, so it works best when running highly concurrent applications with minimal locking, or when running several applications concurrently. For some applications, it may be desirable to use processor sets to limit the set of threads that it can see and/or to run multiple copies concurrently and load balance across them. But for others that are designed to scale well (e.g., those that already run well on larger systems like the E6800 with 24 UltraSPARC-III or the E2900 with 12 dual-core UltraSPARC-IV chips), then they can take full advantage of the available processing power.

    For the tests that I've run with an application that does scale, a system with a single UltraSPARC T1 chip easily doubles up the performance of a system with two 3.2GHz HT Xeons (regardless of whether HT was enabled or disabled). Of course, I haven't been lucky enough to test with the officially shipping version of the T1 chip (the ones I've been able to use have been running at a slower clock rate, and some of them have had some of the cores disabled), so that performance gap may actually be larger than I have been able to measure so far.

  19. Mod Parent UP! by GroundBounce · · Score: 2, Informative

    The parent post is common sense, which seems infrequent. I have found the range to be quite wide: When rednering animations from Blender, I have found that hyperthreading results in nearly 70% faster throughput when turned on. For rendering MPEG2 using Tmpgenc (under Wine), I see around 40% improvement with HT on. Clearly, these two applications benefit quite a bit from HT due to small computational footprint and/or low cache contention, etc. On the other hand, on my system, on-screen 3D acceleration in the NVIDIA driver (under Linux at least) appears to suffer with HT, with frame rates that are around 10-20% slower than with HT disabled.

    So, I see improvements ranging from -20% to +70% depending on the application, with many applications seeing only small differences one way or the other. Like many things, this tends to turn into a religious debate when the fact is that it varies case-by-case.

  20. wrong on both counts by r00t · · Score: 4, Informative

    First, it doesn't matter if the server uses threads or processes. Threads have a minor performance advantage for startup and context switching, and some disadvantages for memory allocation speed (finding VM space is a hashing problem) and some locking overhead. For the most part though, with tasks that just crunch numbers (including scanning memory) or make system calls, there isn't all that much difference.

    Running 2 threads per CPU is not cheating. It's normal to run 1 thread per CPU plus 1 thread per concurrent blocking IO operation. That could come out to be 2 threads per CPU.

  21. Re:You'd think... by tomstdenis · · Score: 2, Informative

    Not really. You're trying to run two threads inside the L1 caches, the decoder bandwidth, etc. So the 16KB of cache turns back into [effectively] 8KB.

    The P4 can issue 3 uOPs per cycle but IIRC only from one thread. They alternate [stalls free up slots though]. Also the decoder can only decode ONE x86 opcode per cycle. Then you have expensive memory ports. Fetches to L2 [or alternatively to system memory] are done one at a time per thread [with deep queues which were lengthened in the Prescott core].

    Aside from the stronger ALU and the fact there are two of them, the AMDx2 also benefits from having dedicated caches per core and a dedicated HT bus between the cores that doesn't sit on the external bus. If you want SMP performance just get opterons or amdx2 cores. Really that simple.

    We all know HT was a kludge hack at the last minute to gain some marketting press. If Intel really wanted to boost the performance of the P4 they would strengthen the ALU [at a cost of clock frequency]. If a 2.2Ghz AMD64 can ROUTINELY beat a 3.2Ghz P4 then a 2.5Ghz optimized P4 variant [with a stronger ALU] could hold it's own.

    Oh wait, that already exists. It's called the Pentium M. :-)

    Tom

    --
    Someday, I'll have a real sig.
  22. Re:even dual core hurts by tomstdenis · · Score: 2, Informative

    The AMDx2 has separate L2 caches per core, they can communicate via a dedicated HT link [between 3.2 and 4 GiB/sec] and share one memory controller.

    So if one core requests cache line $x and the other core has it the data will be sent over the internal HT link and not even hit the memory bus. The memory controller is pipelined [I suspect] so even while the L2 fulfillment is going on the memory bus can be busy fetching another cache line.

    The HT cores have *one* L2 cache per physical core [so for instance, a dual-core HT processor has 4 "logical processors" but only 2 L2 caches]. The prescott [and later] cores have a deep memory read/write pipelines to queue up many memory operations at once. The dual-core P4s have their own cache per physical core which communicates on a FSB that is shared between everything [including the memory bus].

    Though in reality the dual-core P4 [e.g. 8xx series] do achieve 2x performance on totally unrelated [and not memory bound] tasks. For instance, doing RSA computations with CRT in two threads gets you a result with half the latency. [same is true for the AMDx2].

    So the dual-core P4 isn't a bad buy if money is strapped. You'll get more performance from an AMDx2 though as the individual cores are so much faster.

    Tom

    --
    Someday, I'll have a real sig.
  23. Re:HT != SMP by TheRealFritz · · Score: 2, Informative

    Actually, the document you point out kind of describes the flaw in the MS scheduler. The MS scheduler only optimizes HT behavior when you have several physical HT enabled CPUs. So let's say you have 2 P4s with HT. Windows will see 4 CPUs. The article describes how Windows will try to schedule on a logical CPU that's part of a physical CPU which currently has nothing scheduled.

    So the support described by Microsoft completely fails to address single physical CPU scenarios, or multi-CPU scenarios under high thread load.

    What the scheduler needs to do in a single physical HT CPU scenario is only allow threads to execute on the second logical CPU that share resources with the thread executing on the first logical CPU in order to minimize resource contention.

  24. Re:You'd think... by tomstdenis · · Score: 2, Informative

    You'd be wrong. The two L2 caches on the AMDx2 allow them to have access in parallel. Had they been unified you'd either have to make it dual-ported [e.g. larger and possibly slower] or have access in sequence [e.g. double the latency].

    If you're running tasks that share [e.g. write to] the same small pocket of memory you're right. However, many tasks don't do that. Often a server will spawn an entire new thread [e.g. unique stack and heap objects] to handle connections.

    It also makes sense in the desktop scene. Why would X11 and XMMS be accessing the same code? One is a media player and the other is a windows server. They have different data objects in their own respective process spaces. So a unified cache doesn't help. Also keep in mind that smart OSes keep tasks in a given CPU so the cache doesn't get killed as quickly.

    Tom

    --
    Someday, I'll have a real sig.