Slashdot Mirror


Hyperthreading Hurts Server Performance?

sebFlyte writes "ZDNet is reporting that enabling Intel's new Hyperthreading Technology on your servers could lead to markedly decreased performance, according to some developers who have been looking into problems that have been occurring since HT has been shipping automatically activated. One MS developer from the SQL server team put it simply: 'Our customers observed very interesting behaviour on high-end HT-enabled hardware. They noticed that in some cases when high load is applied SQL Server CPU usage increases significantly but SQL Server performance degrades.' Another developer, this time from Citrix, was just as blunt. 'It's ironic. Intel had sold hyperthreading as something that gave performance gains to heavily threaded software. SQL Server is very thread-intensive, but it suffers. In fact, I've never seen performance improvement on server software with hyperthreading enabled. We recommend customers disable it.'"

71 of 255 comments (clear)

  1. This is news? by Anonymous Coward · · Score: 5, Informative

    Anybody who understands HT has been saying this since chips supported it, I have it enabled because I find that at typical loads our DB servers performance benefits from HT aware scheduling. Welcome to 2002.

    1. Re:This is news? by dindi · · Score: 4, Interesting

      Mysql on linux with a 10gig DB for me definetely benefits my server's performance.
      In fact turning it off results in a 20+ percent query time, especially with multiple fulltime queries.

      Of course differently written queries and different systems/sql engines might behave differently.

      In fact I am so happy with HT, that I am going to change my desktop to one, as it is a linux machine with lots of running apps at the same time. Not mentioning that it is also a devel station with SQL+apache that benefited with HT according to my experience.

      (well it is time to upgrade anyway, and I choose HT over non HT).

    2. Re:This is news? by magarity · · Score: 5, Interesting

      Anybody who understands HT has been saying this since chips supported it
       
      People also have to trouble themselves to configure things properly which isn't the obvious or the default. HT pretends to Windows that its another processor but as you know it isn't. So you have to set SQL Server's '# of processors for parallel processing' setting to the number of real processors, not virtual. We changed ours to this spec and performance went up markedly. SQL Server defaults to what Win tells it the number of procs are and tries to run a full CPU's worth of load on the HT. Not gonna happen.

    3. Re:This is news? by magarity · · Score: 2, Informative

      You should have been taught that two 1Ghz CPUs may or may not be slower than one 2Ghz CPU depending on what the server does for a living. The OS consideration is miniscule; cross CPU communication is almost as fast as the internals of the CPUs until you get up to NUMA type machines. Even then, no, a blanket statement such as the one you say you've been taught is incorrect for at least as many cases as it is correct.

    4. Re:This is news? by Velox_SwiftFox · · Score: 2, Interesting

      Ah, but with Intel, now you have to choose between dual cores and HT (or pay a lot for the super gaming processor). And choose 2M over 1M cache over 2 processors with 1M each cache, et cetra. Even in the medium priced processors.

      Experience here shows the servers I deal with running Linux 2.6 kernel/Apache/MySQL and dual Xeons up to 6GB is that turning HT on as well reduces performance. When a CPU fan failed and one CPU had to be temporarily removed, however, there was a clear benefit turning it on with the single processor.

    5. Re:This is news? by dindi · · Score: 2, Interesting

      Hmm interesting.
      I was talking about a single proc and HT,
      I imagine that with dual + HT it is different. I do not see why it is happening. Actually if I bought an expensive server and experienced that, I might try to get some official explanation
      for the problem.

      I wonder If you tried BSD or Windows on the same or similar hardware, that might be some OS specific problem as well.

      Hmm, Google on it I will. :)

    6. Re:This is news? by dnoyeb · · Score: 2, Interesting

      Quite interesting. So SQL Server spawns processes as opposed to threads when it finds a second processor? I can't imagine thats true. What exactly do you mean by a 'full CPU's worth of load'?

      The only situation I can imagine is if SQLServer spawns say, 2 threads per CPU for performance. But this is a cheating way to get more CPU time and I wouldn't expect a _server_ class program to do such a thing when such a program would tend to expect its getting dedicated CPU anyway.

  2. It's all in the name by hjf · · Score: 3, Insightful

    Well, a technology with a name such as "HyperThreading" is targeted more to end users who don't know about processors, rather than SQL "Performance Tuners" who try to squeeze every cycle of processing power.
    HyperThreading might help poorly written thread management (independent audio and video subsystems for example), but not true multithreading, that's for sure.

  3. The code wasn't changed by ocelotbob · · Score: 5, Informative

    I read the intel assembly guide section regarding hyperthreading, and it clearly states that performance will drop if you don't take the shared cache into consideration. The two logical threads contend for the cache, causing the performance problems that were described. In order for there to be a true benefit to hyperthreading, either the program, the OS or the compiler needs to determine that hyperthreading is enabled, and model the code to only use less than half the cache. It's been known that way since the beginning, and frankly, is silly that MS is scratching their heads wondering why this is. Lower the cache footprint, and I'll be willing to bet that performance rises dramatically.

    --

    Marxism is the opiate of dumbasses

    1. Re:The code wasn't changed by springbox · · Score: 4, Insightful

      That's lame. It seems like an exteremely BAD idea to get programs to worry about the total cache usage on the CPU. If this is the case, then no wonder performance is suffering. There should be no reason for any programmer to write a threaded application so it's "hyperthreading optimized," especially since HT was seemingly created as a transparent mechanism to increase performance.

    2. Re:The code wasn't changed by drerwk · · Score: 4, Insightful

      It seems like an exteremely BAD idea to get programs to worry about the total cache usage on the CPU.
      If you want to maximize performance then you want the compiler to know as much as possible about the architecture. If you have no cache then loop unrolling is a good thing, if you have a small cache then loop unrolling can bust the cache. If you are doing large matrix manipulations, how you choose to stride the matrix, and possibly pad it is exactly dependent on the size of the cache. Now, it may be that having the applications programmer worry about it is too much to ask, but the compiler most certainly needs to worry about such detail.

    3. Re:The code wasn't changed by springbox · · Score: 3, Insightful

      It depends on what your goals are. I do realize that was a fairly general statement, and it does not apply to every application. For something like lets say MS SQL server without a compiler that does it automatically, it would be an unreasonable expectation. If someone was writing an application for an embedded system, however, it might make sense if they chose the HT enabled processor. Are there any compilers currently that will do HT optimizations? I was under the impression that most commercial apps were basically compiled for the lowest common denominator anyway.

    4. Re:The code wasn't changed by ochnap2 · · Score: 5, Insightful

      That's nonsense. Compilers routinely do loads of optimisations to better suit the underlying hardware. That's why any linux distro that ships binary packages has many flavors of each important or performance sensitive package (specially the kernel, in Debian you'll find images optimised for 386, 586, 686, k6, k7, etc). Is one of the reasons of the existence of Gentoo, also.

      So MS had to make a choise: ship a binary optimized for every possible mix of hw (being the processor the most important factor, but not the only one), which is impossible, or ship images compatible with any recent x86 processor/hw... without being specially optimised for any. That's why hyperthreading performance suffers.

      This is an important problem on Windows because most of the time you cannot simply recompile the un-optimised software to suit your hardware, as you can in Linux, etc.

      (sorry for my bad english)

    5. Re:The code wasn't changed by springbox · · Score: 4, Insightful
      I wasn't thinking of compilers. I was mostly talking about the people who have to write the software. Assuming there's no compiler that knows about HT, I stand by my assertion that it would generally be a bad pratice to get people to worry about it. Especailly these days. Another point that I was trying to make is that even if there were compilers who knew about the HT issues, I still think it's exceedingly stupid that Intel went ahead with HT despite the glaring problems that were mentioned. If people want multiple of threads of execution on the same processor then they should get one with two cores.

      Lots of programs are designed with the multiple thread model in mind. Programs should not be designed with the multiple thread model plus cache limitations in mind.

    6. Re:The code wasn't changed by canavan · · Score: 4, Informative

      When optimizing code, the compiler should worry about cache size and cache footprint, so that it doesn't unroll inner loops too far or cause the code size to increase enough as to cause thrashing. HT has just cut the maximum cache footprint where increasing size for possily minor performance boosts may make sense in half. GCC has an option called --param max-unrolled-insns=VALUE, which controls just that. There are possibly others with similar effects, possibly also for other compilers. Additionally, it may make sense to have the compiler optimize for size instead of speed in some cases.

    7. Re:The code wasn't changed by Tim+Browse · · Score: 4, Insightful
      It seems like an exteremely BAD idea to get programs to worry about the total cache usage on the CPU.

      For an application like SQL Server, I'd have to disagree. Are you saying there's no one on the MSSQL team who looks at cache usage? I'd hope there were a lot of resources devoted to some fairly in-depth analysis of how the code performs on different CPUs. After all, after correctness, performance is how SQL Server is going to be judged (and criticised).

      Given that a while back I watched a PDC presentation by Raymond Chen on how to avoid page faults etc in your Windows application (improving start-up times, etc), I'd say that Microsoft are no strangers to performance monitoring and analysis.

      For your average Windows desktop app, then yes, worrying about cache usage on HT CPUs is way over the top. For something like SQL Server? Hell, no.

    8. Re:The code wasn't changed by ElvenMonkey · · Score: 2, Interesting

      If people want multiple of threads of execution on the same processor then they should get one with two cores.

      If you read the article / summary you'd see what its talking about are servers that come with HT enabled by default. Thinking off the top of my head I can't come up with a single Intel processor still being sold and used in servers today that doesn't have HT technology built in. We're not talking about people specifically buying HT processors looking to get a performance boost, we're talking about every single individual who is buying or has bought an Intel server certainly within the last year or two. I've certainly noticed HT being enabled by default on our servers over the past couple of years. Novell has always strongly advised against it, so its always been turned off not long after they're powered up for the first time.

      From the way the article reads, the software companies are expressing concern that HT is being enabled by default on server, rather than that they're baffled why its causing slower performance. It even goes so far as to point out that the shared L1 and L2 cache is the problem.

      --
      "Joy is not in things; it is in us." Richard Wagner
  4. Poor mans dual-core by IdleTime · · Score: 4, Interesting

    indeed has once again proved it is expensive to be poor.

    Question I find more interesting: What is the performance gap between dual CPU vs Dual-core?

    --
    If you mod me down, I *will* introduce you to my sister!
    1. Re:Poor mans dual-core by Chewbacon · · Score: 2, Informative

      I have a dual core on my desktop at home and HT on my machine at work. I'll take the dual core over HT any day that ends with a Y. You can multi-task so well with the Pentium D it becomes blissful. Want to archive a DVD movie and put your favorite CD on your mp3 player? Set the two apps to run on different cores. On the other hand, my HT workstation goes nuts-slow if I try to do two intensive tasks at once.

      --
      Chewbacon
      The Bible is like Wikipedia: written by a bunch of people and verifiable by questionable sources.
    2. Re:Poor mans dual-core by dsci · · Score: 5, Insightful

      What is the performance gap between dual CPU vs Dual-core?

      It's the usual answer: it depends.

      We have to get rid of the notion that there is one overall system architecture that is "right" for all computing needs.

      For general, every-day desktop use, there should be little difference between a dual CPU SMP box and a dual core box.

      I have a small cluster consisting of AMD 64 X2 nodes, and the nodes use the FC4 SMP kernel just fine. All scheduling between CPU's is handled by the OS, and MPI/PVM apps run just as expected when using the configurations suggested for SMP nodes.

      In fact, with the dual channel memory model, dual core AMD systems might be a little better than generic dual CPU, since each processor has it's "own" memory.

      --
      Computational Chemistry products and services.
    3. Re:Poor mans dual-core by masklinn · · Score: 2, Informative
      In fact, with the dual channel memory model, dual core AMD systems might be a little better than generic dual CPU, since each processor has it's "own" memory.

      Nope, both cores use the same bridge to access central memory so that point is moot. On the other hand, the cores of an AthlonX2 get to discuss with one another through a special link while regular multiprocessor have to use the FSB (or HyperThreading for AMD's Opterons) link, and therefore have to compete with every other device using said FSB/HT (on top of getting much higher latencies)

      --
      "The way we can tell it's C# instead of Haskell is because it's nine lines instead of two." -- wadler
    4. Re:Poor mans dual-core by Malor · · Score: 4, Informative

      I think you're kind of saying this already, but I felt confused by your wording and thought I'd chime in. I'm a little blurry on a few of these details, and too lazy to go look things up, so pay attention to replies... don't treat this as gospel.

      As far as I know, all multi-cpu AMD packages use exactly the same method to talk amongst themselves, HyperTransport. They absolutely use a private, dedicated HT bus between cores. I *think* that when you run two single core Opterons, each has a link to main memory, and they also share a direct link. In the case of a 4-die system, I think the third and fourth CPUs 'piggyback' on the 1st and 2nd... they talk to processors 1 and 2, and each other. Processors 1 and 2 do main-memory fetches on their behalf. Each CPU has its own dedicated cache, and I think the cache ends up being semi-unified... so that if something is in processor 2's cache, when processor 4 requests the data, it comes from processor 2 instead of main memory. That's not quite as fast as direct cache, but it's a LOT faster than the DRAM.

      The X2 architecture is like half of a 4-way system. There's one link to main memory, and one internal link between the two CPUs... the second one is piggybacking, just like processors 3 and 4 do in a the 4-way system. It's not quite as good as a dedicated bus per processor, but the AMD architecture isn't that bandwidth-starved, and a 1gb HT link is usually fine for keeping two processors fed. You do lose a little performance, but not that much.

      Intel dual cores share a single 800mhz bus, with no special link between the chips. And the Netburst architecture is extremely memory bandwidth hungry. Because of its enormous pipeline, a branch mispredict/pipeline stall hurts terribly. The RAM needs to be very very fast to refill the pipeline and get the processor moving again.

      So running two Netburst processors down a single, already-starved memory bus is just Not a Good Idea. It's a crummy, slapped-together answer to the much, much better design of the AMD chips. It's a desperate solution to avoid the worst of all possible fates... not being in a high-end market segment at all.

      Next year this could all be different again, but at the moment, AMD chips, particularly dual core, are a lot better from nearly every standpoint.

    5. Re:Poor mans dual-core by volsung · · Score: 4, Informative
      That's not quite true either. Each Opteron has a separate memory controller (dual-channel), which means that each CPU can have its own pipe to a bank of memory. So if the CPU needs to access memory in its banks, it will not have to contend with the other CPU over the HT link. A NUMA-aware OS will try to schedule processes on the same CPU which controls the process's allocated memory. If your programs can fit in one CPU's memory bank, then you can get bus contention down pretty low.

      This is why SMP makers are going nuts over the Opteron. Your effective memory bandwidth scales linearly with the number of processors, assuming your processes partition nicely.

    6. Re:Poor mans dual-core by Anonymous Coward · · Score: 2, Interesting

      You're lying. All HT does is schedule a single processor's execution units so as to achieve more parallelism out of the code stream than can be obtained from instruction level parallelism using OoO execution. The only conceivable way for this to be doubling your performance, would be if your output code were so unbelievably awful as to merit some form of parade in its honor for sucking so completely. The only way for an actual SMP system (dual core or dual processor) to not improve upon this would be if your algorithm doesn't scale to more than two threads, since the Pentium D affords both two cores and HT. Or you could just disable HT and get a compiler that doesn't suck monkey ass so much that the P4 can't execute more instructions in parallel.

    7. Re:Poor mans dual-core by InvalidError · · Score: 4, Informative

      AMD Opterons each have their own local RAM and can access each other's RAM over the HT links to form a a cache-coherent non-uniform memory architecture - ccNUMA.

      Multi-core Opterons have a special internal crossbar switch that allow the cores to share the memory controller and HT links, they do not 'piggy back' on the other. This reduces latencies and increases bandwidth for communication between the two cores and gives both cores the equal-opportunity access to the HT ports and CPU's local RAM. With a NUMA-enabled OS, applications will run off the CPU's local RAM whenever possible to minimize bus contention and this allows Opteron servers' overall bandwidth and processing power to scale up almost linearly with the number of CPUs.

      As for Intel's dual-cores, the P4 makes sub-optimal use of its very limited available bandwidth. Turning HT on in a quad-core setup where the FSB is already dry on bandwidth naturally only makes things worse by increasing bus contention. Netburst was a good idea but it was poorly executed and the shared FSB very much killed any potential for scalability. If Intel gave the P4 an integrated RAM controller and a true dual-core CPU (two cores connected through a crossbar switch to shared memory and bus controllers like AMD did for the X2s), things would look much better. I'm not buying Intel again until Intel gets this obvious bit of common sense. The CPU is the largest RAM bandwidth consumer in a system, it should have the most direct RAM access possible. Having to fill pipelines and hide latencies with distant RAM wastes many resources and a fair amount of performance - and to make this bad problem worse, Intel is doing this on a shared bus. Things will get a little better with the upcoming dual-bus chipsets with quad-channel FBDIMM but this will still put a hard limit on practical scalability thanks to the non-scalable RAM bandwidth.

      On modern high-performance CPUs, shared busses kill scalability. AMD moved towards independant CPU busses with the K7 and integrated RAM controllers with the K8 to swerve around the scalability brick wall Intel was about to crash into many years ago and has kept on ramming ever since. Right now, Intel's future dual-FSB chipset is nothing more than Intel finally catching up with last millenia's dual-processor K7 platforms, only with bigger bandwidth figures.

  5. Behold! by alphapartic1e · · Score: 5, Funny

    Perhaps this ushers a new era of computing, where Intel chips underperform AMD ones.

    Oh, wait...

  6. sort of obvious by Vlad_the_Inhaler · · Score: 4, Informative

    If you have a system thread cleaning out blocks of disk cache memory then of course it is going to suffer. The whole point of hyperthreading was that one thread could run while another was waiting for I/O.

    The first tests on Linux when Hyperthreading came out were also pretty discouraging.

    --
    Mielipiteet omiani - Opinions personal, facts suspect.
    1. Re:sort of obvious by timeOday · · Score: 2, Insightful
      The whole point of hyperthreading was that one thread could run while another was waiting for I/O.
      Huh? You don't need hyperthreading for that, it's just normal multitasking.
    2. Re:sort of obvious by Mateorabi · · Score: 2, Informative
      Except that in multitasking, when a process blocks and swaps you suffer hundreds to thousands of cycles while the OS swaps out processes structs, rewrites VM tables, etc. This usualy happens at the os syscall level too.

      In hyperthreading, one thread simply stops contending for functional units for 10s of cycles letting the other, already loaded and running thread max out its ALU/FPU usage while the other waits for cache to get filled from DRAM. This is much higher granularity: the OS doesn't force a swap penalty for every single cache miss, because the act of swapping is way more expensive.

      The problem is if both threads are simultaneously memory (vs cpu) intensive then you end up with two waiting threads and don't see a performance boost. Even worse, they both start fighting over the same cache lines. This is the HT process equivalent of virtual memory thrashing, only its at the DRAMcache level instead of the diskDRAM level.

      --
      "You saved 1968." - Ms. Valerie Pringle to the crew of Apollo 8

  7. Marketing ploy. by Chickenofbristol55 · · Score: 2, Insightful

    I don't want to start a flamewar, but everytime I see an Intel commercial when the announcer says "pentium 4 with ht technology", it sounds like a stupid marketing ploy. It's suppose to offer better performance in heavily threaded apps, but apparently it doesn't. Also, in the commercials, it never explains to the customer what HT is, which just shows that if they had a great piece of technology, they would atleast take 10 seconds to explain the benefits, but they never do. They say a catch phrase, and that's really what it all seems to boil down to.

    --
    public class null extends java applet { System.out.print ("Tabula Rasa"); }
  8. Figures by xouumalperxe · · Score: 5, Interesting

    Well, AFAIK, the HTT thing only allows for the processor to sort of split execution units (FPU, ALU, etc) so that one can work on one thread, the other on another one. If an application resorts heavily to one of those units -- and my somewhat uninformed feeling is that software like SQL probably works mostly on the ALU, it, can't possibly GAIN performance. On the other hand, I can see the effort of thrying to pigeonhole the idle threads on the wrong execution unit (will it even try that?) completely borking performance. So yeah, no surprises here.

    1. Re:Figures by MourningBlade · · Score: 2, Informative

      Well, AFAIK, the HTT thing only allows for the processor to sort of split execution units (FPU, ALU, etc) so that one can work on one thread, the other on another one. If an application resorts heavily to one of those units -- and my somewhat uninformed feeling is that software like SQL probably works mostly on the ALU, it, can't possibly GAIN performance. On the other hand, I can see the effort of thrying to pigeonhole the idle threads on the wrong execution unit (will it even try that?) completely borking performance. So yeah, no surprises here.

      Intel and AMD's Hyperthreading is one-at-a-time: the chip is either working as CPU 0 or it's working as CPU 1. I believe the newest Power5 is the only chip out there that can divvy up the internal units amongst the threads and work simultaneously.

      And they have noticed performance improvements, though (once again) you have the cache size issue. Then again, a Power5 has an enormous CPU cache compared to an x86 processor.

  9. How is this news? by logicnazi · · Score: 3, Informative

    This sort of effect has been talked about for as long as I remember hearing about hyperthreading. It was common knowledge long before the chips came out that running two threads on the same cache can cause performance issues. One can see this with two chips sharing an L2 cache so why should it be a surprise here?

    The real question is whether this issue can be optimized for. If the developers design their code with HT in mind will this still be a problem since the other thread may belong to another process or would properly optimized code be able to deal with his?

    Most importantly is this a rare effect or a common one? Would it be rare or common if you optimize your programs for an HT machine?

    --

    If you liked this thought maybe you would find my blog nice too:

  10. So, what do we call this? by jcr · · Score: 5, Funny

    Hyperthrashing?

    -jcr

    --
    The only title of honor that a tyrant can grant is "Enemy of the State."
    1. Re:So, what do we call this? by porneL · · Score: 2, Interesting

      HyperThrottling

  11. Re:It's been that way since day one, desktop as we by logicnazi · · Score: 5, Interesting

    As someone who commented above pointed out intel openly acknowledges performance can be hurt. I don't know what you mean about not being acceptable to notice this as I've seen this sort of issue mentioned in pretty much every article I've read on HT starting quite far back.

    HT is just another chip technology like any other. It is only in the rarest circumstances that a new technology will be better/faster for everything. These things all have tradeoffs and the question is whether the benefits are enough to exceed the disadvantages.

    I really think you are being a little unfair to intel. If you had evidence that it decreased performance for most systems even when the software was compiled taking HT into account then you might have a point. However, as it is this is no different than IBM touting its RISC technology or AMD talking about their SIMD capabilities. For each of these technologies you could find some code which would actually run slower. If you happen to be running code which makes heavy use of some hardware optimized string instructions a RISC system can actually make things worse not to mention a whole other host of issues. The SIMD capabilities of most x86 processors required switching the FPU state which took time as well.

    It's only reasonable that companies want to publisize their newest fancy technology and they are hardly unsavory because they don't put the potential disadvantages centrally in their advertisements/PR material. When you go on a first date do you tell the girl about your loud snoring, how you cheated on your ex or other bad qualities about yourself. Of course not, one doesn't lie about these things but it is only natural to want to put the best face forward and it seems ridiculous to hold intel to a higher standard than an individual in these matters.

    --

    If you liked this thought maybe you would find my blog nice too:

  12. HT problems with firebird database (slowdowns) by mAriuZ · · Score: 2, Informative

    Usual response is to disable it from bios

    One possible solution (code patch)

    http://sourceforge.net/mailarchive/message.php?msg _id=12403341

    Other threads with hyperthreading problems (slowdowns)
    http://sourceforge.net/search/?forum_id=6330&group _id=9028&words=hyperthreading&type_of_search=mlist s

    --
    developer http://flamerobin.org
  13. Windows problem? by kasperd · · Score: 2, Insightful

    The article seems to focus only on Windows. To get good performance from hyperthreading, the scheduler has to be aware of situations that could lead to decreased performance and avoid them. So is this a problem with the Windows scheduler being unable to deal with hyperthreading or is hyperthreading really broken? How is hyperthreading performance on other operating systems?

    Another question one needs to ask is, how is performance on single and dual CPU systems? Getting good performance on a dual CPU HT system (which means four logical CPUs) is more complicated and thus requires more sophisticated algorithms in the scheduler.

    Applications are most likely not to be blamed for the decreased performance. Such hardware differences should be dealt with by the kernel. Occationally the scheduler should keep one thread idle whenever that leads to the best performance. Only when there is a performance benefit should both threads be used at the same time.

    --

    Do you care about the security of your wireless mouse?
  14. Time to Buy AMD? by olddotter · · Score: 4, Insightful
    Sounds like it might be time to buy more AMD stock.

    I second the person that said programmers shouldn't be writing code to the cache size on a processor. How well your code fits in cache is not something you can control at run time. Different releases of the CPU often have different cache sizes. And frankly developers should always try to achieve tight efficent code, not develope to a particular cache size.

  15. HT kills my ATI All in Wonder by puto · · Score: 4, Interesting

    I have had an ATI all in wonder 9800 for close to more than a year now. I never really used the tuner part until a few weeks a go when I took delivery of several new LCD's and decided that I could be watching a little tv on one while working.

    The 9800 sits on my XP box, which rarely gets rebooted. Games, browsing etc. My mac mini and linux boxes sit in their places with a KVM

    Well after using the tuner part, it looks great with my digital cable. But the box would lock, couldnt kill the process of the ATI software MMC. A few times an hour sometimes at least once a day. Well I was on the point of sticking an old haupage in there. Or using another MMC.

    Well after much digging I found a thread on how HT could cause issues with the software. I disabled it in the bios, do not really need it for anything. And ran the Tuner 48 hours solid without a lockup.

    Now perhaps ATI is at fault for the software, but then again HT caused the incompatibility in my book.

    Puto

    --
    The Revolution Will Not Be Televised
    1. Re:HT kills my ATI All in Wonder by sm8000 · · Score: 5, Funny

      You watched TV for 48 hours?

    2. Re:HT kills my ATI All in Wonder by puto · · Score: 4, Funny

      Just your mom on a Cinemax weekend hump-a-thon.

      I left it on for 48 hours unattended.

      Puto

      --
      The Revolution Will Not Be Televised
    3. Re:HT kills my ATI All in Wonder by laffer1 · · Score: 2, Informative

      I don't run with htt on but I do have an SMP box. (2 xeon 2ghz) The ati software works fine on my desktop, so it must be an issue specific to hyperthreading.

  16. Not a developer from Citrix by Bluey · · Score: 2, Informative

    I know asking for them to research is a stretch, but the submitter should at least read the acticle before submitting it. The quote was from a Technical Director at a consulting company that sells Citrix software, not from a developer at Citrix. Hyperthreading can definitely help performance of Metaframe running under Windows 2003. Enabling it in the bios on a server running Windows 2000 was where the problem resided.

  17. shared cache versus local memory by erwincoumans · · Score: 2, Informative

    "I read the intel assembly guide section regarding hyperthreading, and it clearly states that performance will drop if you don't take the shared cache into consideration." This is a general problem. XBox 360 has similar issues, 3 cores sharing the same cache. Having multiple independent cpu's with each its local memory (like multiprocessor or PS3 SPU's),doesn't suffer from these issues.

  18. Of course it can hurt performance by Anonymous Coward · · Score: 2, Informative

    HT is a very simple concept: Virtualize 2 CPUs by cutting all caches in half and allocating each half to one of the CPUs, and allow the ALUs to process data from either thread. Ths can give good performance, for instance when one thread has a cache miss and is waiting for data from main memory (or god forbid there is a fault and you need to read from the HDD). In a normal single CPU operation, this ties up resources, and that thread can't make any progress. with HT on, the second thread can continue processing data. Or even without a cache miss, there are 4 (or more) ALUs on the die, and only certain types of applications can effectively make use of them all simulatneously. Having HT allows a higher probability that all the resources on the chip are used. But the cost, as I said above, is cutting the cache sizes in half (effectively). And cache is king for some applications. there are many job types where doubling the cache gives much better performance than even doubling the CPU speed (well, that is probably pushing it, ut certainly adding 10% more cache can be better than 10% higher clock rate), as it means less time going to main memory.

    It isn't a foolproof technology, but it has it's benefits. SQL can be very heavy on the cache, and I'm not surprised that it doesn't perform optimally without some tuning.

  19. Dual Core performance... by Name+Anonymous · · Score: 3, Interesting
    As others have said, it depends...

    Is it two complete cores? Front Side Bus speed? Memroy Speed? etc.

    The IBM 970MP that Apple is using for the dual core PowerMacs was designed right. And due to the cache snooping (among other things), a dual core 970MP can be slightly faster than a dual processor setu at the same clock and bus speeds.

    Another multicore chip to look at for being done right is the Sun UltraSPARC T1 processor. Up to 8 cores with 4 threads per core. Sun's threading model in this processor doesn't have the faults that Intel's HyperThreading does.

    Intel HT technology seems as bad a patch on the architecture much like Microsoft's updates to Windows.

    1. Re:Dual Core performance... by InvalidError · · Score: 3, Insightful

      HT and Netburst were good ideas... but they were poorly executed.

      Part of the reason for this is that desktop CPUs mostly run desktop apps and most desktop apps are single-threaded so Intel and AMD could not afford to give up on single-threaded performance. This forced them to add heaps of logic to extract parallelism and Intel made many (IMO dumb) decisions in the process. The SPARC stuff is used for scientific apps which have a long history of multi-threading and distributed computing so Sun does not have to worry about single-threaded performance, allowing for much simpler, leaner and more efficient designs.

      Where I think Netburst is particularly bad is the execution engine... when I read Intel's improved hyper-threading patent, I was struck in disbelief: the execution pipelines are wrapped in a replay queue that blindly re-executes uOPs until they successfully execute and retire. Each instruction that fails to retire on the first pass enters the queue and has its execution latency increased by a dozen cycles until its next replay. Once the queue is full, no more uOPs can be issued so the CPU wastes power and cycles re-executing stale uOPs until they retire, causing execution to stall on all threads. Prescott added independant replay queues for each thread so one single thread would never be able to stall the whole CPU by filling the queue... this could have helped Northwood quite a bit but Prescott's extra latency killed any direct gains from it. Intel should roll back to the Northwood pipeline and re-apply the good Prescott stuff like dedicated integer multiplier and barrel shifter, HT2, SSE3 and a few other things, no miracle but it would be much better than the current Prescotts, though it certainly would not help the saturated FSB issue.

      With a pure TLP-oriented CPU, there is no need for deep out-of-order execution, no need for branch prediction and no need for speculative execution. Going for TLP throughput allows the CPU to freeze threads whenever there is no nearby code that can execute deterministically instead of doing desperate deep searches, guesses and speculative execution: more likely than not, the other threads will have enough ready-and-able uOPs to fill the gaps and keep all execution units busy producing useful results on nearly every tick. Stick those SPARC chips on a P4-style shared FSB/RAM platform and they would still choke about as bad as P4s do.

      The P4's greatest achile's heel is the shared FSB... it was not an issue back when Netburst was running at sub-2GHz speeds but it clearly is not suitable for multi-threading multi-core multi-processor setups. The shared FSB is clearly taking the 'r' out of Netburst. The single-threaded obsession is also costing AMD and Intel a lot of potential performance, complexity and power.

  20. Hyperthreading works best with "bad" code by cimetmc · · Score: 4, Insightful

    Beside the cachae considerations which were discussed by numerous people here, there is one aspect that hasn't been mentioned.
    The reason why hyperthreading was introduced in first place was to reduce the "idle" time of the processor. The Pentium 4 class processors have an extremely long pipeline and this often leads to pipeline stalls. E.g. the processing of an instruction cannot proceed because it depends on the result of a previous instruction. The idea of hyperthreading is that whenever there is a potential pipeline stall, the processor switches to the other thread which hopefully can continue its executon because it isn't stalled by some dependency. Now most pipeline stalls occur when the code being executed isn't optimized for Pentium 4 class processors. However the better Pentium 4 optimized your code is, the less pipeline stalls you have and the better your CPU utilisation is with a single thread.

    Marcel

  21. Not Intel's fault; Microsoft's fault. c.f. Linux. by Theovon · · Score: 5, Interesting

    I remember early discussions from LKML where developers realized that if you were to run a high-priority thread on one virtual processor and a low-priority thread on the other VP, you'd have a priority imbalance and a situation that you'd want to avoid. The developers solved the problem by adding a tunable parameter that indicated the assumed amount of "extra" performance you could get out of the CPU from HT. In other words, with 1 CPU, max load is 100%; with two physical CPU's, max load is 200%; with one HT CPU, max load would be set to something on the order of 115% to 130%. So, when your hi-pri thread is running and the lo-pri thread wants to run, we let the low-pri thread only run 15% of the time (or something like that), resulting in only a modest impact on the hi-pri thread but an improvement in over-all system throughput.

    That being said, I infer from the article that Windows does not do any such priority fairness checking. Consider the example they gave in the article. The DB is running, and then some disk-cache cleaner process comes along and competes for CPU cache. If the OS were SMART, it would recognize that the system task is of a MUCH lower priority and either not run it or only run it for a small portion of the time.

    As said by others commenting on this article, the complainers are being stupid for two reasons. One, Intel already admitted that there are lots of cases where HT can hurt performance, so shut up. And Two, there are ways to ameliorate the problem in the OS, but since Windows isn't doing it, they should be complaining to Microsoft, not misdirecting the blame at Intel, so shut up.

    (Note that I don't like Intel too terribly much either. Hey, we all hate Microsoft, but when someone is an idiot and blames them for something they're not responsible for, it doesn't help anyone.)

  22. Benchmark your own application by cryfreedomlove · · Score: 2, Insightful

    I never accept the assertions that a configuration option lile HyperThreading is always good or always bad. It's never black and white. The answer is always: it depends on the application. In my experience a busy linux java based web serving application that does a lot of context switching and a lot of IO to back end applications uses less CPU when hyperthreading is enabled. Collective wisdom aside, it works for my application so I am leaving it on.

  23. Breaking the EULA ? by DrSkwid · · Score: 4, Funny

    I thought you couldn't report any performance issues of MS SQL Server :)

    --
    There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
  24. HT on Linux by RAMMS+EIN · · Score: 3, Informative

    Hyperthreading Speeds Linux.

    In a nutshell:

      - hyperthreading decreases syscall speed by a few percent
      - on single-threaded workloads, the effect is often negligible, with occasional large improvements or degradations
      - on multithreaded workloads, around 30% improvement is common
      - Linux 2.5 (which introduced HT-awareness) performs significantly better than Linux 2.4

    So, from that benchmark (and others like it, just STFW) it appears that HT offers significant benefits; you need multithreading to take advantage of it, and having a HT-aware OS helps.

    --
    Please correct me if I got my facts wrong.
  25. Interesting Technical Analysis on the subject by morcego · · Score: 4, Informative
    You will find here a very interesting technical analysis on the subject, by Bryan J. Smith, on why Hyperthreading is crappy engeneering. From the message:


    Since then, Intel has made a number of "hacks" to the i686 architecture.
    One is HyperThreading which tries to keep its pipes full by using its
    control units to virtualize two instruction schedulers, registers, etc...
    In a nutshell, it's a nice way to get "out-of-order and register
    renaming for almost free." Other than basic coherency checking as
    necessary in silicon, it "passes the buck" to the OS, leveraging its
    context switching (and associated overhead) to manage some details.

    That's why HyperThreading can actually be slower for some applications,
    because they do not thread, and the added overhead in _software_
    results in reduced processing time for the applications.
    --
    morcego
  26. Inconclusive on Linux? by ndogg · · Score: 2, Interesting

    I don't have a HT-capable proc (AMD Athlon XP 1700), so I don't know anything from personal experience.

    I decided to check out how PostgreSQL did with HT.

    The first link (1) was suggesting to someone--who was having performance problems under FreeBSD--to turn off HT. Of course, that may not be related to PostgreSQL itself, but rather FreeBSD. I really don't know.

    The next thing I found showed some mixed results with ext2 under Linux (2). Somethings showed gain with HT, but not others.

    Another link (3) commented that HT with Java requires special consideration when coding.

    I didn't come up with anything useful under PostgreSQL, so I checked out Linux.

    According to Linux Electrons, Linux performance can drop without proper setup.

    --
    // file: mice.h
    #include "frickin_lasers.h"
  27. That's not all by koan · · Score: 2, Interesting

    I use Nuendo for professional music recording and even though their latest version says it's HT aware, the performance is poor. In fact in several instances it only takes a few instruments loaded for it to peak CPU, change it back to basic CPU with HT off and it works fine.
    MY understanding is it's this way with Cubase as well.

    --
    "If any question why we died, Tell them because our fathers lied."
  28. Intel's Hyperthreading vs Sun's Chip Mulithreading by Dopeskills · · Score: 2, Interesting

    Can anyone explain to me the exact difference between HT and CMT ? I'm wondering if these same issues would plague Sun's new Niagra prcessor.

  29. Re:Should we turn it off in PCs? by Duhavid · · Score: 2, Insightful

    Yes, definately.

    Along with the rest of the machine.

    --
    emt 377 emt 4
  30. Re:Not Intel's fault; Microsoft's fault. c.f. Linu by Knackered · · Score: 2, Insightful
    HT is a bandaid for poor compiler technology and a mediocre architecture.


    I may agree that HyperThreading as implemented in the x86 architecture is a hack, but I wouldn't dismiss the original idea of HT, as implemented in the Tera supercomputers. It was designed to have hundreds of thread contexts in hardware, so if it has to wait on memory, there will be some other thread available to run. There are enough threads available that it can do without a cache, while utilising the full memory bandwidth. This quite neatly avoids cache consistency problems that can kill massively parallel performance.
    --
    a.
  31. Re:The developers are not smart enough! by canuck57 · · Score: 2, Insightful

    Are you high, or are you just in the habit of randomly making up nonsensical stuff?

    No, not high. Just willing to take pro-M$ flame bait today.

    I guess I overestimated the intelligence of the /. readership, especially those from the PC world.

    The fact is, if you are writing software to be efficient on a single processor the architecture of the software will be much different than if you know you have 32 processors. And neither is best for the other.

    For single processor speed you don't want the overhead of interprocess commutations so you can skip it and sequentially do what you need to do without worry of what the other processes are doing. In fact, this is usually how most programs operate as coding is much more easy to do.

    For multiprocessor systems you want to distribute as evenly as possible the work across as many processors and I/O busses as you have. It is worth the effort of code, threads, interprocess communications layer with mutex, locks and individual disk writers. But this model would run slower on a single CPU.

    The HT model isn't dual CPU in performance but does allow for 2 threads on the system to be active at once, at the expense of individual thread performance. Do we want single process speed or throughput? Example:

    Classic seti 3.x on Fedora Linux.

    - with 1 seti running takes 4 hours

    - with 2 seti running each takes 5.2 hours

    So if I want the fastest seti I want to run 1. If I want the most seti I want to run 2 to keep the processor busy to maximum performance.

    And MS SQL, like it or not will have it's ups and downs depending how it was architected.

  32. The Fix is not in The Software by Nom+du+Keyboard · · Score: 2, Interesting
    Where multiple threads access different parts of memory but are simultaneously processed by the chip's Hyperthreading Technology, the shared cache cannot keep up with their alternate demands and performance falls dramatically,

    Software shouldn't be expected to handle hardware quirks. It's up to the hardware to run the software efficiently.

    Seems to me a hardware fix would be to partition the cache into two pieces when HT is enabled and running -- use the whole cache for the processor otherwise.

    With 2MB caches per processor now becoming available, would this be such a bad thing? IIRC once you're up to 256KB of cache you've already got a hit rate near 90%. That severely limits your possible improvement to less than 10% regardless of how much more cache you add. And yes I am aware that increasing the processor multiplier does make every cache miss worse in proportion, but still having HT run more efficiently in the bargain could make this tradeoff worth it. And that's even before you consider uneven partitioning if the OS can determine that one thread needs more cache than the other.

    --
    "It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
  33. Offloading the responsibility to the OS by msimm · · Score: 2, Interesting

    is still a kludge. HT was a cheap hack to get extra performance under certain scenarios. Looks like their getting called out for it. Dual-core is the right answer, HT wasn't.

    --
    Quack, quack.
  34. I'd like to move all our servers to dual-core Opt. by msimm · · Score: 2, Interesting

    (erons). But the price makes them a hard sell. I'll definately be keeping my eye on these things, as soon as the price points start to line up. I want to see AMD suceed in the server market, but for now (aside from Sun and a few HP systems) Xeon is still the dominant player.

    --
    Quack, quack.
  35. Re:Not Intel's fault; Microsoft's fault. c.f. Linu by Vladimir · · Score: 2, Interesting

    I have two identical high-end dual cpu desktops, both with HT enabled sitting on my desk. One runs win-xp, the other a 64 bit Linux. The thing I observe every day is how windows scheduler sucks. I don't know for how long marketing dept. of MSFT knows about HT, but their OS definitely doesn't know about it yet (start update in subversion or compilation in VC -- go to drink some coffee, as computer is unusable). On Linux, on the other hand, HT really improves both responsiveness and throughput. I'm waiting to test quad- dual-core box with HT enabled ;)

  36. Mod Parent UP! by GroundBounce · · Score: 2, Informative

    The parent post is common sense, which seems infrequent. I have found the range to be quite wide: When rednering animations from Blender, I have found that hyperthreading results in nearly 70% faster throughput when turned on. For rendering MPEG2 using Tmpgenc (under Wine), I see around 40% improvement with HT on. Clearly, these two applications benefit quite a bit from HT due to small computational footprint and/or low cache contention, etc. On the other hand, on my system, on-screen 3D acceleration in the NVIDIA driver (under Linux at least) appears to suffer with HT, with frame rates that are around 10-20% slower than with HT disabled.

    So, I see improvements ranging from -20% to +70% depending on the application, with many applications seeing only small differences one way or the other. Like many things, this tends to turn into a religious debate when the fact is that it varies case-by-case.

  37. wrong on both counts by r00t · · Score: 4, Informative

    First, it doesn't matter if the server uses threads or processes. Threads have a minor performance advantage for startup and context switching, and some disadvantages for memory allocation speed (finding VM space is a hashing problem) and some locking overhead. For the most part though, with tasks that just crunch numbers (including scanning memory) or make system calls, there isn't all that much difference.

    Running 2 threads per CPU is not cheating. It's normal to run 1 thread per CPU plus 1 thread per concurrent blocking IO operation. That could come out to be 2 threads per CPU.

  38. Re:You'd think... by tomstdenis · · Score: 2, Informative

    Not really. You're trying to run two threads inside the L1 caches, the decoder bandwidth, etc. So the 16KB of cache turns back into [effectively] 8KB.

    The P4 can issue 3 uOPs per cycle but IIRC only from one thread. They alternate [stalls free up slots though]. Also the decoder can only decode ONE x86 opcode per cycle. Then you have expensive memory ports. Fetches to L2 [or alternatively to system memory] are done one at a time per thread [with deep queues which were lengthened in the Prescott core].

    Aside from the stronger ALU and the fact there are two of them, the AMDx2 also benefits from having dedicated caches per core and a dedicated HT bus between the cores that doesn't sit on the external bus. If you want SMP performance just get opterons or amdx2 cores. Really that simple.

    We all know HT was a kludge hack at the last minute to gain some marketting press. If Intel really wanted to boost the performance of the P4 they would strengthen the ALU [at a cost of clock frequency]. If a 2.2Ghz AMD64 can ROUTINELY beat a 3.2Ghz P4 then a 2.5Ghz optimized P4 variant [with a stronger ALU] could hold it's own.

    Oh wait, that already exists. It's called the Pentium M. :-)

    Tom

    --
    Someday, I'll have a real sig.
  39. Re:even dual core hurts by tomstdenis · · Score: 2, Informative

    The AMDx2 has separate L2 caches per core, they can communicate via a dedicated HT link [between 3.2 and 4 GiB/sec] and share one memory controller.

    So if one core requests cache line $x and the other core has it the data will be sent over the internal HT link and not even hit the memory bus. The memory controller is pipelined [I suspect] so even while the L2 fulfillment is going on the memory bus can be busy fetching another cache line.

    The HT cores have *one* L2 cache per physical core [so for instance, a dual-core HT processor has 4 "logical processors" but only 2 L2 caches]. The prescott [and later] cores have a deep memory read/write pipelines to queue up many memory operations at once. The dual-core P4s have their own cache per physical core which communicates on a FSB that is shared between everything [including the memory bus].

    Though in reality the dual-core P4 [e.g. 8xx series] do achieve 2x performance on totally unrelated [and not memory bound] tasks. For instance, doing RSA computations with CRT in two threads gets you a result with half the latency. [same is true for the AMDx2].

    So the dual-core P4 isn't a bad buy if money is strapped. You'll get more performance from an AMDx2 though as the individual cores are so much faster.

    Tom

    --
    Someday, I'll have a real sig.
  40. Re:I'd like to move all our servers to dual-core O by tomstdenis · · Score: 2, Insightful

    Twice the ALU power and half the power.

    That's not a hard sell. If you're doing number crunching of any kind in a professional setting an AMDx2 or opt will pay for itself quickly.

    Oh that and you're not funding the never ending chain of stupidity that is the P4 design team ;-)

    Tom

    --
    Someday, I'll have a real sig.
  41. Re:HT != SMP by TheRealFritz · · Score: 2, Informative

    Actually, the document you point out kind of describes the flaw in the MS scheduler. The MS scheduler only optimizes HT behavior when you have several physical HT enabled CPUs. So let's say you have 2 P4s with HT. Windows will see 4 CPUs. The article describes how Windows will try to schedule on a logical CPU that's part of a physical CPU which currently has nothing scheduled.

    So the support described by Microsoft completely fails to address single physical CPU scenarios, or multi-CPU scenarios under high thread load.

    What the scheduler needs to do in a single physical HT CPU scenario is only allow threads to execute on the second logical CPU that share resources with the thread executing on the first logical CPU in order to minimize resource contention.

  42. Re:You'd think... by tomstdenis · · Score: 2, Informative

    You'd be wrong. The two L2 caches on the AMDx2 allow them to have access in parallel. Had they been unified you'd either have to make it dual-ported [e.g. larger and possibly slower] or have access in sequence [e.g. double the latency].

    If you're running tasks that share [e.g. write to] the same small pocket of memory you're right. However, many tasks don't do that. Often a server will spawn an entire new thread [e.g. unique stack and heap objects] to handle connections.

    It also makes sense in the desktop scene. Why would X11 and XMMS be accessing the same code? One is a media player and the other is a windows server. They have different data objects in their own respective process spaces. So a unified cache doesn't help. Also keep in mind that smart OSes keep tasks in a given CPU so the cache doesn't get killed as quickly.

    Tom

    --
    Someday, I'll have a real sig.