Hyperthreading Hurts Server Performance?
sebFlyte writes "ZDNet is reporting that enabling Intel's new Hyperthreading Technology on your servers could lead to markedly decreased performance, according to some developers who have been looking into problems that have been occurring since HT has been shipping automatically activated. One MS developer from the SQL server team put it simply: 'Our customers observed very interesting behaviour on high-end HT-enabled hardware. They noticed that in some cases when high load is applied SQL Server CPU usage increases significantly but SQL Server performance degrades.' Another developer, this time from Citrix, was just as blunt. 'It's ironic. Intel had sold hyperthreading as something that gave performance gains to heavily threaded software. SQL Server is very thread-intensive, but it suffers. In fact, I've never seen performance improvement on server software with hyperthreading enabled. We recommend customers disable it.'"
Well, a technology with a name such as "HyperThreading" is targeted more to end users who don't know about processors, rather than SQL "Performance Tuners" who try to squeeze every cycle of processing power.
HyperThreading might help poorly written thread management (independent audio and video subsystems for example), but not true multithreading, that's for sure.
That's lame. It seems like an exteremely BAD idea to get programs to worry about the total cache usage on the CPU. If this is the case, then no wonder performance is suffering. There should be no reason for any programmer to write a threaded application so it's "hyperthreading optimized," especially since HT was seemingly created as a transparent mechanism to increase performance.
It seems like an exteremely BAD idea to get programs to worry about the total cache usage on the CPU.
If you want to maximize performance then you want the compiler to know as much as possible about the architecture. If you have no cache then loop unrolling is a good thing, if you have a small cache then loop unrolling can bust the cache. If you are doing large matrix manipulations, how you choose to stride the matrix, and possibly pad it is exactly dependent on the size of the cache. Now, it may be that having the applications programmer worry about it is too much to ask, but the compiler most certainly needs to worry about such detail.
It depends on what your goals are. I do realize that was a fairly general statement, and it does not apply to every application. For something like lets say MS SQL server without a compiler that does it automatically, it would be an unreasonable expectation. If someone was writing an application for an embedded system, however, it might make sense if they chose the HT enabled processor. Are there any compilers currently that will do HT optimizations? I was under the impression that most commercial apps were basically compiled for the lowest common denominator anyway.
What is the performance gap between dual CPU vs Dual-core?
It's the usual answer: it depends.
We have to get rid of the notion that there is one overall system architecture that is "right" for all computing needs.
For general, every-day desktop use, there should be little difference between a dual CPU SMP box and a dual core box.
I have a small cluster consisting of AMD 64 X2 nodes, and the nodes use the FC4 SMP kernel just fine. All scheduling between CPU's is handled by the OS, and MPI/PVM apps run just as expected when using the configurations suggested for SMP nodes.
In fact, with the dual channel memory model, dual core AMD systems might be a little better than generic dual CPU, since each processor has it's "own" memory.
Computational Chemistry products and services.
I second the person that said programmers shouldn't be writing code to the cache size on a processor. How well your code fits in cache is not something you can control at run time. Different releases of the CPU often have different cache sizes. And frankly developers should always try to achieve tight efficent code, not develope to a particular cache size.
Think Deeply.
That's nonsense. Compilers routinely do loads of optimisations to better suit the underlying hardware. That's why any linux distro that ships binary packages has many flavors of each important or performance sensitive package (specially the kernel, in Debian you'll find images optimised for 386, 586, 686, k6, k7, etc). Is one of the reasons of the existence of Gentoo, also.
So MS had to make a choise: ship a binary optimized for every possible mix of hw (being the processor the most important factor, but not the only one), which is impossible, or ship images compatible with any recent x86 processor/hw... without being specially optimised for any. That's why hyperthreading performance suffers.
This is an important problem on Windows because most of the time you cannot simply recompile the un-optimised software to suit your hardware, as you can in Linux, etc.
(sorry for my bad english)
Beside the cachae considerations which were discussed by numerous people here, there is one aspect that hasn't been mentioned.
The reason why hyperthreading was introduced in first place was to reduce the "idle" time of the processor. The Pentium 4 class processors have an extremely long pipeline and this often leads to pipeline stalls. E.g. the processing of an instruction cannot proceed because it depends on the result of a previous instruction. The idea of hyperthreading is that whenever there is a potential pipeline stall, the processor switches to the other thread which hopefully can continue its executon because it isn't stalled by some dependency. Now most pipeline stalls occur when the code being executed isn't optimized for Pentium 4 class processors. However the better Pentium 4 optimized your code is, the less pipeline stalls you have and the better your CPU utilisation is with a single thread.
Marcel
Lots of programs are designed with the multiple thread model in mind. Programs should not be designed with the multiple thread model plus cache limitations in mind.
For an application like SQL Server, I'd have to disagree. Are you saying there's no one on the MSSQL team who looks at cache usage? I'd hope there were a lot of resources devoted to some fairly in-depth analysis of how the code performs on different CPUs. After all, after correctness, performance is how SQL Server is going to be judged (and criticised).
Given that a while back I watched a PDC presentation by Raymond Chen on how to avoid page faults etc in your Windows application (improving start-up times, etc), I'd say that Microsoft are no strangers to performance monitoring and analysis.
For your average Windows desktop app, then yes, worrying about cache usage on HT CPUs is way over the top. For something like SQL Server? Hell, no.
HT and Netburst were good ideas... but they were poorly executed.
Part of the reason for this is that desktop CPUs mostly run desktop apps and most desktop apps are single-threaded so Intel and AMD could not afford to give up on single-threaded performance. This forced them to add heaps of logic to extract parallelism and Intel made many (IMO dumb) decisions in the process. The SPARC stuff is used for scientific apps which have a long history of multi-threading and distributed computing so Sun does not have to worry about single-threaded performance, allowing for much simpler, leaner and more efficient designs.
Where I think Netburst is particularly bad is the execution engine... when I read Intel's improved hyper-threading patent, I was struck in disbelief: the execution pipelines are wrapped in a replay queue that blindly re-executes uOPs until they successfully execute and retire. Each instruction that fails to retire on the first pass enters the queue and has its execution latency increased by a dozen cycles until its next replay. Once the queue is full, no more uOPs can be issued so the CPU wastes power and cycles re-executing stale uOPs until they retire, causing execution to stall on all threads. Prescott added independant replay queues for each thread so one single thread would never be able to stall the whole CPU by filling the queue... this could have helped Northwood quite a bit but Prescott's extra latency killed any direct gains from it. Intel should roll back to the Northwood pipeline and re-apply the good Prescott stuff like dedicated integer multiplier and barrel shifter, HT2, SSE3 and a few other things, no miracle but it would be much better than the current Prescotts, though it certainly would not help the saturated FSB issue.
With a pure TLP-oriented CPU, there is no need for deep out-of-order execution, no need for branch prediction and no need for speculative execution. Going for TLP throughput allows the CPU to freeze threads whenever there is no nearby code that can execute deterministically instead of doing desperate deep searches, guesses and speculative execution: more likely than not, the other threads will have enough ready-and-able uOPs to fill the gaps and keep all execution units busy producing useful results on nearly every tick. Stick those SPARC chips on a P4-style shared FSB/RAM platform and they would still choke about as bad as P4s do.
The P4's greatest achile's heel is the shared FSB... it was not an issue back when Netburst was running at sub-2GHz speeds but it clearly is not suitable for multi-threading multi-core multi-processor setups. The shared FSB is clearly taking the 'r' out of Netburst. The single-threaded obsession is also costing AMD and Intel a lot of potential performance, complexity and power.