Hyperthreading Hurts Server Performance?

← Back to Stories (view on slashdot.org)

Hyperthreading Hurts Server Performance?

Posted by CowboyNeal on Saturday November 19, 2005 @02:16AM from the opposite-effects dept.

sebFlyte writes "ZDNet is reporting that enabling Intel's new Hyperthreading Technology on your servers could lead to markedly decreased performance, according to some developers who have been looking into problems that have been occurring since HT has been shipping automatically activated. One MS developer from the SQL server team put it simply: 'Our customers observed very interesting behaviour on high-end HT-enabled hardware. They noticed that in some cases when high load is applied SQL Server CPU usage increases significantly but SQL Server performance degrades.' Another developer, this time from Citrix, was just as blunt. 'It's ironic. Intel had sold hyperthreading as something that gave performance gains to heavily threaded software. SQL Server is very thread-intensive, but it suffers. In fact, I've never seen performance improvement on server software with hyperthreading enabled. We recommend customers disable it.'"

11 of 255 comments (clear)

Min score:

Reason:

Sort:

It's all in the name by hjf · 2005-11-19 02:22 · Score: 3, Insightful

Well, a technology with a name such as "HyperThreading" is targeted more to end users who don't know about processors, rather than SQL "Performance Tuners" who try to squeeze every cycle of processing power.
HyperThreading might help poorly written thread management (independent audio and video subsystems for example), but not true multithreading, that's for sure.
Re:The code wasn't changed by springbox · 2005-11-19 02:27 · Score: 4, Insightful

That's lame. It seems like an exteremely BAD idea to get programs to worry about the total cache usage on the CPU. If this is the case, then no wonder performance is suffering. There should be no reason for any programmer to write a threaded application so it's "hyperthreading optimized," especially since HT was seemingly created as a transparent mechanism to increase performance.
Re:The code wasn't changed by drerwk · 2005-11-19 02:35 · Score: 4, Insightful

It seems like an exteremely BAD idea to get programs to worry about the total cache usage on the CPU.
If you want to maximize performance then you want the compiler to know as much as possible about the architecture. If you have no cache then loop unrolling is a good thing, if you have a small cache then loop unrolling can bust the cache. If you are doing large matrix manipulations, how you choose to stride the matrix, and possibly pad it is exactly dependent on the size of the cache. Now, it may be that having the applications programmer worry about it is too much to ask, but the compiler most certainly needs to worry about such detail.
Re:The code wasn't changed by springbox · 2005-11-19 02:40 · Score: 3, Insightful

It depends on what your goals are. I do realize that was a fairly general statement, and it does not apply to every application. For something like lets say MS SQL server without a compiler that does it automatically, it would be an unreasonable expectation. If someone was writing an application for an embedded system, however, it might make sense if they chose the HT enabled processor. Are there any compilers currently that will do HT optimizations? I was under the impression that most commercial apps were basically compiled for the lowest common denominator anyway.
Re:Poor mans dual-core by dsci · 2005-11-19 02:47 · Score: 5, Insightful

What is the performance gap between dual CPU vs Dual-core?

It's the usual answer: it depends.

We have to get rid of the notion that there is one overall system architecture that is "right" for all computing needs.

For general, every-day desktop use, there should be little difference between a dual CPU SMP box and a dual core box.

I have a small cluster consisting of AMD 64 X2 nodes, and the nodes use the FC4 SMP kernel just fine. All scheduling between CPU's is handled by the OS, and MPI/PVM apps run just as expected when using the configurations suggested for SMP nodes.

In fact, with the dual channel memory model, dual core AMD systems might be a little better than generic dual CPU, since each processor has it's "own" memory.

--
Computational Chemistry products and services.
Time to Buy AMD? by olddotter · 2005-11-19 02:55 · Score: 4, Insightful

Sounds like it might be time to buy more AMD stock.
I second the person that said programmers shouldn't be writing code to the cache size on a processor. How well your code fits in cache is not something you can control at run time. Different releases of the CPU often have different cache sizes. And frankly developers should always try to achieve tight efficent code, not develope to a particular cache size.

--
Think Deeply. ...
Re:The code wasn't changed by ochnap2 · 2005-11-19 03:00 · Score: 5, Insightful

That's nonsense. Compilers routinely do loads of optimisations to better suit the underlying hardware. That's why any linux distro that ships binary packages has many flavors of each important or performance sensitive package (specially the kernel, in Debian you'll find images optimised for 386, 586, 686, k6, k7, etc). Is one of the reasons of the existence of Gentoo, also.

So MS had to make a choise: ship a binary optimized for every possible mix of hw (being the processor the most important factor, but not the only one), which is impossible, or ship images compatible with any recent x86 processor/hw... without being specially optimised for any. That's why hyperthreading performance suffers.

This is an important problem on Windows because most of the time you cannot simply recompile the un-optimised software to suit your hardware, as you can in Linux, etc.

(sorry for my bad english)
Hyperthreading works best with "bad" code by cimetmc · 2005-11-19 03:20 · Score: 4, Insightful

Beside the cachae considerations which were discussed by numerous people here, there is one aspect that hasn't been mentioned.
The reason why hyperthreading was introduced in first place was to reduce the "idle" time of the processor. The Pentium 4 class processors have an extremely long pipeline and this often leads to pipeline stalls. E.g. the processing of an instruction cannot proceed because it depends on the result of a previous instruction. The idea of hyperthreading is that whenever there is a potential pipeline stall, the processor switches to the other thread which hopefully can continue its executon because it isn't stalled by some dependency. Now most pipeline stalls occur when the code being executed isn't optimized for Pentium 4 class processors. However the better Pentium 4 optimized your code is, the less pipeline stalls you have and the better your CPU utilisation is with a single thread.

Marcel
Re:The code wasn't changed by springbox · 2005-11-19 03:29 · Score: 4, Insightful

I wasn't thinking of compilers. I was mostly talking about the people who have to write the software. Assuming there's no compiler that knows about HT, I stand by my assertion that it would generally be a bad pratice to get people to worry about it. Especailly these days. Another point that I was trying to make is that even if there were compilers who knew about the HT issues, I still think it's exceedingly stupid that Intel went ahead with HT despite the glaring problems that were mentioned. If people want multiple of threads of execution on the same processor then they should get one with two cores.
Lots of programs are designed with the multiple thread model in mind. Programs should not be designed with the multiple thread model plus cache limitations in mind.
Re:The code wasn't changed by Tim+Browse · 2005-11-19 04:13 · Score: 4, Insightful

It seems like an exteremely BAD idea to get programs to worry about the total cache usage on the CPU.

For an application like SQL Server, I'd have to disagree. Are you saying there's no one on the MSSQL team who looks at cache usage? I'd hope there were a lot of resources devoted to some fairly in-depth analysis of how the code performs on different CPUs. After all, after correctness, performance is how SQL Server is going to be judged (and criticised).
Given that a while back I watched a PDC presentation by Raymond Chen on how to avoid page faults etc in your Windows application (improving start-up times, etc), I'd say that Microsoft are no strangers to performance monitoring and analysis.
For your average Windows desktop app, then yes, worrying about cache usage on HT CPUs is way over the top. For something like SQL Server? Hell, no.
Re:Dual Core performance... by InvalidError · 2005-11-19 07:48 · Score: 3, Insightful

HT and Netburst were good ideas... but they were poorly executed.

Part of the reason for this is that desktop CPUs mostly run desktop apps and most desktop apps are single-threaded so Intel and AMD could not afford to give up on single-threaded performance. This forced them to add heaps of logic to extract parallelism and Intel made many (IMO dumb) decisions in the process. The SPARC stuff is used for scientific apps which have a long history of multi-threading and distributed computing so Sun does not have to worry about single-threaded performance, allowing for much simpler, leaner and more efficient designs.

Where I think Netburst is particularly bad is the execution engine... when I read Intel's improved hyper-threading patent, I was struck in disbelief: the execution pipelines are wrapped in a replay queue that blindly re-executes uOPs until they successfully execute and retire. Each instruction that fails to retire on the first pass enters the queue and has its execution latency increased by a dozen cycles until its next replay. Once the queue is full, no more uOPs can be issued so the CPU wastes power and cycles re-executing stale uOPs until they retire, causing execution to stall on all threads. Prescott added independant replay queues for each thread so one single thread would never be able to stall the whole CPU by filling the queue... this could have helped Northwood quite a bit but Prescott's extra latency killed any direct gains from it. Intel should roll back to the Northwood pipeline and re-apply the good Prescott stuff like dedicated integer multiplier and barrel shifter, HT2, SSE3 and a few other things, no miracle but it would be much better than the current Prescotts, though it certainly would not help the saturated FSB issue.

With a pure TLP-oriented CPU, there is no need for deep out-of-order execution, no need for branch prediction and no need for speculative execution. Going for TLP throughput allows the CPU to freeze threads whenever there is no nearby code that can execute deterministically instead of doing desperate deep searches, guesses and speculative execution: more likely than not, the other threads will have enough ready-and-able uOPs to fill the gaps and keep all execution units busy producing useful results on nearly every tick. Stick those SPARC chips on a P4-style shared FSB/RAM platform and they would still choke about as bad as P4s do.

The P4's greatest achile's heel is the shared FSB... it was not an issue back when Netburst was running at sub-2GHz speeds but it clearly is not suitable for multi-threading multi-core multi-processor setups. The shared FSB is clearly taking the 'r' out of Netburst. The single-threaded obsession is also costing AMD and Intel a lot of potential performance, complexity and power.