Hyperthreading Hurts Server Performance?
sebFlyte writes "ZDNet is reporting that enabling Intel's new Hyperthreading Technology on your servers could lead to markedly decreased performance, according to some developers who have been looking into problems that have been occurring since HT has been shipping automatically activated. One MS developer from the SQL server team put it simply: 'Our customers observed very interesting behaviour on high-end HT-enabled hardware. They noticed that in some cases when high load is applied SQL Server CPU usage increases significantly but SQL Server performance degrades.' Another developer, this time from Citrix, was just as blunt. 'It's ironic. Intel had sold hyperthreading as something that gave performance gains to heavily threaded software. SQL Server is very thread-intensive, but it suffers. In fact, I've never seen performance improvement on server software with hyperthreading enabled. We recommend customers disable it.'"
Well, a technology with a name such as "HyperThreading" is targeted more to end users who don't know about processors, rather than SQL "Performance Tuners" who try to squeeze every cycle of processing power.
HyperThreading might help poorly written thread management (independent audio and video subsystems for example), but not true multithreading, that's for sure.
Those of us who care to measure for themselves rather than buy Intel's propaganda, have noticed this long ago. I bet the people quoted in the article noticed it long ago as well, but it has only recently become "politically correct" to share that knowledge.
I don't want to start a flamewar, but everytime I see an Intel commercial when the announcer says "pentium 4 with ht technology", it sounds like a stupid marketing ploy. It's suppose to offer better performance in heavily threaded apps, but apparently it doesn't. Also, in the commercials, it never explains to the customer what HT is, which just shows that if they had a great piece of technology, they would atleast take 10 seconds to explain the benefits, but they never do. They say a catch phrase, and that's really what it all seems to boil down to.
public class null extends java applet { System.out.print ("Tabula Rasa"); }
That's lame. It seems like an exteremely BAD idea to get programs to worry about the total cache usage on the CPU. If this is the case, then no wonder performance is suffering. There should be no reason for any programmer to write a threaded application so it's "hyperthreading optimized," especially since HT was seemingly created as a transparent mechanism to increase performance.
It seems like an exteremely BAD idea to get programs to worry about the total cache usage on the CPU.
If you want to maximize performance then you want the compiler to know as much as possible about the architecture. If you have no cache then loop unrolling is a good thing, if you have a small cache then loop unrolling can bust the cache. If you are doing large matrix manipulations, how you choose to stride the matrix, and possibly pad it is exactly dependent on the size of the cache. Now, it may be that having the applications programmer worry about it is too much to ask, but the compiler most certainly needs to worry about such detail.
It depends on what your goals are. I do realize that was a fairly general statement, and it does not apply to every application. For something like lets say MS SQL server without a compiler that does it automatically, it would be an unreasonable expectation. If someone was writing an application for an embedded system, however, it might make sense if they chose the HT enabled processor. Are there any compilers currently that will do HT optimizations? I was under the impression that most commercial apps were basically compiled for the lowest common denominator anyway.
What is the performance gap between dual CPU vs Dual-core?
It's the usual answer: it depends.
We have to get rid of the notion that there is one overall system architecture that is "right" for all computing needs.
For general, every-day desktop use, there should be little difference between a dual CPU SMP box and a dual core box.
I have a small cluster consisting of AMD 64 X2 nodes, and the nodes use the FC4 SMP kernel just fine. All scheduling between CPU's is handled by the OS, and MPI/PVM apps run just as expected when using the configurations suggested for SMP nodes.
In fact, with the dual channel memory model, dual core AMD systems might be a little better than generic dual CPU, since each processor has it's "own" memory.
Computational Chemistry products and services.
The article seems to focus only on Windows. To get good performance from hyperthreading, the scheduler has to be aware of situations that could lead to decreased performance and avoid them. So is this a problem with the Windows scheduler being unable to deal with hyperthreading or is hyperthreading really broken? How is hyperthreading performance on other operating systems?
Another question one needs to ask is, how is performance on single and dual CPU systems? Getting good performance on a dual CPU HT system (which means four logical CPUs) is more complicated and thus requires more sophisticated algorithms in the scheduler.
Applications are most likely not to be blamed for the decreased performance. Such hardware differences should be dealt with by the kernel. Occationally the scheduler should keep one thread idle whenever that leads to the best performance. Only when there is a performance benefit should both threads be used at the same time.
Do you care about the security of your wireless mouse?
I second the person that said programmers shouldn't be writing code to the cache size on a processor. How well your code fits in cache is not something you can control at run time. Different releases of the CPU often have different cache sizes. And frankly developers should always try to achieve tight efficent code, not develope to a particular cache size.
Think Deeply.
That's nonsense. Compilers routinely do loads of optimisations to better suit the underlying hardware. That's why any linux distro that ships binary packages has many flavors of each important or performance sensitive package (specially the kernel, in Debian you'll find images optimised for 386, 586, 686, k6, k7, etc). Is one of the reasons of the existence of Gentoo, also.
So MS had to make a choise: ship a binary optimized for every possible mix of hw (being the processor the most important factor, but not the only one), which is impossible, or ship images compatible with any recent x86 processor/hw... without being specially optimised for any. That's why hyperthreading performance suffers.
This is an important problem on Windows because most of the time you cannot simply recompile the un-optimised software to suit your hardware, as you can in Linux, etc.
(sorry for my bad english)
Beside the cachae considerations which were discussed by numerous people here, there is one aspect that hasn't been mentioned.
The reason why hyperthreading was introduced in first place was to reduce the "idle" time of the processor. The Pentium 4 class processors have an extremely long pipeline and this often leads to pipeline stalls. E.g. the processing of an instruction cannot proceed because it depends on the result of a previous instruction. The idea of hyperthreading is that whenever there is a potential pipeline stall, the processor switches to the other thread which hopefully can continue its executon because it isn't stalled by some dependency. Now most pipeline stalls occur when the code being executed isn't optimized for Pentium 4 class processors. However the better Pentium 4 optimized your code is, the less pipeline stalls you have and the better your CPU utilisation is with a single thread.
Marcel
I never accept the assertions that a configuration option lile HyperThreading is always good or always bad. It's never black and white. The answer is always: it depends on the application. In my experience a busy linux java based web serving application that does a lot of context switching and a lot of IO to back end applications uses less CPU when hyperthreading is enabled. Collective wisdom aside, it works for my application so I am leaving it on.
Lots of programs are designed with the multiple thread model in mind. Programs should not be designed with the multiple thread model plus cache limitations in mind.
For an application like SQL Server, I'd have to disagree. Are you saying there's no one on the MSSQL team who looks at cache usage? I'd hope there were a lot of resources devoted to some fairly in-depth analysis of how the code performs on different CPUs. After all, after correctness, performance is how SQL Server is going to be judged (and criticised).
Given that a while back I watched a PDC presentation by Raymond Chen on how to avoid page faults etc in your Windows application (improving start-up times, etc), I'd say that Microsoft are no strangers to performance monitoring and analysis.
For your average Windows desktop app, then yes, worrying about cache usage on HT CPUs is way over the top. For something like SQL Server? Hell, no.
No, if you care about fault-tolerance and extreme availability, you go for HP Non-Stop servers (Tandem/NSK)... /G
Cache memory coherency.
Dual CPU/SMP setups have their own caches on EACH cpu involved.
Dual Core CPU's have a shared cache between cores "on die" (same physical CPU)!
(Thus, theoretically (more than that, practically) don't have the same "issues" on checking for concurrent data accesses between CPU cores as Dual Core CPU's do, since their cores are physically separated... @ least, that's my understanding of it).
Plus, iirc, cache data access on a Dual Core CPU should be faster than that of Dual CPU/SMP setups... no overheads on 'sync checks' between them, as to the data they are working on, as well as communications between them (since they are on physically separated CPU's in a Dual CPU/SMP setup, whereas DualCore CPU's have BOTH PROCESSORS on 1 single die & a shared cache they both use, no "over the motherboard cpu cache checking synchronization required").
* If I am incorrect, please, anyone - feel free to "set me straight" here as to my response to this person's questions!
APK
Yes, definately.
Along with the rest of the machine.
emt 377 emt 4
I may agree that HyperThreading as implemented in the x86 architecture is a hack, but I wouldn't dismiss the original idea of HT, as implemented in the Tera supercomputers. It was designed to have hundreds of thread contexts in hardware, so if it has to wait on memory, there will be some other thread available to run. There are enough threads available that it can do without a cache, while utilising the full memory bandwidth. This quite neatly avoids cache consistency problems that can kill massively parallel performance.
a.
Are you high, or are you just in the habit of randomly making up nonsensical stuff?
No, not high. Just willing to take pro-M$ flame bait today.
I guess I overestimated the intelligence of the /. readership, especially those from the PC world.
The fact is, if you are writing software to be efficient on a single processor the architecture of the software will be much different than if you know you have 32 processors. And neither is best for the other.
For single processor speed you don't want the overhead of interprocess commutations so you can skip it and sequentially do what you need to do without worry of what the other processes are doing. In fact, this is usually how most programs operate as coding is much more easy to do.
For multiprocessor systems you want to distribute as evenly as possible the work across as many processors and I/O busses as you have. It is worth the effort of code, threads, interprocess communications layer with mutex, locks and individual disk writers. But this model would run slower on a single CPU.
The HT model isn't dual CPU in performance but does allow for 2 threads on the system to be active at once, at the expense of individual thread performance. Do we want single process speed or throughput? Example:
Classic seti 3.x on Fedora Linux.
- with 1 seti running takes 4 hours
- with 2 seti running each takes 5.2 hours
So if I want the fastest seti I want to run 1. If I want the most seti I want to run 2 to keep the processor busy to maximum performance.
And MS SQL, like it or not will have it's ups and downs depending how it was architected.
HT and Netburst were good ideas... but they were poorly executed.
Part of the reason for this is that desktop CPUs mostly run desktop apps and most desktop apps are single-threaded so Intel and AMD could not afford to give up on single-threaded performance. This forced them to add heaps of logic to extract parallelism and Intel made many (IMO dumb) decisions in the process. The SPARC stuff is used for scientific apps which have a long history of multi-threading and distributed computing so Sun does not have to worry about single-threaded performance, allowing for much simpler, leaner and more efficient designs.
Where I think Netburst is particularly bad is the execution engine... when I read Intel's improved hyper-threading patent, I was struck in disbelief: the execution pipelines are wrapped in a replay queue that blindly re-executes uOPs until they successfully execute and retire. Each instruction that fails to retire on the first pass enters the queue and has its execution latency increased by a dozen cycles until its next replay. Once the queue is full, no more uOPs can be issued so the CPU wastes power and cycles re-executing stale uOPs until they retire, causing execution to stall on all threads. Prescott added independant replay queues for each thread so one single thread would never be able to stall the whole CPU by filling the queue... this could have helped Northwood quite a bit but Prescott's extra latency killed any direct gains from it. Intel should roll back to the Northwood pipeline and re-apply the good Prescott stuff like dedicated integer multiplier and barrel shifter, HT2, SSE3 and a few other things, no miracle but it would be much better than the current Prescotts, though it certainly would not help the saturated FSB issue.
With a pure TLP-oriented CPU, there is no need for deep out-of-order execution, no need for branch prediction and no need for speculative execution. Going for TLP throughput allows the CPU to freeze threads whenever there is no nearby code that can execute deterministically instead of doing desperate deep searches, guesses and speculative execution: more likely than not, the other threads will have enough ready-and-able uOPs to fill the gaps and keep all execution units busy producing useful results on nearly every tick. Stick those SPARC chips on a P4-style shared FSB/RAM platform and they would still choke about as bad as P4s do.
The P4's greatest achile's heel is the shared FSB... it was not an issue back when Netburst was running at sub-2GHz speeds but it clearly is not suitable for multi-threading multi-core multi-processor setups. The shared FSB is clearly taking the 'r' out of Netburst. The single-threaded obsession is also costing AMD and Intel a lot of potential performance, complexity and power.
Twice the ALU power and half the power.
;-)
That's not a hard sell. If you're doing number crunching of any kind in a professional setting an AMDx2 or opt will pay for itself quickly.
Oh that and you're not funding the never ending chain of stupidity that is the P4 design team
Tom
Someday, I'll have a real sig.