Hyperthreading Hurts Server Performance?
sebFlyte writes "ZDNet is reporting that enabling Intel's new Hyperthreading Technology on your servers could lead to markedly decreased performance, according to some developers who have been looking into problems that have been occurring since HT has been shipping automatically activated. One MS developer from the SQL server team put it simply: 'Our customers observed very interesting behaviour on high-end HT-enabled hardware. They noticed that in some cases when high load is applied SQL Server CPU usage increases significantly but SQL Server performance degrades.' Another developer, this time from Citrix, was just as blunt. 'It's ironic. Intel had sold hyperthreading as something that gave performance gains to heavily threaded software. SQL Server is very thread-intensive, but it suffers. In fact, I've never seen performance improvement on server software with hyperthreading enabled. We recommend customers disable it.'"
Anybody who understands HT has been saying this since chips supported it, I have it enabled because I find that at typical loads our DB servers performance benefits from HT aware scheduling. Welcome to 2002.
Well, a technology with a name such as "HyperThreading" is targeted more to end users who don't know about processors, rather than SQL "Performance Tuners" who try to squeeze every cycle of processing power.
HyperThreading might help poorly written thread management (independent audio and video subsystems for example), but not true multithreading, that's for sure.
I read the intel assembly guide section regarding hyperthreading, and it clearly states that performance will drop if you don't take the shared cache into consideration. The two logical threads contend for the cache, causing the performance problems that were described. In order for there to be a true benefit to hyperthreading, either the program, the OS or the compiler needs to determine that hyperthreading is enabled, and model the code to only use less than half the cache. It's been known that way since the beginning, and frankly, is silly that MS is scratching their heads wondering why this is. Lower the cache footprint, and I'll be willing to bet that performance rises dramatically.
Marxism is the opiate of dumbasses
indeed has once again proved it is expensive to be poor.
Question I find more interesting: What is the performance gap between dual CPU vs Dual-core?
If you mod me down, I *will* introduce you to my sister!
Perhaps this ushers a new era of computing, where Intel chips underperform AMD ones.
Oh, wait...
If you have a system thread cleaning out blocks of disk cache memory then of course it is going to suffer. The whole point of hyperthreading was that one thread could run while another was waiting for I/O.
The first tests on Linux when Hyperthreading came out were also pretty discouraging.
Mielipiteet omiani - Opinions personal, facts suspect.
I don't want to start a flamewar, but everytime I see an Intel commercial when the announcer says "pentium 4 with ht technology", it sounds like a stupid marketing ploy. It's suppose to offer better performance in heavily threaded apps, but apparently it doesn't. Also, in the commercials, it never explains to the customer what HT is, which just shows that if they had a great piece of technology, they would atleast take 10 seconds to explain the benefits, but they never do. They say a catch phrase, and that's really what it all seems to boil down to.
public class null extends java applet { System.out.print ("Tabula Rasa"); }
Well, AFAIK, the HTT thing only allows for the processor to sort of split execution units (FPU, ALU, etc) so that one can work on one thread, the other on another one. If an application resorts heavily to one of those units -- and my somewhat uninformed feeling is that software like SQL probably works mostly on the ALU, it, can't possibly GAIN performance. On the other hand, I can see the effort of thrying to pigeonhole the idle threads on the wrong execution unit (will it even try that?) completely borking performance. So yeah, no surprises here.
This sort of effect has been talked about for as long as I remember hearing about hyperthreading. It was common knowledge long before the chips came out that running two threads on the same cache can cause performance issues. One can see this with two chips sharing an L2 cache so why should it be a surprise here?
The real question is whether this issue can be optimized for. If the developers design their code with HT in mind will this still be a problem since the other thread may belong to another process or would properly optimized code be able to deal with his?
Most importantly is this a rare effect or a common one? Would it be rare or common if you optimize your programs for an HT machine?
If you liked this thought maybe you would find my blog nice too:
Hyperthrashing?
-jcr
The only title of honor that a tyrant can grant is "Enemy of the State."
As someone who commented above pointed out intel openly acknowledges performance can be hurt. I don't know what you mean about not being acceptable to notice this as I've seen this sort of issue mentioned in pretty much every article I've read on HT starting quite far back.
HT is just another chip technology like any other. It is only in the rarest circumstances that a new technology will be better/faster for everything. These things all have tradeoffs and the question is whether the benefits are enough to exceed the disadvantages.
I really think you are being a little unfair to intel. If you had evidence that it decreased performance for most systems even when the software was compiled taking HT into account then you might have a point. However, as it is this is no different than IBM touting its RISC technology or AMD talking about their SIMD capabilities. For each of these technologies you could find some code which would actually run slower. If you happen to be running code which makes heavy use of some hardware optimized string instructions a RISC system can actually make things worse not to mention a whole other host of issues. The SIMD capabilities of most x86 processors required switching the FPU state which took time as well.
It's only reasonable that companies want to publisize their newest fancy technology and they are hardly unsavory because they don't put the potential disadvantages centrally in their advertisements/PR material. When you go on a first date do you tell the girl about your loud snoring, how you cheated on your ex or other bad qualities about yourself. Of course not, one doesn't lie about these things but it is only natural to want to put the best face forward and it seems ridiculous to hold intel to a higher standard than an individual in these matters.
If you liked this thought maybe you would find my blog nice too:
Usual response is to disable it from bios
g _id=12403341
p _id=9028&words=hyperthreading&type_of_search=mlist s
One possible solution (code patch)
http://sourceforge.net/mailarchive/message.php?ms
Other threads with hyperthreading problems (slowdowns)
http://sourceforge.net/search/?forum_id=6330&grou
developer http://flamerobin.org
The article seems to focus only on Windows. To get good performance from hyperthreading, the scheduler has to be aware of situations that could lead to decreased performance and avoid them. So is this a problem with the Windows scheduler being unable to deal with hyperthreading or is hyperthreading really broken? How is hyperthreading performance on other operating systems?
Another question one needs to ask is, how is performance on single and dual CPU systems? Getting good performance on a dual CPU HT system (which means four logical CPUs) is more complicated and thus requires more sophisticated algorithms in the scheduler.
Applications are most likely not to be blamed for the decreased performance. Such hardware differences should be dealt with by the kernel. Occationally the scheduler should keep one thread idle whenever that leads to the best performance. Only when there is a performance benefit should both threads be used at the same time.
Do you care about the security of your wireless mouse?
I second the person that said programmers shouldn't be writing code to the cache size on a processor. How well your code fits in cache is not something you can control at run time. Different releases of the CPU often have different cache sizes. And frankly developers should always try to achieve tight efficent code, not develope to a particular cache size.
Think Deeply.
I have had an ATI all in wonder 9800 for close to more than a year now. I never really used the tuner part until a few weeks a go when I took delivery of several new LCD's and decided that I could be watching a little tv on one while working.
The 9800 sits on my XP box, which rarely gets rebooted. Games, browsing etc. My mac mini and linux boxes sit in their places with a KVM
Well after using the tuner part, it looks great with my digital cable. But the box would lock, couldnt kill the process of the ATI software MMC. A few times an hour sometimes at least once a day. Well I was on the point of sticking an old haupage in there. Or using another MMC.
Well after much digging I found a thread on how HT could cause issues with the software. I disabled it in the bios, do not really need it for anything. And ran the Tuner 48 hours solid without a lockup.
Now perhaps ATI is at fault for the software, but then again HT caused the incompatibility in my book.
Puto
The Revolution Will Not Be Televised
I know asking for them to research is a stretch, but the submitter should at least read the acticle before submitting it. The quote was from a Technical Director at a consulting company that sells Citrix software, not from a developer at Citrix. Hyperthreading can definitely help performance of Metaframe running under Windows 2003. Enabling it in the bios on a server running Windows 2000 was where the problem resided.
"I read the intel assembly guide section regarding hyperthreading, and it clearly states that performance will drop if you don't take the shared cache into consideration." This is a general problem. XBox 360 has similar issues, 3 cores sharing the same cache. Having multiple independent cpu's with each its local memory (like multiprocessor or PS3 SPU's),doesn't suffer from these issues.
HT is a very simple concept: Virtualize 2 CPUs by cutting all caches in half and allocating each half to one of the CPUs, and allow the ALUs to process data from either thread. Ths can give good performance, for instance when one thread has a cache miss and is waiting for data from main memory (or god forbid there is a fault and you need to read from the HDD). In a normal single CPU operation, this ties up resources, and that thread can't make any progress. with HT on, the second thread can continue processing data. Or even without a cache miss, there are 4 (or more) ALUs on the die, and only certain types of applications can effectively make use of them all simulatneously. Having HT allows a higher probability that all the resources on the chip are used. But the cost, as I said above, is cutting the cache sizes in half (effectively). And cache is king for some applications. there are many job types where doubling the cache gives much better performance than even doubling the CPU speed (well, that is probably pushing it, ut certainly adding 10% more cache can be better than 10% higher clock rate), as it means less time going to main memory.
It isn't a foolproof technology, but it has it's benefits. SQL can be very heavy on the cache, and I'm not surprised that it doesn't perform optimally without some tuning.
Is it two complete cores? Front Side Bus speed? Memroy Speed? etc.
The IBM 970MP that Apple is using for the dual core PowerMacs was designed right. And due to the cache snooping (among other things), a dual core 970MP can be slightly faster than a dual processor setu at the same clock and bus speeds.
Another multicore chip to look at for being done right is the Sun UltraSPARC T1 processor. Up to 8 cores with 4 threads per core. Sun's threading model in this processor doesn't have the faults that Intel's HyperThreading does.
Intel HT technology seems as bad a patch on the architecture much like Microsoft's updates to Windows.
Beside the cachae considerations which were discussed by numerous people here, there is one aspect that hasn't been mentioned.
The reason why hyperthreading was introduced in first place was to reduce the "idle" time of the processor. The Pentium 4 class processors have an extremely long pipeline and this often leads to pipeline stalls. E.g. the processing of an instruction cannot proceed because it depends on the result of a previous instruction. The idea of hyperthreading is that whenever there is a potential pipeline stall, the processor switches to the other thread which hopefully can continue its executon because it isn't stalled by some dependency. Now most pipeline stalls occur when the code being executed isn't optimized for Pentium 4 class processors. However the better Pentium 4 optimized your code is, the less pipeline stalls you have and the better your CPU utilisation is with a single thread.
Marcel
I remember early discussions from LKML where developers realized that if you were to run a high-priority thread on one virtual processor and a low-priority thread on the other VP, you'd have a priority imbalance and a situation that you'd want to avoid. The developers solved the problem by adding a tunable parameter that indicated the assumed amount of "extra" performance you could get out of the CPU from HT. In other words, with 1 CPU, max load is 100%; with two physical CPU's, max load is 200%; with one HT CPU, max load would be set to something on the order of 115% to 130%. So, when your hi-pri thread is running and the lo-pri thread wants to run, we let the low-pri thread only run 15% of the time (or something like that), resulting in only a modest impact on the hi-pri thread but an improvement in over-all system throughput.
That being said, I infer from the article that Windows does not do any such priority fairness checking. Consider the example they gave in the article. The DB is running, and then some disk-cache cleaner process comes along and competes for CPU cache. If the OS were SMART, it would recognize that the system task is of a MUCH lower priority and either not run it or only run it for a small portion of the time.
As said by others commenting on this article, the complainers are being stupid for two reasons. One, Intel already admitted that there are lots of cases where HT can hurt performance, so shut up. And Two, there are ways to ameliorate the problem in the OS, but since Windows isn't doing it, they should be complaining to Microsoft, not misdirecting the blame at Intel, so shut up.
(Note that I don't like Intel too terribly much either. Hey, we all hate Microsoft, but when someone is an idiot and blames them for something they're not responsible for, it doesn't help anyone.)
I never accept the assertions that a configuration option lile HyperThreading is always good or always bad. It's never black and white. The answer is always: it depends on the application. In my experience a busy linux java based web serving application that does a lot of context switching and a lot of IO to back end applications uses less CPU when hyperthreading is enabled. Collective wisdom aside, it works for my application so I am leaving it on.
I thought you couldn't report any performance issues of MS SQL Server :)
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
Hyperthreading Speeds Linux.
In a nutshell:
- hyperthreading decreases syscall speed by a few percent
- on single-threaded workloads, the effect is often negligible, with occasional large improvements or degradations
- on multithreaded workloads, around 30% improvement is common
- Linux 2.5 (which introduced HT-awareness) performs significantly better than Linux 2.4
So, from that benchmark (and others like it, just STFW) it appears that HT offers significant benefits; you need multithreading to take advantage of it, and having a HT-aware OS helps.
Please correct me if I got my facts wrong.
morcego
I don't have a HT-capable proc (AMD Athlon XP 1700), so I don't know anything from personal experience.
I decided to check out how PostgreSQL did with HT.
The first link (1) was suggesting to someone--who was having performance problems under FreeBSD--to turn off HT. Of course, that may not be related to PostgreSQL itself, but rather FreeBSD. I really don't know.
The next thing I found showed some mixed results with ext2 under Linux (2). Somethings showed gain with HT, but not others.
Another link (3) commented that HT with Java requires special consideration when coding.
I didn't come up with anything useful under PostgreSQL, so I checked out Linux.
According to Linux Electrons, Linux performance can drop without proper setup.
// file: mice.h
#include "frickin_lasers.h"
I use Nuendo for professional music recording and even though their latest version says it's HT aware, the performance is poor. In fact in several instances it only takes a few instruments loaded for it to peak CPU, change it back to basic CPU with HT off and it works fine.
MY understanding is it's this way with Cubase as well.
"If any question why we died, Tell them because our fathers lied."
Can anyone explain to me the exact difference between HT and CMT ? I'm wondering if these same issues would plague Sun's new Niagra prcessor.
Yes, definately.
Along with the rest of the machine.
emt 377 emt 4
I may agree that HyperThreading as implemented in the x86 architecture is a hack, but I wouldn't dismiss the original idea of HT, as implemented in the Tera supercomputers. It was designed to have hundreds of thread contexts in hardware, so if it has to wait on memory, there will be some other thread available to run. There are enough threads available that it can do without a cache, while utilising the full memory bandwidth. This quite neatly avoids cache consistency problems that can kill massively parallel performance.
a.
Are you high, or are you just in the habit of randomly making up nonsensical stuff?
No, not high. Just willing to take pro-M$ flame bait today.
I guess I overestimated the intelligence of the /. readership, especially those from the PC world.
The fact is, if you are writing software to be efficient on a single processor the architecture of the software will be much different than if you know you have 32 processors. And neither is best for the other.
For single processor speed you don't want the overhead of interprocess commutations so you can skip it and sequentially do what you need to do without worry of what the other processes are doing. In fact, this is usually how most programs operate as coding is much more easy to do.
For multiprocessor systems you want to distribute as evenly as possible the work across as many processors and I/O busses as you have. It is worth the effort of code, threads, interprocess communications layer with mutex, locks and individual disk writers. But this model would run slower on a single CPU.
The HT model isn't dual CPU in performance but does allow for 2 threads on the system to be active at once, at the expense of individual thread performance. Do we want single process speed or throughput? Example:
Classic seti 3.x on Fedora Linux.
- with 1 seti running takes 4 hours
- with 2 seti running each takes 5.2 hours
So if I want the fastest seti I want to run 1. If I want the most seti I want to run 2 to keep the processor busy to maximum performance.
And MS SQL, like it or not will have it's ups and downs depending how it was architected.
Software shouldn't be expected to handle hardware quirks. It's up to the hardware to run the software efficiently.
Seems to me a hardware fix would be to partition the cache into two pieces when HT is enabled and running -- use the whole cache for the processor otherwise.
With 2MB caches per processor now becoming available, would this be such a bad thing? IIRC once you're up to 256KB of cache you've already got a hit rate near 90%. That severely limits your possible improvement to less than 10% regardless of how much more cache you add. And yes I am aware that increasing the processor multiplier does make every cache miss worse in proportion, but still having HT run more efficiently in the bargain could make this tradeoff worth it. And that's even before you consider uneven partitioning if the OS can determine that one thread needs more cache than the other.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
is still a kludge. HT was a cheap hack to get extra performance under certain scenarios. Looks like their getting called out for it. Dual-core is the right answer, HT wasn't.
Quack, quack.
(erons). But the price makes them a hard sell. I'll definately be keeping my eye on these things, as soon as the price points start to line up. I want to see AMD suceed in the server market, but for now (aside from Sun and a few HP systems) Xeon is still the dominant player.
Quack, quack.
I have two identical high-end dual cpu desktops, both with HT enabled sitting on my desk. One runs win-xp, the other a 64 bit Linux. The thing I observe every day is how windows scheduler sucks. I don't know for how long marketing dept. of MSFT knows about HT, but their OS definitely doesn't know about it yet (start update in subversion or compilation in VC -- go to drink some coffee, as computer is unusable). On Linux, on the other hand, HT really improves both responsiveness and throughput. I'm waiting to test quad- dual-core box with HT enabled ;)
The parent post is common sense, which seems infrequent. I have found the range to be quite wide: When rednering animations from Blender, I have found that hyperthreading results in nearly 70% faster throughput when turned on. For rendering MPEG2 using Tmpgenc (under Wine), I see around 40% improvement with HT on. Clearly, these two applications benefit quite a bit from HT due to small computational footprint and/or low cache contention, etc. On the other hand, on my system, on-screen 3D acceleration in the NVIDIA driver (under Linux at least) appears to suffer with HT, with frame rates that are around 10-20% slower than with HT disabled.
So, I see improvements ranging from -20% to +70% depending on the application, with many applications seeing only small differences one way or the other. Like many things, this tends to turn into a religious debate when the fact is that it varies case-by-case.
First, it doesn't matter if the server uses threads or processes. Threads have a minor performance advantage for startup and context switching, and some disadvantages for memory allocation speed (finding VM space is a hashing problem) and some locking overhead. For the most part though, with tasks that just crunch numbers (including scanning memory) or make system calls, there isn't all that much difference.
Running 2 threads per CPU is not cheating. It's normal to run 1 thread per CPU plus 1 thread per concurrent blocking IO operation. That could come out to be 2 threads per CPU.
Not really. You're trying to run two threads inside the L1 caches, the decoder bandwidth, etc. So the 16KB of cache turns back into [effectively] 8KB.
:-)
The P4 can issue 3 uOPs per cycle but IIRC only from one thread. They alternate [stalls free up slots though]. Also the decoder can only decode ONE x86 opcode per cycle. Then you have expensive memory ports. Fetches to L2 [or alternatively to system memory] are done one at a time per thread [with deep queues which were lengthened in the Prescott core].
Aside from the stronger ALU and the fact there are two of them, the AMDx2 also benefits from having dedicated caches per core and a dedicated HT bus between the cores that doesn't sit on the external bus. If you want SMP performance just get opterons or amdx2 cores. Really that simple.
We all know HT was a kludge hack at the last minute to gain some marketting press. If Intel really wanted to boost the performance of the P4 they would strengthen the ALU [at a cost of clock frequency]. If a 2.2Ghz AMD64 can ROUTINELY beat a 3.2Ghz P4 then a 2.5Ghz optimized P4 variant [with a stronger ALU] could hold it's own.
Oh wait, that already exists. It's called the Pentium M.
Tom
Someday, I'll have a real sig.
The AMDx2 has separate L2 caches per core, they can communicate via a dedicated HT link [between 3.2 and 4 GiB/sec] and share one memory controller.
So if one core requests cache line $x and the other core has it the data will be sent over the internal HT link and not even hit the memory bus. The memory controller is pipelined [I suspect] so even while the L2 fulfillment is going on the memory bus can be busy fetching another cache line.
The HT cores have *one* L2 cache per physical core [so for instance, a dual-core HT processor has 4 "logical processors" but only 2 L2 caches]. The prescott [and later] cores have a deep memory read/write pipelines to queue up many memory operations at once. The dual-core P4s have their own cache per physical core which communicates on a FSB that is shared between everything [including the memory bus].
Though in reality the dual-core P4 [e.g. 8xx series] do achieve 2x performance on totally unrelated [and not memory bound] tasks. For instance, doing RSA computations with CRT in two threads gets you a result with half the latency. [same is true for the AMDx2].
So the dual-core P4 isn't a bad buy if money is strapped. You'll get more performance from an AMDx2 though as the individual cores are so much faster.
Tom
Someday, I'll have a real sig.
Twice the ALU power and half the power.
;-)
That's not a hard sell. If you're doing number crunching of any kind in a professional setting an AMDx2 or opt will pay for itself quickly.
Oh that and you're not funding the never ending chain of stupidity that is the P4 design team
Tom
Someday, I'll have a real sig.
Actually, the document you point out kind of describes the flaw in the MS scheduler. The MS scheduler only optimizes HT behavior when you have several physical HT enabled CPUs. So let's say you have 2 P4s with HT. Windows will see 4 CPUs. The article describes how Windows will try to schedule on a logical CPU that's part of a physical CPU which currently has nothing scheduled.
So the support described by Microsoft completely fails to address single physical CPU scenarios, or multi-CPU scenarios under high thread load.
What the scheduler needs to do in a single physical HT CPU scenario is only allow threads to execute on the second logical CPU that share resources with the thread executing on the first logical CPU in order to minimize resource contention.
You'd be wrong. The two L2 caches on the AMDx2 allow them to have access in parallel. Had they been unified you'd either have to make it dual-ported [e.g. larger and possibly slower] or have access in sequence [e.g. double the latency].
If you're running tasks that share [e.g. write to] the same small pocket of memory you're right. However, many tasks don't do that. Often a server will spawn an entire new thread [e.g. unique stack and heap objects] to handle connections.
It also makes sense in the desktop scene. Why would X11 and XMMS be accessing the same code? One is a media player and the other is a windows server. They have different data objects in their own respective process spaces. So a unified cache doesn't help. Also keep in mind that smart OSes keep tasks in a given CPU so the cache doesn't get killed as quickly.
Tom
Someday, I'll have a real sig.