Hyperthreading Hurts Server Performance?
sebFlyte writes "ZDNet is reporting that enabling Intel's new Hyperthreading Technology on your servers could lead to markedly decreased performance, according to some developers who have been looking into problems that have been occurring since HT has been shipping automatically activated. One MS developer from the SQL server team put it simply: 'Our customers observed very interesting behaviour on high-end HT-enabled hardware. They noticed that in some cases when high load is applied SQL Server CPU usage increases significantly but SQL Server performance degrades.' Another developer, this time from Citrix, was just as blunt. 'It's ironic. Intel had sold hyperthreading as something that gave performance gains to heavily threaded software. SQL Server is very thread-intensive, but it suffers. In fact, I've never seen performance improvement on server software with hyperthreading enabled. We recommend customers disable it.'"
Anybody who understands HT has been saying this since chips supported it, I have it enabled because I find that at typical loads our DB servers performance benefits from HT aware scheduling. Welcome to 2002.
I read the intel assembly guide section regarding hyperthreading, and it clearly states that performance will drop if you don't take the shared cache into consideration. The two logical threads contend for the cache, causing the performance problems that were described. In order for there to be a true benefit to hyperthreading, either the program, the OS or the compiler needs to determine that hyperthreading is enabled, and model the code to only use less than half the cache. It's been known that way since the beginning, and frankly, is silly that MS is scratching their heads wondering why this is. Lower the cache footprint, and I'll be willing to bet that performance rises dramatically.
Marxism is the opiate of dumbasses
If you have a system thread cleaning out blocks of disk cache memory then of course it is going to suffer. The whole point of hyperthreading was that one thread could run while another was waiting for I/O.
The first tests on Linux when Hyperthreading came out were also pretty discouraging.
Mielipiteet omiani - Opinions personal, facts suspect.
This sort of effect has been talked about for as long as I remember hearing about hyperthreading. It was common knowledge long before the chips came out that running two threads on the same cache can cause performance issues. One can see this with two chips sharing an L2 cache so why should it be a surprise here?
The real question is whether this issue can be optimized for. If the developers design their code with HT in mind will this still be a problem since the other thread may belong to another process or would properly optimized code be able to deal with his?
Most importantly is this a rare effect or a common one? Would it be rare or common if you optimize your programs for an HT machine?
If you liked this thought maybe you would find my blog nice too:
I have a dual core on my desktop at home and HT on my machine at work. I'll take the dual core over HT any day that ends with a Y. You can multi-task so well with the Pentium D it becomes blissful. Want to archive a DVD movie and put your favorite CD on your mp3 player? Set the two apps to run on different cores. On the other hand, my HT workstation goes nuts-slow if I try to do two intensive tasks at once.
Chewbacon
The Bible is like Wikipedia: written by a bunch of people and verifiable by questionable sources.
Usual response is to disable it from bios
g _id=12403341
p _id=9028&words=hyperthreading&type_of_search=mlist s
One possible solution (code patch)
http://sourceforge.net/mailarchive/message.php?ms
Other threads with hyperthreading problems (slowdowns)
http://sourceforge.net/search/?forum_id=6330&grou
developer http://flamerobin.org
I know asking for them to research is a stretch, but the submitter should at least read the acticle before submitting it. The quote was from a Technical Director at a consulting company that sells Citrix software, not from a developer at Citrix. Hyperthreading can definitely help performance of Metaframe running under Windows 2003. Enabling it in the bios on a server running Windows 2000 was where the problem resided.
"I read the intel assembly guide section regarding hyperthreading, and it clearly states that performance will drop if you don't take the shared cache into consideration." This is a general problem. XBox 360 has similar issues, 3 cores sharing the same cache. Having multiple independent cpu's with each its local memory (like multiprocessor or PS3 SPU's),doesn't suffer from these issues.
Nope, both cores use the same bridge to access central memory so that point is moot. On the other hand, the cores of an AthlonX2 get to discuss with one another through a special link while regular multiprocessor have to use the FSB (or HyperThreading for AMD's Opterons) link, and therefore have to compete with every other device using said FSB/HT (on top of getting much higher latencies)
"The way we can tell it's C# instead of Haskell is because it's nine lines instead of two." -- wadler
HT is a very simple concept: Virtualize 2 CPUs by cutting all caches in half and allocating each half to one of the CPUs, and allow the ALUs to process data from either thread. Ths can give good performance, for instance when one thread has a cache miss and is waiting for data from main memory (or god forbid there is a fault and you need to read from the HDD). In a normal single CPU operation, this ties up resources, and that thread can't make any progress. with HT on, the second thread can continue processing data. Or even without a cache miss, there are 4 (or more) ALUs on the die, and only certain types of applications can effectively make use of them all simulatneously. Having HT allows a higher probability that all the resources on the chip are used. But the cost, as I said above, is cutting the cache sizes in half (effectively). And cache is king for some applications. there are many job types where doubling the cache gives much better performance than even doubling the CPU speed (well, that is probably pushing it, ut certainly adding 10% more cache can be better than 10% higher clock rate), as it means less time going to main memory.
It isn't a foolproof technology, but it has it's benefits. SQL can be very heavy on the cache, and I'm not surprised that it doesn't perform optimally without some tuning.
I don't run with htt on but I do have an SMP box. (2 xeon 2ghz) The ati software works fine on my desktop, so it must be an issue specific to hyperthreading.
MidnightBSD: The BSD for Everyone
Hyperthreading Speeds Linux.
In a nutshell:
- hyperthreading decreases syscall speed by a few percent
- on single-threaded workloads, the effect is often negligible, with occasional large improvements or degradations
- on multithreaded workloads, around 30% improvement is common
- Linux 2.5 (which introduced HT-awareness) performs significantly better than Linux 2.4
So, from that benchmark (and others like it, just STFW) it appears that HT offers significant benefits; you need multithreading to take advantage of it, and having a HT-aware OS helps.
Please correct me if I got my facts wrong.
morcego
I think you're kind of saying this already, but I felt confused by your wording and thought I'd chime in. I'm a little blurry on a few of these details, and too lazy to go look things up, so pay attention to replies... don't treat this as gospel.
As far as I know, all multi-cpu AMD packages use exactly the same method to talk amongst themselves, HyperTransport. They absolutely use a private, dedicated HT bus between cores. I *think* that when you run two single core Opterons, each has a link to main memory, and they also share a direct link. In the case of a 4-die system, I think the third and fourth CPUs 'piggyback' on the 1st and 2nd... they talk to processors 1 and 2, and each other. Processors 1 and 2 do main-memory fetches on their behalf. Each CPU has its own dedicated cache, and I think the cache ends up being semi-unified... so that if something is in processor 2's cache, when processor 4 requests the data, it comes from processor 2 instead of main memory. That's not quite as fast as direct cache, but it's a LOT faster than the DRAM.
The X2 architecture is like half of a 4-way system. There's one link to main memory, and one internal link between the two CPUs... the second one is piggybacking, just like processors 3 and 4 do in a the 4-way system. It's not quite as good as a dedicated bus per processor, but the AMD architecture isn't that bandwidth-starved, and a 1gb HT link is usually fine for keeping two processors fed. You do lose a little performance, but not that much.
Intel dual cores share a single 800mhz bus, with no special link between the chips. And the Netburst architecture is extremely memory bandwidth hungry. Because of its enormous pipeline, a branch mispredict/pipeline stall hurts terribly. The RAM needs to be very very fast to refill the pipeline and get the processor moving again.
So running two Netburst processors down a single, already-starved memory bus is just Not a Good Idea. It's a crummy, slapped-together answer to the much, much better design of the AMD chips. It's a desperate solution to avoid the worst of all possible fates... not being in a high-end market segment at all.
Next year this could all be different again, but at the moment, AMD chips, particularly dual core, are a lot better from nearly every standpoint.
This is why SMP makers are going nuts over the Opteron. Your effective memory bandwidth scales linearly with the number of processors, assuming your processes partition nicely.
AMD Opterons each have their own local RAM and can access each other's RAM over the HT links to form a a cache-coherent non-uniform memory architecture - ccNUMA.
Multi-core Opterons have a special internal crossbar switch that allow the cores to share the memory controller and HT links, they do not 'piggy back' on the other. This reduces latencies and increases bandwidth for communication between the two cores and gives both cores the equal-opportunity access to the HT ports and CPU's local RAM. With a NUMA-enabled OS, applications will run off the CPU's local RAM whenever possible to minimize bus contention and this allows Opteron servers' overall bandwidth and processing power to scale up almost linearly with the number of CPUs.
As for Intel's dual-cores, the P4 makes sub-optimal use of its very limited available bandwidth. Turning HT on in a quad-core setup where the FSB is already dry on bandwidth naturally only makes things worse by increasing bus contention. Netburst was a good idea but it was poorly executed and the shared FSB very much killed any potential for scalability. If Intel gave the P4 an integrated RAM controller and a true dual-core CPU (two cores connected through a crossbar switch to shared memory and bus controllers like AMD did for the X2s), things would look much better. I'm not buying Intel again until Intel gets this obvious bit of common sense. The CPU is the largest RAM bandwidth consumer in a system, it should have the most direct RAM access possible. Having to fill pipelines and hide latencies with distant RAM wastes many resources and a fair amount of performance - and to make this bad problem worse, Intel is doing this on a shared bus. Things will get a little better with the upcoming dual-bus chipsets with quad-channel FBDIMM but this will still put a hard limit on practical scalability thanks to the non-scalable RAM bandwidth.
On modern high-performance CPUs, shared busses kill scalability. AMD moved towards independant CPU busses with the K7 and integrated RAM controllers with the K8 to swerve around the scalability brick wall Intel was about to crash into many years ago and has kept on ramming ever since. Right now, Intel's future dual-FSB chipset is nothing more than Intel finally catching up with last millenia's dual-processor K7 platforms, only with bigger bandwidth figures.
Well, AFAIK, the HTT thing only allows for the processor to sort of split execution units (FPU, ALU, etc) so that one can work on one thread, the other on another one. If an application resorts heavily to one of those units -- and my somewhat uninformed feeling is that software like SQL probably works mostly on the ALU, it, can't possibly GAIN performance. On the other hand, I can see the effort of thrying to pigeonhole the idle threads on the wrong execution unit (will it even try that?) completely borking performance. So yeah, no surprises here.
Intel and AMD's Hyperthreading is one-at-a-time: the chip is either working as CPU 0 or it's working as CPU 1. I believe the newest Power5 is the only chip out there that can divvy up the internal units amongst the threads and work simultaneously.
And they have noticed performance improvements, though (once again) you have the cache size issue. Then again, a Power5 has an enormous CPU cache compared to an x86 processor.
Not at all. One of the big problems with HyperThreading as Intel has designed it is that they did not provide sufficient memory bandwidth to be able to feed both threads. This problem also plagues Intel's "dual core" chips. Ultimately, it eliminates the supposed benefit of switching over to the other thread when the first is blocked on memory access because as soon as the second thread needs information from memory it will actually slow things down. Also, even if there were sufficient memory bandwidth, the comparatively long fetch times would still mean that the CPU would be blocked parts of the time waiting on memory because there are only two threads available. Finally, there is a cost to switching between the threads, so even if you had the memory bandwidth and enough threads to prevent idle time it would still lose time because of the overhead in switching to another thread whenever the active one gets blocked.
On the other hand, the new UltraSPARC T1 (aka, Niagara) has massive memory bandwidth, shorter fetch times, four threads per core rather than two, and zero penalty for switching between threads. The result is incredible throughput with a total of 32 hardware threads (8 cores with four threads per core) in a single chip. And by the way, that single chip draws a fraction of the power and generates much less heat than a single Intel HT processor (I swear it seems like the systems are blowing out cool air).
Note, however, that the T1 chip may not be ideal for all workloads. It does have a relatively slow single-threaded performance, so it works best when running highly concurrent applications with minimal locking, or when running several applications concurrently. For some applications, it may be desirable to use processor sets to limit the set of threads that it can see and/or to run multiple copies concurrently and load balance across them. But for others that are designed to scale well (e.g., those that already run well on larger systems like the E6800 with 24 UltraSPARC-III or the E2900 with 12 dual-core UltraSPARC-IV chips), then they can take full advantage of the available processing power.
For the tests that I've run with an application that does scale, a system with a single UltraSPARC T1 chip easily doubles up the performance of a system with two 3.2GHz HT Xeons (regardless of whether HT was enabled or disabled). Of course, I haven't been lucky enough to test with the officially shipping version of the T1 chip (the ones I've been able to use have been running at a slower clock rate, and some of them have had some of the cores disabled), so that performance gap may actually be larger than I have been able to measure so far.
The parent post is common sense, which seems infrequent. I have found the range to be quite wide: When rednering animations from Blender, I have found that hyperthreading results in nearly 70% faster throughput when turned on. For rendering MPEG2 using Tmpgenc (under Wine), I see around 40% improvement with HT on. Clearly, these two applications benefit quite a bit from HT due to small computational footprint and/or low cache contention, etc. On the other hand, on my system, on-screen 3D acceleration in the NVIDIA driver (under Linux at least) appears to suffer with HT, with frame rates that are around 10-20% slower than with HT disabled.
So, I see improvements ranging from -20% to +70% depending on the application, with many applications seeing only small differences one way or the other. Like many things, this tends to turn into a religious debate when the fact is that it varies case-by-case.
First, it doesn't matter if the server uses threads or processes. Threads have a minor performance advantage for startup and context switching, and some disadvantages for memory allocation speed (finding VM space is a hashing problem) and some locking overhead. For the most part though, with tasks that just crunch numbers (including scanning memory) or make system calls, there isn't all that much difference.
Running 2 threads per CPU is not cheating. It's normal to run 1 thread per CPU plus 1 thread per concurrent blocking IO operation. That could come out to be 2 threads per CPU.
Not really. You're trying to run two threads inside the L1 caches, the decoder bandwidth, etc. So the 16KB of cache turns back into [effectively] 8KB.
:-)
The P4 can issue 3 uOPs per cycle but IIRC only from one thread. They alternate [stalls free up slots though]. Also the decoder can only decode ONE x86 opcode per cycle. Then you have expensive memory ports. Fetches to L2 [or alternatively to system memory] are done one at a time per thread [with deep queues which were lengthened in the Prescott core].
Aside from the stronger ALU and the fact there are two of them, the AMDx2 also benefits from having dedicated caches per core and a dedicated HT bus between the cores that doesn't sit on the external bus. If you want SMP performance just get opterons or amdx2 cores. Really that simple.
We all know HT was a kludge hack at the last minute to gain some marketting press. If Intel really wanted to boost the performance of the P4 they would strengthen the ALU [at a cost of clock frequency]. If a 2.2Ghz AMD64 can ROUTINELY beat a 3.2Ghz P4 then a 2.5Ghz optimized P4 variant [with a stronger ALU] could hold it's own.
Oh wait, that already exists. It's called the Pentium M.
Tom
Someday, I'll have a real sig.
The AMDx2 has separate L2 caches per core, they can communicate via a dedicated HT link [between 3.2 and 4 GiB/sec] and share one memory controller.
So if one core requests cache line $x and the other core has it the data will be sent over the internal HT link and not even hit the memory bus. The memory controller is pipelined [I suspect] so even while the L2 fulfillment is going on the memory bus can be busy fetching another cache line.
The HT cores have *one* L2 cache per physical core [so for instance, a dual-core HT processor has 4 "logical processors" but only 2 L2 caches]. The prescott [and later] cores have a deep memory read/write pipelines to queue up many memory operations at once. The dual-core P4s have their own cache per physical core which communicates on a FSB that is shared between everything [including the memory bus].
Though in reality the dual-core P4 [e.g. 8xx series] do achieve 2x performance on totally unrelated [and not memory bound] tasks. For instance, doing RSA computations with CRT in two threads gets you a result with half the latency. [same is true for the AMDx2].
So the dual-core P4 isn't a bad buy if money is strapped. You'll get more performance from an AMDx2 though as the individual cores are so much faster.
Tom
Someday, I'll have a real sig.
Actually, the document you point out kind of describes the flaw in the MS scheduler. The MS scheduler only optimizes HT behavior when you have several physical HT enabled CPUs. So let's say you have 2 P4s with HT. Windows will see 4 CPUs. The article describes how Windows will try to schedule on a logical CPU that's part of a physical CPU which currently has nothing scheduled.
So the support described by Microsoft completely fails to address single physical CPU scenarios, or multi-CPU scenarios under high thread load.
What the scheduler needs to do in a single physical HT CPU scenario is only allow threads to execute on the second logical CPU that share resources with the thread executing on the first logical CPU in order to minimize resource contention.
You'd be wrong. The two L2 caches on the AMDx2 allow them to have access in parallel. Had they been unified you'd either have to make it dual-ported [e.g. larger and possibly slower] or have access in sequence [e.g. double the latency].
If you're running tasks that share [e.g. write to] the same small pocket of memory you're right. However, many tasks don't do that. Often a server will spawn an entire new thread [e.g. unique stack and heap objects] to handle connections.
It also makes sense in the desktop scene. Why would X11 and XMMS be accessing the same code? One is a media player and the other is a windows server. They have different data objects in their own respective process spaces. So a unified cache doesn't help. Also keep in mind that smart OSes keep tasks in a given CPU so the cache doesn't get killed as quickly.
Tom
Someday, I'll have a real sig.