Hyperthreading Hurts Server Performance?

← Back to Stories (view on slashdot.org)

Hyperthreading Hurts Server Performance?

Posted by CowboyNeal on Saturday November 19, 2005 @02:16AM from the opposite-effects dept.

sebFlyte writes "ZDNet is reporting that enabling Intel's new Hyperthreading Technology on your servers could lead to markedly decreased performance, according to some developers who have been looking into problems that have been occurring since HT has been shipping automatically activated. One MS developer from the SQL server team put it simply: 'Our customers observed very interesting behaviour on high-end HT-enabled hardware. They noticed that in some cases when high load is applied SQL Server CPU usage increases significantly but SQL Server performance degrades.' Another developer, this time from Citrix, was just as blunt. 'It's ironic. Intel had sold hyperthreading as something that gave performance gains to heavily threaded software. SQL Server is very thread-intensive, but it suffers. In fact, I've never seen performance improvement on server software with hyperthreading enabled. We recommend customers disable it.'"

9 of 255 comments (clear)

Min score:

Reason:

Sort:

This is news? by Anonymous Coward · 2005-11-19 02:21 · Score: 5, Informative

Anybody who understands HT has been saying this since chips supported it, I have it enabled because I find that at typical loads our DB servers performance benefits from HT aware scheduling. Welcome to 2002.
The code wasn't changed by ocelotbob · 2005-11-19 02:22 · Score: 5, Informative

I read the intel assembly guide section regarding hyperthreading, and it clearly states that performance will drop if you don't take the shared cache into consideration. The two logical threads contend for the cache, causing the performance problems that were described. In order for there to be a true benefit to hyperthreading, either the program, the OS or the compiler needs to determine that hyperthreading is enabled, and model the code to only use less than half the cache. It's been known that way since the beginning, and frankly, is silly that MS is scratching their heads wondering why this is. Lower the cache footprint, and I'll be willing to bet that performance rises dramatically.

--
Marxism is the opiate of dumbasses
1. Re:The code wasn't changed by canavan · 2005-11-19 04:03 · Score: 4, Informative
  
  When optimizing code, the compiler should worry about cache size and cache footprint, so that it doesn't unroll inner loops too far or cause the code size to increase enough as to cause thrashing. HT has just cut the maximum cache footprint where increasing size for possily minor performance boosts may make sense in half. GCC has an option called --param max-unrolled-insns=VALUE, which controls just that. There are possibly others with similar effects, possibly also for other compilers. Additionally, it may make sense to have the compiler optimize for size instead of speed in some cases.
sort of obvious by Vlad_the_Inhaler · 2005-11-19 02:24 · Score: 4, Informative

If you have a system thread cleaning out blocks of disk cache memory then of course it is going to suffer. The whole point of hyperthreading was that one thread could run while another was waiting for I/O.

The first tests on Linux when Hyperthreading came out were also pretty discouraging.

--
Mielipiteet omiani - Opinions personal, facts suspect.
Interesting Technical Analysis on the subject by morcego · 2005-11-19 03:39 · Score: 4, Informative

You will find here a very interesting technical analysis on the subject, by Bryan J. Smith, on why Hyperthreading is crappy engeneering. From the message:

Since then, Intel has made a number of "hacks" to the i686 architecture.
One is HyperThreading which tries to keep its pipes full by using its
control units to virtualize two instruction schedulers, registers, etc...
In a nutshell, it's a nice way to get "out-of-order and register
renaming for almost free." Other than basic coherency checking as
necessary in silicon, it "passes the buck" to the OS, leveraging its
context switching (and associated overhead) to manage some details.

That's why HyperThreading can actually be slower for some applications,
because they do not thread, and the added overhead in _software_
results in reduced processing time for the applications.

--
morcego
Re:Poor mans dual-core by Malor · 2005-11-19 03:50 · Score: 4, Informative

I think you're kind of saying this already, but I felt confused by your wording and thought I'd chime in. I'm a little blurry on a few of these details, and too lazy to go look things up, so pay attention to replies... don't treat this as gospel.

As far as I know, all multi-cpu AMD packages use exactly the same method to talk amongst themselves, HyperTransport. They absolutely use a private, dedicated HT bus between cores. I *think* that when you run two single core Opterons, each has a link to main memory, and they also share a direct link. In the case of a 4-die system, I think the third and fourth CPUs 'piggyback' on the 1st and 2nd... they talk to processors 1 and 2, and each other. Processors 1 and 2 do main-memory fetches on their behalf. Each CPU has its own dedicated cache, and I think the cache ends up being semi-unified... so that if something is in processor 2's cache, when processor 4 requests the data, it comes from processor 2 instead of main memory. That's not quite as fast as direct cache, but it's a LOT faster than the DRAM.

The X2 architecture is like half of a 4-way system. There's one link to main memory, and one internal link between the two CPUs... the second one is piggybacking, just like processors 3 and 4 do in a the 4-way system. It's not quite as good as a dedicated bus per processor, but the AMD architecture isn't that bandwidth-starved, and a 1gb HT link is usually fine for keeping two processors fed. You do lose a little performance, but not that much.

Intel dual cores share a single 800mhz bus, with no special link between the chips. And the Netburst architecture is extremely memory bandwidth hungry. Because of its enormous pipeline, a branch mispredict/pipeline stall hurts terribly. The RAM needs to be very very fast to refill the pipeline and get the processor moving again.

So running two Netburst processors down a single, already-starved memory bus is just Not a Good Idea. It's a crummy, slapped-together answer to the much, much better design of the AMD chips. It's a desperate solution to avoid the worst of all possible fates... not being in a high-end market segment at all.

Next year this could all be different again, but at the moment, AMD chips, particularly dual core, are a lot better from nearly every standpoint.
Re:Poor mans dual-core by volsung · 2005-11-19 04:03 · Score: 4, Informative

That's not quite true either. Each Opteron has a separate memory controller (dual-channel), which means that each CPU can have its own pipe to a bank of memory. So if the CPU needs to access memory in its banks, it will not have to contend with the other CPU over the HT link. A NUMA-aware OS will try to schedule processes on the same CPU which controls the process's allocated memory. If your programs can fit in one CPU's memory bank, then you can get bus contention down pretty low.
This is why SMP makers are going nuts over the Opteron. Your effective memory bandwidth scales linearly with the number of processors, assuming your processes partition nicely.
Re:Poor mans dual-core by InvalidError · 2005-11-19 05:37 · Score: 4, Informative

AMD Opterons each have their own local RAM and can access each other's RAM over the HT links to form a a cache-coherent non-uniform memory architecture - ccNUMA.

Multi-core Opterons have a special internal crossbar switch that allow the cores to share the memory controller and HT links, they do not 'piggy back' on the other. This reduces latencies and increases bandwidth for communication between the two cores and gives both cores the equal-opportunity access to the HT ports and CPU's local RAM. With a NUMA-enabled OS, applications will run off the CPU's local RAM whenever possible to minimize bus contention and this allows Opteron servers' overall bandwidth and processing power to scale up almost linearly with the number of CPUs.

As for Intel's dual-cores, the P4 makes sub-optimal use of its very limited available bandwidth. Turning HT on in a quad-core setup where the FSB is already dry on bandwidth naturally only makes things worse by increasing bus contention. Netburst was a good idea but it was poorly executed and the shared FSB very much killed any potential for scalability. If Intel gave the P4 an integrated RAM controller and a true dual-core CPU (two cores connected through a crossbar switch to shared memory and bus controllers like AMD did for the X2s), things would look much better. I'm not buying Intel again until Intel gets this obvious bit of common sense. The CPU is the largest RAM bandwidth consumer in a system, it should have the most direct RAM access possible. Having to fill pipelines and hide latencies with distant RAM wastes many resources and a fair amount of performance - and to make this bad problem worse, Intel is doing this on a shared bus. Things will get a little better with the upcoming dual-bus chipsets with quad-channel FBDIMM but this will still put a hard limit on practical scalability thanks to the non-scalable RAM bandwidth.

On modern high-performance CPUs, shared busses kill scalability. AMD moved towards independant CPU busses with the K7 and integrated RAM controllers with the K8 to swerve around the scalability brick wall Intel was about to crash into many years ago and has kept on ramming ever since. Right now, Intel's future dual-FSB chipset is nothing more than Intel finally catching up with last millenia's dual-processor K7 platforms, only with bigger bandwidth figures.
wrong on both counts by r00t · 2005-11-19 08:31 · Score: 4, Informative

First, it doesn't matter if the server uses threads or processes. Threads have a minor performance advantage for startup and context switching, and some disadvantages for memory allocation speed (finding VM space is a hashing problem) and some locking overhead. For the most part though, with tasks that just crunch numbers (including scanning memory) or make system calls, there isn't all that much difference.

Running 2 threads per CPU is not cheating. It's normal to run 1 thread per CPU plus 1 thread per concurrent blocking IO operation. That could come out to be 2 threads per CPU.