Hyperthreading Hurts Server Performance?

This is news? by Anonymous Coward · 2005-11-19 02:21 · Score: 5, Informative

Anybody who understands HT has been saying this since chips supported it, I have it enabled because I find that at typical loads our DB servers performance benefits from HT aware scheduling. Welcome to 2002.

Re:This is news? by GIL_Dude · 2005-11-19 02:34 · Score: 1

Right, that's true. I find that on my desktop machine, having it enabled allows me to more seamlessly run VMWare virtual machines. One can be spinning one virtual processor fairly well and not freeze my host machine's apps. Of course this only goes so far; you do still feel it - but you aren't totally processor starved. So far on my development desktops it has been a good thing to have HT.
Re:This is news? by dindi · 2005-11-19 03:47 · Score: 4, Interesting

Mysql on linux with a 10gig DB for me definetely benefits my server's performance.
In fact turning it off results in a 20+ percent query time, especially with multiple fulltime queries.

Of course differently written queries and different systems/sql engines might behave differently.

In fact I am so happy with HT, that I am going to change my desktop to one, as it is a linux machine with lots of running apps at the same time. Not mentioning that it is also a devel station with SQL+apache that benefited with HT according to my experience.

(well it is time to upgrade anyway, and I choose HT over non HT).
Re:This is news? by magarity · 2005-11-19 03:49 · Score: 5, Interesting

Anybody who understands HT has been saying this since chips supported it

People also have to trouble themselves to configure things properly which isn't the obvious or the default. HT pretends to Windows that its another processor but as you know it isn't. So you have to set SQL Server's '# of processors for parallel processing' setting to the number of real processors, not virtual. We changed ours to this spec and performance went up markedly. SQL Server defaults to what Win tells it the number of procs are and tries to run a full CPU's worth of load on the HT. Not gonna happen.
Re:This is news? by tylernt · 2005-11-19 03:53 · Score: 1

I was always taught that two 1GHz CPUs are slower than one 2GHz CPU, because of the extra overhead of the OS managing the 2nd CPU. On all the servers at work, HT CPUs show up as two virtual CPUs to Windows... so yeah, I would fully expect HT to be slower on heavily loaded systems -- no surprises there.

--
DRM 'manages access' in the same way that a prison 'manages freedom'
Re:This is news? by magarity · 2005-11-19 03:54 · Score: 1

PS - From this article about HT issues versus the article a couple of weeks back about the new Xeon dual cores and their memory architecture, I think rather just stick with single core HT models.
Re:This is news? by magarity · 2005-11-19 04:00 · Score: 2, Informative

You should have been taught that two 1Ghz CPUs may or may not be slower than one 2Ghz CPU depending on what the server does for a living. The OS consideration is miniscule; cross CPU communication is almost as fast as the internals of the CPUs until you get up to NUMA type machines. Even then, no, a blanket statement such as the one you say you've been taught is incorrect for at least as many cases as it is correct.
Re:This is news? by Velox_SwiftFox · 2005-11-19 04:24 · Score: 2, Interesting

Ah, but with Intel, now you have to choose between dual cores and HT (or pay a lot for the super gaming processor). And choose 2M over 1M cache over 2 processors with 1M each cache, et cetra. Even in the medium priced processors.

Experience here shows the servers I deal with running Linux 2.6 kernel/Apache/MySQL and dual Xeons up to 6GB is that turning HT on as well reduces performance. When a CPU fan failed and one CPU had to be temporarily removed, however, there was a clear benefit turning it on with the single processor.
Re:This is news? by dindi · 2005-11-19 04:41 · Score: 2, Interesting

Hmm interesting.
I was talking about a single proc and HT,
I imagine that with dual + HT it is different. I do not see why it is happening. Actually if I bought an expensive server and experienced that, I might try to get some official explanation
for the problem.

I wonder If you tried BSD or Windows on the same or similar hardware, that might be some OS specific problem as well.

Hmm, Google on it I will. :)
Re:This is news? by tylernt · 2005-11-19 04:47 · Score: 1

You have a point there... also, I was told this a number of years ago when clock speeds were measured in MHz rather than GHz so things were an order of magnatude slower "back then".

--
DRM 'manages access' in the same way that a prison 'manages freedom'
Re:This is news? by Glasswire · 2005-11-19 05:17 · Score: 1

Sort of correct.
Nobody (who knew what they were talking about) ever said HT always gave a speed improvement - but database generally does benefit from it. It would be interesting to do a rigourous analyis of what the writer's situation. Since Hyperthreading is hardly "new" - Intel has been shipping it on desktop and server chips for about 3 years (as the post suggests), one wonders what else the writer is clueless about.
Re:This is news? by dnoyeb · 2005-11-19 05:20 · Score: 2, Interesting

Quite interesting. So SQL Server spawns processes as opposed to threads when it finds a second processor? I can't imagine thats true. What exactly do you mean by a 'full CPU's worth of load'?

The only situation I can imagine is if SQLServer spawns say, 2 threads per CPU for performance. But this is a cheating way to get more CPU time and I wouldn't expect a _server_ class program to do such a thing when such a program would tend to expect its getting dedicated CPU anyway.
Re:This is news? by Glasswire · 2005-11-19 05:23 · Score: 1, Interesting

Ah, but with Intel, now you have to choose between dual cores and HT (or pay a lot for the super gaming processor). And choose 2M over 1M cache over 2 processors with 1M each cache, et cetra. Even in the medium priced processors.
The above (as you actually imply) is about single socket uniproc DESKTOP systems, not the servers (generally at least 2, 4+ socket) and server apps we're talking about.
As a matter of fact, both Intel's current dual socket, dual core cpu (Paxville DP) and the follow-on dual core Dempsey HAVE Hyperthreading. Please don't contaminate a discussion about servers with irrelevant desktop technology observations.
Re:This is news? by dnoyeb · 2005-11-19 05:24 · Score: 1

Depends on the architecture. Its the sharing of memory that slows them down. This is why CPUs intended for multi setups have additional cache on-board. If there is no memory contention then the dual can appear more responsive, but i would never call it 'faster.'
Re:This is news? by geopsychic · 2005-11-19 07:39 · Score: 1

Hyperthreading performance depends on how bright the scheduler is as well as the process mix. Last year a cluster vendor asked my company to run benchmarks on their hardware. This was dual CPU Linux boxes. They gave us four to test on. All had hyperthreading turned on.

Our code for the test is CPU bound. Seismic prestack time migration, which may not mean much to most people. We were running two threads per machine. The jobs typically run for hours between I/O operations. The entire process on a large survey can take weeks on a large cluster.

The timing results we got were very inconsistent. In tracking the problem down, I logged into the cluster machines and ran a top. Roughly half of the time the scheduler had both threads running on the same CPU. Turning hyperthreading off increased throughput significantly and gave reproducable run times.

We tell our clients to turn hyperthreading off.
Re:This is news? by sapgau · 2005-11-19 09:24 · Score: 1

How is that "cheating"?
No matter how many processors you have on your server there will always be more threads per CPU on modern OSes.
Re:This is news? by Velox_SwiftFox · 2005-11-19 09:56 · Score: 1

I don't know either; AFAIK the situation might resolve, or reverse, with the next release of the Linux kernel, or with a recompile of the kernel and server apps for the actual hardware. Or different apps, YMMV.

I suspect it is a case of "Data, data, who's got the data? Oh, the chip with the other two virtual CPUs. Which is competing with me for RAM bus bandwidth because I have the data it needs in my cache. Where it shoved out the data I need now." In any case, I have the misfortune to have to experimentate some SQL2000 servers now because of the results the article linked to.
Re:This is news? by i.r.id10t · 2005-11-19 12:06 · Score: 1

Would you rather do several things at once at a speed that is probably "fast enough", or one thing at a time really fast? Would you like to be able to run some massive CPU hog of a app and still have a responsive system for doing other things while you wait for it to do its thing (compile, render, whatever). In cases like this, disk access quickly becomes your bottleneck.

--
Don't blame me, I voted for Kodos
Re:This is news? by bennini · 2005-11-19 15:08 · Score: 1

i think he means "cheating" in a different form... he thinks that a process will "get more CPU time", i.e. be allocated a longer time slice (or will be given time slices more often), if a process has two internal threads as opposed to one. this is, of course, not true.
Re:This is news? by sbohmann · 2005-11-20 00:22 · Score: 1

Hmm... The Windows NT (and 2000, XP) Scheduler has long since been known to perform abysmally in cases like this. As soon as there is a thread using the CPU to the max, that is, a process that doesn't yield or wait or make system calls or do IO at least a couple of times a second, for some reason I don't understand, the priority of all other processes seems to be set to zero. You will hardly find similar behaviour on machines running ANY other contemporary OS. A Linux machine won't stop everything else simply because there is a process performing a non-IO-intensive computation for more than a couple of seconds. And a Windows machine shouldn't. Because that's the reason I know people who have used dual core machines seven years ago under NT 4 - simply so they could read/write e-mail or browse the web or play solitaire while compiling on the same machine. And now Intel comes out with a nasty little trick that makes Windows usable in such situations for more reasonable a price. At least as long as you won't start a second compiler or renderer at the same time ;-) IMHO, the scheduler should do its job right, so we wouldn' be discussing dual core CPUs for desktop machines, except where necessary in order to increase pure number crunching power. A single, fast CPU is the typically faster alternative on systems with a REAL scheduler. A scheduler that needs a second CPU for the event queue is not. sigh...
Re:This is news? by Luminous+Coward · 2005-11-20 00:54 · Score: 1

In fact I am so happy with HT, that I am going to change my desktop to one, as it is a Linux machine with lots of running apps at the same time. Not mentionning that it is also a devel station with SQL+apache that benefited with HT according to my experience.

(Well it is time to upgrade anyway, and I choose HT over non HT).
Have you looked into socket-939 dual-core Athlons? Specifically, the Athlon64 X2 3800+ (2 GHz, 512 KB L2 cache per core) is available for $320.
Re:This is news? by dindi · 2005-11-20 03:51 · Score: 1

Nop. I am in Costa Rica. You do not want to know how much that $320 proc costs here. And I have a HT capable mobo already so I would just need a proc and do not necessarily want to replace half the machine.

Especially because my local store staff just stares at me when I ask about 64bit processors....

Maybe on my next shopping trip to Panama or Miami.
Price compare on that; I will.

It's all in the name by hjf · 2005-11-19 02:22 · Score: 3, Insightful

Well, a technology with a name such as "HyperThreading" is targeted more to end users who don't know about processors, rather than SQL "Performance Tuners" who try to squeeze every cycle of processing power.
HyperThreading might help poorly written thread management (independent audio and video subsystems for example), but not true multithreading, that's for sure.

Re:It's all in the name by prockcore · 2005-11-19 16:02 · Score: 1

Well, a technology with a name such as "HyperThreading" is targeted more to end users who don't know about processors,

Which is odd, since most things that begin with hyper aren't good things.

Hyperventilation, hypertension, and now hyperthreading.
Re:It's all in the name by sbohmann · 2005-11-20 00:28 · Score: 1

It allows one thread / process to overtake another in the execution pipe.
So two of them ma be executed in parallel BETWEEN two task switches.
When task switches are rare because of bad scheduling, it thus may improve the mutitasking experience.
So HT is a half-hearted replacement for a reasonable scheduler...

The code wasn't changed by ocelotbob · 2005-11-19 02:22 · Score: 5, Informative

I read the intel assembly guide section regarding hyperthreading, and it clearly states that performance will drop if you don't take the shared cache into consideration. The two logical threads contend for the cache, causing the performance problems that were described. In order for there to be a true benefit to hyperthreading, either the program, the OS or the compiler needs to determine that hyperthreading is enabled, and model the code to only use less than half the cache. It's been known that way since the beginning, and frankly, is silly that MS is scratching their heads wondering why this is. Lower the cache footprint, and I'll be willing to bet that performance rises dramatically.

--

Marxism is the opiate of dumbasses

Re:The code wasn't changed by springbox · 2005-11-19 02:27 · Score: 4, Insightful

That's lame. It seems like an exteremely BAD idea to get programs to worry about the total cache usage on the CPU. If this is the case, then no wonder performance is suffering. There should be no reason for any programmer to write a threaded application so it's "hyperthreading optimized," especially since HT was seemingly created as a transparent mechanism to increase performance.
Re:The code wasn't changed by baldass_newbie · 2005-11-19 02:33 · Score: 1

You could almost use this to make the argument that only by compiling your OS for your given architecture will you get the performance best suited to your hardware.

(This coming from someone smug enough to have installed Gentoo.)

But seriously, it's tough enough to compile a kernel, say, for a given hardware architecture (not impossible, obviously, just not something one does every day.) Now that there is Windows XP and Windows xp64 you wonder if there won't be additional hardware/architecture specific distros from Windows.

One of the advantages Mac OS X has is that Apple knows what hardware they're building for while MS is building windows for all comers.

--
The opposite of progress is congress
Re:The code wasn't changed by drerwk · 2005-11-19 02:35 · Score: 4, Insightful

It seems like an exteremely BAD idea to get programs to worry about the total cache usage on the CPU.
If you want to maximize performance then you want the compiler to know as much as possible about the architecture. If you have no cache then loop unrolling is a good thing, if you have a small cache then loop unrolling can bust the cache. If you are doing large matrix manipulations, how you choose to stride the matrix, and possibly pad it is exactly dependent on the size of the cache. Now, it may be that having the applications programmer worry about it is too much to ask, but the compiler most certainly needs to worry about such detail.
Re:The code wasn't changed by springbox · 2005-11-19 02:40 · Score: 3, Insightful

It depends on what your goals are. I do realize that was a fairly general statement, and it does not apply to every application. For something like lets say MS SQL server without a compiler that does it automatically, it would be an unreasonable expectation. If someone was writing an application for an embedded system, however, it might make sense if they chose the HT enabled processor. Are there any compilers currently that will do HT optimizations? I was under the impression that most commercial apps were basically compiled for the lowest common denominator anyway.
Re:The code wasn't changed by Sique · 2005-11-19 02:48 · Score: 1

Normally cache lines are connected to physical memory pages. You can always tell from the address of the memory page in which cache line the data will be loaded if accessed. If you now have memory management that gives different threads sophisticatedly choosen memory pages, two threads never will trash each other's cache. In this case every thread will only see half of the cache, because its memory pages never get loaded in the other half of the cache, without the programmer needing to take care of this. Of course then the programmer has to take care that two concurrently running threads always use disjunct cache halves (basicly have 'even' and 'odd' threads and make sure, that never two of the same type are running in parallel).

--
.sig: Sique *sigh*
Re:The code wasn't changed by ochnap2 · 2005-11-19 03:00 · Score: 5, Insightful

That's nonsense. Compilers routinely do loads of optimisations to better suit the underlying hardware. That's why any linux distro that ships binary packages has many flavors of each important or performance sensitive package (specially the kernel, in Debian you'll find images optimised for 386, 586, 686, k6, k7, etc). Is one of the reasons of the existence of Gentoo, also.

So MS had to make a choise: ship a binary optimized for every possible mix of hw (being the processor the most important factor, but not the only one), which is impossible, or ship images compatible with any recent x86 processor/hw... without being specially optimised for any. That's why hyperthreading performance suffers.

This is an important problem on Windows because most of the time you cannot simply recompile the un-optimised software to suit your hardware, as you can in Linux, etc.

(sorry for my bad english)
Re:The code wasn't changed by Taladar · 2005-11-19 03:12 · Score: 1

Bullshit, threads use the same memory per definition. If you want distinct memory pages you need processes.

--
Linux is not Windows
Re:The code wasn't changed by springbox · 2005-11-19 03:29 · Score: 4, Insightful

I wasn't thinking of compilers. I was mostly talking about the people who have to write the software. Assuming there's no compiler that knows about HT, I stand by my assertion that it would generally be a bad pratice to get people to worry about it. Especailly these days. Another point that I was trying to make is that even if there were compilers who knew about the HT issues, I still think it's exceedingly stupid that Intel went ahead with HT despite the glaring problems that were mentioned. If people want multiple of threads of execution on the same processor then they should get one with two cores.
Lots of programs are designed with the multiple thread model in mind. Programs should not be designed with the multiple thread model plus cache limitations in mind.
Re:The code wasn't changed by Jugalator · 2005-11-19 03:33 · Score: 1

This is an important problem on Windows

And, due to enormous MS dominance, for P4 HT processors as well.

--
Beware: In C++, your friends can see your privates!
Re:The code wasn't changed by DavidTC · 2005-11-19 03:48 · Score: 1

I think you need to go look at the defination of 'thread' the person you responded to was using, and I think you need to realize that hyperthreading is an instance of stealing an already existing terminology name to sound cool.
Absolutely nothing requires both 'halves' of a hyperthreaded CPU to be executing 'threads' that belong to the same process. Which, in fact, makes those not threads, but that is what they are called.

--
If corporations are people, aren't stockholders guilty of slavery?
Re:The code wasn't changed by canavan · 2005-11-19 04:03 · Score: 4, Informative

When optimizing code, the compiler should worry about cache size and cache footprint, so that it doesn't unroll inner loops too far or cause the code size to increase enough as to cause thrashing. HT has just cut the maximum cache footprint where increasing size for possily minor performance boosts may make sense in half. GCC has an option called --param max-unrolled-insns=VALUE, which controls just that. There are possibly others with similar effects, possibly also for other compilers. Additionally, it may make sense to have the compiler optimize for size instead of speed in some cases.
Re:The code wasn't changed by level_headed_midwest · 2005-11-19 04:06 · Score: 1

Recompiling a kernel in Linux for your particular CPU type is pretty simple.

1. Download the kernel source from your distribution's package manager.
2. Go to /usr/src and extract the source's compressed file if it not automatically done.
3. Make a link from /usr/src/your_kernel's_extracted_files'_folder to /usr/src/linux.
4. Copy the kernel .config file from /usr/src to /usr/src/linux.
5. Open the kernel config screen by typing in make gconfig (Gnome) or make xconfig (KDE).
6. Go to "processor type" and selct your processor type. Exit the config utility saving changes.
7. Type "make bzImage modules modules_install" and wait about 30 to 90 minutes.
8. Update your bootloader by typing in update-grub or /sbin/lilo.
9. Reboot and you're good to go.

There are a bunch of kernel config howto's online. Find one for your distro and it will be even simpler.

--
Just "gittin-r-done," day after day.
Re:The code wasn't changed by Tim+Browse · 2005-11-19 04:13 · Score: 4, Insightful

It seems like an exteremely BAD idea to get programs to worry about the total cache usage on the CPU.

For an application like SQL Server, I'd have to disagree. Are you saying there's no one on the MSSQL team who looks at cache usage? I'd hope there were a lot of resources devoted to some fairly in-depth analysis of how the code performs on different CPUs. After all, after correctness, performance is how SQL Server is going to be judged (and criticised).
Given that a while back I watched a PDC presentation by Raymond Chen on how to avoid page faults etc in your Windows application (improving start-up times, etc), I'd say that Microsoft are no strangers to performance monitoring and analysis.
For your average Windows desktop app, then yes, worrying about cache usage on HT CPUs is way over the top. For something like SQL Server? Hell, no.
Re:The code wasn't changed by ultranova · 2005-11-19 04:38 · Score: 1

Absolutely nothing requires both 'halves' of a hyperthreaded CPU to be executing 'threads' that belong to the same process. Which, in fact, makes those not threads, but that is what they are called.

Actually, they are threads, but they aren't neccessarily threads from the same process. Each running process has at least one thread, after all; otherwise there wouldn't be anything running.

Remember that a thread is just a list of instructions to be executed sequentially. Everything running in a computer is a thread.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:The code wasn't changed by ElvenMonkey · 2005-11-19 04:47 · Score: 2, Interesting

If people want multiple of threads of execution on the same processor then they should get one with two cores.

If you read the article / summary you'd see what its talking about are servers that come with HT enabled by default. Thinking off the top of my head I can't come up with a single Intel processor still being sold and used in servers today that doesn't have HT technology built in. We're not talking about people specifically buying HT processors looking to get a performance boost, we're talking about every single individual who is buying or has bought an Intel server certainly within the last year or two. I've certainly noticed HT being enabled by default on our servers over the past couple of years. Novell has always strongly advised against it, so its always been turned off not long after they're powered up for the first time.

From the way the article reads, the software companies are expressing concern that HT is being enabled by default on server, rather than that they're baffled why its causing slower performance. It even goes so far as to point out that the shared L1 and L2 cache is the problem.

--
"Joy is not in things; it is in us." Richard Wagner
Re:The code wasn't changed by quantum+bit · 2005-11-19 04:54 · Score: 1

Absolutely nothing requires both 'halves' of a hyperthreaded CPU to be executing 'threads' that belong to the same process. Which, in fact, makes those not threads, but that is what they are called.

Yes, but the idea of HT is to run threads from the same process on the logical cores. Otherwise you run into exactly the cache problem that is being discussed.

While the scheduler can run different processes on the logical cores, if it does performance will suffer compared to a non-HT system. The way to get any gains out of it is to only use the logical CPU for threading purposes, hence the name.
Re:The code wasn't changed by quantum+bit · 2005-11-19 04:56 · Score: 1

This is an important problem on Windows because most of the time you cannot simply recompile the un-optimised software to suit your hardware, as you can in Linux, etc.

Which is probably why MS is so gung-ho about machine-independent bytecode (.NET) and JIT compiling these days...

Unfortunately you pay huge costs in startup time and memory usage for that.
Re:The code wasn't changed by ocelotbob · 2005-11-19 04:58 · Score: 1

This isn't an issue about overall cache size, this is an issue about cache contention. You'd still have a contention problem to a degree if the cache was 8MB. The way to do a multithreading processor is to either implement a shared cache approach like intel did, and expect programs not to use all the cache, a partitioned cache, which would lower performance on multithreaded apps that shared memory resources and decrease performance on unequally sized threads, or add cache partition control instructions, which would further increase cpu complexity and code size. Regardless of how you look at it, multithreading has a tendency to step on toes and requires one to program differently than in a single thread design. Regardless of the approach, you're not going to achieve optimal performance unless you're willing to modify your code to utilize the uniqueness of the hyperthreading model.

--
Marxism is the opiate of dumbasses
Re:The code wasn't changed by Captain+McCrank · 2005-11-19 05:29 · Score: 1

So MS had to make a choise:
Knock it off you knuckle head! *bops curly on head*
Re:The code wasn't changed by great_snoopy · 2005-11-19 05:32 · Score: 1

Not true. There are some _very_ sensitive areas where cache sensitive coding can improve performance _a_lot_. For example making a certain loop or a certain intensive used data structure fit into the processor's cache will greatly improve performance due to the much faster nature of the cache compared to the general purpose memory.The benefits of this awareness are much greater than using processor specific instructions (586, 686...). Just take your usual benchmark and compare the cache's bandwidth with the RAM bandwidth, and then consider a loop or a data structure that will be run/accessed _a_lot_ of times (a sort operation, a matrix operation and a lot other things) and you will get an ideea about the magnitude of the benefits. I am not discarding the benefits of cpu specific instruction compiling (tough, the actual speed improvements obtained with this technique largely used for gentoo for ex are not so great in the real life) . What I want to pinpoint is that smart programming(hard to find today...) -even using plain old 386 instructions - can give much better performance improvements than using a few processor specific instructions. Professional game programmers usually employ this "smart programming" in order to optimize some certain known critical operations. Of course, there are a lot of gray tones between the two extremes (coding in assemblers and arranging instructions/structure sizes by hand on one side and relying only on compiler optimizations on the other side) and the best thing is to place yourself on a graytone in order to maximize development time but also performance.
Re:The code wasn't changed by DavidTC · 2005-11-19 06:06 · Score: 1

No, you run into this cache problem anyway. You have this problem if the application has been optimized to use the cache, because the cache, mysteriously, has other threads of execution, either within the same program, or another program, using it and undoing all that careful work of optimizing the code so it fits.
Granted, it is possibly possible for you to go and fix your other threads so they don't do this on hyperthreading, although I don't know how. But you certainly can't fix other people's code from running. (And, hell, they have as much right to the cache as you.)
Now, this problem has always existed, but it used to happen at process switch, and thus was under the control of the OS, so could be tuned. Now the damn processor say 'Hey, I've got an idle second while doing memory access, let me run this other code', thus knocking a tiny bit of your process out of cache, so when your process restarts, it has to do another memory access, at least to L2 cache.
Basically, this is the 'timeslice question'...how long should we let each process control the CPU. Except, you know, removed from the control of the OS and applications. I don't think hyperthreading is a good idea at all.
Now, if they are different processes, however, you can run into a whole nother set of problems. Like those driver blue-screens other people were talking about. I'm sure the driver ius doing something it shouldn't do, but I have a feeling the crash has something to do with hyperthreading and ring 0 access.

--
If corporations are people, aren't stockholders guilty of slavery?
Re:The code wasn't changed by great_snoopy · 2005-11-19 06:43 · Score: 1

Hidden was the initial sense of the thing. However, being aware of the cache and using this information (without even altering subplatform compatibility like in using specific processor instructions !) can provide quite significant improvements at no other cost than using cpuid once. As about benefits, certain repetitive operations can benefit a lot. For example an array operations fitted into the cache can run significantly faster (you can test this yourself). These kind of operations occur a lot when dealing with image manipulation. Of course this won't matter for your usual desktop application, but it does matter in a few particular applications and it will be of great benefit. I agree to you that cache is always a good thing, I agree that it was initially supposed to be transparent, however knowing about it can in SOME cases (a few to be correct) give a much better improvement than some people would think.
Re:The code wasn't changed by tjma2001 · 2005-11-19 06:55 · Score: 1

No Program Should EVER have to interfere with memory management ....that is the job of the os and it should stay that way. Sure you can squeeze more out of your processor by taking a shortcut here and there if you are a programming guru. But guru's are scarce, and to produce faultless code that deals directly with memory management is nearly impossible. If you want better performance for your SQL then i suggest you invest in a nice heavily multithreaded OS. Its no secret that windows is really bad at multiprocessing. They only introduced multithreading post windows 93 where as these unix gurus have been doing it since the 1970's. At the moment that part of windows is really bad. With luck in the future there will be improvement
Re:The code wasn't changed by nigelo · 2005-11-19 07:19 · Score: 1

"No Program Should EVER have to interfere with memory management"

I can't say that I agree with the implied "don't worry about memory access strategies": there are many opportunities for application programmers to thrash a system's memory needlessly by flushing the various caches due to poor program design.

--
*Still* negative function...
Re:The code wasn't changed by TimboJones · 2005-11-19 09:59 · Score: 1

(HT) is not equal to Multiple Cores

That's the point. People who buy multi-core servers are looking for a perf improvement over single-core. People who buy hyperthreaded servers are just buying servers, and these servers default to non-optimal settings.
Re:The code wasn't changed by arodland · 2005-11-19 13:10 · Score: 1

Maybe you need to spend just a little more time in the real world. "Don't screw around with all of that stuff, it's taken care of for you at a lower level" is a wonderful philosophy, and it's useful most of the time, but when you're working on something big and important where performance sells, then mucking around with every last underlying piece of the system, whether it's polite or not, is just what you do.

Yes, that was a horrible sentence, but fortunately for me slashdot doesn't take away points for comma splices.
Re:The code wasn't changed by drsmithy · 2005-11-19 13:59 · Score: 1

Its no secret that windows is really bad at multiprocessing.
This sounds like typical anti-Microsoft FUD.
They only introduced multithreading post windows 93 where as these unix gurus have been doing it since the 1970's. At the moment that part of windows is really bad. With luck in the future there will be improvement
Right, because BeOS (that only "introduced multithreading" when it was first written in the early 90s) was renowned for being "really bad at multiprocessing". Clearly not having been around since the 70s was a major problem for them.
Windows NT was built from the start to be heavily multithreaded and work well in multiprocessing environments. It was designed by the same people who built VMS. You'd think they might have a rough idea of how to handle "multiprocessing" reasonably well.
Re:The code wasn't changed by drsmithy · 2005-11-19 16:19 · Score: 1

That's why any linux distro that ships binary packages has many flavors of each important or performance sensitive package (specially the kernel, in Debian you'll find images optimised for 386, 586, 686, k6, k7, etc). Is one of the reasons of the existence of Gentoo, also.
Do you have any benchmarks that show CPU-specific compiler optimisations, for OS level code, give any meaningful general performance improvement ?
So MS had to make a choise: ship a binary optimized for every possible mix of hw (being the processor the most important factor, but not the only one), which is impossible, or ship images compatible with any recent x86 processor/hw... without being specially optimised for any. That's why hyperthreading performance suffers.
HT performance is an OS scheduler issue, it has nothing to do with "special" compilation optimisations.
Re:The code wasn't changed by B1 · 2005-11-19 20:44 · Score: 1

Maybe you need to spend just a little more time in the real world. "Don't screw around with all of that stuff, it's taken care of for you at a lower level" is a wonderful philosophy, and it's useful most of the time, but when you're working on something big and important where performance sells, then mucking around with every last underlying piece of the system, whether it's polite or not, is just what you do.

How much performance gain are you going to realize? 100%? 50%? 5%? 0.5%?

One of the nice things about abstractions is that they make it easier for you to write maintainable code. In the real world, delivery dates count, and abstractions help you meet deadlines.

It's not worth spending months on hand-tuning a rendering engine if it's more cost-effective to simply throw more expensive/faster hardware at the problem. If your rendering engine is going to take an extra three months to hit the market because you've decided to hand code parts in assembly, your customers may have already walked by the time you finish--especially if they need to start rendering today. In the mean time, the three months you spend hand-tuning your rendering engine might be better spent on adding features for the next version.

Specializing your codebase for a particular platform (OS/hardware) makes sense if you need to squeeze out every last ounce of performance, but in the real world it doesn't happen in a vacuum. Specialize too much for a platform, and you might quickly find yourself locked in to using an obsolete platform. There's a tradeoff between long-term maintainability/portability and short-term performance. The more dependent your code is on a particular platform today, the more it's going to cost you when your customer wants to change platforms down the road.

Platform changes do happen. If your customers want to migrate to a new platform and you can't migrate with them, they're going to look for alternatives.

For example, if you find that you get performance boosts by using undocumented function calls in a particular version of Windows, you run the risk that those functions will change or disappear between versions--thereby breaking your application. If your TurboWord2000 doesn't run on Windows XP because of such a change, your customers aren't going to wait around for you to fix your program.

Suppose you have a CPU intensive server application written in C / C++. For performance, you decide to rewrite some of the most CPU intensive code with hand-coded inline x86 assembly. Because instruction ordering can play a big part in software performance, your hand-coded assembly might run very well on an Athlon, but then run very poorly on a Pentium4 if your instruction ordering results in a high number of pipeline flushes / stalls. Some compilers generate several CPU-specific branches of code because of this--are you willing to go to this effort? And needless to say, your code won't run on anything other than x86 hardware (unless you're willing to maintain parallel versions).

Tuning your product to work well on a particular platform makes sense, but after a point, the performance gains you'll realize are likely minimal given the tuning effort on your part. After a point, it's not worth the cost, time, or platform-dependency risks. Your time is precious, and life is short. As much as it pains me to say this, it's often more cost-effective to just throw hardware/money at the problem.
Re:The code wasn't changed by oztiks · 2005-11-19 23:07 · Score: 1

not knowing very much about ht in particular i was wondering if you could post a url about this. I know with L2 cache that it stores data segments and code segments, but when your discussing this particular problem i'm wanting to know if you mean the performace lag comes from the time it takes for the system to switch execution states by utilising the cache to do so or actual data cache usage?
Re:The code wasn't changed by ocelotbob · 2005-11-20 00:09 · Score: 1

The actual cache usage is the performance hit here. The big issue with hyperthreading is that it was designed a bit on the cheap, so certain control structures dealing with controlling cache filling aren't in place - the two logical threads fight over who fills the cache. As a result, there's a greater chance of cache misses, causing the CPU to have to fetch data from the significantly slower main memory. Intel has an excellent basic article describing the issues regarding various code/data block sizes and cache sizes, explaining the difference between how to code for hyperthreading processors and how to code for non-hyperthreading processors. What it boils down to is you've got to design your code to use less than half the cache size if you want hyperthreading to be effective.

--
Marxism is the opiate of dumbasses
Re:The code wasn't changed by kimvette · 2005-11-20 01:02 · Score: 1

The optimal solution to this is a intelligent process and thread scheduling in the Windows kernel.

(the above should also apply to Linux, *BSD, etc.)

Another solution (possibly a better-performing one) is to distribute multiple binaries and the installer would the appropriate one that is optimized for the system's processor, but then the release engineer's life will be hell because in that case the release engineer will have to run multiple builds per project, the installer would be more complex, plus QA's test matrix would be a hell of a lot larger. Of course that would drive the cost of software way up. For some situations it would work and is already being done (e.g., multiple kernels shipping with an OS, multiple HALs shipping with Windows) but for your average $30 to $70 game "it just ain't gonna happen"

--
The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
Re:The code wasn't changed by Fartacus · 2005-11-20 02:23 · Score: 1

Yeah it is a bad idea to get programs to worry about total cache usage, or to even detect and optimize usage for hyperthreading. But I thought that hyperthreading was actually created as a band-aid for Intel's broken NetBurst architecture. The deeply pipelined architecture had so many pipeline bubbles and stalls that Intel decided to fill those bubbles and stalls (i.e. use the idle hardware) by adding support for an additional thread.
Re:The code wasn't changed by Sique · 2005-11-21 04:20 · Score: 1

So to optimize code for HT we should make sure that different threads of the same process are running at the same small code and data fragments, so the code and data caches are not flushed due to thread switches. Process (and thus context) switches are bad enough with having to reload the whole cache.
There are not many problems where this is the case (ray tracing comes to mind, where different rays run on the same vector set and use the same code to get evaluated).

If different threads have to run at different data or code fragments, then THOSE should be optimized in RAM in a way that they don't purge each other out of the cache. This can be achieved with sophisticated memory mapping, but I guess it's hell of an optimizing problem.

--
.sig: Sique *sigh*

Poor mans dual-core by IdleTime · 2005-11-19 02:22 · Score: 4, Interesting

indeed has once again proved it is expensive to be poor.

Question I find more interesting: What is the performance gap between dual CPU vs Dual-core?

--
If you mod me down, I *will* introduce you to my sister!

Re:Poor mans dual-core by Chewbacon · 2005-11-19 02:36 · Score: 2, Informative

I have a dual core on my desktop at home and HT on my machine at work. I'll take the dual core over HT any day that ends with a Y. You can multi-task so well with the Pentium D it becomes blissful. Want to archive a DVD movie and put your favorite CD on your mp3 player? Set the two apps to run on different cores. On the other hand, my HT workstation goes nuts-slow if I try to do two intensive tasks at once.

--
Chewbacon
The Bible is like Wikipedia: written by a bunch of people and verifiable by questionable sources.
Re:Poor mans dual-core by dsci · 2005-11-19 02:47 · Score: 5, Insightful

What is the performance gap between dual CPU vs Dual-core?

It's the usual answer: it depends.

We have to get rid of the notion that there is one overall system architecture that is "right" for all computing needs.

For general, every-day desktop use, there should be little difference between a dual CPU SMP box and a dual core box.

I have a small cluster consisting of AMD 64 X2 nodes, and the nodes use the FC4 SMP kernel just fine. All scheduling between CPU's is handled by the OS, and MPI/PVM apps run just as expected when using the configurations suggested for SMP nodes.

In fact, with the dual channel memory model, dual core AMD systems might be a little better than generic dual CPU, since each processor has it's "own" memory.

--
Computational Chemistry products and services.
Re:Poor mans dual-core by raynet · 2005-11-19 02:53 · Score: 1

Even though multi cpu systems often are more responsive, I've rarely had any problems running multiple programs with single cpu, even with Windows. Quite often I burn 2 DVDs simultaneously (and the dvd-drives are even connected to the same IDE cable) and still can use my computer for surfing the web, chatting in irc etc. Even rendering something in the background doesnt affect the foreground applications as long as you remember to set the rendering process to a lower priority.

--
- Raynet --> .
Re:Poor mans dual-core by Soybean47 · 2005-11-19 02:58 · Score: 1

Well, it depends on the application. In my case, I'm writing a clustered simulation program, and HT does miraculous things. It's so good, that I find it implausible that a dual-core system will do better. In my tests, a computer can literally do twice as much work in a given amount of time if it has HT turned on. In theory, dual core shouldn't more than double performance, so, for my application, I doubt there will be a performance gap at all. I'll know for sure in a couple of weeks; I've got some low-end Pentium D systems on their way.
Re:Poor mans dual-core by masklinn · 2005-11-19 03:12 · Score: 2, Informative

In fact, with the dual channel memory model, dual core AMD systems might be a little better than generic dual CPU, since each processor has it's "own" memory.

Nope, both cores use the same bridge to access central memory so that point is moot. On the other hand, the cores of an AthlonX2 get to discuss with one another through a special link while regular multiprocessor have to use the FSB (or HyperThreading for AMD's Opterons) link, and therefore have to compete with every other device using said FSB/HT (on top of getting much higher latencies)

--
"The way we can tell it's C# instead of Haskell is because it's nine lines instead of two." -- wadler
Re:Poor mans dual-core by Malor · 2005-11-19 03:50 · Score: 4, Informative

I think you're kind of saying this already, but I felt confused by your wording and thought I'd chime in. I'm a little blurry on a few of these details, and too lazy to go look things up, so pay attention to replies... don't treat this as gospel.

As far as I know, all multi-cpu AMD packages use exactly the same method to talk amongst themselves, HyperTransport. They absolutely use a private, dedicated HT bus between cores. I *think* that when you run two single core Opterons, each has a link to main memory, and they also share a direct link. In the case of a 4-die system, I think the third and fourth CPUs 'piggyback' on the 1st and 2nd... they talk to processors 1 and 2, and each other. Processors 1 and 2 do main-memory fetches on their behalf. Each CPU has its own dedicated cache, and I think the cache ends up being semi-unified... so that if something is in processor 2's cache, when processor 4 requests the data, it comes from processor 2 instead of main memory. That's not quite as fast as direct cache, but it's a LOT faster than the DRAM.

The X2 architecture is like half of a 4-way system. There's one link to main memory, and one internal link between the two CPUs... the second one is piggybacking, just like processors 3 and 4 do in a the 4-way system. It's not quite as good as a dedicated bus per processor, but the AMD architecture isn't that bandwidth-starved, and a 1gb HT link is usually fine for keeping two processors fed. You do lose a little performance, but not that much.

Intel dual cores share a single 800mhz bus, with no special link between the chips. And the Netburst architecture is extremely memory bandwidth hungry. Because of its enormous pipeline, a branch mispredict/pipeline stall hurts terribly. The RAM needs to be very very fast to refill the pipeline and get the processor moving again.

So running two Netburst processors down a single, already-starved memory bus is just Not a Good Idea. It's a crummy, slapped-together answer to the much, much better design of the AMD chips. It's a desperate solution to avoid the worst of all possible fates... not being in a high-end market segment at all.

Next year this could all be different again, but at the moment, AMD chips, particularly dual core, are a lot better from nearly every standpoint.
Re:Poor mans dual-core by volsung · 2005-11-19 04:03 · Score: 4, Informative

That's not quite true either. Each Opteron has a separate memory controller (dual-channel), which means that each CPU can have its own pipe to a bank of memory. So if the CPU needs to access memory in its banks, it will not have to contend with the other CPU over the HT link. A NUMA-aware OS will try to schedule processes on the same CPU which controls the process's allocated memory. If your programs can fit in one CPU's memory bank, then you can get bus contention down pretty low.
This is why SMP makers are going nuts over the Opteron. Your effective memory bandwidth scales linearly with the number of processors, assuming your processes partition nicely.
Re:Poor mans dual-core by Anonymous Coward · 2005-11-19 04:31 · Score: 1, Insightful

Cache memory coherency.

Dual CPU/SMP setups have their own caches on EACH cpu involved.

Dual Core CPU's have a shared cache between cores "on die" (same physical CPU)!

(Thus, theoretically (more than that, practically) don't have the same "issues" on checking for concurrent data accesses between CPU cores as Dual Core CPU's do, since their cores are physically separated... @ least, that's my understanding of it).

Plus, iirc, cache data access on a Dual Core CPU should be faster than that of Dual CPU/SMP setups... no overheads on 'sync checks' between them, as to the data they are working on, as well as communications between them (since they are on physically separated CPU's in a Dual CPU/SMP setup, whereas DualCore CPU's have BOTH PROCESSORS on 1 single die & a shared cache they both use, no "over the motherboard cpu cache checking synchronization required").

* If I am incorrect, please, anyone - feel free to "set me straight" here as to my response to this person's questions!

APK
Re:Poor mans dual-core by Anonymous Coward · 2005-11-19 04:34 · Score: 2, Interesting

You're lying. All HT does is schedule a single processor's execution units so as to achieve more parallelism out of the code stream than can be obtained from instruction level parallelism using OoO execution. The only conceivable way for this to be doubling your performance, would be if your output code were so unbelievably awful as to merit some form of parade in its honor for sucking so completely. The only way for an actual SMP system (dual core or dual processor) to not improve upon this would be if your algorithm doesn't scale to more than two threads, since the Pentium D affords both two cores and HT. Or you could just disable HT and get a compiler that doesn't suck monkey ass so much that the P4 can't execute more instructions in parallel.
Re:Poor mans dual-core by SuperQ · 2005-11-19 05:36 · Score: 1

But the parrents were not talking about Opteron, they were talking about Athlon X2, which does not have multiple memory controlers, it simply has a crossbar switch between the cores, and a single memory controler.
Re:Poor mans dual-core by InvalidError · 2005-11-19 05:37 · Score: 4, Informative

AMD Opterons each have their own local RAM and can access each other's RAM over the HT links to form a a cache-coherent non-uniform memory architecture - ccNUMA.

Multi-core Opterons have a special internal crossbar switch that allow the cores to share the memory controller and HT links, they do not 'piggy back' on the other. This reduces latencies and increases bandwidth for communication between the two cores and gives both cores the equal-opportunity access to the HT ports and CPU's local RAM. With a NUMA-enabled OS, applications will run off the CPU's local RAM whenever possible to minimize bus contention and this allows Opteron servers' overall bandwidth and processing power to scale up almost linearly with the number of CPUs.

As for Intel's dual-cores, the P4 makes sub-optimal use of its very limited available bandwidth. Turning HT on in a quad-core setup where the FSB is already dry on bandwidth naturally only makes things worse by increasing bus contention. Netburst was a good idea but it was poorly executed and the shared FSB very much killed any potential for scalability. If Intel gave the P4 an integrated RAM controller and a true dual-core CPU (two cores connected through a crossbar switch to shared memory and bus controllers like AMD did for the X2s), things would look much better. I'm not buying Intel again until Intel gets this obvious bit of common sense. The CPU is the largest RAM bandwidth consumer in a system, it should have the most direct RAM access possible. Having to fill pipelines and hide latencies with distant RAM wastes many resources and a fair amount of performance - and to make this bad problem worse, Intel is doing this on a shared bus. Things will get a little better with the upcoming dual-bus chipsets with quad-channel FBDIMM but this will still put a hard limit on practical scalability thanks to the non-scalable RAM bandwidth.

On modern high-performance CPUs, shared busses kill scalability. AMD moved towards independant CPU busses with the K7 and integrated RAM controllers with the K8 to swerve around the scalability brick wall Intel was about to crash into many years ago and has kept on ramming ever since. Right now, Intel's future dual-FSB chipset is nothing more than Intel finally catching up with last millenia's dual-processor K7 platforms, only with bigger bandwidth figures.
Re:Poor mans dual-core by Malor · 2005-11-19 07:28 · Score: 1

That's multi-core Opterons, though, not the X2? The X2, and the multi-die single-core Opterons do work how I thought?
Re:Poor mans dual-core by Bishop · 2005-11-19 08:32 · Score: 1

Have a look at the X2 block diagram. Both cores have direct access to the crossbar switch. The Opteron is the same but with more HT lanes. The situation you describe where one core or cpu accesses ram through another core is a mainboard implementation issue. Ideally each processor (of one or more cores) should have local ram. However not all boards are wired up that way. Most are.

The key here is not that there can be shared ram buses. The key is the crossbar switch versus the shared bus. The switch allows core_A to access ram while core_B accesses the HT.
Re:Poor mans dual-core by SageMusings · 2005-11-19 08:45 · Score: 1

Chewbacon,

How do you "set" an app to target a particular core?

--
-- Posted from my parent's basement
Re:Poor mans dual-core by Chewbacon · 2005-11-19 08:46 · Score: 1

Setting priority is one thing I don't have to do with the Pentium D. The HT I have, priorities are my friend.

--
Chewbacon
The Bible is like Wikipedia: written by a bunch of people and verifiable by questionable sources.
Re:Poor mans dual-core by InvalidError · 2005-11-19 09:28 · Score: 1

What are the differences between X2 and Opteron?
1) Athlons only have one usable HT link while Opterons have three
2) Opterons require buffered DIMMs
3) Opterons can use ECC DIMMs

Other than that, the two are fundamentally the same. The first A64s (S940) were Opterons with defective HT links or surplus Opteron production with disabled HT links, just like Celerons are Pentiums with slow/defective cache blocks or surplus production lobotomized to avoid flooding the market with high-end parts that would kill ASPs. Pentiums magically disappear from the retail channel at the $178 mark thanks to this artificial scarcity. This is why Intel loves unified cores so much - and probably part of their motive to divorce their dual-core chips: this also spares them the trouble/embarasment of introducing dual-core Celerons to liquidate their surplus Pentium-D cores.

If you compared Athlon 64 with Opteron (and their dual-core equivalents), I would expect the only observable technical differences to boil down to the two extra HT links and buffered ECC DIMM support circuitry in the Opteron.
Re:Poor mans dual-core by PygmySurfer · 2005-11-19 11:47 · Score: 1

On the other hand, the cores of an AthlonX2 get to discuss with one another through a special link while regular multiprocessor have to use the FSB (or HyperThreading for AMD's Opterons) link

That's HyperTransport, not HyperThreading (I'm sure that's what you meant) :)
Re:Poor mans dual-core by Chewbacon · 2005-11-19 12:25 · Score: 1

Set the processor affinity just as if you have multiple CPUs.

--
Chewbacon
The Bible is like Wikipedia: written by a bunch of people and verifiable by questionable sources.
Re:Poor mans dual-core by PygmySurfer · 2005-11-19 12:52 · Score: 1

Each core on an Athlon 64 X2 has its own dedicated L2 cache. Pretty graph here.
Re:Poor mans dual-core by drsmithy · 2005-11-19 14:03 · Score: 1

Want to archive a DVD movie and put your favorite CD on your mp3 player? Set the two apps to run on different cores.
You really shouldn't mess around with (relatively) low level scheduling behaviour like this. The OS will schedule appropriately and is far more capable of responding to changing system load dynamically to maximise performance than you are.
Re:Poor mans dual-core by drsmithy · 2005-11-19 14:06 · Score: 1

AMD moved towards independant CPU busses with the K7 [...]
AFAIK the Athlon [MP] didn't have "independent CPU busses".
Re:Poor mans dual-core by InvalidError · 2005-11-19 18:00 · Score: 1

http://www.amd.com/us-en/Processors/ProductInforma tion/0,,30_118_809_4368,00.html

"Smart MP technology for smarter multiprocessing:
* Dual point-to-point, high-speed system buses
* Innovative bus snooping capability
* Optimized MOESI cache coherency protocol"

The K7 had point-to-point CPU-to-chipset bus and (IIRC) an auxiliary CPU-to-CPU snooping bus.
Re:Poor mans dual-core by PygmySurfer · 2005-11-19 18:54 · Score: 1

I would assume Intel's dual core's are the same, though I haven't looked into them much. Indeed, it looks like they use 1mb L2 per processor (Found here.
Re:Poor mans dual-core by volsung · 2005-11-20 03:18 · Score: 1

From original post:
...while regular multiprocessor have to use the FSB (or HyperThreading for AMD's Opterons) link, and therefore have to compete with every other device using said FSB/HT (on top of getting much higher latencies)

It's been that way since day one, desktop as well. by Circuit+Breaker · 2005-11-19 02:22 · Score: 1, Insightful

Those of us who care to measure for themselves rather than buy Intel's propaganda, have noticed this long ago. I bet the people quoted in the article noticed it long ago as well, but it has only recently become "politically correct" to share that knowledge.

Behold! by alphapartic1e · 2005-11-19 02:23 · Score: 5, Funny

Perhaps this ushers a new era of computing, where Intel chips underperform AMD ones.

Oh, wait...

sort of obvious by Vlad_the_Inhaler · 2005-11-19 02:24 · Score: 4, Informative

If you have a system thread cleaning out blocks of disk cache memory then of course it is going to suffer. The whole point of hyperthreading was that one thread could run while another was waiting for I/O.

The first tests on Linux when Hyperthreading came out were also pretty discouraging.

--
Mielipiteet omiani - Opinions personal, facts suspect.

Re:sort of obvious by gtoomey · 2005-11-19 02:38 · Score: 1

Its obviously L2 cache, not disk cache.
Re:sort of obvious by Vlad_the_Inhaler · 2005-11-19 02:53 · Score: 1

I work on a mainframe.

Altered data is written back to disc pretty quickly but left in cache as long as possible for obvious reasons. Clearing stuff out of cache is basically a process of deciding which data pages have overstayed their welcome. I/O does not take place.

The whole idea of this is that a SW/HW stop should not cause data loss. All updates are also written to a separate Audit device as well.

--
Mielipiteet omiani - Opinions personal, facts suspect.
Re:sort of obvious by timeOday · 2005-11-19 05:05 · Score: 2, Insightful

The whole point of hyperthreading was that one thread could run while another was waiting for I/O.
Huh? You don't need hyperthreading for that, it's just normal multitasking.
Re:sort of obvious by Mateorabi · 2005-11-19 06:08 · Score: 2, Informative

Except that in multitasking, when a process blocks and swaps you suffer hundreds to thousands of cycles while the OS swaps out processes structs, rewrites VM tables, etc. This usualy happens at the os syscall level too.
In hyperthreading, one thread simply stops contending for functional units for 10s of cycles letting the other, already loaded and running thread max out its ALU/FPU usage while the other waits for cache to get filled from DRAM. This is much higher granularity: the OS doesn't force a swap penalty for every single cache miss, because the act of swapping is way more expensive.
The problem is if both threads are simultaneously memory (vs cpu) intensive then you end up with two waiting threads and don't see a performance boost. Even worse, they both start fighting over the same cache lines. This is the HT process equivalent of virtual memory thrashing, only its at the DRAMcache level instead of the diskDRAM level.

--
"You saved 1968." - Ms. Valerie Pringle to the crew of Apollo 8
Re:sort of obvious by timeOday · 2005-11-19 06:21 · Score: 1

Fair enough. I'm not used to thinking of main memory access as "I/O" but maybe that's just me.

Marketing ploy. by Chickenofbristol55 · 2005-11-19 02:25 · Score: 2, Insightful

I don't want to start a flamewar, but everytime I see an Intel commercial when the announcer says "pentium 4 with ht technology", it sounds like a stupid marketing ploy. It's suppose to offer better performance in heavily threaded apps, but apparently it doesn't. Also, in the commercials, it never explains to the customer what HT is, which just shows that if they had a great piece of technology, they would atleast take 10 seconds to explain the benefits, but they never do. They say a catch phrase, and that's really what it all seems to boil down to.

--
public class null extends java applet { System.out.print ("Tabula Rasa"); }

Re:Marketing ploy. by gbjbaanb · 2005-11-19 02:33 · Score: 1

However, in the adverts that are targetted at consumers, with their desktop applications, the (relatively simple) threading benefits of HT will make their computer seem more responsive at the very least.

I doubt many desktop apps use lots of CPU running lots of similar threads like SQLServer does (and other high-load applications like MySQL and Apache that also do not perform as well with HT turned on).

In an advert, the bing-bong-bung-bong jingle takes longer than any explanation anyway - you surely didn't expect them to explain *anything* except 'buy one of these now'.
Re:Marketing ploy. by dnoyeb · 2005-11-19 05:57 · Score: 1

Intel has been marketing this way for as long as I can remember starting with the built in co-processor. It has worked well for them too.

It is a marketing ploy. In multiprocessing world performance and responsiveness mean distinct things. In the marketing world I guess intel felt it was ok to substitute the word 'performance' where responsiveness should have been.
Re:Marketing ploy. by AvitarX · 2005-11-20 04:46 · Score: 1

Well they don't need to tell us. We already know that Pentium technology "makes the internet faster". It would be redundant to tell us again.

--
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg

Figures by xouumalperxe · 2005-11-19 02:25 · Score: 5, Interesting

Well, AFAIK, the HTT thing only allows for the processor to sort of split execution units (FPU, ALU, etc) so that one can work on one thread, the other on another one. If an application resorts heavily to one of those units -- and my somewhat uninformed feeling is that software like SQL probably works mostly on the ALU, it, can't possibly GAIN performance. On the other hand, I can see the effort of thrying to pigeonhole the idle threads on the wrong execution unit (will it even try that?) completely borking performance. So yeah, no surprises here.

Re:Figures by MourningBlade · 2005-11-19 05:45 · Score: 2, Informative

Well, AFAIK, the HTT thing only allows for the processor to sort of split execution units (FPU, ALU, etc) so that one can work on one thread, the other on another one. If an application resorts heavily to one of those units -- and my somewhat uninformed feeling is that software like SQL probably works mostly on the ALU, it, can't possibly GAIN performance. On the other hand, I can see the effort of thrying to pigeonhole the idle threads on the wrong execution unit (will it even try that?) completely borking performance. So yeah, no surprises here.
Intel and AMD's Hyperthreading is one-at-a-time: the chip is either working as CPU 0 or it's working as CPU 1. I believe the newest Power5 is the only chip out there that can divvy up the internal units amongst the threads and work simultaneously.
And they have noticed performance improvements, though (once again) you have the cache size issue. Then again, a Power5 has an enormous CPU cache compared to an x86 processor.
Re:Figures by iabervon · 2005-11-19 07:29 · Score: 1

Actually, software like databases is probably limited by memory bandwidth, not any execution unit in the processor. So hyperthreading won't help the database any, because it doesn't add any memory bandwidth. On the other hand, it probably does let other code running on the machine get more done while the database threads are waiting for data, gives more time to random Windows cruft at the expense of useful work in this case (that is, other processes get more quickly to the point of making memory accesses, and use more of the cache, since they can run while the database is waiting for data).

How is this news? by logicnazi · 2005-11-19 02:30 · Score: 3, Informative

This sort of effect has been talked about for as long as I remember hearing about hyperthreading. It was common knowledge long before the chips came out that running two threads on the same cache can cause performance issues. One can see this with two chips sharing an L2 cache so why should it be a surprise here?

The real question is whether this issue can be optimized for. If the developers design their code with HT in mind will this still be a problem since the other thread may belong to another process or would properly optimized code be able to deal with his?

Most importantly is this a rare effect or a common one? Would it be rare or common if you optimize your programs for an HT machine?

--

If you liked this thought maybe you would find my blog nice too:

So, what do we call this? by jcr · 2005-11-19 02:34 · Score: 5, Funny

Hyperthrashing?

-jcr

--
The only title of honor that a tyrant can grant is "Enemy of the State."

Re:So, what do we call this? by porneL · 2005-11-19 02:52 · Score: 2, Interesting

HyperThrottling

Re:The developers are not smart enough! by RedLaggedTeut · 2005-11-19 02:36 · Score: 1

Probably the developers of sql server didn't undestand how to get the best from a hyperthreading architecture. There's a big difference between 'real' threads and 'pseudo' (time-sliced) threads. I'm betting it's the software that's at fault here and not Intel's architecture.

Maybe start by betting your Karma, Mr. AC?

AFAIK intel-HT is intended to improve the felt performance of users e.g. in front of a GUI by reducing response time. There has to be a catch, because if it was so easy, everyone would have done it before instead of painstakingly optimizing the CPU.

--
I'm still trying to figure out what people mean by 'social skills' here.

Re:It's been that way since day one, desktop as we by logicnazi · 2005-11-19 02:46 · Score: 5, Interesting

As someone who commented above pointed out intel openly acknowledges performance can be hurt. I don't know what you mean about not being acceptable to notice this as I've seen this sort of issue mentioned in pretty much every article I've read on HT starting quite far back.

HT is just another chip technology like any other. It is only in the rarest circumstances that a new technology will be better/faster for everything. These things all have tradeoffs and the question is whether the benefits are enough to exceed the disadvantages.

I really think you are being a little unfair to intel. If you had evidence that it decreased performance for most systems even when the software was compiled taking HT into account then you might have a point. However, as it is this is no different than IBM touting its RISC technology or AMD talking about their SIMD capabilities. For each of these technologies you could find some code which would actually run slower. If you happen to be running code which makes heavy use of some hardware optimized string instructions a RISC system can actually make things worse not to mention a whole other host of issues. The SIMD capabilities of most x86 processors required switching the FPU state which took time as well.

It's only reasonable that companies want to publisize their newest fancy technology and they are hardly unsavory because they don't put the potential disadvantages centrally in their advertisements/PR material. When you go on a first date do you tell the girl about your loud snoring, how you cheated on your ex or other bad qualities about yourself. Of course not, one doesn't lie about these things but it is only natural to want to put the best face forward and it seems ridiculous to hold intel to a higher standard than an individual in these matters.

--

If you liked this thought maybe you would find my blog nice too:

HT problems with firebird database (slowdowns) by mAriuZ · 2005-11-19 02:50 · Score: 2, Informative

Usual response is to disable it from bios

One possible solution (code patch)

http://sourceforge.net/mailarchive/message.php?msg _id=12403341

Other threads with hyperthreading problems (slowdowns)
http://sourceforge.net/search/?forum_id=6330&group _id=9028&words=hyperthreading&type_of_search=mlist s

--
developer http://flamerobin.org

Windows problem? by kasperd · 2005-11-19 02:50 · Score: 2, Insightful

The article seems to focus only on Windows. To get good performance from hyperthreading, the scheduler has to be aware of situations that could lead to decreased performance and avoid them. So is this a problem with the Windows scheduler being unable to deal with hyperthreading or is hyperthreading really broken? How is hyperthreading performance on other operating systems?

Another question one needs to ask is, how is performance on single and dual CPU systems? Getting good performance on a dual CPU HT system (which means four logical CPUs) is more complicated and thus requires more sophisticated algorithms in the scheduler.

Applications are most likely not to be blamed for the decreased performance. Such hardware differences should be dealt with by the kernel. Occationally the scheduler should keep one thread idle whenever that leads to the best performance. Only when there is a performance benefit should both threads be used at the same time.

--

Do you care about the security of your wireless mouse?

Re:Windows problem? by galaxyboy · 2005-11-19 03:43 · Score: 1

Applications are most likely not to be blamed for the decreased performance.
What? The kernel can't possibly deal optimally with all circumstances. It simply doesn't know what a given application will be doing. Only the programmer knows how data will be accessed. A multithreaded application will always perform better if the programmer takes into account the underlying hardware. Every level from the programmer all the way down to the hardware is extremely important to obtain optimal performance.
Hyperthreading can help if it is used how it is intended to be used. Unfortunately, the programmer or the compiler has to know what they are doing. It is doubtful to me that the OS can accurately predict what schedule will be best. That is why OS's export the ability to attach threads to processors and other scheduling primitives.
Re:Windows problem? by PepeGSay · 2005-11-19 04:48 · Score: 1

All the articles and information on HT has said "performance will decrease in IO intensive applications." SQL server seems to fit that bill. This whole article is a non starter.
Re:Windows problem? by 10101001+10101001 · 2005-11-19 06:14 · Score: 1

Partially true and partially false. While the OS can't accurately know what a thread is about to do, it can make predictions on what a thread is about to do. Tasks that are heavily i/o based will block before their quantum of processor time is up, while cpu heavy tasks will use up their entire quantum. Given that i/o and cpu tasks are generally orthoginal and HT relies heavily on orthoginality for performance improvements, an HT-aware scheduler simple loads i/o intenstive threads on one virtual cpu and cpu intensive threads on another virtual cpu. So, simply tracking past quantum usage should be a basis to make relatively intelligent decisions on scheduling.

--
Eurohacker European paranoia, gun rights, and h

Time to Buy AMD? by olddotter · 2005-11-19 02:55 · Score: 4, Insightful

Sounds like it might be time to buy more AMD stock.

I second the person that said programmers shouldn't be writing code to the cache size on a processor. How well your code fits in cache is not something you can control at run time. Different releases of the CPU often have different cache sizes. And frankly developers should always try to achieve tight efficent code, not develope to a particular cache size.

--
Think Deeply. ...

HT kills my ATI All in Wonder by puto · 2005-11-19 02:55 · Score: 4, Interesting

I have had an ATI all in wonder 9800 for close to more than a year now. I never really used the tuner part until a few weeks a go when I took delivery of several new LCD's and decided that I could be watching a little tv on one while working.

The 9800 sits on my XP box, which rarely gets rebooted. Games, browsing etc. My mac mini and linux boxes sit in their places with a KVM

Well after using the tuner part, it looks great with my digital cable. But the box would lock, couldnt kill the process of the ATI software MMC. A few times an hour sometimes at least once a day. Well I was on the point of sticking an old haupage in there. Or using another MMC.

Well after much digging I found a thread on how HT could cause issues with the software. I disabled it in the bios, do not really need it for anything. And ran the Tuner 48 hours solid without a lockup.

Now perhaps ATI is at fault for the software, but then again HT caused the incompatibility in my book.

Puto

--
The Revolution Will Not Be Televised

Re:HT kills my ATI All in Wonder by sm8000 · 2005-11-19 03:02 · Score: 5, Funny

You watched TV for 48 hours?
Re:HT kills my ATI All in Wonder by puto · 2005-11-19 03:06 · Score: 4, Funny

Just your mom on a Cinemax weekend hump-a-thon.

I left it on for 48 hours unattended.

Puto

--
The Revolution Will Not Be Televised
Re:HT kills my ATI All in Wonder by laffer1 · 2005-11-19 03:25 · Score: 2, Informative

I don't run with htt on but I do have an SMP box. (2 xeon 2ghz) The ati software works fine on my desktop, so it must be an issue specific to hyperthreading.

--
MidnightBSD: The BSD for Everyone
Re:HT kills my ATI All in Wonder by magarity · 2005-11-19 04:04 · Score: 1

You watched TV for 48 hours?

This is slashdot after all, must have been a Dr Who marathon.
Re:HT kills my ATI All in Wonder by Tim+Doran · 2005-11-19 05:15 · Score: 1

I sure could go for a hundred tacos right about now...

*salivate
Re:HT kills my ATI All in Wonder by this+great+guy · 2005-11-19 10:52 · Score: 1

Well after much digging I found a thread on how HT could cause issues with the software. I disabled it in the bios, do not really need it for anything. And ran the Tuner 48 hours solid without a lockup.

This does not prove that Hyper Threading is the one to blame. On the contrary, this kind of problem sounds typically like there are deadlock bugs in ATI's softawre that are only triggered by HT. A deadlock situation occurs when multiple threads need to acquire a common set of software locks (spinlocks, mutexes, etc), but each of them is waiting for other threads to release some locks that have already been acquired. Deadlocks are usually easier to trigger when running multiple logical CPUs (HT, or SMP, or dual-core, etc) because the threads actually run concurrently on the CPUs, increasing the chances of having them acquiring the locks at the same time. Of course since deadlock issues are actually a subclass of race condition issues, all of this is extremely time sensitive and the problem may disappear/reappear when changing the hardware or software even just a little bit. This may explain why laffer1 (who replied to your post) don't experience your issue with its different config (not the same software versions, and SMP instead of HT).

I am not a big Intel fan, but in this particular case it looks like Intel's HT is not the one to blame. Instead you should redirect your anger toward ATI :)

Not a developer from Citrix by Bluey · 2005-11-19 02:58 · Score: 2, Informative

I know asking for them to research is a stretch, but the submitter should at least read the acticle before submitting it. The quote was from a Technical Director at a consulting company that sells Citrix software, not from a developer at Citrix. Hyperthreading can definitely help performance of Metaframe running under Windows 2003. Enabling it in the bios on a server running Windows 2000 was where the problem resided.

Linux Server? by slashkitty · 2005-11-19 02:58 · Score: 1

I don't know about you guys, but I run many linux servers. I have a mix of CPU's, and the HT servers seem to perform better than non HT servers. Is linux better optimized for HT?

--
-- these are only opinions and they might not be mine.

Re:Linux Server? by TallMatthew · 2005-11-19 03:31 · Score: 1

It shouldn't make a difference unless Linux naturally hits the CPU cache more than Windows. AFAIK, the OS doesn't have any insight into whether pages are cached on the CPU or within main memory.
It makes sense that a app that consumes memory, I/O and CPU cycles like a database server would miss L1/L2 caches more often than not. The implication then is that it consumes more cycles for these "hyperthreads" to wait on a fetch than threads running on the main CPU. Is that right?
Unless I'm missing something, which is entirely possible, software and kernel developers are completely innocent here as hardware makes the choice whether an instruction is executed within a CPU's main core or hyperthreaded core.

HT for DISsimilar tasks by Baddas · 2005-11-19 03:00 · Score: 1

As far as I can tell, everything that hyperthreading was designed around was the idea that two dissimilar threads would run at the same time, for example, an I/O bound thread with a FPU-bound thread, or the like.

Running two identical threads on the same processor intuitively seems like it would result in a slowdown, as you've got more overhead than the thread running alone, with the same tasks being executed.

Kinda like trying to toast bread by putting one piece in, then rapidly taking it out and putting a different one in, repeat as needed, vs having two seperate slots, or just toasting one at a time.

HT is a kludge by Urusai · 2005-11-19 03:05 · Score: 1

Hyperthreading is a gimmick to keep Intel's overly long pipeline busy. At one point the wisdom was for processors to have long instruction pipelines. The problem arises when branch prediction fails and trashes your pipeline. AMD saw that the long pipelines were harmful and shortened them on the Athlon line. The rest is history.

As far as I'm concerned, the fiasco of P4 being far worse than P3, and the apparent inability to do a turnabout, means Intel is a broken company. They should have just tacked the new P4 instructions on a P-M core and called it the P5. Oh wait...

shared cache versus local memory by erwincoumans · 2005-11-19 03:11 · Score: 2, Informative

"I read the intel assembly guide section regarding hyperthreading, and it clearly states that performance will drop if you don't take the shared cache into consideration." This is a general problem. XBox 360 has similar issues, 3 cores sharing the same cache. Having multiple independent cpu's with each its local memory (like multiprocessor or PS3 SPU's),doesn't suffer from these issues.

Of course it can hurt performance by Anonymous Coward · 2005-11-19 03:12 · Score: 2, Informative

HT is a very simple concept: Virtualize 2 CPUs by cutting all caches in half and allocating each half to one of the CPUs, and allow the ALUs to process data from either thread. Ths can give good performance, for instance when one thread has a cache miss and is waiting for data from main memory (or god forbid there is a fault and you need to read from the HDD). In a normal single CPU operation, this ties up resources, and that thread can't make any progress. with HT on, the second thread can continue processing data. Or even without a cache miss, there are 4 (or more) ALUs on the die, and only certain types of applications can effectively make use of them all simulatneously. Having HT allows a higher probability that all the resources on the chip are used. But the cost, as I said above, is cutting the cache sizes in half (effectively). And cache is king for some applications. there are many job types where doubling the cache gives much better performance than even doubling the CPU speed (well, that is probably pushing it, ut certainly adding 10% more cache can be better than 10% higher clock rate), as it means less time going to main memory.

It isn't a foolproof technology, but it has it's benefits. SQL can be very heavy on the cache, and I'm not surprised that it doesn't perform optimally without some tuning.

Re:The developers are not smart enough! by jfroebe · 2005-11-19 03:13 · Score: 1

"MS SQL was designed and likely largely tested in a single processor system and multiprocessor or HT support is somewhat less than optimal. So MS SQL is likely best tuned to single processor."

Where did you get this wallop of information? It is not true, MS SQL Server performs very well in multiprocessor environments (not using Hyperthreading). Checkout the TPC benchmarks if you don't believe me: http://www.tpc.org/

--
No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil

Re:Should we turn it off in PCs? by masklinn · 2005-11-19 03:14 · Score: 1

Usually not, no.

Best would - of course - be to perform your own test, but enabling HT on desktops usually improves the multi-app flow and reduces the cases of boxes "locking" with one application eating all the resources.

--
"The way we can tell it's C# instead of Haskell is because it's nine lines instead of two." -- wadler

Re:Should we turn it off in PCs? by DrSkwid · 2005-11-19 03:16 · Score: 1

What was the difference when you tried ?

--
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter

Dual Core performance... by Name+Anonymous · 2005-11-19 03:16 · Score: 3, Interesting

As others have said, it depends...

Is it two complete cores? Front Side Bus speed? Memroy Speed? etc.

The IBM 970MP that Apple is using for the dual core PowerMacs was designed right. And due to the cache snooping (among other things), a dual core 970MP can be slightly faster than a dual processor setu at the same clock and bus speeds.

Another multicore chip to look at for being done right is the Sun UltraSPARC T1 processor. Up to 8 cores with 4 threads per core. Sun's threading model in this processor doesn't have the faults that Intel's HyperThreading does.

Intel HT technology seems as bad a patch on the architecture much like Microsoft's updates to Windows.

Re:Dual Core performance... by InvalidError · 2005-11-19 07:48 · Score: 3, Insightful

HT and Netburst were good ideas... but they were poorly executed.

Part of the reason for this is that desktop CPUs mostly run desktop apps and most desktop apps are single-threaded so Intel and AMD could not afford to give up on single-threaded performance. This forced them to add heaps of logic to extract parallelism and Intel made many (IMO dumb) decisions in the process. The SPARC stuff is used for scientific apps which have a long history of multi-threading and distributed computing so Sun does not have to worry about single-threaded performance, allowing for much simpler, leaner and more efficient designs.

Where I think Netburst is particularly bad is the execution engine... when I read Intel's improved hyper-threading patent, I was struck in disbelief: the execution pipelines are wrapped in a replay queue that blindly re-executes uOPs until they successfully execute and retire. Each instruction that fails to retire on the first pass enters the queue and has its execution latency increased by a dozen cycles until its next replay. Once the queue is full, no more uOPs can be issued so the CPU wastes power and cycles re-executing stale uOPs until they retire, causing execution to stall on all threads. Prescott added independant replay queues for each thread so one single thread would never be able to stall the whole CPU by filling the queue... this could have helped Northwood quite a bit but Prescott's extra latency killed any direct gains from it. Intel should roll back to the Northwood pipeline and re-apply the good Prescott stuff like dedicated integer multiplier and barrel shifter, HT2, SSE3 and a few other things, no miracle but it would be much better than the current Prescotts, though it certainly would not help the saturated FSB issue.

With a pure TLP-oriented CPU, there is no need for deep out-of-order execution, no need for branch prediction and no need for speculative execution. Going for TLP throughput allows the CPU to freeze threads whenever there is no nearby code that can execute deterministically instead of doing desperate deep searches, guesses and speculative execution: more likely than not, the other threads will have enough ready-and-able uOPs to fill the gaps and keep all execution units busy producing useful results on nearly every tick. Stick those SPARC chips on a P4-style shared FSB/RAM platform and they would still choke about as bad as P4s do.

The P4's greatest achile's heel is the shared FSB... it was not an issue back when Netburst was running at sub-2GHz speeds but it clearly is not suitable for multi-threading multi-core multi-processor setups. The shared FSB is clearly taking the 'r' out of Netburst. The single-threaded obsession is also costing AMD and Intel a lot of potential performance, complexity and power.
Re:Dual Core performance... by stevesliva · 2005-11-19 13:10 · Score: 1

I always find it amazing how much armchair-quarterbacking goes on with regards to Intel architectures. There are just way too many folks out there that have facility with all the code names like Prescott and Northwood and the vagaries of their pipelines and caches and whetever. Normally with Slashdot criticism of Intel, I'd give Intel's architects the benefit of the doubt, but not these days, not with them releasing crappier processors than their competition despite having the best manufacturing capabilities on the planet.

--
Who do you get to be an expert to tell you something's not obvious? The least insightful person you can find? -J Roberts
Re:Dual Core performance... by InvalidError · 2005-11-19 17:52 · Score: 1

There is no giving Intel's Architect benefit of the doubt. The problem is that Intel's Marketing and Engineering departments were not expecting process advancements to still stall under 4GHz. They banked on ever increasing GHz figures that never materialized. Remember that Netburst was born in the pursuit of GHz infinity, the replay engine was probably selected for its simplicity and ability to scale with frequency, not for its operational and electrical (in-)efficiency. This is the result of letting Marketeers dictate engineering objectives to perpetuate the "Megahurtz Myth". Intel's chip architects were simply doing what the marketing department told the bosses to tell the architects to do. I wonder how many watts and cycles the replay queue and associated discarded computations cost under load.

Northwood and Prescott are the two most openly documented/reviewed/inspected/analyzed Intel cores out there, in large part because Willamette shocked so many with its awful clock-to-clock performance disadvantage over PPro/P2/P3 cores. This caused all the major sites to do thorough exploration of subsequent desktop chips like Northwood that set a new pipeline depth records and Prescott that was even more controversial for shattering that record.

If Intel restructured Netburst to get rid of the replay queue, they could still turn Netburst into a winner... by simply adding condition flags to execution queue entries, they could postpone issuing uOPs until dependencies are guaranteed to be resolved just-in-time or earlier. That way, they would not need a replay queue of any sort and no uOPs would ever waste execution slots multiple times to produce discarded invalid results while waiting for dependencies.

- A former Intel fanboy and wanabe chip architect.
(I am currently scheduled to start my new job as an ASIC validation specialist next January.)

Hyperthreading works best with "bad" code by cimetmc · 2005-11-19 03:20 · Score: 4, Insightful

Beside the cachae considerations which were discussed by numerous people here, there is one aspect that hasn't been mentioned.
The reason why hyperthreading was introduced in first place was to reduce the "idle" time of the processor. The Pentium 4 class processors have an extremely long pipeline and this often leads to pipeline stalls. E.g. the processing of an instruction cannot proceed because it depends on the result of a previous instruction. The idea of hyperthreading is that whenever there is a potential pipeline stall, the processor switches to the other thread which hopefully can continue its executon because it isn't stalled by some dependency. Now most pipeline stalls occur when the code being executed isn't optimized for Pentium 4 class processors. However the better Pentium 4 optimized your code is, the less pipeline stalls you have and the better your CPU utilisation is with a single thread.

Marcel

Re:"High End"! - LOL @ U by ZorinLynx · 2005-11-19 03:20 · Score: 1

Actually, the big hitters today typically use IBM mainframe technology. Machines so fault-tolerant that they can lose CPU and memory cards and keep right on running, and end up with uptimes measured in years.

Sun equipment is a bad joke compred to IBM iron. Some banks and big firms have been using the same software for decades; once you get something debugged to the point that it never crashes, and your needs don't vary too much (finance is a pretty well-understood field), you just want it to work. Period.

-Z

Not Intel's fault; Microsoft's fault. c.f. Linux. by Theovon · 2005-11-19 03:22 · Score: 5, Interesting

I remember early discussions from LKML where developers realized that if you were to run a high-priority thread on one virtual processor and a low-priority thread on the other VP, you'd have a priority imbalance and a situation that you'd want to avoid. The developers solved the problem by adding a tunable parameter that indicated the assumed amount of "extra" performance you could get out of the CPU from HT. In other words, with 1 CPU, max load is 100%; with two physical CPU's, max load is 200%; with one HT CPU, max load would be set to something on the order of 115% to 130%. So, when your hi-pri thread is running and the lo-pri thread wants to run, we let the low-pri thread only run 15% of the time (or something like that), resulting in only a modest impact on the hi-pri thread but an improvement in over-all system throughput.

That being said, I infer from the article that Windows does not do any such priority fairness checking. Consider the example they gave in the article. The DB is running, and then some disk-cache cleaner process comes along and competes for CPU cache. If the OS were SMART, it would recognize that the system task is of a MUCH lower priority and either not run it or only run it for a small portion of the time.

As said by others commenting on this article, the complainers are being stupid for two reasons. One, Intel already admitted that there are lots of cases where HT can hurt performance, so shut up. And Two, there are ways to ameliorate the problem in the OS, but since Windows isn't doing it, they should be complaining to Microsoft, not misdirecting the blame at Intel, so shut up.

(Note that I don't like Intel too terribly much either. Hey, we all hate Microsoft, but when someone is an idiot and blames them for something they're not responsible for, it doesn't help anyone.)

Benchmark your own application by cryfreedomlove · 2005-11-19 03:27 · Score: 2, Insightful

I never accept the assertions that a configuration option lile HyperThreading is always good or always bad. It's never black and white. The answer is always: it depends on the application. In my experience a busy linux java based web serving application that does a lot of context switching and a lot of IO to back end applications uses less CPU when hyperthreading is enabled. Collective wisdom aside, it works for my application so I am leaving it on.

Re:The developers are not smart enough! by Kjella · 2005-11-19 03:27 · Score: 1

While HT degrades faster than two CPU systems for reason of contention of more components than just I/O and memory, if properly programmed it will add to throughput.

Given that MS SQL isn't exactly a rare piece of software, what fraction of software will actually take advantage of the hyperthreading? It's sort of the Itanium argument all over again, who cares how wonderful the architechture is if no software is able to use it well? If I was building server software, my primary performance metrics would be single-core (no HT/dual-core) and multi-CPU (SMP) benchmarks, and HT/double-core performance would be mostly what it would be using the SMP code. Dual-core seems to handle SMP code quite well, so specifically targetting HT CPUs seems like a really niche target, considering the limited gain it has even under the best of conditions.

--
Live today, because you never know what tomorrow brings

Breaking the EULA ? by DrSkwid · 2005-11-19 03:28 · Score: 4, Funny

I thought you couldn't report any performance issues of MS SQL Server :)

--
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter

I've seen this problem by Explodo · 2005-11-19 03:30 · Score: 1

Certain applications take a big hit in performance with HT turned on. It's not just server apps. I don't know the specific class of problems, but some of our software has been benchmarked running faster with HT off.

Re:Hyperthreading... by jfroebe · 2005-11-19 03:31 · Score: 1

"Of course a database server isn't going to take advantage of a hyperthreaded CPU. It doesn't do any FPU at all."

Actually MS SQL Server and Sybase ASE do use the FPU. I'm not sure about Oracle though.

Jason L. Froebe
Team Sybase

--
No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil

HT on Linux by RAMMS+EIN · 2005-11-19 03:34 · Score: 3, Informative

Hyperthreading Speeds Linux.

In a nutshell:

- hyperthreading decreases syscall speed by a few percent
- on single-threaded workloads, the effect is often negligible, with occasional large improvements or degradations
- on multithreaded workloads, around 30% improvement is common
- Linux 2.5 (which introduced HT-awareness) performs significantly better than Linux 2.4

So, from that benchmark (and others like it, just STFW) it appears that HT offers significant benefits; you need multithreading to take advantage of it, and having a HT-aware OS helps.

--
Please correct me if I got my facts wrong.

Common practice for Oracle hosted servers by tigerknight · 2005-11-19 03:37 · Score: 1

We run RH AS2.1 on most machines right now and hyperthreading is disabled (under any kernel) because of this performance hit, it can grind a heavily active database into a big backlog.

So far it looks like AS3 and a newer kernel resolves the issue - but we don't have a big spread of those servers in the DC just yet so may not be a good sampling of HT enabled instances.

Re:The developers are not smart enough! by ocbwilg · 2005-11-19 03:38 · Score: 1

MS SQL was designed and likely largely tested in a single processor system and multiprocessor or HT support is somewhat less than optimal. So MS SQL is likely best tuned to single processor.

Are you high, or are you just in the habit of randomly making up nonsensical stuff? While we're at it, which morons modded that post to +4 Insightful? Do you really think that Microsoft would design and target their database server platform for use in only single CPU servers? Database applications are alwasy processor CPU intensive, and Microsoft, Oracle, and other vendors of database software spend ridiculous amounts of time optimizing their software to be heavily multi-threaded exactly so that it will perform well on multiple CPU systems.

Every couple of months either there's a new press release from Microsoft or Oracle indicating that hardware vendor X has set a new record for the highest TPC marks on database processing by using some new multi-CPU configuration and their software. Do you really think that Microsoft could compete in those conditions if they only wrote SQL server for single CPU configurations?

Re:Not Intel's fault; Microsoft's fault. c.f. Linu by DrSkwid · 2005-11-19 03:38 · Score: 1

you do know that windows has priority levels too ?

--
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter

Interesting Technical Analysis on the subject by morcego · 2005-11-19 03:39 · Score: 4, Informative

You will find here a very interesting technical analysis on the subject, by Bryan J. Smith, on why Hyperthreading is crappy engeneering. From the message:

Since then, Intel has made a number of "hacks" to the i686 architecture.
One is HyperThreading which tries to keep its pipes full by using its
control units to virtualize two instruction schedulers, registers, etc...
In a nutshell, it's a nice way to get "out-of-order and register
renaming for almost free." Other than basic coherency checking as
necessary in silicon, it "passes the buck" to the OS, leveraging its
context switching (and associated overhead) to manage some details.

That's why HyperThreading can actually be slower for some applications,
because they do not thread, and the added overhead in _software_
results in reduced processing time for the applications.

--
morcego

Re:Interesting Technical Analysis on the subject by mojotooth · 2005-11-19 09:41 · Score: 1

In a nutshell, it's a nice way to get "out-of-order and register renaming for almost free."

Anybody who has studied the p6 architecture knows that it has always featured OOO execution and register renaming. Mod parent down, the linked forum post is completely bunk. "Interesting" isn't nearly as useful as "factual."

--
-- Mojo Tooth : exploring our world as only an idiot can.

Re:The developers are not smart enough! by Jugalator · 2005-11-19 03:41 · Score: 1

MS SQL was designed and likely largely tested in a single processor system and multiprocessor or HT support is somewhat less than optimal.

lol, with an outrageous claim like this for dedicated server software, you really need to provide an unbiased source. ;-)

--
Beware: In C++, your friends can see your privates!

Run-Time Cache Size Optimization by RAMMS+EIN · 2005-11-19 03:43 · Score: 1

``How well your code fits in cache is not something you can control at run time.''

You most certainly can, and the speed gains can be significant. One way to do it:

- write a version of your code optimized for 256 KB cache
- write a version of your code optimized for 512 KB cache

Use the contents of /proc/cpuinfo to see how much cache you really have, and chose the version of your code to run based on that.

I'm sure there are better ways, but this is just proof that it's possible. Whether or not it's advisable depends on the situation.

--
Please correct me if I got my facts wrong.

Inconclusive on Linux? by ndogg · 2005-11-19 03:48 · Score: 2, Interesting

I don't have a HT-capable proc (AMD Athlon XP 1700), so I don't know anything from personal experience.

I decided to check out how PostgreSQL did with HT.

The first link (1) was suggesting to someone--who was having performance problems under FreeBSD--to turn off HT. Of course, that may not be related to PostgreSQL itself, but rather FreeBSD. I really don't know.

The next thing I found showed some mixed results with ext2 under Linux (2). Somethings showed gain with HT, but not others.

Another link (3) commented that HT with Java requires special consideration when coding.

I didn't come up with anything useful under PostgreSQL, so I checked out Linux.

According to Linux Electrons, Linux performance can drop without proper setup.

--
// file: mice.h
#include "frickin_lasers.h"

Re:Inconclusive on Linux? by jazir1979 · 2005-11-19 08:51 · Score: 1

Link 3 does not suggest that HT with Java requires special consideration when coding. It just implies that the application this person written had synchronisation issues that would only surface on a multi-CPU box, or a single CPU with HT enabled.

This just means that their application is incorrect and not at all thread safe, but you can "get away with it" on a single CPU where the particular race condition either never occurs, or is much much less likely to occur.

--
What's your GCNSEQNO?

That's not all by koan · 2005-11-19 03:49 · Score: 2, Interesting

I use Nuendo for professional music recording and even though their latest version says it's HT aware, the performance is poor. In fact in several instances it only takes a few instruments loaded for it to peak CPU, change it back to basic CPU with HT off and it works fine.
MY understanding is it's this way with Cubase as well.

--
"If any question why we died, Tell them because our fathers lied."

Re:That's not all by slackmaster2000 · 2005-11-19 08:26 · Score: 1

Audio hardware and software always seems to have issues whenever new underlying technology is introduced, and results also seem to depend heavily on the chipset. Are you sure your CPU is maxing out, or is the meter just insane? Either case wouldn't suprise me.

This reminds me vaguely of about five years ago when ACPI started to become standard, and we all went about installing Win2K with the standard HAL to improve performance and solve conflicts. Of course at that time the standard support responsde from any soundcard manufacturer was "the soundcard must be on its own IRQ", regardless of the problem. Prior to that I remember all the problems with several lines of soundcards on non-Intel chipsets, about the time when the Athlon started looking like a great processor choice. Or how about all the audio programs and drivers that didn't support NT and didn't get on the ball to support Win2K until 2001-2002?

Anyhow, I don't get a warm fuzzy feeling when I think about audio developers as a whole, although things have improved. Whether HT is problematic on its own or this is just the result of years of goofball hacking in the audio industry I dunno. But of course what's important to you and I is that we do whatever it takes to get a stable system, no matter what we end up having to disable to get there. I suppose this is true in every case. If SQL Server performs like crap with HT enabled, then disable it. No biggie.

You on crack? by Just+Some+Guy · 2005-11-19 03:55 · Score: 1

HyperThreading was never ment for server computers.

So all these Xeons around the farm are laptop CPUs or something?

--
Dewey, what part of this looks like authorities should be involved?

Re:The developers are not smart enough! by DavidTC · 2005-11-19 03:58 · Score: 1

Hell, even if he were right, he's still be wrong.

Why? Because CPU-intensive applications can't help but work better under dual-core systems. Even if MSSQL server was inexplicably designed for a single CPU, you'd end up with it running on one CPU and the OS and everything else on the other, and it would, indeed, be faster.

Under hyperthreading, of course, that doesn't work at all.

I always thought the idea of hyperthreading was a little dodgy, but I didn't know enough about CPU design to prove it. I'm glad other people are saying the same thing. It always seemed like it would be faster to introduce a new instruction that's basically 'save and restore context', and then rewrite the process scheduler to use it instead of doing that manually, then do it how Intel did it, which is to switch back and forth between two specific contexts outside of software control.

--
If corporations are people, aren't stockholders guilty of slavery?

Who buys Intel chips anymore anyway? by Hellraisr · 2005-11-19 03:58 · Score: 1

AMD is the way to go.

Re:"High End"! - LOL @ U by Anonymous Coward · 2005-11-19 04:26 · Score: 1, Insightful

No, if you care about fault-tolerance and extreme availability, you go for HP Non-Stop servers (Tandem/NSK)... /G

Re:HyperThreading is not for servers by LurkerXXX · 2005-11-19 04:30 · Score: 1

Really? I guess some AC knows more than the Intel reps that were at last weeks SQL 2005 launch. Every other word out of the reps mouths was about hyper-threading, and they were talking to SQL users.

Re:The developers are not smart enough! by cciRRus · 2005-11-19 04:37 · Score: 1

It's sort of the Itanium argument all over again, who cares how wonderful the architechture is if no software is able to use it well?

A very insightful comment! I'd mod you +1 Insightful if I had the points.

--
w00t

Intel's Hyperthreading vs Sun's Chip Mulithreading by Dopeskills · 2005-11-19 04:40 · Score: 2, Interesting

Can anyone explain to me the exact difference between HT and CMT ? I'm wondering if these same issues would plague Sun's new Niagra prcessor.

Re:Stupid car analogy by slyborg · 2005-11-19 04:49 · Score: 1

What's a carburetor? Or a Yugo, for that matter?

-1 for old and moldy as well as nonsensical.

Re:Should we turn it off in PCs? by Duhavid · 2005-11-19 05:02 · Score: 2, Insightful

Yes, definately.

Along with the rest of the machine.

--
emt 377 emt 4

Re:The developers are not smart enough! by canuck57 · 2005-11-19 05:07 · Score: 1

Where did you get this wallop of information? It is not true, MS SQL Server performs very well in multiprocessor environments (not using Hyperthreading). Checkout the TPC benchmarks if you don't believe me: http://www.tpc.org/

Wow, this post sure attracted a lot of flame bait from M$ 'n FUD crew.

Read the original post, "and likely largely tested in a single processor system".

I don't think Microsoft gave it's developers a $5.8M USD machine in #4 www.tcp.org spot that you can't even buy yet to develop MS SQL. It was more likely a PC, single processor and subsequently and later tested on the bigger iron.

Instead of looking at the www.tcp.org site where vendors post systems you can buy, why not look at what organizations are really buying?

http://www.top500.org/lists/2005/11/

There must be some reason that Microsoft consistantly is excluded completely from the top 10 by *real* world purchases. I didn't check to see how far down the list you have to go to see a Microsoft product. I guess those Dells run Linux nicely.

Go ahead M$ pundits, mod this down too. After all it is the M$ way. You don't like the facts so you FUD it and mod it down.

Re:HyperThreading is not for servers by canuck57 · 2005-11-19 05:23 · Score: 1

Anyone have any links to any test reports?

Not teribly scientific, but when I ran seti 3.x on my HT w. Linux I got the following results:

1 seti at a time ran in about 4 hours for 6 units per day.

2 instances of seti at a time was about 5.2 hours per unit at 9.2 seti units per day.

So I ran 2 seti instances to get the throughput as I was after the work unit count.

HT != SMP by TheRealFritz · 2005-11-19 05:27 · Score: 1

Unfortunately Windows looks at an HT CPU as if it had multiple cores (true SMP). If Microsoft would change the Windows Scheduler to properly treat an HT CPU by adjusting the way it distributes threads and processes to the two virtual CPUs, then there should be a performance gain and no penalty.

--
http://www.gloryhoundz.com/

Re:HT != SMP by Dahan · 2005-11-19 06:43 · Score: 1

If Microsoft would change the Windows Scheduler to properly treat an HT CPU by adjusting the way it distributes threads and processes to the two virtual CPUs,
It does. See section 5. (Google HTML conversion of original Word document).
then there should be ... no penalty.
But there is. So what's your next suggestion for Microsoft?
Re:HT != SMP by TheRealFritz · 2005-11-19 10:58 · Score: 2, Informative

Actually, the document you point out kind of describes the flaw in the MS scheduler. The MS scheduler only optimizes HT behavior when you have several physical HT enabled CPUs. So let's say you have 2 P4s with HT. Windows will see 4 CPUs. The article describes how Windows will try to schedule on a logical CPU that's part of a physical CPU which currently has nothing scheduled.

So the support described by Microsoft completely fails to address single physical CPU scenarios, or multi-CPU scenarios under high thread load.

What the scheduler needs to do in a single physical HT CPU scenario is only allow threads to execute on the second logical CPU that share resources with the thread executing on the first logical CPU in order to minimize resource contention.

Persistent JIT by CustomDesigned · 2005-11-19 05:32 · Score: 1

JIT code can be cached persistently so that startup costs are only paid once. AS400 does this sweetly. And JIT doesn't add significantly to memory footprint (there is a fixed overhead - think about Transmeta), but certain types of garbage collection do - the fast ones (e.g. generational). When it comes to memory management, you can make it small, fast, automatic - pick any two.

Re:Not Intel's fault; Microsoft's fault. c.f. Linu by Knackered · 2005-11-19 05:33 · Score: 2, Insightful

HT is a bandaid for poor compiler technology and a mediocre architecture.

I may agree that HyperThreading as implemented in the x86 architecture is a hack, but I wouldn't dismiss the original idea of HT, as implemented in the Tera supercomputers. It was designed to have hundreds of thread contexts in hardware, so if it has to wait on memory, there will be some other thread available to run. There are enough threads available that it can do without a cache, while utilising the full memory bandwidth. This quite neatly avoids cache consistency problems that can kill massively parallel performance.

--
a.

Re:The developers are not smart enough! by canuck57 · 2005-11-19 05:47 · Score: 2, Insightful

Are you high, or are you just in the habit of randomly making up nonsensical stuff?

No, not high. Just willing to take pro-M$ flame bait today.

I guess I overestimated the intelligence of the /. readership, especially those from the PC world.

The fact is, if you are writing software to be efficient on a single processor the architecture of the software will be much different than if you know you have 32 processors. And neither is best for the other.

For single processor speed you don't want the overhead of interprocess commutations so you can skip it and sequentially do what you need to do without worry of what the other processes are doing. In fact, this is usually how most programs operate as coding is much more easy to do.

For multiprocessor systems you want to distribute as evenly as possible the work across as many processors and I/O busses as you have. It is worth the effort of code, threads, interprocess communications layer with mutex, locks and individual disk writers. But this model would run slower on a single CPU.

The HT model isn't dual CPU in performance but does allow for 2 threads on the system to be active at once, at the expense of individual thread performance. Do we want single process speed or throughput? Example:

Classic seti 3.x on Fedora Linux.

- with 1 seti running takes 4 hours

- with 2 seti running each takes 5.2 hours

So if I want the fastest seti I want to run 1. If I want the most seti I want to run 2 to keep the processor busy to maximum performance.

And MS SQL, like it or not will have it's ups and downs depending how it was architected.

Re:It's been that way since day one, desktop as we by dnoyeb · 2005-11-19 05:51 · Score: 1

I don't entirely agree. AMD's multi-core architecture was targeted from the start toward servers. And its quite fair to say servers benefit from it. With HT, servers are not benefiting from it in their 'server' capacity. That is, if the server is loaded, which is the job of a server-to be loaded) then HT benefit is reduced.

Basically HT provides greater responsiveness, than performance so it should be targeted toward desktop not servers.

Its ok to have drawbacks, but they should not be in the thing you are designing for or rather advertising for.

even dual core hurts by poeidon1 · 2005-11-19 06:03 · Score: 1

Why HyperThreaded only, performance can drop even on new dual core processors as they share L2 cache. Dual core cannot increase single thread performance if thread is memory bound (though less severly than HT). Hyperthreaded was meant to increase processor throughput but it will work only if program has a decent cache footprint.

--
They called me mad, and I called them mad, and damn them, they outvoted me. -Nathaniel Lee

Re:even dual core hurts by Hymer · 2005-11-19 07:01 · Score: 1

"performance can drop even on new dual core processors as they share L2 cache"
Does this apply to AMD dual core or Intel dual core or both ??
Re:even dual core hurts by tomstdenis · 2005-11-19 09:10 · Score: 2, Informative

The AMDx2 has separate L2 caches per core, they can communicate via a dedicated HT link [between 3.2 and 4 GiB/sec] and share one memory controller.

So if one core requests cache line $x and the other core has it the data will be sent over the internal HT link and not even hit the memory bus. The memory controller is pipelined [I suspect] so even while the L2 fulfillment is going on the memory bus can be busy fetching another cache line.

The HT cores have *one* L2 cache per physical core [so for instance, a dual-core HT processor has 4 "logical processors" but only 2 L2 caches]. The prescott [and later] cores have a deep memory read/write pipelines to queue up many memory operations at once. The dual-core P4s have their own cache per physical core which communicates on a FSB that is shared between everything [including the memory bus].

Though in reality the dual-core P4 [e.g. 8xx series] do achieve 2x performance on totally unrelated [and not memory bound] tasks. For instance, doing RSA computations with CRT in two threads gets you a result with half the latency. [same is true for the AMDx2].

So the dual-core P4 isn't a bad buy if money is strapped. You'll get more performance from an AMDx2 though as the individual cores are so much faster.

Tom

--
Someday, I'll have a real sig.

Re:The developers are not smart enough! by BanzaiBill · 2005-11-19 06:03 · Score: 1

MS SQL was designed and likely largely tested in a single processor system and multiprocessor or HT support is somewhat less than optimal. So MS SQL is likely best tuned to single processor.

That is the most uninformed and dumbass thing I've seen written on Slashdot in a while, and that's saying something.

http://www.tpc.org/tpcc/results/tpcc_perf_results. asp
Check the www.tpc.org top 10 list. At #8, from way back in 2003, is a 64 way HP Itanium system running SQL Server. Everything else in the top 10 is more recent.

I really wish slashdot had a -1 (Idiot) mod.

--
- Think of it as evolution in action -

Re:Intel's Hyperthreading vs Sun's Chip Mulithread by Anonymous Coward · 2005-11-19 06:27 · Score: 1, Informative

Not at all. One of the big problems with HyperThreading as Intel has designed it is that they did not provide sufficient memory bandwidth to be able to feed both threads. This problem also plagues Intel's "dual core" chips. Ultimately, it eliminates the supposed benefit of switching over to the other thread when the first is blocked on memory access because as soon as the second thread needs information from memory it will actually slow things down. Also, even if there were sufficient memory bandwidth, the comparatively long fetch times would still mean that the CPU would be blocked parts of the time waiting on memory because there are only two threads available. Finally, there is a cost to switching between the threads, so even if you had the memory bandwidth and enough threads to prevent idle time it would still lose time because of the overhead in switching to another thread whenever the active one gets blocked.

On the other hand, the new UltraSPARC T1 (aka, Niagara) has massive memory bandwidth, shorter fetch times, four threads per core rather than two, and zero penalty for switching between threads. The result is incredible throughput with a total of 32 hardware threads (8 cores with four threads per core) in a single chip. And by the way, that single chip draws a fraction of the power and generates much less heat than a single Intel HT processor (I swear it seems like the systems are blowing out cool air).

Note, however, that the T1 chip may not be ideal for all workloads. It does have a relatively slow single-threaded performance, so it works best when running highly concurrent applications with minimal locking, or when running several applications concurrently. For some applications, it may be desirable to use processor sets to limit the set of threads that it can see and/or to run multiple copies concurrently and load balance across them. But for others that are designed to scale well (e.g., those that already run well on larger systems like the E6800 with 24 UltraSPARC-III or the E2900 with 12 dual-core UltraSPARC-IV chips), then they can take full advantage of the available processing power.

For the tests that I've run with an application that does scale, a system with a single UltraSPARC T1 chip easily doubles up the performance of a system with two 3.2GHz HT Xeons (regardless of whether HT was enabled or disabled). Of course, I haven't been lucky enough to test with the officially shipping version of the T1 chip (the ones I've been able to use have been running at a slower clock rate, and some of them have had some of the cores disabled), so that performance gap may actually be larger than I have been able to measure so far.

Yeah, Einstein by melted · 2005-11-19 06:34 · Score: 1

>> CPU usage increases significantly but SQL Server performance degrades

That's called "saturation". Happens to every piece of server software. There is ALWAYS a point where "requests per second" start going down and latencies begin to go up. And from there it usually goes WAY downhill unless you take the load off or reduce it significantly to let the software catch up and recover.

God, I hate when developers are allowed to do perf testing. They test a simple scenario without full understanding of what's going on and make wild conclusions from it to get "visibility" which at large companies like MSFT often leads to promotion. Then they go ahead and solve a "problem" which doesn't exist.

This is not to say that HT doesn't degrade performance. I've heard that from Intel folks themselves that in some scenarios it does. But when a "developer" does perf testing, I take that with a three pound grain of salt.

question by josepha48 · 2005-11-19 06:49 · Score: 1

is this just windows or did they test other x86 oses as well? IE: Could it be a problem with the OS itself and not hyperthreading?

--

Only 'flamers' flame!
Does slashdot hate my posts?

Re:Intel's Hyperthreading vs Sun's Chip Mulithread by Serpent+Mage · 2005-11-19 07:01 · Score: 1

Roughly the same thing in theory. Difference is that sun applications are typically compiled using sun compilers and that sun hardware doesn't suck and the sun compiler actually knows when and how to make use of the threading benefits.

So no you shouldn'th ave problems at all with the Niagra proc unless you do something stupid like shove linux on it ;)

Not on Linux by alexborges · 2005-11-19 07:34 · Score: 1

Well.

All ive seen is better performance with HT and kernel 2.6.

Maybe thats because 2.6 is so much better, maybe its ALSO because of the HT. We will never know.

--
NO SIG

Re:Not on Linux by tomstdenis · 2005-11-19 09:18 · Score: 1

On my prescott box I've noticed marginal improvements in latency on things like compiles. A build of LTC takes ~37 seconds and around 34 with HT turned on. It takes about 16 with my AMDx2 and about 19 with the Intel 820.

[Both the prescott and AMDx2 have PC-3200 CAS3 memory, the 820 has 533Mhz DDR II].

But then you get into the camp of single task things [e.g. bulk encryption] and the performance hit is there.

You're better off getting at least a dual-core P4 [e.g. 820] rather than an HT enabled P4.

Tom

--
Someday, I'll have a real sig.

You'd think... by Nom+du+Keyboard · 2005-11-19 07:48 · Score: 1

You'd think, wouldn't you, that HT would cut in half or more the very expensive (in cycles) context switching involved in moving to a new thread or handling an interrupt. This is in addition to giving the processor something to do while the other thread is stalled on latency to main memory. Strange to see it go the other way instead.

--
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."

Re:You'd think... by tomstdenis · 2005-11-19 09:04 · Score: 2, Informative

Not really. You're trying to run two threads inside the L1 caches, the decoder bandwidth, etc. So the 16KB of cache turns back into [effectively] 8KB.

The P4 can issue 3 uOPs per cycle but IIRC only from one thread. They alternate [stalls free up slots though]. Also the decoder can only decode ONE x86 opcode per cycle. Then you have expensive memory ports. Fetches to L2 [or alternatively to system memory] are done one at a time per thread [with deep queues which were lengthened in the Prescott core].

Aside from the stronger ALU and the fact there are two of them, the AMDx2 also benefits from having dedicated caches per core and a dedicated HT bus between the cores that doesn't sit on the external bus. If you want SMP performance just get opterons or amdx2 cores. Really that simple.

We all know HT was a kludge hack at the last minute to gain some marketting press. If Intel really wanted to boost the performance of the P4 they would strengthen the ALU [at a cost of clock frequency]. If a 2.2Ghz AMD64 can ROUTINELY beat a 3.2Ghz P4 then a 2.5Ghz optimized P4 variant [with a stronger ALU] could hold it's own.

Oh wait, that already exists. It's called the Pentium M. :-)

Tom

--
Someday, I'll have a real sig.
Re:You'd think... by Nom+du+Keyboard · 2005-11-19 11:08 · Score: 1

the AMDx2 also benefits from having dedicated caches per core
If you're speaking of a dedicated L1 cache per core then I'll likely agree with you. However it is my understanding that unified L2 cache for a dual processor system is better than separate L2 caches for each processor. And that we haven't fully arrived at -- at least with Intel.

--
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Re:You'd think... by tomstdenis · 2005-11-19 11:26 · Score: 2, Informative

You'd be wrong. The two L2 caches on the AMDx2 allow them to have access in parallel. Had they been unified you'd either have to make it dual-ported [e.g. larger and possibly slower] or have access in sequence [e.g. double the latency].

If you're running tasks that share [e.g. write to] the same small pocket of memory you're right. However, many tasks don't do that. Often a server will spawn an entire new thread [e.g. unique stack and heap objects] to handle connections.

It also makes sense in the desktop scene. Why would X11 and XMMS be accessing the same code? One is a media player and the other is a windows server. They have different data objects in their own respective process spaces. So a unified cache doesn't help. Also keep in mind that smart OSes keep tasks in a given CPU so the cache doesn't get killed as quickly.

Tom

--
Someday, I'll have a real sig.

The Fix is not in The Software by Nom+du+Keyboard · 2005-11-19 08:01 · Score: 2, Interesting

Where multiple threads access different parts of memory but are simultaneously processed by the chip's Hyperthreading Technology, the shared cache cannot keep up with their alternate demands and performance falls dramatically,

Software shouldn't be expected to handle hardware quirks. It's up to the hardware to run the software efficiently.

Seems to me a hardware fix would be to partition the cache into two pieces when HT is enabled and running -- use the whole cache for the processor otherwise.

With 2MB caches per processor now becoming available, would this be such a bad thing? IIRC once you're up to 256KB of cache you've already got a hit rate near 90%. That severely limits your possible improvement to less than 10% regardless of how much more cache you add. And yes I am aware that increasing the processor multiplier does make every cache miss worse in proportion, but still having HT run more efficiently in the bargain could make this tradeoff worth it. And that's even before you consider uneven partitioning if the OS can determine that one thread needs more cache than the other.

--
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."

Offloading the responsibility to the OS by msimm · 2005-11-19 08:08 · Score: 2, Interesting

is still a kludge. HT was a cheap hack to get extra performance under certain scenarios. Looks like their getting called out for it. Dual-core is the right answer, HT wasn't.

--
Quack, quack.

I'd like to move all our servers to dual-core Opt. by msimm · 2005-11-19 08:17 · Score: 2, Interesting

(erons). But the price makes them a hard sell. I'll definately be keeping my eye on these things, as soon as the price points start to line up. I want to see AMD suceed in the server market, but for now (aside from Sun and a few HP systems) Xeon is still the dominant player.

--
Quack, quack.

Re:Not Intel's fault; Microsoft's fault. c.f. Linu by Vladimir · 2005-11-19 08:25 · Score: 2, Interesting

I have two identical high-end dual cpu desktops, both with HT enabled sitting on my desk. One runs win-xp, the other a 64 bit Linux. The thing I observe every day is how windows scheduler sucks. I don't know for how long marketing dept. of MSFT knows about HT, but their OS definitely doesn't know about it yet (start update in subversion or compilation in VC -- go to drink some coffee, as computer is unusable). On Linux, on the other hand, HT really improves both responsiveness and throughput. I'm waiting to test quad- dual-core box with HT enabled ;)

Mod Parent UP! by GroundBounce · 2005-11-19 08:30 · Score: 2, Informative

The parent post is common sense, which seems infrequent. I have found the range to be quite wide: When rednering animations from Blender, I have found that hyperthreading results in nearly 70% faster throughput when turned on. For rendering MPEG2 using Tmpgenc (under Wine), I see around 40% improvement with HT on. Clearly, these two applications benefit quite a bit from HT due to small computational footprint and/or low cache contention, etc. On the other hand, on my system, on-screen 3D acceleration in the NVIDIA driver (under Linux at least) appears to suffer with HT, with frame rates that are around 10-20% slower than with HT disabled.

So, I see improvements ranging from -20% to +70% depending on the application, with many applications seeing only small differences one way or the other. Like many things, this tends to turn into a religious debate when the fact is that it varies case-by-case.

Re:The developers are not smart enough! by jfroebe · 2005-11-19 08:31 · Score: 1

If you just make up crap, why don't you even make it believable? You sir are simply a troll.

--
No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil

wrong on both counts by r00t · 2005-11-19 08:31 · Score: 4, Informative

First, it doesn't matter if the server uses threads or processes. Threads have a minor performance advantage for startup and context switching, and some disadvantages for memory allocation speed (finding VM space is a hashing problem) and some locking overhead. For the most part though, with tasks that just crunch numbers (including scanning memory) or make system calls, there isn't all that much difference.

Running 2 threads per CPU is not cheating. It's normal to run 1 thread per CPU plus 1 thread per concurrent blocking IO operation. That could come out to be 2 threads per CPU.

Re:wrong on both counts by r00t · 2005-11-20 05:31 · Score: 1

We call the cheap thread-like things "fibers" now.

For threads and processes:

The OS isn't going to matter much, given that you don't do something really stupid like have lots of short-lived threads.
Re:wrong on both counts by dnoyeb · 2005-11-28 02:47 · Score: 1

You can add 100 threads if you have valid reason and that is not cheating. But when you are adding threads for speed and not for responsiveness then threads beyond the number of available processors is indeed cheating. It will only benefit you if you are sharing CPU time with other programs. If you are dedicated, then it will tend to slow you down even.

Re:I'd like to move all our servers to dual-core O by tomstdenis · 2005-11-19 09:15 · Score: 2, Insightful

Twice the ALU power and half the power.

That's not a hard sell. If you're doing number crunching of any kind in a professional setting an AMDx2 or opt will pay for itself quickly.

Oh that and you're not funding the never ending chain of stupidity that is the P4 design team ;-)

Tom

--
Someday, I'll have a real sig.

Multitasking vs Hyperthreading by Sloppy · 2005-11-19 09:33 · Score: 1

The whole point of hyperthreading was that one thread could run while another was waiting for I/O.
Huh? You don't need hyperthreading for that, it's just normal multitasking.

I think he means processor I/O, as in fetching memory. One thread runs (using data from cache or already in registers) while the other thread waits for the RAM to cough up something it needs.

When you get down to that level, even accessing a variable counts as "I/O". ;-)

--
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.

Re:I'd like to move all our servers to dual-core O by msimm · 2005-11-19 09:34 · Score: 1

Of course I agree. Unfortunately the person ultimately responsible for deciding (in my case in favor of the cheaper server line that will get the job done) has to weigh the pros and the cons included in the broader picture. This year will probably be the first year we end with a profit, the right hardware wouldn't have made that possible.

Still your long-term argument holds, but try to explain that to your investors and you can see how it starts to get a little thornier.

--
Quack, quack.

This "information" is very software specific by Sloppy · 2005-11-19 09:39 · Score: 1

BFD if this particular software load happens to be slower. That's just one example. Overall, HT is still a good idea. (Although I happen to recommend AMD over Intel right now. But still, if AMD cores had hyperthreading, I would be delighted.)

Also, I think it's amusing that they talk about performance degradation like it's some aweful thing, considering their example software is all blue-light-special junk. Complaining about HT slowing down this particular server, is like complaining that a certain brand of gasoline makes your Yugo run slower. If you care about speed, you probably don't drive a Yugo, and if you care about computers, you probably don't use SQL Server or Citrix.

--
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.

Ah, the SQL team by TarrySingh · 2005-11-19 10:28 · Score: 1

just needs an excuse for writing that sorry ass software(which they like anyother touted MS product), bought it (way back in 91)!

--
Scott McNealy to Michael: "Suck my Sun!" Michael Dell to Scott : "Lick my Dell!"

HT multiplies performances by 2 by this+great+guy · 2005-11-19 11:59 · Score: 1

To all those people thinking that HT multiplies CPU performance by 2, here is a very simple experiment that will prove you wrong. Define those functions in your shell (BSD, Linux, Cygwin, whatever):

$ dosha1() { dd if=/dev/zero bs=128k count=4096 | openssl sha1; } $ p1() { dosha1; } $ p2() { dosha1 & dosha1; wait; }

Then, on an HT enabled box, benchmark the function running a single CPU intensive process:

$ time p1 [...] real 0m5.165s

It took 5.2 seconds. Now do it with the function running two CPU intensive processes:

$ time p2 [...] real 0m10.390s

It takes twice the time (10.4 secs). If HT really offered twice the perfs, it would have taken the same amount of time (5.2 secs) because HT would have run the 2 processes on the 2 "independent" CPUs, but as you can see this is not the case. The explanation of this is that the execution units are shared between the 2 logical CPUs. Whereas on this dual opteron 244 box I happen to have in front of me, both benchmarks give the same numbers: p1 = 4.4 secs and p2 = 4.5 secs, because on a true SMP (or dual-core) box, the 2 CPUs are obviously independent and don't share their execution units. As it is correctly pointed out by other people, HT is a way to reduce the impact of pipeline stalls and execution units under-utilisation, it is not a way to magically "multiplies" raw CPU performance by 2.

Novell NetWare 6.5 and HT by Jeddawg · 2005-11-19 15:18 · Score: 1

I've personally seen that HT technology can kill performance on Novell NetWare 6.5 on very high-end servers. (Performance increased more than tenfold when HT was disabled.

Hyperthreading and performance are not always rela by unclocked · 2005-11-19 16:06 · Score: 1

It is incorrect, IMHO, that Hyperthreading and increased performance are related, at least on server platforms. Unless the server applications are concieved with multithreading in their design, it would be incorrect to assume that Intel HT would somehow figure out the assembly code and facilitate performance.

Re:Not Intel's fault; Microsoft's fault. c.f. Linu by Theovon · 2005-11-21 07:50 · Score: 1

Of course I know that. It just seemed to me that if Windows had a way to decide not to use both virtual processors because the only two runnable threads were of different priorities (or run the lower-pri thread only part of the time), then the people complaining would never have noticed a problem.

Slashdot Mirror

Hyperthreading Hurts Server Performance?

202 of 255 comments (clear)