Hyper-Threading Explained And Benchmarked
John Martin writes "2CPU.com has posted an updated article about Hyper-threading performance. They discuss the technology behind it, provide benchmarks, and make observations on what the future holds for hyper-threading. It's actually an easy, interesting read.
Of note, they'll be publishing Part II in the near future which will detail hyper-threading performance under Linux 2.6. Hardware geeks will probably appreciate this."
There was an interesting discussion on the Plan9 newsgroup about hyperthreading recently, read here
For those more technically inclined I would suggest reading Intel's Hyper-Threading Technology Architecture and Microarchitecture whitepaper instead.
If you are really interested in the how and why of hypertreading in suggest you read trough the lecture notes of Computer System Architecture at MIT OpenCourseWare. This gives you enough background to race trough all the articles at Ars Techica et al.
karma police: arrest this man, he talks in maths; he buzzes like a fridge, he's like a detuned radio. [radiohead]
Well Intel is already encountering heat problems which limit how fast they can crank the clockspeed. Hyperthreading is a moderately successful attempt to make use of the available execution units on the chip which would otherwise sit idle. It's also not so new and untested, it has been implemented but not enabled on earlier P4 steppings.
Athlon and Athlon64 are generally better able to make use of their execution units, and wouldn't benefit from HT as much as P4/Xeon.
- FDIV error: yes, it was division, not addition. However, conditions ware far less specific as Intel would have liked us to believe...
- CISC vs RISC: you correctly pointed out that Pentiums still are CISC (even though they nowadays have a RISC core)
And you've missed the following hooks:Note to moderators: mod grand-parent down. It is obviously a troll (albeit a rather well written troll!). If you absolutely must mod it up, at least use Funny rather than Interesting
So it is, and it's not all that fast either. Then again, you shouldn't believe all that you read on the Intarweb.
Stick Men
Are you kidding?? This review linked to from /. a few weeks ago shows that a 1.8ghz Athlon XP easily beats the 2.6ghz Celeron in the DivX encoding test. With their 128kb L2 cache (384kb less than a P4) the Celerons just can't keep up with the P4. And the lower end P4s can't keep up with the Athlon XPs. Celerons are a complete waste of money, IMO.
... I learned from this article.
A short but informative article about SMT is on Wikipedia
If X is the lower number and Y is the higher number, he's figuring his percentage increases as (Y-X)/Y instead of (Y-X)/X .
Or is this some kind of "New New Math" that they started teaching in the 10 years since I graduated?
3000+ comments meta-modded. 0 mod points awarded.
Lesson for other meta-suckers: Don't believe the hype!
Excepty that Celery 2.6 gets his ass kicked pretty badly by a 1.6 Duron
See benchmark at Anandtech Budget Shootout
how long until
Cost, cost cost. Cost cost cost cost cost, cost cost cost cost cost cost. Cost cost cost--cost! Cost cost cost, cost cost cost cost cost cost...cost cost. Cost cost "cost" cost cost cost cost cost cost cost cost. Cost cost cost cost cost COST cost cost.....
The lameness filter blows. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Gates' Law: Every 18 months, the speed of software halves.
When a process blocks because it is trying to access memory that is not loaded into the cache, it sits idle while the data is retrieved from the much-slower main memory. If you can store two process contexts on the CPU instead of just one, whenever one process blocks to read from memory, the operating system can quickly switch the CPU to the other context which is waiting to run.
I can't remember the name of the machine, but one parallel shared-memory machine used this exclusively. The CPU had 128 process contexts and would switch through them in order. The time between subsequent activations of each context was great enough that data could be fetched from main memory and loaded into a register. This eliminated cache coherency problems (no cache!) and all delays related to memory fetching.
A P4 with hyperthreading is a simplified and much more practical version of that machine.
The only way "better caches" will improve SMT is if you had one cache for each thread, however with that kind of configuration you basically end up with two cores on one chip.
The original thinking behind SMT was that with cache and branch prediction misses staring to have very large penalties, switching to an alternate thread would result in significant performance increase.
It turns out however that doing context switching at this ultra-fine granularity causes the cache miss rate to go up as each thread fights for the cache.
To get the best out of it the second thread would have to either "lock down" some cache lines and be doing either mainly ALU intensive operations or using streamed memory that would not be cached. This however end up limiting SMT to some pretty special case programming situations.
How about two people in moderate shape being able to push wood through a single wood chipper than a single person who is in great shape (assuming the wood is piled up 18 feet away = cache miss).
The single wood chipper being analogous to the actual processing part of the core, is only going to be able to shred so much wood - but if two people fetching wood from the woodpile can keep it running at 100% capacity they will shred more wood than a single guy running back and forth to the wood pile by himself.
Glonoinha the MebiByte Slayer
i'm not 100% sure bout this but i just got da ;) ) that can compete in
fishy feeling that hyper threading really is just
to make life easier for novice/beginner programmer
to write programs in "high" level languages (say
Vbasic, or just basic
performance to programs writen by cracks, say in
assember or C / C++.
Good programmers don't write programs in assembly. They pick good compilers and know the correct optimizations. Even if they could beat the best compilers, the code wouldn't be portable. That is bad programming except in very rare cases. Good compilers exist for languages like C/C++ to take advantage of multithreading. (icc)
In the old single-processor days, your calc thread could do a Wait(0) -- according to the Windows docs, this yields all of the calc thread's remaining time slice to blocked threads, like the GUI thread holding WM_PAINT in its queue. In these modern hyperthreaded times (I imagine true SMP works the same way), Wait(0) does nothing because the calc thread does not block when the GUI thread is on another virtual or real processor, and the screen updates gum up and get all blocky.
The solution I use is that when the GUI thread services a PostMessage from the Calc thread, it runs the message pump to check for and dispatch WM_PAINTs -- a kludge to give the PostMessage from the calc thread lower priority than WM_PAINT. But in the mean time I am cursing a blue streak that MSDN cannot document that Wait(0) is essentially meaningless with more than one processor and I have spend two weeks tearing my hair out about what is going on.
I have some insight into this technology as I was part of a research group researching SMT. It is a really cool technology that exposes Instruction level parellelism (ILP) and increases performance. The basic HT technology for the processor however distributes the resources. The details of Intel HT are available here at http://www.intel.com/technology/hyperthread/ You can also find whitepapers associated with this. Now the catch is application should be multi threaded. You just can't buy a HT processors and run single thread application and expect to improve performance. The performance benefits lie if optimal number of threads are used. If too less it will be unnecessary wastage of resources. If too high they will queue up and cause bottlenecks. The other thing that can affect performance is unbalanced workload and can cause threads which cannot exploit the parallelism. This is a new technology and lot of research is going on in this area and it looks really promising.
That is totally true. Processor-specific microcode optimizations are definitly the compilers job. But you have to conceed the fact that the compiler can only do so much. If the programmer doesn't choose a good method or solving the problem at hand there isn't much a good compiler can do to optimize the code, especially if the problem being solved is complex.
Compilers simply can't be asked to pick up the slack for programs written with a poor logical flow. They can't be ask to figure out a completely different and improved algorithm for solving a complex problem they don't completely understand the parameters for.
You are thinking of James T Kirk... See this is James R Kirk. :)
I've had enough abrasive sigs. Kittens are cute and fuzzy.
AnandTech did an excellent article on hyper threading a while back. Well written and worth reading.
What you wrote here is almost verbatim what Michael Abrash said in his book "Zen of Code Optimization". Dr. Dobbs Journal actually offered it up for free in PDF format at one point, I can only hope to find it amongst my mass of CD's.
Smart code will do more for you than hand optimized assembler, unless you already have written smart code.
Slashdot is proof that Sturgeon's Law applies to mankind.
IBM will have SMT in the Power5. Their approach looks even better than Intel's, but part of that is the Power architecture and part of that is IBM learning from what Intel did. SMT is really the best way to get past the limiting reagents of modern processors : bandwidth.
Also, make sure to set the vm's to low priority when you are not in the window, it makes a huge difference in system response, even without Ht.
-Mike
#6495ED - cornflower blue
If you want to benchmark a hyper-threaded machine, a useful exercise is to run two different benchmarks simultaneously. Running the same one is the best case for cache performance; one copy of the benchmark in cache is serving both execution engines. Running different ones lets you see if cache thrashing is occuring. Or try something like compressing two different video files simultaneously.
If you're seeing significant performance with real-world applications using a a "hyper-threaded" CPU, that's a sign that the operating system's dispatcher is broken. And, of course, hyper-threading dumps more work on the scheduler. There's more stuff to worry about in CPU dispatching now.
Intel seems to be desperate for a new technology that will make people buy new CPUs. The Inanium bombed. The Pentium 4 clock speed hack (faster clock, less performance per clock) has gone as far as it can go. The Pentium 5 seems to be on hold. Intel doesn't still have a good response to AMD's 64-bit CPUs.
Remember what happened with the Itanium, Intel's last architectural innovation. Intel's plan was to convert the industry over to a technology that couldn't be cloned. This would allow Intel to push CPU price margins back up to their pre-AMD levels. For a few years, Intel had been able to push the price of CPU chips to nearly $1000, and achieved huge margins and profits. Then came the clones.
Intel has many patents on the innovative technologies of the Itanium. Itanium architecture is different, all right, but not, it's clear by now, better. It's certainly far worse in price/performance. Hyperthreading isn't quite that bad an idea, but it's up there.
From a consumer perspective, it's like four-valve per cylinder auto engines. The performance increase is marginal and it adds some headaches, but it's cool.
Not to be specific about SMT. Assembly too hard? You people haven't heard of Forth, right? Just use ficl, or some other embeddable forth instead of assembler, will save you lots of time. Better debugging too, since forth is interactive.
Dhrystone and Whetstone should show almost no difference in performance w/ w/0 Hyperthreading. The HT just allows the Superscalar superpipelined processor to stick multiple threads on the same processor at the same time.
So what may be interesting would be to run both dhrystone and whetsone at the same time. Seeing as then you'd be using the ALU and floating point unit. That should show a large difference in the performance w/ w/o HT.
i could not think of anything clever.
Not true at all! RISC refers to the instruction set, not the internal architecture. Even the earliest RISC processors to carry that name included pipeline interlocks -- it was the simplicity of RISC that made such techniques feasible, especially at the chip densities of the 80's.
There's a lot of confusion about what RISC means. Look up a computer architecture textbook. RISC is somewhat fuzzy, and most chips bend the edges of the definitions in places. The general operating principle is "reduced," and herein lies the ambiguity, since this is relative to the technology of the day. (A "RISC" Alpha made in the 90's has more opcodes than a "CISC" 8086 made in 1978.) But RISC processors typically have the following properties:
CISC used to mean that many or most instructions were implemented in microcode on the processor.
Again, no. CISC means supporting many different kinds of operations directly in hardware. This was especially appealing in the days when back-end compiler code generation wasn't very good, so CISC means often a simple 1-for-1 translation from high-level constructs to machine opcodes. The ISA complexity usually meant microcode was the best approach, but this was not part of the definition.
Background: I've used single CPU systems, HT systems, and SMP systems. I've taken courses on OS design and even in the process of writing my own. I'm quite familiar with the 80x86 32-bit instruction set and aware of the new 64-bit design as planned by AMD.
My $0.02 (this GREATLY SIMPLIFIED)
In the beginning there were CPUs. And CPUs were good.
Soon we realized the limitations and said.. Hey! Why not add another CPU and SMP was born.
SMP was good as well, however the additional cost was something of a deterrent for all but the power-users (and commercial applications of course).
Then Intel tried to develop a middle-ground, HyperThreading. It was a decent idea, however did not work quite as well as originally expected. AMD does not use it for a reason
From my experience I see HT as a hack developed by Intel, trying to duplicate true SMP. Might work sometimes and in certain environments but it's been show to actually slow execution in some situations (cache thrashing). In addition, SMP systems have much better responsiveness than HT ones under a high CPU load.
Which is why AMD is working on multi-core CPUs. This is the *correct* way (at least in my opinion) to tackle the problem, asides from getting true multiple CPUs. More can be read about it here. This combined with the new 64-bit instruction set (read more about that at the above link) will truly create a new era of CPUs.
# fuser -v
#