Hyper-Threading Explained And Benchmarked
John Martin writes "2CPU.com has posted an updated article about Hyper-threading performance. They discuss the technology behind it, provide benchmarks, and make observations on what the future holds for hyper-threading. It's actually an easy, interesting read.
Of note, they'll be publishing Part II in the near future which will detail hyper-threading performance under Linux 2.6. Hardware geeks will probably appreciate this."
Simultaneous Multithreading (SMT) is not a new idea, although no one to my knowledge has implemented it yet. Intel just calls it "Hyperthreading"...it is essentially SMT.
And yes, this is a very good idea. A modern superscaler out-of-order processor, like the Athlon and Pentium Pro (and later), can issue and retire multiple instructions per clock cycle. However, it can *only* do this if there is enough instruction-level parallelism (ILP). Turns out, there is not enough ILP in current programs to take full advantage of the chips processing capabilities. Issue slots and function units go unused due to dependencies in the program and cache misses that stall the processing. A typical processor can only look at about 32 instructions at a time. This is not a large enough window to execute future instructions out-of-order when such a stall occurs.
However, 2 threads of execution will likely fill all of the issue slots. They are also independent threads of execution, so dependencies don't exist between them. This means that when the pipeline stalls due to a cache miss, the other thread can keep on retiring instructions.
To all those saying that this is dumb, I suggest you study some modern architecture (I'm not talking about your undergrad architecture course either). A paper I read recently studied the affects of SMT on a simulated Alpha processor. The results were astounding with very little changes to the processor core. I heard that the next Alpha was slated to include SMT before Intel killed it.
There was an interesting discussion on the Plan9 newsgroup about hyperthreading recently, read here
Hyperthreading helps increase efficiency when applications are coded for it and it is enabled. As better caches and busses get built into future CPUs, hyperthreading will also get better.
For those more technically inclined I would suggest reading Intel's Hyper-Threading Technology Architecture and Microarchitecture whitepaper instead.
I hate to say it, but your logic is flawed.
To put hyperthreading into your car analogy:
Hyperthreading is like a car that has power assisted steering. If you want, you can switch it off; you'll likely have a slightly smoother time with it on. But if you want the control (or don't trust it) then you can switch it off.
For the geek who reads posts as a stack of strings delimited by <br>, Nobody's forcing you to use hyperthreading. Use it, don't use it. Don't complain that it's a Bad Thing[tm] simply because you're being given the choice
Global symbol "$deity" requires explicit package name at line 2. - If only $scripture started "use strict;"
that Cinebench performance evaluation is wacked, looks like he interpreted his own graphs wrong.
What a fool.
Please, don't mod up this idiot. It only encourages him. Check his name and then check his previous posts for some other inane comments. The day that he actually has something valuable to say will be the day that hell freezes over.
"they'll be publishing Part II in the near future"
Part II should've been published concurrently, using idle time... tch!
If you are really interested in the how and why of hypertreading in suggest you read trough the lecture notes of Computer System Architecture at MIT OpenCourseWare. This gives you enough background to race trough all the articles at Ars Techica et al.
karma police: arrest this man, he talks in maths; he buzzes like a fridge, he's like a detuned radio. [radiohead]
your right there with its nothing new, BUT for the vast majority (with the right consumer price ofcourse) of users, this is the first time that 64bit is available on the desktop - whether its AMD or Apple.
We played dungeons and dragons for 3 hours.....then i was slain by an elf
I meant to say the 0xF00F bug which freezes the Pentium.
The 0xCAFEBABE bug just slows it down to a crawl.
I have been pwned because my
Perhaps I'm feeding a troll here, but....
64 bits, while not interesting in and of itself, is interesting in AMD's implementation. I have an UltraSparc sitting on my desk at work, and I assure you it's one of the most boring machines in the world. Why is AMD interesting? In the Opteron/Athlon 64 they've fixed some of the shortcomings of the x86 architecture. More registers. Access to more than 4GB of RAM without menutia (like Intel uses). Things that were expensive in a register-starved 32 bit processor aren't on an Athlon64.
No, it's not innovative, not by a longshot. It's the same damn thing Intel did when they introduced the 80386. But it continues the line unbroken, and that's why the processor is important.
Hyperthreading is interesting, I agree, but I'd much prefer more affordable dual processor machines. Why in the world do Intel, AMD, and Microsoft go out of their way to keep SMP machines off the desktop? Apple certainly is going in the opposite direction.
and it is _not_ unreliable. no way. no how. I am very impressed with Intel's chips. I have HT turned on, and again, I experience zero crashes. But some RISC processors are very neat. Never managed to get my hands on any of them, though.
That's pretty cool, but if your primary concern is encoding, then there are some things to keep in mind. A Celeron is much cheaper than a P4 with the hyperthreading ($90 for a 2.6GHz Celeron, and $170 for a P4 2.6C). And if the app you're using doesn't support HT, then a Celery will likely encode faster than a P4 with HT on. HT can also reveal nasty bugs in some drivers (my HDTV card is an example). So unless you're playing games, the P4 is just added expense.
Whether it's something obvious like the Pentium off by 1+1=1.9999943 error
The Pentium math bug was with division, not addition, and it only occurred in very specific circumstances. So while it supports your general point that complicated systems are more difficult to debug, that wasn't a very good example of an "obvious" bug. Careless, yes.
One thing that was good for the industry was to move away from the complex instruction set (CISC) towards a reduced set of instructions (RISC), and we have seen the speed improvements as well as a general reduction in hardware bugs since that time.
You do realize that Intel x86 processors are still CISC, right? (OK, actually internally they do execute things very much like a RISC chip, but the instruction set is still CISC, and modern x86 processors are certainly not any _simpler_ for having some RISC-like elements to them.
Besides, RISC chips don't actually have fewer instructions. Most of them these days have more. The difference between CISC and RISC is that RISC chips don't have certain complicated, slow instructions, but rather break these up into smaller pieces. For example, CISC processors usually have an instruction to move memory-to-memory while RISC only moves memory-to-register and register-to-memory. Also, CISC processors often have a division instruction while many RISC processors instead just have a multiplicitive inverse instruction (so to compute a/b you instead compute a*inv(b)).
But to add Hyperthreading, an untested and unproven technology which can guarantee no more than a 12% speed improvement, is folly. Better to amp the CPU clock and deal with a known like heat than to risk your company's livelihood on letting the CPU figure out which thread is which. That is something an OS is much more reliable in handling.
Now that's just ridiculous. Hyperthreading is not untested or unproven. Similar ideas have been discussed in academic papers for years; Intel was just the first to put it into a modern CPU. It's hardly untested, either - Intel started seeding the first Hyperthreading-capable processors what, two years ago now? At that point I wouldn't have suggested running a mission-critical application on a machine with Hyperthreading enabled, but now? You'd be crazy not to if it actually speeds up the application you need to run.
The reality is that in order to advance the speed of computer processors, it's necessary to make them more complicated.
... but the moderator still don't recognize this for the trawl it is!
Well Intel is already encountering heat problems which limit how fast they can crank the clockspeed. Hyperthreading is a moderately successful attempt to make use of the available execution units on the chip which would otherwise sit idle. It's also not so new and untested, it has been implemented but not enabled on earlier P4 steppings.
Athlon and Athlon64 are generally better able to make use of their execution units, and wouldn't benefit from HT as much as P4/Xeon.
I think they made a mistake here. ..."
From the article:
"Sandra's CPU benchmark is obviously quite optimized for hyperthreading at this point, and the numbers certainly show that. We see an average improvement of ~39% when hyper-threading is enabled on the P4
The numbers are:
4328 without HT
7125 with HT
You could say that disabling HT makes this benchmark 39% slower. But the the increase by turning HT on is
7125/4328-1 = 1.646 - 1 = 0.646 = 64.6 %
Hrmpf.
I do believe that HT does have future, perhaps not in its present form, but still.
:)
I do remember when there was that RISC vs CISC thing in the 80s, people were saying that CISC was obsolete, RISC being the future and so on. What we see today is not pure RISC processors but something in between. -- It's just that the answer was not that pure or clean as people thought at first.
Few years ago there was BeBox and its BeOS. Well, BeOS had the philosophy for a machine not having a single super-powerful-burning-hot processor but, instead, several low-power combined.
Well, Hyper-Threading may push distributed processing technology to the desktop, to the masses, so we might have interesting changes in software and hardware philosophy in the future.
Sort of romantic thinking... But one can dream.
- FDIV error: yes, it was division, not addition. However, conditions ware far less specific as Intel would have liked us to believe...
- CISC vs RISC: you correctly pointed out that Pentiums still are CISC (even though they nowadays have a RISC core)
And you've missed the following hooks:Note to moderators: mod grand-parent down. It is obviously a troll (albeit a rather well written troll!). If you absolutely must mod it up, at least use Funny rather than Interesting
I grant you that it's better than what Intel was doing with 64-bits, but it was nothing more than the next logical step on the x86 CPU line.
an extra frame or two for Doom3!
Do any modern chips support per-process cache reservation? That would alleviate some of the problems reported in the article.
Mea navis aericumbens anguillis abundat
All things being equal, RISC gives you more bang for your buck. The difference is that Intel has pushed CISC, or specifically the x86 architecture, as fast or faster than RISC by using more bucks. The amount of R&D dollars powered into x86 vs the amount poured into PowerPC or Alpha is overwhelming.
When I was at Apple our processor architect, Phil Koch, gave a talk in, I think, 1997, where he said that the PowerPC consortium had essentially optimized for power consumption and dollars spent on R&D. What was amazing at that time was that PowerPC was competitive with Intel given much lower power consumption and much lower investment of R&D dollars. However, noone really cared about lower power consumption so it didn't translate into any real advantage. Without the R&D dollar leverage given by RISC, however, the PowerPC would not have been able to compete at all. Pushing the 68K architecture to be competitive with Intel with the same R&D dollars as PowerPC would have been impossible
The F00F bug was on the Pentium I'm certain. I have an old P-100 sitting in my closet that is affected. I don't know if it ever existed on the Pentium 2s - can anyone confirm? Just curious...
This could be analogous to two people in moderate shape being able to pile more wood in total, than a single person who's in great shape. :)
hmm... in 6 years of architecture research i have never heard anyone talk about SMT like that. it's not even analogous
Clark Kent is Superman's critique on the human race.
... I learned from this article.
When did Kirk start benchmarking processors? One would think he would be too busy getting his crew killed and shagging green alien women...
Why would you want to have a virtual double processor when... you can actually get a second one? Both changes require that you change your motherboard (One for HT, one for Dual Sockets). Dual Celerons sounds like a good cheap buy, or even Dual Athlons. Why bother with this? Except for the coolness factor of having your POST screen littered with "Hyperthreading Enabled", and in most cases it's not even called that, i forgot what they really write on the screen. Seriously, i wouldnt put my money that HT will be even copied to other manufacturers any time soon, unlike SSE or MMX.
Trolls dont like to be Flamebait, because they burn so well. Protect our Troll heritage!
Current situation:
;-).
Xeon and P4 cpu's have to small caches and to slow busses.
Lets watch this technology develop and come back in let's say 6 months
Many thanks,
M
To really exploit this, you'd need gang scheduling in the operating system. But it's unlikely that SMT would remain around long enough for any efforts to exploit it to be feasible. CMP with separate cache would likely take over before then since it would behave more like separate cpu's from a performance standpoint and thus offer more consistent behavior.
"Why in the world do Intel, AMD, and Microsoft go out of their way to keep SMP machines off the desktop?"
Actually, Windows XP Home will quite happily cope with a dual CPU machine. While I initially wondered if this was specifically to support hyperthreading (which appears as 2 CPUs in Windows), it does actually work with dual chip machines.
I guess the real answer is; they want us to buy workstations. Their argument is that if you need that much power, buy a workstation. Intel and AMD don't subscribe to the ethic of giving you what you need, but what they want to sell you. Microsoft though, have gone some way towards promoting SMP on the desktop, probably thanks to them already having the technology in NT workstation versions, which is part of XP heritage.
No, they aren't. The Apple "common desktop" oriented machines - the eMac, iMac and perhaps at a stretch the 1.6Ghz G5 - are all single CPU machines and are likely to remain so now the G5 has finally appeared (price alone, without going into other aspects, puts the dual G5s into workstation/high-end enthusiast desktop territory).
Apple briefly flirted with putting dual CPUs into their nearly-home-desktop machines, but this was driven by the massive speed deficit at the time of G4 CPUs - they *had* to have dual CPUs to be even remotely competitive. No matter what else Apple's marketing department might have tried to say.
If you could option a dual CPU onto an eMac, and all the iMacs were dual CPU, then your comment would be accurate. Two high-end machines out of a base range of seven (and that's ignoring the laptops) is not a paradigm shift. By that measure, just about any major manufacturer is "going in the opposite direction".
i'm not 100% sure bout this but i just got da ;) ) that can compete in
/methinks. especially if what i'm
... wrong.
...
fishy feeling that hyper threading really is just
to make life easier for novice/beginner programmer
to write programs in "high" level languages (say
Vbasic, or just basic
performance to programs writen by cracks, say in
assember or C / C++.
i believe CPU manufactures shouldn't care about
this but should cater to the cracks, not the
beginners.(*)
looking at what programms are writen in and then
adapting the CPU to this isn't really the
way to go
guessing should turn out to be true it would be
terrible for a MAINSTREAM processor to make these
bold claims.
i mean it would be okay to market a
"hyperthreading" as a optimizing CPU for
high-level languages or something but making the
claim that it also speeds up execution times for a
assembler program that has been optimized on paper
by the programmer is
(*) of course the market goes where the money is
but at least label the product correctly
p.s. anone noticed how long "calc.exe" takes to
load on AMD Athlons?
Unfortunately, historically CPU speed has increased faster than memory bandwidth. That's why we've had ever more layers of cache added to our systems, to make up for the relative deficiency.
Unless things change, a technology that works better with a higher ratio of memory bandwith / CPU speed is likely to become progressively less, not more effective.
Of course, there's always the argument that marketing reasons have pushed CPU clockspeed faster than memory bandwidth, and that Intel et al will just shift their focus more towards memory in future. But defying the tide of 'what people think they want' is usually risky.
I wouldn't say that intel and AMD are against dual CPU machines on the desktop exactly, its just that they cost too much for most users, and most of the time money is better spent on a high end single processor machine than a dual processor one. Of course that is mostly to do with the fact that most SMP systems available up until now haven't scaled very well, not least because with Athlon MP's and Xeons the second CPU has to share the available bandwith with the first. Now though there is the Opteron dual processor system and for the first time low end SMP systems scale memory bandwidth linearly with the number of CPUs so a system with 2 CPU's operates almost twice as fast as a single CPU machine, whereas before you'd be lucky to get a 50% improvement. What will be intersting to see in 2005 will be the dual core Athlon FX type chips. These will basically be 2 of the current Athlon 64 (754 pin) CPU's on a single die each with it's own single channel memory controller. The question is, what are they going to call these chips? They'll have a PR rating of about 6800, just using 2 of the currently available cores!!
You can't win Darth. If you mod me down, I shall become more powerful than you could possibly imagine
You can access *way* more than 4GB in 32 bit windows http://msdn.microsoft.com/library/default.asp?url= /library/en-us/memory/base/physical_address_extens ion.asp
I have found HyperThreading a real boost for developing operator training simulators (think giant custom computer game for process plant operators [eg: Oil refineries, gas plants, chemicals, etc...]) where the a single thread will totally consume the resources of a single CPU (we call it "no-wait" where the simulation calculates what happens in the next 2 seconds and then immediately jumps to the next timestep, thus fast forwarding through slow parts of a process start-up such as warming a reactor).
An issue we encounter is the DCS (Distributed Control System) interface (the bit that links the PC to the fancy membrane keyboards, touch screens, alarm annunciators that the operator uses on the real plant [to maximise training benefit]). Although the interface typically only uses 0.5 to 2% of the CPU, when the simulation goes flat out, there is a noticable impact on other threads to the point where there is timeouts on data requests from the operator console.
In summary, if you have a system where some threads are IO bound (in our case, processing requests coming across via ethernet) and other threads are CPU intensive (high end numerical calculations) you will see a definite benifit. It allows us to give every team member a machine fit for the job at approximately 1/3 the cost (those of you who wish to argue that SMP machines are cheaper, we are bound by corporate purchasing agreements where SMP falls into the "Workstation" catagory while a uni-processor HT machine falls into the far cheaper "Desktop" catagory).
If you are performing just purely calculations and need to run two parallel threads, I would recommend a SMP or similar machine.
As always your milage may vary.
ZombieEngineer
If X is the lower number and Y is the higher number, he's figuring his percentage increases as (Y-X)/Y instead of (Y-X)/X .
Or is this some kind of "New New Math" that they started teaching in the 10 years since I graduated?
3000+ comments meta-modded. 0 mod points awarded.
Lesson for other meta-suckers: Don't believe the hype!
In the app we develop here at work, we are highly conscious of performance and scalability. Simply put - the more transactions we can process, the bigger and happier the customers. And more money in our pockets.
With Xeon with HT, our performance has increased quite dramatically. We use Perl, so we simply fork off the jobs that do the processing. The result is that we fill all the four virtual processors in Linux if we have a sufficient number of jobs running.
Stop the brainwash
I have a computer with dual Xeon 1.7GHz. Those apparently have HT capability built in, but it's not enabled in the BIOS. Anyone know a way to circumvent this to enable HT on these?
Cost, cost cost. Cost cost cost cost cost, cost cost cost cost cost cost. Cost cost cost--cost! Cost cost cost, cost cost cost cost cost cost...cost cost. Cost cost "cost" cost cost cost cost cost cost cost cost. Cost cost cost cost cost COST cost cost.....
The lameness filter blows. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Gates' Law: Every 18 months, the speed of software halves.
Cache hits are what you want. It's cache misses that kill performance.
Greg
(Inside a nuclear plant)
Aaaarrrggh! Run! The canary has mutated!
When a process blocks because it is trying to access memory that is not loaded into the cache, it sits idle while the data is retrieved from the much-slower main memory. If you can store two process contexts on the CPU instead of just one, whenever one process blocks to read from memory, the operating system can quickly switch the CPU to the other context which is waiting to run.
I can't remember the name of the machine, but one parallel shared-memory machine used this exclusively. The CPU had 128 process contexts and would switch through them in order. The time between subsequent activations of each context was great enough that data could be fetched from main memory and loaded into a register. This eliminated cache coherency problems (no cache!) and all delays related to memory fetching.
A P4 with hyperthreading is a simplified and much more practical version of that machine.
I hate it when inarticulate, uneducated people try to pretend to be eloquent by substituting "fancy" words for what they meant to say. This article is a perfect example of that. Because they almost always don't know exactly what the fancy words mean: they just found them in a thesaurus and never checked the definition to see whether they are accurate or really even appropriate.
"Hyperthreading is not untested or unproven"
This commented used RISC type language, and in the process, a logical error was accidentally introduced... the correct programmatic statement would be:
"Hyperthreading is not untested _nor_ unproven"
CISC has it's advantage in the way the intended statement would be encoded:
"Hyperthreading is better"
This is a complex statement succinctly written with fewer keywords and fewer potential (epistemological) errors.
http://pcblues.com - Digits and Wood
But if you want the control (or don't trust it) then you can switch it off.
That is not a good analogy. Sure you can choose not to use HT, it will give you the same control over the system as you would have on a computer without HT. But there is no way you could utilize the full power of the CPU without HT.
Do you care about the security of your wireless mouse?
If you refer back to Marc Tremblay's CMT Article, you'll see that one of the approaches is to run one thread until it blocks on a memory read, then run another until it blocks and so on, repeating for as many threads as it takes to soak up all the wasted time waiting for the memory fetches.
The Sun paper on their plans for it is here. Have a look at page 5 for the diagram.
--dave (biased, you understand) c-b
davecb@spamcop.net
I did comp sci (undergrad) in the days when we used unix/VMS to learn and so I have a pretty good understanding of architecture and the basics of threads and processes. The one thing that never sat well with me was that as processor speed "exploded" in the last 5 years, I was under the impression that a "lot" of the performance increase was achieved by parallelising stuff in the execution core. (You can see that my knowledge is _limited_) So as a result unless your applications could somehow take advantage of this parallelism a given bit of code would never really get the full benefit of todays uber processors. So all the speed gains were only really marginal improvements.
I think the advent of SMT confirms that it is indeed the case that a given process cannot of itself (unless it is _real_ special) take full advantage of a modern processor and so SMT is a way of reducing the problem by assuming that whilst one process aint enough to take full advantage, two processes are able to make more advantage. It sure makes sense to me.
But it also presents the very interesting question of the marginal benefit of execution pipelines compared to complexity in the front end to allow SMT. What I mean is, what are the trade offs between having a "virtual" (for want of a better word) processor for each execution pipepline rather than using them to out of order execute parts of a single stream of instructions. Is it simply a question of the nature of the work being undertaken my the machine? Ie a processor with 8 pipelines serving 20 users doing stuff, would it be better doing 1 bis of work from each of 8 users or maybe 2-4 bits of stuff from 4-2 users. And can we answer that question heuristically to allow the front end to make good use of each pipeline with a variable profile over the chaing use of the machine. Fascinating (well to me anyway).
"The first thing to do when you find yourself in a hole is stop digging."
Could be, but isn't. A better analogy would be two people using the same narrow corridor to perform to chop and pile wood. If one piles wood, whilst the other chops, then they perform better than one person. If they both chop wood, and then both pile wood then they waste lots of time trying to squeeze past each other and accidentally hitting each other with axes.
Okay, so it's not that much better an analogy. But it least it bears some relevance to HyperThreading.
I use VMware workstation extensively... and HT rocks. Ever have a virtual machine go to 100% CPU utilization, and your machine slow down to a crawl? With the extra 20% of cpu available, you system can still function and be responsive, and allow you to deal with whatever is going on. Or I can run two VMs and get much better performance out of them and the system as a whole.
In the old single-processor days, your calc thread could do a Wait(0) -- according to the Windows docs, this yields all of the calc thread's remaining time slice to blocked threads, like the GUI thread holding WM_PAINT in its queue. In these modern hyperthreaded times (I imagine true SMP works the same way), Wait(0) does nothing because the calc thread does not block when the GUI thread is on another virtual or real processor, and the screen updates gum up and get all blocky.
The solution I use is that when the GUI thread services a PostMessage from the Calc thread, it runs the message pump to check for and dispatch WM_PAINTs -- a kludge to give the PostMessage from the calc thread lower priority than WM_PAINT. But in the mean time I am cursing a blue streak that MSDN cannot document that Wait(0) is essentially meaningless with more than one processor and I have spend two weeks tearing my hair out about what is going on.
I have some insight into this technology as I was part of a research group researching SMT. It is a really cool technology that exposes Instruction level parellelism (ILP) and increases performance. The basic HT technology for the processor however distributes the resources. The details of Intel HT are available here at http://www.intel.com/technology/hyperthread/ You can also find whitepapers associated with this. Now the catch is application should be multi threaded. You just can't buy a HT processors and run single thread application and expect to improve performance. The performance benefits lie if optimal number of threads are used. If too less it will be unnecessary wastage of resources. If too high they will queue up and cause bottlenecks. The other thing that can affect performance is unbalanced workload and can cause threads which cannot exploit the parallelism. This is a new technology and lot of research is going on in this area and it looks really promising.
With HT enabled I can run 2 copies of Folding@Home.
This is a significant boost in production over a non-HT processor because these programs.
I would assume this would also help other DC projects like Seti@Home.
http://www.kubuntu.org/
I could get a dual athlon system, but then I wouldnt be able to hear the dog barking
http://rareformnewmedia.com/
More correct:
We start with one wood chipper, one wood chipper operator and a pile of wood. We can chip (whatever) per unit time.
We make the chipper faster, and can do more (increase clock speed of processor), but at some point the operator can't bring us the wood. So, we use a wheelbarrow to transport more wood in a go, and we keep the stack next to the chipper (a cache).
Now, there's plenty of wood, so we get a SECOND chipper. The operator can stick wood into whatever chipper is free (multiple ALU units, out of order execution).
Add a third chipper, and a separate wheel-barrow operator.
This is what we have (pre-"hyperthreading").
Add a second wood chipper operator. If one of them gets tired, the second can take over stuffing the chippers (hyperthreading).
Is that a bit clearer?
Ratboy
Just another "Cubible(sic) Joe" 2 17 3061
Space heaters!
(Nononononono, I'm an AMD fan, but I couldn't resist)
To folks considering buying HT-enabled processors, be warned that not everything will work when HT is enabled!
For one, burst!, my BitTorrent client simply crashes on start-up. I've been in contact with Intel about the issue, and after some initial jerking me around, I seem to have finally found a tech who's looking into the issue.. Probably has something to do with my compiler (the crash offset is within the delphi RTL).
My app is not alone, as others in this thread pointed out, hyperthreading can also trigger bugs in drivers..
DJ kRYPT's Free MP3s!
AnandTech did an excellent article on hyper threading a while back. Well written and worth reading.
When parallelism is introduced you run the risk of "process inversion". If the system runs high enough all of your execution units are working as fast as the slowest process no matter how fast the execution units can run.
The key to this effect is that the slowest execution unit is taking the most time forcing all other execution to wait on it. Other faster execution units must wait for one reason or another so they all appear to be as slow as the slowest.
In software you can try to soften the blow by bumping up the priority on the slower threads as it crosses any critical sections wtih faster threads. In hardware the beast is a lot different. Doing a pure register calculation is fast. Loading a register from a cache is slower. Incuring a cache miss is even worse. If your system is running fast enough to incur many cache misses then it doesn't matter how fast your register operations or how many CPUs are operating: they will start to appear as if they are all operating like they are missing the cache.
CPUs are plenty fast these days. The future problems all seem to be around I/O. There can be N number of execution cores in your system but if there is only 1 "slow" memory bus then your system is going to be restricted hard. Looking into ways to speed up the memory-CPU bus cheaply would be of great use to any parallel system. Far better than figuring out to cram more faster units into the box.
IBM will have SMT in the Power5. Their approach looks even better than Intel's, but part of that is the Power architecture and part of that is IBM learning from what Intel did. SMT is really the best way to get past the limiting reagents of modern processors : bandwidth.
yeah, but PAE is an ugly hack. if you happen to have a linux kernel
source at hand, read what the help says about enabling it.
cheers.
``If a program can't rewrite its own code, what good is it?'' - Mel
and that, of course, is why the dishonest Zealots at that fruity computer company DISABLED hyperthreading when testing the latest Pentium against the G5. Of course, the single processor benchmarks--even WITHOUT HT--beat the G5 on integer performance, even with Apple's flawed benchmarks. Imagine what would have happened if they used the manufacturer's recommended compiler (Intel) and OS (XP) when they did their benchmarks!
...is that the used-car salesman knows when he's lying.
There's a really interesting philosopical point here, BTW. If you are chartered to (or are pretending to know) something that you don't really understand, can you really claim that you didn't lie (because you didn't realize what you said was false) or do you have a responsibility to be correct if you offer yourself as an authority on a subject?
If you want to benchmark a hyper-threaded machine, a useful exercise is to run two different benchmarks simultaneously. Running the same one is the best case for cache performance; one copy of the benchmark in cache is serving both execution engines. Running different ones lets you see if cache thrashing is occuring. Or try something like compressing two different video files simultaneously.
If you're seeing significant performance with real-world applications using a a "hyper-threaded" CPU, that's a sign that the operating system's dispatcher is broken. And, of course, hyper-threading dumps more work on the scheduler. There's more stuff to worry about in CPU dispatching now.
Intel seems to be desperate for a new technology that will make people buy new CPUs. The Inanium bombed. The Pentium 4 clock speed hack (faster clock, less performance per clock) has gone as far as it can go. The Pentium 5 seems to be on hold. Intel doesn't still have a good response to AMD's 64-bit CPUs.
Remember what happened with the Itanium, Intel's last architectural innovation. Intel's plan was to convert the industry over to a technology that couldn't be cloned. This would allow Intel to push CPU price margins back up to their pre-AMD levels. For a few years, Intel had been able to push the price of CPU chips to nearly $1000, and achieved huge margins and profits. Then came the clones.
Intel has many patents on the innovative technologies of the Itanium. Itanium architecture is different, all right, but not, it's clear by now, better. It's certainly far worse in price/performance. Hyperthreading isn't quite that bad an idea, but it's up there.
From a consumer perspective, it's like four-valve per cylinder auto engines. The performance increase is marginal and it adds some headaches, but it's cool.
The Xeon has a slower clock, and yet outperforms the higher clock P4C. This is further evidence that MHz isn't everything.
The P4C has higher memory bandwidth (the FSB) yet slower performance. This shows that on-chip cache can be king over memory bandwidth too.
Some of my historic benchmarks fit completely in the 486's cache, so not all applications will benefit from more. Alien searches (SETI@Home) appear to benefit from large on-chip caches up to it's resident set size (about 13 MB). The more the better. My current favorite production application has a resident set size of about 200 MB. It isn't clear that on-chip cache size makes much difference. It is clear that FSB bandwidth makes all the difference.
As always, the best benchmark is your application. Unfortunately, most of us can't run our favorite application on a variety of machines before buying one. I know I end up buying something that appears cost effective. This favors the low end processors, which at the moment favors AMD in the X86 world. I've been particularly highly impressed with the Athlon's memory bandwith performance. My Athlon 1800+ (1.3 GHz) performs better than 1.5 Ghz P4's at work - primarily due to having more than double the memory bandwidth. It was also considerably cheaper. I feel as if I got a good deal. I personally have shown no brand loyalty, purchasing a chip from a differant vendor each time.
-- Stephen.
Not to be specific about SMT. Assembly too hard? You people haven't heard of Forth, right? Just use ficl, or some other embeddable forth instead of assembler, will save you lots of time. Better debugging too, since forth is interactive.
I have a dual xeon 2.4GHz at home and I don't see any performance change when hyperthreading is enabled on WinXP. Hyperthreading is really a liability rather than an asset if a program tries to use only a single processor in hyperthreading mode rather than using the second processor. I can say that when I installed FreeBSD 5.1 with hyperthreading disabled it was significantly faster than when I installed it with hyperthreading enabled. I'd say that hyperthreading is more or less worthless at this point in time for multi-processor systems (at least until software properly recognizes it), though admittedly it does have its advantages in single processor configuration.
Hyperthreading is interesting, I agree, but I'd much prefer more affordable dual processor machines
Hyperthreading is not another form of dual processing. It gives you a nice boost in performance in many circumstances with very little hardware cost. The idea is that since a single thread of execution hardly ever makes full use of all the multiple instruction execution units, registers for renaming, and other CPU resources that you already have in your single (modern, superscalar) processor, you might as well make use of those resources by executing some of a second thread.
SMP, on the other hand, costs you an entire extra copy of every single component of the processor, and thus more expensive. There's additional design complexity in the memory architecture you have to pay for. Not to mention integration issues such as board complexity, heat, manufacturing, cost. A dual processor box is going to cost more than a single processor box, all else being equal. The interesting question becomes at what point does two older, simpler, cheaper processors, plus the cost of SMP, become more cost-effective than one newer, complex, faster, expensive processor.
Once you have an SMP box, you then have the problem of how to schedule threads on your processors. One of the interesting things about processor development over the past ten years is that the hardware guys pretty much bypassed all those people doing work on trying to build parallelizing compilers that could find parallelism within a traditional sequential program in order to generate faster code for an SMP setup. A lot of the real estate on the chip these days goes for hardware that automatically seeks out parallelism in the program and dispatches instructions accordingly. SMP with simpler processors is in some ways a step back, as you revert to having software -- or the system designer -- having to notice the parallelism and design for a particular system, rather than having the hardware find the parallelism for itself. There's a huge area for research awaiting methods for discovering parallelism at different levels of abstraction within a system, and developing hardware and software that can best take advantage of that parallelism.
HT is an idea that sits nicely between a single processor system and a full-up SMP arrangement. It's not a replacement for SMP, as one HT processor doesn't have a complete duplicate set of resources. But it does make better use of what you have, and runs (most) code faster -- which is the point, right?
Dhrystone and Whetstone should show almost no difference in performance w/ w/0 Hyperthreading. The HT just allows the Superscalar superpipelined processor to stick multiple threads on the same processor at the same time.
So what may be interesting would be to run both dhrystone and whetsone at the same time. Seeing as then you'd be using the ALU and floating point unit. That should show a large difference in the performance w/ w/o HT.
i could not think of anything clever.
This brings back memories of Sega's "blast processing" and Nintendo's "FX chip". Just a bunch of marketing, and a little smoke and mirrors added in for variety. Sort of like Intel's P4 "Expensive Edition" that does diddly squat in terms of performance gains.
I hate sigs.
If you have two threads that want to run together -- say your program and the OS itself -- to time-slice between them efficiently so that both get service involves context switching each time, which is an expensive, time-consuming operation for x86 processors.
But with HT/SMT running, each thread can operate on one logical processor much longer without interruption. Given the multi-threaded nature of many OS's today, this alone should be a significant advantage that never seems to get mentioned in articles on HT/SMT.
It's that other coolness factor of only running 1 cpu, with its associated lower power dissapation and cooling requirements.
Excuse be, BUT...that's one P4C against two SMT/HT Xeons. That's what the (x2) means after the Xeon name.
How much wood can a woodchuck chuck if a woodchuck could chuck wood?
Life is not for the lazy.
I first heard of SMT/HT in the mid 80s in a machine called the HEP (Heterogeneous Element Processor) designed by Burton Smith. I think this was at a company called Denelcorp. Smith has been working with this concept since then and later on founded Terra Computer. Terra finally bought into the remains of Cray and is now of of the companies that either calls itself Cray or has a product named after Cray. (I remember taking naps on the seat of a Cray XMP. The power supplies were in the seat and it was a warm place in the computer room.)
An unbalanced workload is bad? That doesn't seem right to me.
With nothing but a quick impression, it seems that HT might be better at an unbalanced workload than an SMP machine. This is because with SMP, everything on the underutilized processor sits idle.
It would seem that HT would end up dedicating all functional units (outdated terminology?) to the thread the has the heavy load. Thus you can get better use of the functional units by moving them back and forth between threads as needed, at least until you have a cache miss in the busy thread.
plus-good, double-plus-good
The problem with emulation is that everything being emulated is processed on the main CPU. For example, The SNES has dedicated processors for logic, video, and audio. But when you emulate the SNES on a computer, the main CPU is emulating everything. Then, the respective resaults are exported to the video card and audio. So in a nutshell, the video card in your PC is not doing any kind of work directly relating to the SNES video (just acting as a frame buffer really).
But, I do find it interesting that the N64 was not only emulated, but if you had a 3D video card, there was a glide wrapper that actually accelerated emulated 3D functions of the N64. It's almost like distributed emulation as far as the hardware is concerned in your PC.
Life is not for the lazy.
Not true at all! RISC refers to the instruction set, not the internal architecture. Even the earliest RISC processors to carry that name included pipeline interlocks -- it was the simplicity of RISC that made such techniques feasible, especially at the chip densities of the 80's.
There's a lot of confusion about what RISC means. Look up a computer architecture textbook. RISC is somewhat fuzzy, and most chips bend the edges of the definitions in places. The general operating principle is "reduced," and herein lies the ambiguity, since this is relative to the technology of the day. (A "RISC" Alpha made in the 90's has more opcodes than a "CISC" 8086 made in 1978.) But RISC processors typically have the following properties:
CISC used to mean that many or most instructions were implemented in microcode on the processor.
Again, no. CISC means supporting many different kinds of operations directly in hardware. This was especially appealing in the days when back-end compiler code generation wasn't very good, so CISC means often a simple 1-for-1 translation from high-level constructs to machine opcodes. The ISA complexity usually meant microcode was the best approach, but this was not part of the definition.
CPU instruction sets are always designed around the software that will run on them. CISC instruction sets were popular because it made assembly programming possible; RISC only gained in popularity when compilers got good enough to produce optimal code.
Just my 2c. But the guy doesn't really have a clue how HT works and its reflected all the way through the artical. His choice of benchmarks is one of the most obvouse flaws, like comparing an AMD vs Intel on a AMD optimizes codebase.
GPLv2: I want my rights, I want my phone call! DRM: What use is a phone call, if you are unable to speak?
I've had no problems with any programs including BitTorrent since buying my P4C 3.0Ghz. And I do microcontroller programming, which can cause timing related problems when using the COM and LPT ports for programming/communication.
The main cause of HT and driver bugs, is sloppy programming. An assumption was made, that turns out not to be true, typically a timing problem. This is not the fault of the hardware, but rather programers being lazy, or not knowing better.
All bugs that I have read about being fixed, were due to the programmer not following best practice as laid out by Intel whitepapers.
[see title]
GPLv2: I want my rights, I want my phone call! DRM: What use is a phone call, if you are unable to speak?
Maybe my terminology confused you. With "different" workload we can have very good parallelism and can exploit it as we can assume independent threads will have independent instruction. How intel HT works is by dividing up resources between threads so lets assume we have 4 threads (call it t1,t2,t3 and t4) and that t1 is very busy while rest are not. Then the resources are split 4 ways however t1 will be queuing up(because it is busy) while resources with t2,t3 and t4 will be idle. This will also cause cache misses a lot and impact the other threads. The catch here is now we have to think up of thread selection policies which can be based on numerous logic like count, cache miss ratio, branch prediction count, etc. This is the area where I think the research is now concentrated on. I guess you will have more idea once you read the white papers on Intel's HT.
If you invoke a synchronization primitive that blocks, of course you are going to yield to another thread, whether it is on another processor or not. If you do Sleep(1) (I mispoke calling it Wait() -- it is called Sleep()), that blocks because you have to wait 1 ms. If you do Sleep(0), that is supposed to block for the remainder of the time slice, but it doesn't on SMP/hyperthreaded systems, and I wonder how many people out there have also lost hours of sleep because MS wouldn't document this.
Why in god's name would you link that picture? That is just utterly hideous and disgusting. You have a demented sense of humor..then again my stupidity for not noticing Tiny URL after the link. But then again it was very misleading. ugggg...I will have nightmares for weeks.
The strange thing is that the resulting assembly code doesn't seem to be much different or particularly inefficient --- both gcc's and icc's code are a long stream of addps, mulps and movaps instructions, and since the evaluation order is made explicit in the C code, dependency should not be much of a problem. The working set fits comfortably inside the L2 cache, but L1 cache is expected to thrash a little. I can't see why this code can be that inefficient.
Similar things happened when I was hand-optimizing an IIR filter for icc8. The speed is quite decent (about 7Gflops in the inner loop), but after I changed "a=b+c+d" to "a=d+b+c" (since d is calculated first, I think this should at least not hurt), speed mysteriously halved. The assembly code doesn't look much different at a glance, either.
The last two cases look similar. I guess the P4 may have much degraded performance when the reorder buffer fills up or something. Anyway, this at least shows that even icc (as of now) does not give a reliable performance. If you want the absolute highest performance, make sure you always keep an eye on the benchmark results.
Of course, icc has automatic vectorization while gcc doesn't, and this is the most important reason why icc often beats gcc 2:1 in some floating-point benchmarks. However, in my case the most time-consuming loops are invariably too complicated for icc to parallize automatically (one for a custom DCT algorithm, one for a 4th order IIR filter), so I still have to vectorize that by hand.
But the parent poster's 50:1 ratio does seem strange.