Hyper-Threading Explained And Benchmarked

Interesting. by Anonymous Coward · 2004-01-06 20:58 · Score: 5, Informative

There was an interesting discussion on the Plan9 newsgroup about hyperthreading recently, read here

Intel's Whitepaper by Cebu · 2004-01-06 21:02 · Score: 5, Informative

For those more technically inclined I would suggest reading Intel's Hyper-Threading Technology Architecture and Microarchitecture whitepaper instead.

Re:Intel's Whitepaper by arkanes · 2004-01-07 00:45 · Score: 4, Informative

Ars Technica has one also - less technical than the Intel paper but very accessible and with pretty colored diagrams.

For the real technical details by photonic · 2004-01-06 21:14 · Score: 5, Informative

The article claims to talk about the technical details of hypertreading. At first glance, however, it seems more like yet another article in the series "Athlon beats Pentium at Doom by 1/2 frame per second".

If you are really interested in the how and why of hypertreading in suggest you read trough the lecture notes of Computer System Architecture at MIT OpenCourseWare. This gives you enough background to race trough all the articles at Ars Techica et al.

--
karma police: arrest this man, he talks in maths; he buzzes like a fridge, he's like a detuned radio. [radiohead]

Re:Ever buy a car with auto-everything? by BlueBiker · 2004-01-06 21:28 · Score: 5, Informative

Well Intel is already encountering heat problems which limit how fast they can crank the clockspeed. Hyperthreading is a moderately successful attempt to make use of the available execution units on the chip which would otherwise sit idle. It's also not so new and untested, it has been implemented but not enabled on earlier P4 steppings.

Athlon and Athlon64 are generally better able to make use of their execution units, and wouldn't benefit from HT as much as P4/Xeon.

YHBT HAND! by TheMidget · 2004-01-06 21:38 · Score: 4, Informative

Indeed, you've bitten on the following hooks:

FDIV error: yes, it was division, not addition. However, conditions ware far less specific as Intel would have liked us to believe...
CISC vs RISC: you correctly pointed out that Pentiums still are CISC (even though they nowadays have a RISC core)

And you've missed the following hooks:

CAFEBABE: that's java's magic number. The code that used to lock up Pentium II's was F00FC7C8
Hyperthreading and the OS's job: no, hyperthreading does not do sth which the OS normally would do. It just pretends that there is a second processor. The OS is still responsible to assign threads to both virtual processors, just like it would do with two real processors!

Note to moderators: mod grand-parent down. It is obviously a troll (albeit a rather well written troll!). If you absolutely must mod it up, at least use Funny rather than Interesting

Re:Celery by turgid · 2004-01-06 22:09 · Score: 4, Informative

A Celeron is much cheaper than a P4 with the hyperthreading

So it is, and it's not all that fast either. Then again, you shouldn't believe all that you read on the Intarweb.

--
Stick Men

Re:Celery by Anonymous Coward · 2004-01-06 22:09 · Score: 1, Informative

Are you kidding?? This review linked to from /. a few weeks ago shows that a 1.8ghz Athlon XP easily beats the 2.6ghz Celeron in the DivX encoding test. With their 128kb L2 cache (384kb less than a P4) the Celerons just can't keep up with the P4. And the lower end P4s can't keep up with the Athlon XPs. Celerons are a complete waste of money, IMO.

Everything I know about Hyperthreading... by obergeist666 · 2004-01-06 22:27 · Score: 5, Informative

... I learned from this article.

Re:SMT by at_18 · 2004-01-06 22:57 · Score: 1, Informative

A short but informative article about SMT is on Wikipedia

Yup, all over the place... by DerProfi · 2004-01-06 23:37 · Score: 2, Informative

This guy can't even calculate his percentages correctly, so I wonder what else might be screwed up in his analysis?

If X is the lower number and Y is the higher number, he's figuring his percentage increases as (Y-X)/Y instead of (Y-X)/X .

Or is this some kind of "New New Math" that they started teaching in the 10 years since I graduated?

--

3000+ comments meta-modded. 0 mod points awarded.
Lesson for other meta-suckers: Don't believe the hype!

Re:Celery by JamesP · 2004-01-07 00:31 · Score: 1, Informative

Excepty that Celery 2.6 gets his ass kicked pretty badly by a 1.6 Duron

See benchmark at Anandtech Budget Shootout

--
how long until /. fixes commenting on Chrome?

Cost by Imperator · 2004-01-07 01:04 · Score: 1, Informative

Cost, cost cost. Cost cost cost cost cost, cost cost cost cost cost cost. Cost cost cost--cost! Cost cost cost, cost cost cost cost cost cost...cost cost. Cost cost "cost" cost cost cost cost cost cost cost cost. Cost cost cost cost cost COST cost cost.....

The lameness filter blows. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

--

Gates' Law: Every 18 months, the speed of software halves.

Hyper-threading explained in 300 words or less. by Anonymous Coward · 2004-01-07 01:09 · Score: 4, Informative

When a process blocks because it is trying to access memory that is not loaded into the cache, it sits idle while the data is retrieved from the much-slower main memory. If you can store two process contexts on the CPU instead of just one, whenever one process blocks to read from memory, the operating system can quickly switch the CPU to the other context which is waiting to run.

I can't remember the name of the machine, but one parallel shared-memory machine used this exclusively. The CPU had 128 process contexts and would switch through them in order. The time between subsequent activations of each context was great enough that data could be fetched from main memory and loaded into a register. This eliminated cache coherency problems (no cache!) and all delays related to memory fetching.

A P4 with hyperthreading is a simplified and much more practical version of that machine.

Re:Capsule summary. by msgmonkey · 2004-01-07 01:26 · Score: 2, Informative

The only way "better caches" will improve SMT is if you had one cache for each thread, however with that kind of configuration you basically end up with two cores on one chip.

The original thinking behind SMT was that with cache and branch prediction misses staring to have very large penalties, switching to an alternate thread would result in significant performance increase.

It turns out however that doing context switching at this ultra-fine granularity causes the cache miss rate to go up as each thread fights for the cache.

To get the best out of it the second thread would have to either "lock down" some cache lines and be doing either mainly ALU intensive operations or using streamed memory that would not be cached. This however end up limiting SMT to some pretty special case programming situations.

Re:From the article: by Glonoinha · 2004-01-07 02:38 · Score: 4, Informative

How about two people in moderate shape being able to push wood through a single wood chipper than a single person who is in great shape (assuming the wood is piled up 18 feet away = cache miss).

The single wood chipper being analogous to the actual processing part of the core, is only going to be able to shred so much wood - but if two people fetching wood from the woodpile can keep it running at 100% capacity they will shred more wood than a single guy running back and forth to the wood pile by himself.

--
Glonoinha the MebiByte Slayer

Re:bad programming ... by Anonymous Coward · 2004-01-07 03:07 · Score: 1, Informative

i'm not 100% sure bout this but i just got da
fishy feeling that hyper threading really is just
to make life easier for novice/beginner programmer
to write programs in "high" level languages (say
Vbasic, or just basic ;) ) that can compete in
performance to programs writen by cracks, say in
assember or C / C++.

Good programmers don't write programs in assembly. They pick good compilers and know the correct optimizations. Even if they could beat the best compilers, the code wouldn't be portable. That is bad programming except in very rare cases. Good compilers exist for languages like C/C++ to take advantage of multithreading. (icc)

The sound of software breaking by Latent+Heat · 2004-01-07 03:08 · Score: 2, Informative

OK, you are doing all this calculation in another thread, but you have to somehow synchronize with the GUI thread (PostMessage under Windows). If your calculation thread were to run faster than your GUI thread (GUI doing a lot of screen updating), you would get these PostMessages clogging up your GUI thread message queue because WM_PAINT is of very low priority (so frequent paints don't lock out key and mouse clicks).

In the old single-processor days, your calc thread could do a Wait(0) -- according to the Windows docs, this yields all of the calc thread's remaining time slice to blocked threads, like the GUI thread holding WM_PAINT in its queue. In these modern hyperthreaded times (I imagine true SMP works the same way), Wait(0) does nothing because the calc thread does not block when the GUI thread is on another virtual or real processor, and the screen updates gum up and get all blocky.

The solution I use is that when the GUI thread services a PostMessage from the Calc thread, it runs the message pump to check for and dispatch WM_PAINTs -- a kludge to give the PostMessage from the calc thread lower priority than WM_PAINT. But in the mean time I am cursing a blue streak that MSDN cannot document that Wait(0) is essentially meaningless with more than one processor and I have spend two weeks tearing my hair out about what is going on.

HT Technology by sameerdesai · 2004-01-07 03:23 · Score: 3, Informative

I have some insight into this technology as I was part of a research group researching SMT. It is a really cool technology that exposes Instruction level parellelism (ILP) and increases performance. The basic HT technology for the processor however distributes the resources. The details of Intel HT are available here at http://www.intel.com/technology/hyperthread/ You can also find whitepapers associated with this. Now the catch is application should be multi threaded. You just can't buy a HT processors and run single thread application and expect to improve performance. The performance benefits lie if optimal number of threads are used. If too less it will be unnecessary wastage of resources. If too high they will queue up and cause bottlenecks. The other thing that can affect performance is unbalanced workload and can cause threads which cannot exploit the parallelism. This is a new technology and lot of research is going on in this area and it looks really promising.

Re:SMT by jtshaw · 2004-01-07 04:12 · Score: 3, Informative

That is totally true. Processor-specific microcode optimizations are definitly the compilers job. But you have to conceed the fact that the compiler can only do so much. If the programmer doesn't choose a good method or solving the problem at hand there isn't much a good compiler can do to optimize the code, especially if the problem being solved is complex.

Compilers simply can't be asked to pick up the slack for programs written with a poor logical flow. They can't be ask to figure out a completely different and improved algorithm for solving a complex problem they don't completely understand the parameters for.

Re:Jim Kirk by GigsVT · 2004-01-07 04:18 · Score: 2, Informative

You are thinking of James T Kirk... See this is James R Kirk. :)

--
I've had enough abrasive sigs. Kittens are cute and fuzzy.

AnandTech on Hyperthreading by glinden · 2004-01-07 04:46 · Score: 3, Informative

AnandTech did an excellent article on hyper threading a while back. Well written and worth reading.

Re:SMT by John+Courtland · 2004-01-07 04:56 · Score: 2, Informative

What you wrote here is almost verbatim what Michael Abrash said in his book "Zen of Code Optimization". Dr. Dobbs Journal actually offered it up for free in PDF format at one point, I can only hope to find it amongst my mass of CD's.

Smart code will do more for you than hand optimized assembler, unless you already have written smart code.

--
Slashdot is proof that Sturgeon's Law applies to mankind.

IBM Will Do SMT Right by fupeg · 2004-01-07 05:06 · Score: 3, Informative

IBM will have SMT in the Power5. Their approach looks even better than Intel's, but part of that is the Power architecture and part of that is IBM learning from what Intel did. SMT is really the best way to get past the limiting reagents of modern processors : bandwidth.

Re:HT and VMWare: perfect together! by mixmasta · 2004-01-07 05:28 · Score: 2, Informative

Also, make sure to set the vm's to low priority when you are not in the window, it makes a huge difference in system response, even without Ht.

-Mike

--
#6495ED - cornflower blue

"hyper-threading" vs. cache size by Animats · 2004-01-07 06:10 · Score: 4, Informative

The basic problem with hyperthreading is, of course, memory bandwidth. CPUs today are memory-bandwidth starved. 30 years ago, CPUs got about one memory cycle per instruction cycle. Since then, CPUs have speeded up by a factor of about 1000, but memory has only speeded up by a factor of 30 or so. The difference has been papered over, very successfully, with cache. The cache designers have accomplished more than seems possible. Compare paging to disk, which is a form of cacheing that hasn't improved much in decades.

If you want to benchmark a hyper-threaded machine, a useful exercise is to run two different benchmarks simultaneously. Running the same one is the best case for cache performance; one copy of the benchmark in cache is serving both execution engines. Running different ones lets you see if cache thrashing is occuring. Or try something like compressing two different video files simultaneously.

If you're seeing significant performance with real-world applications using a a "hyper-threaded" CPU, that's a sign that the operating system's dispatcher is broken. And, of course, hyper-threading dumps more work on the scheduler. There's more stuff to worry about in CPU dispatching now.

Intel seems to be desperate for a new technology that will make people buy new CPUs. The Inanium bombed. The Pentium 4 clock speed hack (faster clock, less performance per clock) has gone as far as it can go. The Pentium 5 seems to be on hold. Intel doesn't still have a good response to AMD's 64-bit CPUs.

Remember what happened with the Itanium, Intel's last architectural innovation. Intel's plan was to convert the industry over to a technology that couldn't be cloned. This would allow Intel to push CPU price margins back up to their pre-AMD levels. For a few years, Intel had been able to push the price of CPU chips to nearly $1000, and achieved huge margins and profits. Then came the clones.

Intel has many patents on the innovative technologies of the Itanium. Itanium architecture is different, all right, but not, it's clear by now, better. It's certainly far worse in price/performance. Hyperthreading isn't quite that bad an idea, but it's up there.

From a consumer perspective, it's like four-valve per cylinder auto engines. The performance increase is marginal and it adds some headaches, but it's cool.

Re:"hyper-threading" vs. cache size by Brandybuck · 2004-01-07 06:47 · Score: 4, Informative

If you're seeing significant performance with real-world applications using a a "hyper-threaded" CPU, that's a sign that the operating system's dispatcher is broken. And, of course, hyper-threading dumps more work on the scheduler. There's more stuff to worry about in CPU dispatching now.

That was my suspicion. Hyperthreading can't be much more efficient than threading via the OS, unless the software is specifically compiled for it, or you use a scheduler specific to hyperthreading. Scheduling work STILL has to be performed, and hyperthreading STILL isn't parallel processing. So where are these performance improvements people are seeing coming from?

I'm not using Linux, but FreeBSD. When I got my new HT P4, I considered turning it on. Then I read the hardware notes. Since FreeBSD does not use a scheduler specific for hyperthreading, it can't take full advantage of it. In some cases it might even result in sub-optimal performance. Just like logic would lead you to think.

The OS cannot treat hyperthreading the same as SMP, because they are two different beasts.

--
Don't blame me, I didn't vote for either of them!

Assembly sucks? by dmelomed · 2004-01-07 07:14 · Score: 2, Informative

Not to be specific about SMT. Assembly too hard? You people haven't heard of Forth, right? Just use ficl, or some other embeddable forth instead of assembler, will save you lots of time. Better debugging too, since forth is interactive.

Synthetic Benchmarks and HT by OppressiveGiant · 2004-01-07 07:41 · Score: 2, Informative

Dhrystone and Whetstone should show almost no difference in performance w/ w/0 Hyperthreading. The HT just allows the Superscalar superpipelined processor to stick multiple threads on the same processor at the same time.

So what may be interesting would be to run both dhrystone and whetsone at the same time. Seeing as then you'd be using the ALU and floating point unit. That should show a large difference in the performance w/ w/o HT.

--
i could not think of anything clever.

Please get your terms straight! by Prof.+Pi · 2004-01-07 10:35 · Score: 2, Informative

The RISC concept, implemented in CPUs like the MIPS R3000, originally meant very simple hardware without pipeline interlocks, instruction schedulers, or more than an absolute bare-bones set of instructions.

Not true at all! RISC refers to the instruction set, not the internal architecture. Even the earliest RISC processors to carry that name included pipeline interlocks -- it was the simplicity of RISC that made such techniques feasible, especially at the chip densities of the 80's.

There's a lot of confusion about what RISC means. Look up a computer architecture textbook. RISC is somewhat fuzzy, and most chips bend the edges of the definitions in places. The general operating principle is "reduced," and herein lies the ambiguity, since this is relative to the technology of the day. (A "RISC" Alpha made in the 90's has more opcodes than a "CISC" 8086 made in 1978.) But RISC processors typically have the following properties:

Limited addressing modes (typically register-register, loads and stores only, maybe with some variants like autoincrement)
Relatively simple instruction formats (often all instructions are the same size)
Emphasis on general instructions rather than specialized instructions with limited applicability (such as string ops)

CISC used to mean that many or most instructions were implemented in microcode on the processor.

Again, no. CISC means supporting many different kinds of operations directly in hardware. This was especially appealing in the days when back-end compiler code generation wasn't very good, so CISC means often a simple 1-for-1 translation from high-level constructs to machine opcodes. The ISA complexity usually meant microcode was the best approach, but this was not part of the definition.

Re:Celery by ktulu1115 · 2004-01-08 05:44 · Score: 2, Informative

Background: I've used single CPU systems, HT systems, and SMP systems. I've taken courses on OS design and even in the process of writing my own. I'm quite familiar with the 80x86 32-bit instruction set and aware of the new 64-bit design as planned by AMD.

My $0.02 (this GREATLY SIMPLIFIED)

In the beginning there were CPUs. And CPUs were good.
Soon we realized the limitations and said.. Hey! Why not add another CPU and SMP was born.

SMP was good as well, however the additional cost was something of a deterrent for all but the power-users (and commercial applications of course).

Then Intel tried to develop a middle-ground, HyperThreading. It was a decent idea, however did not work quite as well as originally expected. AMD does not use it for a reason

From my experience I see HT as a hack developed by Intel, trying to duplicate true SMP. Might work sometimes and in certain environments but it's been show to actually slow execution in some situations (cache thrashing). In addition, SMP systems have much better responsiveness than HT ones under a high CPU load.

Which is why AMD is working on multi-core CPUs. This is the *correct* way (at least in my opinion) to tackle the problem, asides from getting true multiple CPUs. More can be read about it here. This combined with the new 64-bit instruction set (read more about that at the above link) will truly create a new era of CPUs.

--
# fuser -v /dev/attention | grep work
#

Slashdot Mirror

Hyper-Threading Explained And Benchmarked

31 of 245 comments (clear)