Hyper-Threading Explained And Benchmarked

SMT by Gary+Whittles · 2004-01-06 20:58 · Score: 2, Troll

Simultaneous Multithreading (SMT) is not a new idea, although no one to my knowledge has implemented it yet. Intel just calls it "Hyperthreading"...it is essentially SMT.

And yes, this is a very good idea. A modern superscaler out-of-order processor, like the Athlon and Pentium Pro (and later), can issue and retire multiple instructions per clock cycle. However, it can *only* do this if there is enough instruction-level parallelism (ILP). Turns out, there is not enough ILP in current programs to take full advantage of the chips processing capabilities. Issue slots and function units go unused due to dependencies in the program and cache misses that stall the processing. A typical processor can only look at about 32 instructions at a time. This is not a large enough window to execute future instructions out-of-order when such a stall occurs.

However, 2 threads of execution will likely fill all of the issue slots. They are also independent threads of execution, so dependencies don't exist between them. This means that when the pipeline stalls due to a cache miss, the other thread can keep on retiring instructions.

To all those saying that this is dumb, I suggest you study some modern architecture (I'm not talking about your undergrad architecture course either). A paper I read recently studied the affects of SMT on a simulated Alpha processor. The results were astounding with very little changes to the processor core. I heard that the next Alpha was slated to include SMT before Intel killed it.

--

3 replies beneath your current threshold

Re:SMT by John+Courtland · 2004-01-06 21:28 · Score: 3, Interesting

Yeah, this is the idea behind the new Cell architecture in the PS3. Dumping the old ideas of having a single threaded model and doing everything in multiple threads where global data can be dynamic with each thread containing its own local storage. Done properly, it's blazingly fast. Done poorly, and you end up with race conditions, blocking semaphores, and generally poor code and poor performance. The only problem is that using the paradigms we have today, very few are capable of programming this style right now. The closest people I can think of are the Michael Abrashes, optimization zealots (not saying it's a bad thing), who know their processor upside and down and are not afraid of assembler, or rescheduling instructions to get the most power out of each cycle, instead of letting an optimizing compiler do it for them.

--
Slashdot is proof that Sturgeon's Law applies to mankind.
Re:SMT by Anonymous Coward · 2004-01-06 22:43 · Score: 0

Big problem being of course that without better caches, and some care on the part of the programmer/OS you can seriously degrade performance if you are blowing your shared I$, D$ and TLB.
Re:SMT by at_18 · 2004-01-06 22:57 · Score: 1, Informative

A short but informative article about SMT is on Wikipedia
Re:SMT by freidog · 2004-01-06 23:03 · Score: 1

The IBM Power 5 will have SMT (along with CMP)

with some very welcome additions, like the ability for thread proritization.
So a low priority thread won't (in the long run) have the same number of CPU resources as a high priority one.

And the ability to switch between SMT and single threaded execution on the fly
Re:SMT by jtshaw · 2004-01-07 01:50 · Score: 3, Interesting

You right, very few people can code a program that works well on an SMT processor. It is a lot to keep track of and quite honestly, most of the code I have seen churned out at software companies was done in such a rush because of deadlines the programmers didn't have time to optimize there code.

However, there is no reason why you can take two single threaded processes and use one to fill the holes in the pipeline left by the other so SMT should still have a decent benifit if the kernel scheduler is prepared for this.
Re:SMT by nikh · 2004-01-07 02:24 · Score: 5, Interesting

Just to clarify here, this is not the same idea as the Cell architecture.

The Cell architecture (which may or may not be used for the PS3) is a multi-processor system designed for scalability; It really does have several processors running at the same time. In contrast, 'Hyperthreading' runs multiple threads on a single processor's core.

They both require multi-threaded code to achieve performance improvements, but fundamentally they're really quite different, and yield quite different price / performance trade-offs.
Re:SMT by sql*kitten · 2004-01-07 03:10 · Score: 2, Insightful

most of the code I have seen churned out at software companies was done in such a rush because of deadlines the programmers didn't have time to optimize there code.

I would argue that in the vast majority of cases, processor-specific microcode (as opposed to language and algorithmic) optimizations aren't the programmer's job - that's what a compiler is for. A professional-grade compiler like MIPSpro or ICC can generate code over twice as fast as GCC on the same processor, because it's smarter about processor-specifics. It's the same as on processors with OOOE and the like; the onus is on the compiler writers working with the hardware designers. On an older architecture like VAX, there was less need for that because the instruction set was so rich, but a more modern architecture like MIPS really needs it.
Re:SMT by Radius9 · 2004-01-07 03:14 · Score: 5, Interesting

Being a console programmer, and having done quite a bit of work on the PS2, there is something in your comment that is a common misperception. You say that hyperthreading works great when you have people who know their processor upside and down and are not afraid of assembler, well, I am not afraid of assembler, and have done quite a bit of it. The problem is that writing in assembler tends to be slow, especially when trying to do heavy optimization. This takes time, a luxury generally not available to those of us in video games who tend to have hard christmas deadlines to ship our product. For Sony to assume that people are going to learn how to program in assembly is a mistake, as learning assembly isn't the issue, having the time to optimize the code in assembly is the issue. This isn't helped by the fact that most of the tools made available to us are piss poor, which makes working on the code much more difficult. For example, the PS2 has the vector units that are generally programmed in assembly. Not only do you need to make sure that the processing done by the vector units synchronizes with your main CPU, but you don't have ANY sort of debugging capability on these. Because of this, programming vector unit code is incredibly slow.

In addition, video games are things that don't always lend themselves particularly well to running in multiple threads. I have my artificial intelligence code, collision & physics code, and my rendering code. These 3 parts are the main parts of the code that take roughly 90-95% of the total CPU time available to me. I can't run collisions and physics until after the AI has run, and I can't run my rendering until the collision & physics have been run. I can multi-thread individual game objects, but even these constantly interact with each other. This isn't normally a problem if you double buffer it in a way that, for example, after the AI has run, I keep the current frame's AI output around somewhere while I run the next frame, but this requires additional memory, another resource that is scarce on consoles.
Re:SMT by Anonymous Coward · 2004-01-07 03:17 · Score: 0

The deeper the pipelines, the more hyperthreading makes sense!
Re:SMT by jtshaw · 2004-01-07 04:12 · Score: 3, Informative

That is totally true. Processor-specific microcode optimizations are definitly the compilers job. But you have to conceed the fact that the compiler can only do so much. If the programmer doesn't choose a good method or solving the problem at hand there isn't much a good compiler can do to optimize the code, especially if the problem being solved is complex.

Compilers simply can't be asked to pick up the slack for programs written with a poor logical flow. They can't be ask to figure out a completely different and improved algorithm for solving a complex problem they don't completely understand the parameters for.
Re:SMT by zenyu · 2004-01-07 04:30 · Score: 1

A professional-grade compiler like MIPSpro or ICC can generate code over twice as fast as GCC on the same processor, because it's smarter about processor-specifics.

gcc 3.3.2 beats the pants off icc 8.0 on my SSE2 code. Up to a 50:1 ratio on speed tests, 4:1 on average. With earlier revisions of gcc and icc the ratio was 2:1 with icc being faster. This code is written with explicit parallelism so all the fancy loop unrolling icc does doesn't help, and the register allocation algorithm in gcc seems to be the thing giving it the advantage, icc spills to memory on simple matrix vector multiplies. I think it just validates the old maxim about optimization, trust nothing, test everything.
Re:SMT by John+Courtland · 2004-01-07 04:56 · Score: 2, Informative

What you wrote here is almost verbatim what Michael Abrash said in his book "Zen of Code Optimization". Dr. Dobbs Journal actually offered it up for free in PDF format at one point, I can only hope to find it amongst my mass of CD's.

Smart code will do more for you than hand optimized assembler, unless you already have written smart code.

--
Slashdot is proof that Sturgeon's Law applies to mankind.
Re:SMT by John+Courtland · 2004-01-07 05:01 · Score: 1

So then they are going to use seperate silicon for each? I guess that would be better, if one unit fails you don't lose the computer. I'm sorry to not have made that distinction, but as you note, the programmatical method is the same or at least similar for both. You must compartamentalize your code into various non-blocking threads to yield a good amount of explicit parallelism to really see any benefit.

--
Slashdot is proof that Sturgeon's Law applies to mankind.
Re:SMT by John+Courtland · 2004-01-07 05:13 · Score: 1

Not saying assembler is the end all be all, but I don't know of another programmtical model that really does a good job of encompassing the scheduling necessary to program for a simultaneous multithreaded processor.

I understand the need for single threaded performance, it does seem hard to break a game down into enough parts to really benefit from massively multithreaded architectures. I mean, all you really have is input, video, sound, physics, AI and rules (I seperate physics from rules because physics are much more difficult to handle than simpler rules). And since most of those are tightly linked to certain conditions, they cannot be left alone to do their own thing. I guess you could make message pumps for each thread, but then you can't guarantee that the sound and video would synchronize, or that the AI would complete a certain task in a given time slice. At least, unless the computer was very fast and had an asynchronous bus.

I also understand your plight with programming on a very limited resource machine, and I'm sorry if you think I slighted your profession. Hell, I would love to program video games, but, alas, it seems it's not meant to be...

--
Slashdot is proof that Sturgeon's Law applies to mankind.
Re:SMT by Anonymous Coward · 2004-01-07 05:22 · Score: 0

Compare it the current versions of icc, butt monkey. icc still beats gcc by 2:1, if not more. Your comparison is meaningless. Its like comparing a 2004 Chevy to a Ford Model T and saying, "See, Chevy really is better than Ford!"

Moron.
Re:SMT by zenyu · 2004-01-07 05:59 · Score: 1

Compare it the current versions of icc, butt monkey. icc still beats gcc by 2:1, if not more. Your comparison is meaningless. Its like comparing a 2004 Chevy to a Ford Model T and saying, "See, Chevy really is better than Ford!"

Umm, yeah, well icc 8.0 is the newest release. And I checked against several versions of gcc 3.2.3, 3.3.2 and the latest 3.4 from CVS. The 3.3.2 seemed to have the best performance overall, with 3.4 a close contender. Except the 3.4 had worse performance on a couple benchmarks, not unexpected for a pre-beta. icc was left far far behind, if you look at my posting history you will see I heaped praise on them in the 6.0 period, said nothing after the 7.0 release, and now let people know that with certain code gcc outperforms icc. If you have a legacy app maybe icc will do better, I don't know, with my application it is significantly slower.

icc also won't use my default libstdc++ which was a bit of a pain (limits used gcc builtins which icc doesn't have). I used the Dinkumware standard library for the benchmarks, but if I could have used the same sytem libraries as gcc I might have had the best of both worlds...
Re:SMT by Anonymous Coward · 2004-01-07 06:23 · Score: 0

Multithreading at Cray has been around a few years. It's actually a Terra Computing development, but they bought Cray and took the name. The MTA-1 was released a few years ago, and they are now selling the MTA-2. The multithreading is far beyond what Intel is doing: 128 threads per processor, interleaved so that memory is always running at full bandwidth. Looks like an interesting architecture, if you already have parallelized code.
Re:SMT by Analog+Squirrel · 2004-01-07 09:05 · Score: 1

What about the Cray(formerly Tera) MTA system? These were hitting the supercomputing community at least 5 years ago...

--
I'd rather be flying
Re:SMT by Moeses · 2004-01-07 09:54 · Score: 1

While I've never written a professional quality game I am a professional programmer (writing business apps like most of us that are still employed) and I agree with what you say. Obviously you've encounted a specific situations where multi-threading wouldn't help much. I think you hinted at a more profound and general rule; multi-threading makes for difficult to fix bugs.

When the same data is touched by multiple threads timing issues lead to indeterministic bugs which can be the hardest problems to solve. You know, the kind that show up once in a while but then disapear when debugging code in turned on, etc. (Sometimes referred to as Heisenbugs.)

Where I see hyperthreading being more useful is improving the performace of systems that have many tasks (separate programs) running that aren't coupled any closer than say, both connecting to the same DB.

Also, there are kinds of apps that really HAVE to be multithreaded such as DBs and hyperthreading will help give more bang for the buck per CPU. I wonder how the corps that do pricing based off CPU type and number are going to handle this? I forsee Oracle coming up with a new way to hold their customers up by their ankles while shaking.

Interesting. by Anonymous Coward · 2004-01-06 20:58 · Score: 5, Informative

There was an interesting discussion on the Plan9 newsgroup about hyperthreading recently, read here

Re:Interesting. by Gleng · 2004-01-06 21:32 · Score: 4, Funny

Cool, that explains it a little.

I was actually trying to explain hyperthreading to someone today. I got about three minutes into the discussion and realised that I had absolutely no idea what I was talking about.

The discussion arose because we were talking about stupid salesmen. I saw a salesman in a shop the other week, trying to explain hyperthreading to a lady with a glazed expression on her face.

He was saying that hyperthreading makes it easier to use two monitors on your PC.

--
"Proudly Posting Without Reading The Article"

Capsule summary. by Anonymous Coward · 2004-01-06 21:01 · Score: 1, Insightful

Hyperthreading helps increase efficiency when applications are coded for it and it is enabled. As better caches and busses get built into future CPUs, hyperthreading will also get better.

Re:Capsule summary. by msgmonkey · 2004-01-07 01:26 · Score: 2, Informative

The only way "better caches" will improve SMT is if you had one cache for each thread, however with that kind of configuration you basically end up with two cores on one chip.

The original thinking behind SMT was that with cache and branch prediction misses staring to have very large penalties, switching to an alternate thread would result in significant performance increase.

It turns out however that doing context switching at this ultra-fine granularity causes the cache miss rate to go up as each thread fights for the cache.

To get the best out of it the second thread would have to either "lock down" some cache lines and be doing either mainly ALU intensive operations or using streamed memory that would not be cached. This however end up limiting SMT to some pretty special case programming situations.
Re:Capsule summary. by AlecC · 2004-01-07 05:16 · Score: 1

Not entirely so. If each process can get its core program into cache separately, and each occupies less than half the cache for that core functionality (i.e. not for the large-scale data being processed), then they will not fight for use of the cache. It also depends upon the associativity of the cache. Each program will have a small number of cache "hotspots", at which nit is intensively using the data. The more threads there are running, the more chance there is in an N-way associative cache that several will overlap and cause thrashing.

--
Consciousness is an illusion caused by an excess of self consciousness.
Re:Capsule summary. by Glasswire · 2004-01-07 06:05 · Score: 1

Howver, HT as Intel has implemented it, will reduce overall latency running more than one single threaded app, not just multi-threaded apps. Who has a system with ONLY ONE THREAD executing?

Intel's Whitepaper by Cebu · 2004-01-06 21:02 · Score: 5, Informative

For those more technically inclined I would suggest reading Intel's Hyper-Threading Technology Architecture and Microarchitecture whitepaper instead.

Re:Intel's Whitepaper by arkanes · 2004-01-07 00:45 · Score: 4, Informative

Ars Technica has one also - less technical than the Intel paper but very accessible and with pretty colored diagrams.

Re:Ever buy a car with auto-everything? by pdbaby · 2004-01-06 21:04 · Score: 5, Interesting

I hate to say it, but your logic is flawed.

To put hyperthreading into your car analogy:
Hyperthreading is like a car that has power assisted steering. If you want, you can switch it off; you'll likely have a slightly smoother time with it on. But if you want the control (or don't trust it) then you can switch it off.

For the geek who reads posts as a stack of strings delimited by <br>, Nobody's forcing you to use hyperthreading. Use it, don't use it. Don't complain that it's a Bad Thing[tm] simply because you're being given the choice

--
Global symbol "$deity" requires explicit package name at line 2. - If only $scripture started "use strict;"

Cinebench by Anonymous Coward · 2004-01-06 21:06 · Score: 1, Funny

that Cinebench performance evaluation is wacked, looks like he interpreted his own graphs wrong.

What a fool.

Re:Cinebench by iansmith · 2004-01-07 06:13 · Score: 1

What is wrong with it? There is a line in the graph missing, but he explains that.

"Obviously when hyper-threading was disabled on my P4 test system, I was unable to run the Multiple CPU portion of Cinebench's rendering benchmark."

Re:Ever buy a car with auto-everything? by Anonymous Coward · 2004-01-06 21:08 · Score: 1, Funny

Please, don't mod up this idiot. It only encourages him. Check his name and then check his previous posts for some other inane comments. The day that he actually has something valuable to say will be the day that hell freezes over.

Call that hyperthreading? by Anonymous Coward · 2004-01-06 21:10 · Score: 5, Funny

"they'll be publishing Part II in the near future"

Part II should've been published concurrently, using idle time... tch!

For the real technical details by photonic · 2004-01-06 21:14 · Score: 5, Informative

The article claims to talk about the technical details of hypertreading. At first glance, however, it seems more like yet another article in the series "Athlon beats Pentium at Doom by 1/2 frame per second".

If you are really interested in the how and why of hypertreading in suggest you read trough the lecture notes of Computer System Architecture at MIT OpenCourseWare. This gives you enough background to race trough all the articles at Ars Techica et al.

--
karma police: arrest this man, he talks in maths; he buzzes like a fridge, he's like a detuned radio. [radiohead]

Re:For the real technical details by Anonymous Coward · 2004-01-06 23:09 · Score: 0

"The article claims to talk about the technical details of hypertreading"

This is the wrong article...... "Hypertreading" is the new technology being introduced by Nike this year....

Re:Just Marketing BS by Intel to get suckers to bu by narkotix · 2004-01-06 21:15 · Score: 0, Offtopic

your right there with its nothing new, BUT for the vast majority (with the right consumer price ofcourse) of users, this is the first time that 64bit is available on the desktop - whether its AMD or Apple.

--
We played dungeons and dragons for 3 hours.....then i was slain by an elf

Bug fixing my post by ObviousGuy · 2004-01-06 21:18 · Score: 2, Funny

I meant to say the 0xF00F bug which freezes the Pentium.

The 0xCAFEBABE bug just slows it down to a crawl.

--
I have been pwned because my /. password was too easy to guess.

Re:Bug fixing my post by BiggerIsBetter · 2004-01-06 21:26 · Score: 1

Freudian slip perhaps?

--
Forget thrust, drag, lift and weight. Airplanes fly because of money.
Re:Bug fixing my post by Anonymous Coward · 2004-01-06 21:35 · Score: 0

Too much caffeine, I bet.
Re:Bug fixing my post by TheMidget · 2004-01-06 21:41 · Score: 1

The 0xCAFEBABE bug just slows it down to a crawl.
And your post is just a trawl!
Btw, at least the 0xCAFEBABE bug doesn't open up the barn door for all viruses and trojans to come in and have a jolly good time in your computer, unline that infamous ActiveX bug! And with 1.4, performance is not that bad either.
Re:Bug fixing my post by JPriest · 2004-01-06 21:48 · Score: 1

Cafe' Babe? I think all geeks slow down for them. Pr0n: Blue men take on Cafe' Babe in a battle for Big 0^H endian.

--
Saying Java is nice because it works on all OS's is like saying that anal sex is nice because it works on all genders.
Re:Bug fixing my post by Anonymous Coward · 2004-01-06 22:18 · Score: 0

Was the hook too obviously showing? Too late, the fish have already swam away!

Re:Just Marketing BS by Intel to get suckers to bu by idiotnot · 2004-01-06 21:19 · Score: 5, Interesting

Perhaps I'm feeding a troll here, but....

64 bits, while not interesting in and of itself, is interesting in AMD's implementation. I have an UltraSparc sitting on my desk at work, and I assure you it's one of the most boring machines in the world. Why is AMD interesting? In the Opteron/Athlon 64 they've fixed some of the shortcomings of the x86 architecture. More registers. Access to more than 4GB of RAM without menutia (like Intel uses). Things that were expensive in a register-starved 32 bit processor aren't on an Athlon64.

No, it's not innovative, not by a longshot. It's the same damn thing Intel did when they introduced the 80386. But it continues the line unbroken, and that's why the processor is important.

Hyperthreading is interesting, I agree, but I'd much prefer more affordable dual processor machines. Why in the world do Intel, AMD, and Microsoft go out of their way to keep SMP machines off the desktop? Apple certainly is going in the opposite direction.

Well my computer has HT by Rick+and+Roll · 2004-01-06 21:20 · Score: 1

and it is _not_ unreliable. no way. no how. I am very impressed with Intel's chips. I have HT turned on, and again, I experience zero crashes. But some RISC processors are very neat. Never managed to get my hands on any of them, though.

Celery by Chris+Siegler · 2004-01-06 21:25 · Score: 4, Insightful

We saw a whopping 30% decrease in encoding time with HT enabled on the 3.2GHz P4C. We were using an application that is certainly multi-threaded in TMPGEnc, so each logical processor had plenty of work to do and they both had plenty of bandwidth available to share.

That's pretty cool, but if your primary concern is encoding, then there are some things to keep in mind. A Celeron is much cheaper than a P4 with the hyperthreading ($90 for a 2.6GHz Celeron, and $170 for a P4 2.6C). And if the app you're using doesn't support HT, then a Celery will likely encode faster than a P4 with HT on. HT can also reveal nasty bugs in some drivers (my HDTV card is an example). So unless you're playing games, the P4 is just added expense.

Re:Celery by Anonymous Coward · 2004-01-06 21:28 · Score: 0

The increase of L2 cache on the P4 compared to the Celeron is worth the extra money in most cases.
Re:Celery by Anonymous Coward · 2004-01-06 21:31 · Score: 0

Depends on what you're doing. Same could be said of Xeon vs. non-Xeon, but the price jump is enormous.
Re:Celery by turgid · 2004-01-06 22:09 · Score: 4, Informative

A Celeron is much cheaper than a P4 with the hyperthreading
So it is, and it's not all that fast either. Then again, you shouldn't believe all that you read on the Intarweb.

--
Stick Men
Re:Celery by Anonymous Coward · 2004-01-06 22:09 · Score: 1, Informative

Are you kidding?? This review linked to from /. a few weeks ago shows that a 1.8ghz Athlon XP easily beats the 2.6ghz Celeron in the DivX encoding test. With their 128kb L2 cache (384kb less than a P4) the Celerons just can't keep up with the P4. And the lower end P4s can't keep up with the Athlon XPs. Celerons are a complete waste of money, IMO.
Re:Celery by JamesP · 2004-01-07 00:31 · Score: 1, Informative

Excepty that Celery 2.6 gets his ass kicked pretty badly by a 1.6 Duron

See benchmark at Anandtech Budget Shootout

--
how long until /. fixes commenting on Chrome?
Re:Celery by Jeff+DeMaagd · 2004-01-07 01:39 · Score: 1

I think the logic is wrong here. Even if HT is enabled with a program that doesn't take advantage of it, usually it isn't a noticible liability.

One can still turn off the HT. With only a 128k cache, IMO, it is too much of a performance liability to make it worth the lower cost.

I just leave it on because the system seems to respond a little better under heavy load.
Re:Celery by Anonymous Coward · 2004-01-07 01:51 · Score: 0

Amusingly enough, having played Star Wars Galaxies, I can tell you nowadays CPU AND Video card are running slightly behind memory for most important speed booster. Get a 2ghz P4 and a celeron, and see how much of a difference they make when running SDRAM or DDR2100, then try them on a mobo running DDR3200, and watch the performance leap :)

-- vranash
Re:Celery by UU7 · 2004-01-07 02:29 · Score: 1

So you're saying a Celeron will encode FASTER than a p4 of equivalent Mhz with HT on ?
Care to show me a site that shows anything close to that ?
Re:Celery by Glonoinha · 2004-01-07 02:55 · Score: 2, Insightful

$80 difference on a $700 machine (assumes a usable amount of RAM, a real video card, a usable performance hard drive, and a legit copy of XP Pro (XP Pro gives you the best performance on the SMT chips, I have seen roughly 5%-10% gains)) means that for every 8 P4 2.6GHz HT machines you were going to buy, you can buy 9 Celeron 2.6GHz machines. Even if you go display-less (no monitors) and use a free OS (Linux or recycled Win2000Pro CDs) you are talking $500 absolute minimum, you are talking 7 Celeron boxes for the same price as 6 P4 boxes. I don't think my honey is going to fall for the 'but I need another 7 computers' line again this year.

At $80 difference, I don't see the price difference being worth it. Particularly given a two year lifespan wherein apps will be developed to get that 30% performance boost we see in a few of the charts (ie, the programs that are multithreaded, and SMT friendly.)

Then again if we applied the $80 towards another half gig of memory, tested same price boxes but the Celeron had another 512M of RAM ... I can see the Celeron simply dominating the P4.

--
Glonoinha the MebiByte Slayer
Re:Celery by ktulu1115 · 2004-01-08 05:44 · Score: 2, Informative

Background: I've used single CPU systems, HT systems, and SMP systems. I've taken courses on OS design and even in the process of writing my own. I'm quite familiar with the 80x86 32-bit instruction set and aware of the new 64-bit design as planned by AMD.

My $0.02 (this GREATLY SIMPLIFIED)

In the beginning there were CPUs. And CPUs were good.
Soon we realized the limitations and said.. Hey! Why not add another CPU and SMP was born.

SMP was good as well, however the additional cost was something of a deterrent for all but the power-users (and commercial applications of course).

Then Intel tried to develop a middle-ground, HyperThreading. It was a decent idea, however did not work quite as well as originally expected. AMD does not use it for a reason

From my experience I see HT as a hack developed by Intel, trying to duplicate true SMP. Might work sometimes and in certain environments but it's been show to actually slow execution in some situations (cache thrashing). In addition, SMP systems have much better responsiveness than HT ones under a high CPU load.

Which is why AMD is working on multi-core CPUs. This is the *correct* way (at least in my opinion) to tackle the problem, asides from getting true multiple CPUs. More can be read about it here. This combined with the new 64-bit instruction set (read more about that at the above link) will truly create a new era of CPUs.

--
# fuser -v /dev/attention | grep work
#

Re:Ever buy a car with auto-everything? by Dominic_Mazzoni · 2004-01-06 21:26 · Score: 5, Insightful

Whether it's something obvious like the Pentium off by 1+1=1.9999943 error

The Pentium math bug was with division, not addition, and it only occurred in very specific circumstances. So while it supports your general point that complicated systems are more difficult to debug, that wasn't a very good example of an "obvious" bug. Careless, yes.

One thing that was good for the industry was to move away from the complex instruction set (CISC) towards a reduced set of instructions (RISC), and we have seen the speed improvements as well as a general reduction in hardware bugs since that time.

You do realize that Intel x86 processors are still CISC, right? (OK, actually internally they do execute things very much like a RISC chip, but the instruction set is still CISC, and modern x86 processors are certainly not any _simpler_ for having some RISC-like elements to them.

Besides, RISC chips don't actually have fewer instructions. Most of them these days have more. The difference between CISC and RISC is that RISC chips don't have certain complicated, slow instructions, but rather break these up into smaller pieces. For example, CISC processors usually have an instruction to move memory-to-memory while RISC only moves memory-to-register and register-to-memory. Also, CISC processors often have a division instruction while many RISC processors instead just have a multiplicitive inverse instruction (so to compute a/b you instead compute a*inv(b)).

But to add Hyperthreading, an untested and unproven technology which can guarantee no more than a 12% speed improvement, is folly. Better to amp the CPU clock and deal with a known like heat than to risk your company's livelihood on letting the CPU figure out which thread is which. That is something an OS is much more reliable in handling.

Now that's just ridiculous. Hyperthreading is not untested or unproven. Similar ideas have been discussed in academic papers for years; Intel was just the first to put it into a modern CPU. It's hardly untested, either - Intel started seeding the first Hyperthreading-capable processors what, two years ago now? At that point I wouldn't have suggested running a mission-critical application on a machine with Hyperthreading enabled, but now? You'd be crazy not to if it actually speeds up the application you need to run.

The reality is that in order to advance the speed of computer processors, it's necessary to make them more complicated.

So many hooks, ... by Anonymous Coward · 2004-01-06 21:28 · Score: 0

... but the moderator still don't recognize this for the trawl it is!

Re:Ever buy a car with auto-everything? by BlueBiker · 2004-01-06 21:28 · Score: 5, Informative

Well Intel is already encountering heat problems which limit how fast they can crank the clockspeed. Hyperthreading is a moderately successful attempt to make use of the available execution units on the chip which would otherwise sit idle. It's also not so new and untested, it has been implemented but not enabled on earlier P4 steppings.

Athlon and Athlon64 are generally better able to make use of their execution units, and wouldn't benefit from HT as much as P4/Xeon.

Wrong percentages? by OMG · 2004-01-06 21:29 · Score: 5, Interesting

I think they made a mistake here.
From the article:
"Sandra's CPU benchmark is obviously quite optimized for hyperthreading at this point, and the numbers certainly show that. We see an average improvement of ~39% when hyper-threading is enabled on the P4 ..."

The numbers are:
4328 without HT
7125 with HT

You could say that disabling HT makes this benchmark 39% slower. But the the increase by turning HT on is
7125/4328-1 = 1.646 - 1 = 0.646 = 64.6 %

Hrmpf.

Re:Wrong percentages? by AndIWonderIfIWonder · 2004-01-07 02:33 · Score: 1

You could say that disabling HT makes this benchmark 39% slower. But the the increase by turning HT on is
Thats true, until you look at the fact there were 3 different tests, you looked at one of them, and they quoted an average.
The 3 percentages are about 64.6%, 71.2% and 22.5%, which is 52.8% on average, which means them saying ~52.5% is actually right.
Re:Wrong percentages? by AndIWonderIfIWonder · 2004-01-07 02:46 · Score: 1

Missed the full quote before, oops! So here it is with my point again (but I'm probably wrong)
You could say that disabling HT makes this benchmark 39% slower. But the the increase by turning HT on is
7125/4328-1 = 1.646 - 1 = 0.646 = 64.6 %
Thats true, until you look at the fact there were 3 different tests, you looked at one of them, and they quoted an average.
The 3 percentages are about 64.6%, 71.2% and 22.5%, which is 52.8% on average, which means them saying ~52.5% is actually right.
Re:Wrong percentages? by Glonoinha · 2004-01-07 03:02 · Score: 3, Interesting

Crap you are right - just by turning on HT on the same box he saw a 65% boost in performance.

I think it was a case of -wanting- to see a specific number and juggling things in his head until he got the number he wanted. Intel touts the 30% range and if he initially got the 65% number he probably discarded it and kept juggling the books to get the number in the 30's that he wanted.

As someone that has a P4 2.4 (not HT) box sitting right next to a P4 2.4 (HT) box I will assure you that in real life you are not going to see a 65% sustained boost in performance in day to day use. Not 30% sustained boost either, unless you are only running apps that are heavily optimized and multithreaded.

--
Glonoinha the MebiByte Slayer
Re:Wrong percentages? by fitten · 2004-01-07 04:03 · Score: 1

unless you are only running apps that are heavily optimized and multithreaded. ...or running two apps at the same time that both like to hog the CPU. What matters is that you have two threads (and they don't have to necessarily in the same process) that need CPU resources. Even outside of this case, you always have the kernel that will need to run, various services/daemons that will need to wake up from time to time and look around for something to do, and interrupts to process so the machine should feel more responsive and perhaps you may see small (single digit % increases) in many things.
Re:Wrong percentages? by budgenator · 2004-01-07 06:33 · Score: 1

The 3 percentages are about 64.6%, 71.2% and 22.5%,
so if i take
50% of a head of lettuce,
100% of an orange,
75% of a bannana and
25% of a cup of Miracle Whip,
I get 62.5% of a fruit salad?

I guess everybody knows why I flunked calculas now!

--
Apocalypse Cancelled, Sorry, No Ticket Refunds

Being philosophical on this... by keeboo · 2004-01-06 21:33 · Score: 5, Interesting

I do believe that HT does have future, perhaps not in its present form, but still.

I do remember when there was that RISC vs CISC thing in the 80s, people were saying that CISC was obsolete, RISC being the future and so on. What we see today is not pure RISC processors but something in between. -- It's just that the answer was not that pure or clean as people thought at first.

Few years ago there was BeBox and its BeOS. Well, BeOS had the philosophy for a machine not having a single super-powerful-burning-hot processor but, instead, several low-power combined.
Well, Hyper-Threading may push distributed processing technology to the desktop, to the masses, so we might have interesting changes in software and hardware philosophy in the future.

Sort of romantic thinking... But one can dream. :)

Re:Being philosophical on this... by jhines · 2004-01-07 03:04 · Score: 1

I remember the period when Digital was developing the Alpha to replace the CISC cpus in Vaxen.

Nice chip, but relegated to the history books now.
Re:Being philosophical on this... by Greyfox · 2004-01-07 03:23 · Score: 1

RISC was another buzzword, like microkernels, XML, Java, XP, etc. It was the wave of the future -- the magic bullet that would let you get "Mainframe performance on a desktop computer." Just like the 386 was going to finally give you "Mainframe performance on a desktop computer," and the 486 was going to give you "mainframe performance on a desktop computer." I have to wonder how many IT departments bought the hype and made a switch, only to discover that they weren't really running all that much faster.
Now Hyperthreading will give you Mainframe Performance on a Desktop Computer, and if you believe that enabling one extra feature on your processor will make it all that much faster in most circumstances, I've got a bridge in New York I'd like to sell you.
That being said, I wouldn't mind having a dual P4 box with an assload of RAM sitting under my desk...

--
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
Re:Being philosophical on this... by Anonymous Coward · 2004-01-07 03:45 · Score: 0

The main reason CISC still has a significant presence is the fact that the Wintel dominance continued and backward compatibility matters.

Had the cleanest architecture "won" - and it could've, in a world where backward compatibility, existing applications and marketing prowess didn't matter - we probably would mostly be using Alphas, the 32/64-bit transition would've gone cleanly and without kludgy backward compatibility requirements.

Even in this world, had DEC not failed miserably in marketing it, the Alpha could now be the dominant (and still the fastest, like it used to be) server CPU.

YHBT HAND! by TheMidget · 2004-01-06 21:38 · Score: 4, Informative

Indeed, you've bitten on the following hooks:

FDIV error: yes, it was division, not addition. However, conditions ware far less specific as Intel would have liked us to believe...
CISC vs RISC: you correctly pointed out that Pentiums still are CISC (even though they nowadays have a RISC core)

And you've missed the following hooks:

CAFEBABE: that's java's magic number. The code that used to lock up Pentium II's was F00FC7C8
Hyperthreading and the OS's job: no, hyperthreading does not do sth which the OS normally would do. It just pretends that there is a second processor. The OS is still responsible to assign threads to both virtual processors, just like it would do with two real processors!

Note to moderators: mod grand-parent down. It is obviously a troll (albeit a rather well written troll!). If you absolutely must mod it up, at least use Funny rather than Interesting

Re:Just Marketing BS by Intel to get suckers to bu by Anonymous Coward · 2004-01-06 21:39 · Score: 0

I grant you that it's better than what Intel was doing with 64-bits, but it was nothing more than the next logical step on the x86 CPU line.

oh goodie! by Anonymous Coward · 2004-01-06 21:51 · Score: 2, Funny

an extra frame or two for Doom3!

Re:oh goodie! by Mipsalawishus · 2004-01-07 01:45 · Score: 1

Isn't Doom3 slated to have a capped fps?
Re:oh goodie! by DomCurtis187 · 2004-01-07 03:18 · Score: 0

that was true with the first betas... who knows about retail though. i don't see why they'd need it -- if a 3D card will do 60 bazillion FPS, why not let it?
Re:oh goodie! by Hudjakov · 2004-01-07 04:27 · Score: 0

My current laptop does only one fps.
Re:oh goodie! by Anonymous Coward · 2004-01-07 04:29 · Score: 0

Had something to do with the timing (tick rate?) in quake allowing you to do certain things (ie, jump further) at a certain frame rate. By limiting the frame rate, they don't have to worry about that exploit.
Re:oh goodie! by GiMP · 2004-01-07 04:57 · Score: 1

Yes, but how many systems will reach the magic 60fps ? It might bring the system from 30fps to 32fps.
Re:oh goodie! by Anonymous Coward · 2004-01-07 12:22 · Score: 0

Yeah, but a 2 fps increase on a 10 fps game is a 64% increase!

Cache Contention by Detritus · 2004-01-06 21:59 · Score: 3, Interesting

Do any modern chips support per-process cache reservation? That would alleviate some of the problems reported in the article.

--
Mea navis aericumbens anguillis abundat

Re:Cache Contention by Anonymous Coward · 2004-01-07 00:44 · Score: 0

Not any on the desktop. A few embedded CPUs/DSPs can do it though. Have a look at Imagination Technology's Metagence processor...
Re:Cache Contention by Codifex+Maximus · 2004-01-07 11:55 · Score: 1

> Do any modern chips support per-process cache
> reservation? That would alleviate some of the
> problems reported in the article.

Wouldn't that defeat some of the benefits of SMT? I mean... if you have two threads executing much the same code... they can SHARE the contents of the cache. Seems to me that increasing the size of the cache would be of more benefit.

I do see where you are going with your idea though...

I have an idea. (Someone else has probably had it too). Why not build a processor that can accept a parent process. This processor could then, also, have multiple logical processors to execute threads in a massive way. Map the address space of the process right onto the processor RAM and all. Add more processors for more processes. Specify one processor specific to the OS so it doesn't have to task switch alot and can perform housekeeping while the other processors run the programs. So, rather than massive context switching... the OS could just map available processors to processes waiting to run (these would be waiting in regular RAM memory), control the flow of data between processors to achieve compartmentalization, preemptively schedule tasks to achieve a smooth overall system.

Maybe the OS could have a priority high enough so that the massively threaded process could be dedicated to a processor and not be preempted - a process that has nearly the same top priority as the OS.

--
Codifex Maximus ~ In search of... a shorter sig.
Re:Cache Contention by Detritus · 2004-01-07 15:01 · Score: 1

One of my favorite computer architectures is the CDC 6600 PPU (Peripheral Processing Unit), an I/O processor for the CDC 6600. It had 10 complete sets of CPU registers and 10 banks of memory, one for each register set. It would execute one instruction in register set #0, then one instruction in register set #1, then one instruction in register set #2, etc. By continually cycling through the register sets, it behaved like 10 independent processors that could simultaneously execute 10 programs, although at 1/10 the speed of the hardware cycle time. It was a clever way to get 10 independent processors without having all of the hardware that would be needed by the conventional approach.

--
Mea navis aericumbens anguillis abundat

RISC gives you more bang for your buck by putaro · 2004-01-06 22:04 · Score: 4, Interesting

All things being equal, RISC gives you more bang for your buck. The difference is that Intel has pushed CISC, or specifically the x86 architecture, as fast or faster than RISC by using more bucks. The amount of R&D dollars powered into x86 vs the amount poured into PowerPC or Alpha is overwhelming.

When I was at Apple our processor architect, Phil Koch, gave a talk in, I think, 1997, where he said that the PowerPC consortium had essentially optimized for power consumption and dollars spent on R&D. What was amazing at that time was that PowerPC was competitive with Intel given much lower power consumption and much lower investment of R&D dollars. However, noone really cared about lower power consumption so it didn't translate into any real advantage. Without the R&D dollar leverage given by RISC, however, the PowerPC would not have been able to compete at all. Pushing the 68K architecture to be competitive with Intel with the same R&D dollars as PowerPC would have been impossible

Re:RISC gives you more bang for your buck by imsabbel · 2004-01-06 23:57 · Score: 2, Insightful

And nowadays it becomes more and more clear that there isnt much of an advantage anymore.
All "Cisc" chips are risc cores with a decoder frontend, and the "cheaply developed" Power PCs before the G5 were slaughtered by X86 in any bench but photoshop gaussian blur.

And the G5 is only a sideproduct from IBMs Power4 program, which cant really be descriped with "low R&D expenses".

--
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
Re:RISC gives you more bang for your buck by Waffle+Iron · 2004-01-07 03:19 · Score: 4, Interesting

All things being equal, RISC gives you more bang for your buck.
Maybe, maybe not. However, it's hard to tell because nobody makes RISC or CISC processors anymore. The RISC concept, implemented in CPUs like the MIPS R3000, originally meant very simple hardware without pipeline interlocks, instruction schedulers, or more than an absolute bare-bones set of instructions. The current Power PC does not match this at all; it is closer to the current X86.
By the same token, CISC used to mean that many or most instructions were implemented in microcode on the processor. Once again, that's no longer the case. All X86s now have a RISC-like core and resemble the Power PC far more than the 80286.
Pure RISC designs and pure CISC designs have both been superceded by a hybrid approach, and neither one would be competetive today outside the embedded device market.
Basically, you were being fed a line of company FUD to get you all excited about their choice of CPU. Today, cache memory dominates the chip real estate, and CPU performance and power consumption are dictated almost exclusively by cache size and silicon process technology rather than these surface architectural details.
Re:RISC gives you more bang for your buck by Jay+Carlson · 2004-01-07 03:26 · Score: 1

CISC processors tend to have smaller code size, even if the execution units are similar. You can think of this as CISC having a decompression engine between icache and the execution units. When main memory is slow and far away, reducing the amount of memory needed for code can be helpful, especially in modern (bloated?) systems with zillions of bytes of shared libraries loaded.

Here's a ref to a discussion of RISC's response to this problem.
Re:RISC gives you more bang for your buck by Anonymous Coward · 2004-01-07 05:55 · Score: 0

By the same token, CISC used to mean that many or most instructions were implemented in microcode on the processor.
Actually, current "CISC" processors can be seen as having all of their instructions implemented in microcode because they're all translated into the internal RISC-like subops.
Re:RISC gives you more bang for your buck by Waffle+Iron · 2004-01-07 06:51 · Score: 1

Actually, current "CISC" processors can be seen as having all of their instructions implemented in microcode because they're all translated into the internal RISC-like subops.
However, there are two major differences between traditional microcode ops and RISC-like subops. First, traditional microcode opcodes were usually very wide, with enough dedicated bits to simultaneously control all of the ALU parameters, address calculations and data path multiplexers in the processor.
Second, microcode worked like a little subroutine, serially stepping through a fixed sequence of steps in order over several clocks to perform one program opcode. RISC-like subops, OTOH, can usually be split out and issued independently to the various execution units in the CPU and execute in an arbitrary order, in parallel with each other or with subops from other assembly opcodes.
Re:RISC gives you more bang for your buck by Anonymous Coward · 2004-01-07 09:19 · Score: 0

That wasn't true back in 1997 however. Before the AIM alliance fell apart, PowerPC was consistently ahead of Intel with less investment.

However, once it became clear that PPC was going to be a "Mac Only" desktop CPU, the investment in PPC development slowed to a trickle.
Re:RISC gives you more bang for your buck by putaro · 2004-01-07 13:08 · Score: 1

Actually, no, I was being told by the processor architect in a small talk for the OS team that we just barely escaped by the skin of our teeth, sort of, and that they had not antcipated the huge amount of money that Intel was going to be able to throw at the x86 architecture.

Re: f00f by Anonymous Coward · 2004-01-06 22:16 · Score: 0

The F00F bug was on the Pentium I'm certain. I have an old P-100 sitting in my closet that is affected. I don't know if it ever existed on the Pentium 2s - can anyone confirm? Just curious...

From the article: by intermediate_represe · 2004-01-06 22:20 · Score: 2, Funny

This could be analogous to two people in moderate shape being able to pile more wood in total, than a single person who's in great shape.
hmm... in 6 years of architecture research i have never heard anyone talk about SMT like that. it's not even analogous :)

--
Clark Kent is Superman's critique on the human race.

Re:From the article: by Anonymous Coward · 2004-01-07 00:20 · Score: 0

More like two cartoonists each trying to do their own thing in one studio with one pencil set, one drawing pad but an enhanced page-flipper and high capacity coffee machine? (only one coffee cup, though)
Re:From the article: by Glonoinha · 2004-01-07 02:38 · Score: 4, Informative

How about two people in moderate shape being able to push wood through a single wood chipper than a single person who is in great shape (assuming the wood is piled up 18 feet away = cache miss).

The single wood chipper being analogous to the actual processing part of the core, is only going to be able to shred so much wood - but if two people fetching wood from the woodpile can keep it running at 100% capacity they will shred more wood than a single guy running back and forth to the wood pile by himself.

--
Glonoinha the MebiByte Slayer
Re:From the article: by Ignominious+Cow+Herd · 2004-01-07 03:19 · Score: 1

No, it is more like two teams of two people. Each team has one cutting trees into logs, the other feeding the chipper. The chipper can run as fast as either person can feed the chipper, but the log cutter may be slow (needs to find, cut and deliver logs). While one team is stalled looking for trees, the other team can continue feeding the chipper. However, in some cases both teams may be out looking for trees and the chipper is still idle.

Geez I can't believe we're taking this analogy this far. :)

--
Lump lingered last in line for brains, and the ones she got were sorta rotten and insane.
Re:From the article: by Anonymous Coward · 2004-01-07 03:41 · Score: 0

The next generation of hyperthreading will have 4 threads, followed by an 8 thread version in a few years. Thats a lot of wood chips.
Re:From the article: by ArsonSmith · 2004-01-07 05:19 · Score: 4, Funny

I have this other idea where we make a large wooden badger...

--
Paying taxes to buy civilization is like paying a hooker to buy love.
Re:From the article: by SpaceLifeForm · 2004-01-07 07:21 · Score: 2, Funny

No, we don't need no stink'n badger.
What we need is *two* Woodchucks.

--
You are being MICROattacked, from various angles, in a SOFT manner.
Re:From the article: by Anonymous Coward · 2004-01-07 10:42 · Score: 1, Funny

Two teams of two woodchucks.
Re:From the article: by Anonymous Coward · 2004-01-07 14:12 · Score: 0

would the woodchuck in this analogy be a UltraSparc 4?

Everything I know about Hyperthreading... by obergeist666 · 2004-01-06 22:27 · Score: 5, Informative

... I learned from this article.

Jim Kirk by Anonymous Coward · 2004-01-06 22:28 · Score: 1, Funny

When did Kirk start benchmarking processors? One would think he would be too busy getting his crew killed and shagging green alien women...

Re:Jim Kirk by GigsVT · 2004-01-07 04:18 · Score: 2, Informative

You are thinking of James T Kirk... See this is James R Kirk. :)

--
I've had enough abrasive sigs. Kittens are cute and fuzzy.
Re:Jim Kirk by Anonymous Coward · 2004-01-07 05:15 · Score: 0

James Roy Kirk - inventor of the probe that became NOMAD?

Quick Q by AvengerXP · 2004-01-06 22:33 · Score: 5, Interesting

Why would you want to have a virtual double processor when... you can actually get a second one? Both changes require that you change your motherboard (One for HT, one for Dual Sockets). Dual Celerons sounds like a good cheap buy, or even Dual Athlons. Why bother with this? Except for the coolness factor of having your POST screen littered with "Hyperthreading Enabled", and in most cases it's not even called that, i forgot what they really write on the screen. Seriously, i wouldnt put my money that HT will be even copied to other manufacturers any time soon, unlike SSE or MMX.

--
Trolls dont like to be Flamebait, because they burn so well. Protect our Troll heritage!

Re:Quick Q by Anonymous Coward · 2004-01-06 22:38 · Score: 0

I guess it's "If you have it, use it" type of thing. More of the Intel CPUs are coming with Hyperthreading.
Re:Quick Q by Betcour · 2004-01-06 23:06 · Score: 1

Dual Celeron hasen't been possible for a VERY long time. Dual Athlon is usually more expensive than a single powerful P4 with HT.
Re:Quick Q by ThaReetLad · 2004-01-06 23:16 · Score: 1

no no no. if you must have 2 CPUs go for 2 low end opteron chips. They'll scale almost linearly compared to rougly 50% for dual celerons, Xeons, Althon XP's etc.

--
You can't win Darth. If you mod me down, I shall become more powerful than you could possibly imagine
Re:Quick Q by renoX · 2004-01-07 00:36 · Score: 4, Insightful

> Why would you want to have a virtual double processor when... you can actually get a second one?

Because it is cheaper?
SMT increase very little the size of the CPU and can give some good improvements (depending of the application, and the OS as said in the article).

SMT can work in the same motherboard as a single CPU contrary as what you said..

And for the same price, the single CPU performance of your dual-CPU setup will be lower..
Re:Quick Q by Ramze · 2004-01-07 01:26 · Score: 2, Interesting

I believe AMD has plans to incorporate more than one CPU on-die in the future. First 2, then 4, etc.
It'll be interesting to see what happens to "hyperthreading" when dual and quad processors come standard on desktop systems for home users.
I look at Hyperthreading as a quick hack to improve response times on a few things. It's a minor speed boost as well, but I think it has enough drawbacks to merit it as only a minor improvement which may not always be a good idea to have enabled. I doubt it will stick around once true dual-processor systems are in the majority, though that's not going to be anytime soon.
In any case, Intel knows it's not a major marketing point or they'd be screaming "Hyperthreading is what you have to have!" like they did w/ MMX.
My response is more like "Hyperthreading... woohoo... call me when you come up with something more interesting."
Re:Quick Q by Jeff+DeMaagd · 2004-01-07 03:52 · Score: 1

AMD claims to have the same idea in the works for the next Athlon 64.

It was supposed to be put into the Alpha processor too, a lot of HT research was done on it and was transferred to Intel.

Most of the CPU players are toying with dual full CPU on-die as well, but keep in mind that HT accounts for under 5% of the die, rather than just requiring a second die.

So you _can_ also have two real processors and two more processors in virtual mode. If you know the Xeon line, the Xeon DP allows two real processors, plus two more virtual processors. A Xeon MP system can have four real processors and four virtual processors. I think HT was found to be very beneficial for high load web serving tasks even on a multi real CPU system.
Re:Quick Q by mengel · 2004-01-07 03:55 · Score: 1

It's more obvious if you scale it up to more CPU's.
It actually makes more sense to build one chip that's, say, 8 logical processors and give it several execution units of each type (i.e. 6 integer math units, 4 floating point units, etc.) depending on instruction mix. Of course, that eats chip real estate, but if you have a multithreaded system to run, it will scream.
If you put in 8 distinct processors, that's 8 integer math units, 8 floating point math units, etc. some percentage of which are idle most of the time. So arguably some of those math units are just plain waste -- I mean if your floating point unit is idle half the time, you could get by with half the floating point units, and just schedule the threads on the right CPU's, right?
With 2 logical processors, its more of a percentage thing, and harder to see.
I'll be impressed when they add, say, enough extra processing units that it actually performs more like a dual CPU (if you give it enough cache, anyway).

--
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'
Re:Quick Q by cowbutt · 2004-01-07 04:45 · Score: 1

Nope - I have a Gigabyte GA-8PE667 Ultra which can use a 3.06GHz HT P4, a 1.7GHz Celeron or anything in between.[*]
Also, SMP boards seem to be 2-3x the cost of UP boards before the cost of the CPUs.
[*] FSB speeds permitting. It does 400MHz and 533MHz FSB speeds, but not 800MHz.
--
Re:Quick Q by iconian · 2004-01-07 05:15 · Score: 2, Interesting

It's not that simple. I believe the cheapest HT processor from Intel is the P4 2.4 Ghz, priced at $161. You can buy one Athlon XP 2400+ for $75. A dual processor Athlon motherboard probably costs more than a single processor Pentium 4 motherboard and you will probably have to pay for a bigger power supply unit. However, I don't think dual logical processors in a single Pentium 4 can beat 2 real Athlon XP 2400+ processors performance wise and in performance-price ratio. (Note: I do not work for NewEgg.)
Re:Quick Q by Anonymous Coward · 2004-01-07 05:18 · Score: 0

CMT and SMT are not mutually exclusive technologies. The largest benefit in performance terms of SMT comes when augmented into a CMT design.
For the uninformed CMT is coarse grained multithreading (normally implemented via multiple cores per CPU package). SMT is simultaneous multithreading where one core acts as multiple virtual cores Intel markets this as hyperthreading.
The idea of SMT is that most CPU's spend the majority of time idle whilst waiting for data/instructions. In SMT you have a single resource pool which multiple execution streams can use this should give a higher overall resource utilisation.
CMT on the other if badly implemented can result in a lower throughput per core as the front end resources such as cache and memory interconnect become shared by each core.
SMT in best case optimised binary should give a performance improvement of 20%, worst case would be -10% due to thread "thrashing". CMT in the best case will give 50%, worst case as discussed by IBM in context of Power4 is in the region of -15% due to resource contention issues.
The ideal situation is that you implement both SMT and CMT as will be the case with IBM Power5 and future Intel Xeon. This in the worst case would (based on IBM numbers) give -5% and a best case of +80%. You'll note that CMT+SMT gives a larger percentage gain than either technology does individually which is why IBM plan on adding SMT to the next version of Power and why when Intel take Xeon dual core SMT support will remain. Sun with future UltraSparc also plan on doing a CMT+SMT implementation, in all likelihood AMD K9 will also be CMT+SMT.
The AMD K8 CMT implementation wouldn't have given a significant performance improvement due to the small shared L2 cache and no L3 cache.
To get significant performance improvements out of CMT, IBM had to implement a large (32MB L3 cache) shared by both cores and a banked 1.5MB L2 cache. The significance is that the L2 is divided into 3 512KB banks each separately connected to a crossbar which is inturn connected to each of the cores. This gives you an 8-way x3 L2 architecture.
The AMD K8 CMT implementation was based around having a single unified 1MB L2 cache shared by both cores. The reality is this would have given around +30% as the best case improvement for x2 the cost which is why AMD dropped plans for a second core with K8.
Re:Quick Q by Anonymous Coward · 2004-01-07 07:19 · Score: 0

Why would you want to have a virtual double processor when... you can actually get a second one?
Others have already pointed many valid points, but one more thing is that it's much easier to create efficient core-to-core, than cpu-to-cpu connections. At current high bus/clock speeds connecting separate CPUs is a pain (plus slow). This translates to cheaper per-core price as well as better performance with comparable cores.
Same goes for managing memory units; it's easier to share/synchronize on-chip caches, similarly, external cache handling can be simplified a lot.
And finally, what really makes sense is to stack both approaches, multi-CPU system with multi-core CPUs. And surprise, surprise, if you add one more layer (distributed systems), you get nice computing power, with triple-layering of parallelism.
Re:Quick Q by Anonymous Coward · 2004-01-07 07:42 · Score: 0

So could I design a dual-cpu mobo w/ hyperthreading and make the OS think I have 4 total cpus? If AMD implemented HT in the x86-64 cpus, in 32 bit mode would the two 32 bit cores each act like 2 cpus? That would be sweet . . .
Re:Quick Q by renoX · 2004-01-07 08:11 · Score: 1

>A dual processor Athlon motherboard probably costs more than a single processor Pentium 4 motherboard.

Yes,dual processor Athlon motherboard must be quite rare, especially since AthlonXP do not work in SMP configuration.. ;-)

That said, a motherboard for dual P4 costs over 200 in France, so this must be taken into account in a performance/price comparison..
Re:Quick Q by Stormie · 2004-01-07 10:26 · Score: 1

It'll be interesting to see what happens to "hyperthreading" when dual and quad processors come standard on desktop systems for home users.

Well, if you look at the benchmarks that guy did, almost all were won by the dual Xeons with HT enabled, i.e. 2 physical CPUs, 4 logical CPUs. HT is by no means an alternative to multiple CPUs, it's just one of many techniques to get better performance out of a CPU, regardless of whether it is running alone or as part of a multi-CPU system.

So if dual and quad processor boxes become standard for home users, and if Intel is still dominating the market at that time, I daresay you will see a lot of home boxes with 4 or 8 logical CPUs thanks to their 2 or 4 hyperthreaded Pentium 6's (or whatever).
Re:Quick Q by Ramze · 2004-01-07 11:33 · Score: 1

hmm... perhaps so :-)
I hadn't thought of the fact that one could (or would want to) do both... have multiple processors WITH hyperthreading enabled. I guess I should check out the benchmarks 'n see... most I've seen in the past showed that hyperthreading occasionally had a 5% increase and often had a 5% decrease in peroformance... it all seemed to even out to me, so I didn't get what the fuss was about.
Perhaps things have improved enough through better processor design and compiling w/ hyperthreading in mind since last I checked.
Re:Quick Q by Ramze · 2004-01-07 11:44 · Score: 1

Wow!
That was incredibly informative. The previous benchmarks I'd seen with hyperthreading showed a boost in some areas and a loss in others to where it seemed to almost even out on benchmarks -- with only a slight improvement due to hyperthreading overall, but that was when hyperthreading first came out and I don't recall the benchmarks covering dual processors at the time. I kept thinking... why don't they just boost the clock speed? It's a bigger improvement than this.
I had no idea that multiple cores with hyperthreading could provide such a boost. wow... so much to look forward to, it makes me feel like my next system will be obsolete before I even buy it online next week.. lol.
Re:Quick Q by iconian · 2004-01-07 20:17 · Score: 1

There are reports of people using Athlon XPs in dual configurations.

To conclude this pickle... by Anonymous Coward · 2004-01-06 22:33 · Score: 0

Current situation:

Xeon and P4 cpu's have to small caches and to slow busses.

Lets watch this technology develop and come back in let's say 6 months ;-).

Many thanks,
M

Cache contention with Hyperthreading by xyote · 2004-01-06 22:36 · Score: 4, Interesting

Threads using hyperthreading or SMT share the cache. This can be a problem if the threads are from different processes and not sharing memory. Your cache is effectively halved (with 2 hyperthreads). On the other hand, it could be a real benefit if your threads were from the same process sharing the same memory. You don't have the cache thrashing which could occur on a multi-cpu system. Since cache hits can really kill performance, this could be quite a performance boost.

To really exploit this, you'd need gang scheduling in the operating system. But it's unlikely that SMT would remain around long enough for any efforts to exploit it to be feasible. CMP with separate cache would likely take over before then since it would behave more like separate cpu's from a performance standpoint and thus offer more consistent behavior.

Re:Cache contention with Hyperthreading by fitten · 2004-01-07 04:15 · Score: 1

It's not like you can't have two CPUs and still have cache issues with shared data if you wrote your app poorly. A heavily accessed and modified datum by both threads running on two seperate processors would have a lot of snoop logic firing between the two CPUs which would slow things down dramatically.
Re:Cache contention with Hyperthreading by Keeper · 2004-01-07 06:31 · Score: 1

The whole point of hyperthreading is that a second thread can run when the first thread stalls (ie: needs to load some data that isn't in the cache) -- instead of stalling for cache data, the cpu switches over to the other thread.
Re:Cache contention with Hyperthreading by Anonymous Coward · 2004-01-07 08:32 · Score: 0

Right, but what if that first thread was only stalled for a couple of clock cycles and during that time the second thread had a couple of cache misses, so the cache was updated with thread 2's data. Then, thread 1 comes back and sees all its data gone, so it has to reload its data back into the cache.

With more threads, you greatly increase both cache and register pressure. If either of these two pressures gets too high, then it starts hurting performance.

In some architectures, compilers don't do certain optimizations like loop unrolling, or even redundant expression elimination because they increase register pressure so much that you actually slow down execution.

Re:Just Marketing BS by Intel to get suckers to bu by Anonymous Coward · 2004-01-06 22:37 · Score: 0

"Why in the world do Intel, AMD, and Microsoft go out of their way to keep SMP machines off the desktop?"

Actually, Windows XP Home will quite happily cope with a dual CPU machine. While I initially wondered if this was specifically to support hyperthreading (which appears as 2 CPUs in Windows), it does actually work with dual chip machines.

I guess the real answer is; they want us to buy workstations. Their argument is that if you need that much power, buy a workstation. Intel and AMD don't subscribe to the ethic of giving you what you need, but what they want to sell you. Microsoft though, have gone some way towards promoting SMP on the desktop, probably thanks to them already having the technology in NT workstation versions, which is part of XP heritage.

Re:Just Marketing BS by Intel to get suckers to bu by drsmithy · 2004-01-06 22:38 · Score: 4, Interesting

Why in the world do Intel, AMD, and Microsoft go out of their way to keep SMP machines off the desktop? Apple certainly is going in the opposite direction.

No, they aren't. The Apple "common desktop" oriented machines - the eMac, iMac and perhaps at a stretch the 1.6Ghz G5 - are all single CPU machines and are likely to remain so now the G5 has finally appeared (price alone, without going into other aspects, puts the dual G5s into workstation/high-end enthusiast desktop territory).

Apple briefly flirted with putting dual CPUs into their nearly-home-desktop machines, but this was driven by the massive speed deficit at the time of G4 CPUs - they *had* to have dual CPUs to be even remotely competitive. No matter what else Apple's marketing department might have tried to say.

If you could option a dual CPU onto an eMac, and all the iMacs were dual CPU, then your comment would be accurate. Two high-end machines out of a base range of seven (and that's ignoring the laptops) is not a paradigm shift. By that measure, just about any major manufacturer is "going in the opposite direction".

bad programming ... by Anonymous Coward · 2004-01-06 22:42 · Score: 1, Interesting

i'm not 100% sure bout this but i just got da
fishy feeling that hyper threading really is just
to make life easier for novice/beginner programmer
to write programs in "high" level languages (say
Vbasic, or just basic ;) ) that can compete in
performance to programs writen by cracks, say in
assember or C / C++.

i believe CPU manufactures shouldn't care about
this but should cater to the cracks, not the
beginners.(*)

looking at what programms are writen in and then
adapting the CPU to this isn't really the
way to go /methinks. especially if what i'm
guessing should turn out to be true it would be
terrible for a MAINSTREAM processor to make these
bold claims.

i mean it would be okay to market a
"hyperthreading" as a optimizing CPU for
high-level languages or something but making the
claim that it also speeds up execution times for a
assembler program that has been optimized on paper
by the programmer is ... wrong.

(*) of course the market goes where the money is
but at least label the product correctly ...

p.s. anone noticed how long "calc.exe" takes to
load on AMD Athlons?

Re:bad programming ... by Anonymous Coward · 2004-01-06 23:43 · Score: 0

having a big differnce between CPU speed and
clock speed gives me the impression of killing of
multiwindow multimedia paradigma. having this
big difference of CPU being 13x or more times
faster then the bus/RAM gives me the impression
of loading a WORLD into RAM in which the CPU and
it's cache operate. un-loading this world then
can take a long time. this really destroys
interactivity.

it seems the problem hasn't really been solved
then.
the problem are the same like 20 years ago. just
now the grafics/sounds have gottan much better,
but i'ts still this CPU in MEMORY prison.

one should also note that most CPU are produced
in america while most RAM is produced in ASIA :P

maybe it is time to figure out this problem.
the cpu depends on good ram and a fast bus. this
goes out to AMD and INTEL engineers. we all know
your THE cutting edge but just imprisoning
your CPUs in memory prison and adding a ton of
cache won't solve the problem.

we want eye candy AND we want interactivity!
i don't want to wait 5 min for the data in memory
to unload itself. if i close a program i want
it unloaded instantely. i just can't believe i
have to wait for a program to CLOSE!!!
Re:bad programming ... by kasperd · 2004-01-07 01:01 · Score: 1

feeling that hyper threading really is just to make life easier for novice/beginner programmer to write programs in "high" level languages

That is completely wrong. HT has nothing to do with programming language. Good compilers for high level languages will do a best effort to arange instructions such that pipelining will be as efficient as possible. But it simply isn't possible to fully utilize the execution units with a single thread of execution. HT is a good idea no matter which language you are using. But of course it does have a lot of pitfalls to avoid, which completely contradicts your claim about it being intended for novices. Can the kernel itself ensure all potential performance problems of HT are avoided, without any help from user mode applications?

--

Do you care about the security of your wireless mouse?
Re:bad programming ... by Anonymous Coward · 2004-01-07 03:07 · Score: 1, Informative

i'm not 100% sure bout this but i just got da
fishy feeling that hyper threading really is just
to make life easier for novice/beginner programmer
to write programs in "high" level languages (say
Vbasic, or just basic ;) ) that can compete in
performance to programs writen by cracks, say in
assember or C / C++.

Good programmers don't write programs in assembly. They pick good compilers and know the correct optimizations. Even if they could beat the best compilers, the code wouldn't be portable. That is bad programming except in very rare cases. Good compilers exist for languages like C/C++ to take advantage of multithreading. (icc)
Re:bad programming ... by raodin · 2004-01-07 04:45 · Score: 1

Calc.exe comes up nearly instantly on my Athlon, since you asked.

Future prognosis for HT by sam0ht · 2004-01-06 22:48 · Score: 5, Interesting

From the article: "As bus speeds increase, and more cache becomes available on die, hyper-threading is going to be more and more efficient. It appears to be somewhat of an engineering symbiotic relationship."

Unfortunately, historically CPU speed has increased faster than memory bandwidth. That's why we've had ever more layers of cache added to our systems, to make up for the relative deficiency.

Unless things change, a technology that works better with a higher ratio of memory bandwith / CPU speed is likely to become progressively less, not more effective.

Of course, there's always the argument that marketing reasons have pushed CPU clockspeed faster than memory bandwidth, and that Intel et al will just shift their focus more towards memory in future. But defying the tide of 'what people think they want' is usually risky.

Re:Future prognosis for HT by sql*kitten · 2004-01-06 23:42 · Score: 4, Insightful

Unfortunately, historically CPU speed has increased faster than memory bandwidth. That's why we've had ever more layers of cache added to our systems, to make up for the relative deficiency.

Aye. Sun has big plans for CMT, which one of their sales reps was quick to tell us all about, up to 32 SPARC cores on one chip. That'll work well in the lots-of-small-tasks model where you can take advantage of direct access (say between disk cache and network card) on FirePlane with very simple code (like a webserver) that can execute out of the processor's cache. But we're heavy database users, and the first question he got asked was, are you seriously telling us Sun is about to makes its memory bandwith an order of magnitude greater? He couldn't answer that question. Now, that means either he was clueless, or Sun is jumping on the Intel benchmark bandwagon.
Re:Future prognosis for HT by Jeff+DeMaagd · 2004-01-07 05:24 · Score: 1

Another way to increasing apparent memory speed is for a wider effective memory bus. It was IIRC, 16 bits 286 and before, 32 bits with 386, 64 bits with Pentium, and with selected PIII, PIV, Athlon & A64 boards, dual channel 64bits, making it 128 bits.

I think an Alpha board or two went as high as 512 bits wide.

Now, the wider memory bus doen't help x86 or A64 as much as one would think but with hyperthreading, it might.
Re:Future prognosis for HT by davecb · 2004-01-07 09:23 · Score: 1

He should say "no, we're trying to add enough processors to use up the bus bandwidth we have".
Joking aside, this is best for very ordinary code which doesn't have special fast interfaces or run out of the cache. It's fine for the query optimizer of a DBMS, as that's something that scales well but could bottleneck on cache-line fetches.
Code that doesn't scale to multiprocessors, that has "special deals" or is hand-tuned to run out of cache doesn't benefit as much as normal code. Samba would be a good thing to run on one of these chips. It's good, fairly plain code which scales well.
--dave (Samba bigot, too (:-)) c-b

--
davecb@spamcop.net

Re:Just Marketing BS by Intel to get suckers to bu by ThaReetLad · 2004-01-06 22:59 · Score: 3, Interesting

I wouldn't say that intel and AMD are against dual CPU machines on the desktop exactly, its just that they cost too much for most users, and most of the time money is better spent on a high end single processor machine than a dual processor one. Of course that is mostly to do with the fact that most SMP systems available up until now haven't scaled very well, not least because with Athlon MP's and Xeons the second CPU has to share the available bandwith with the first. Now though there is the Opteron dual processor system and for the first time low end SMP systems scale memory bandwidth linearly with the number of CPUs so a system with 2 CPU's operates almost twice as fast as a single CPU machine, whereas before you'd be lucky to get a 50% improvement. What will be intersting to see in 2005 will be the dual core Athlon FX type chips. These will basically be 2 of the current Athlon 64 (754 pin) CPU's on a single die each with it's own single channel memory controller. The question is, what are they going to call these chips? They'll have a PR rating of about 6800, just using 2 of the currently available cores!!

--
You can't win Darth. If you mod me down, I shall become more powerful than you could possibly imagine

Re:Just Marketing BS by Intel to get suckers to bu by Anonymous Coward · 2004-01-06 23:14 · Score: 0

You can access *way* more than 4GB in 32 bit windows http://msdn.microsoft.com/library/default.asp?url= /library/en-us/memory/base/physical_address_extens ion.asp

Situations where HT really becomes useful by ZombieEngineer · 2004-01-06 23:36 · Score: 5, Interesting

I have found HyperThreading a real boost for developing operator training simulators (think giant custom computer game for process plant operators [eg: Oil refineries, gas plants, chemicals, etc...]) where the a single thread will totally consume the resources of a single CPU (we call it "no-wait" where the simulation calculates what happens in the next 2 seconds and then immediately jumps to the next timestep, thus fast forwarding through slow parts of a process start-up such as warming a reactor).

An issue we encounter is the DCS (Distributed Control System) interface (the bit that links the PC to the fancy membrane keyboards, touch screens, alarm annunciators that the operator uses on the real plant [to maximise training benefit]). Although the interface typically only uses 0.5 to 2% of the CPU, when the simulation goes flat out, there is a noticable impact on other threads to the point where there is timeouts on data requests from the operator console.

In summary, if you have a system where some threads are IO bound (in our case, processing requests coming across via ethernet) and other threads are CPU intensive (high end numerical calculations) you will see a definite benifit. It allows us to give every team member a machine fit for the job at approximately 1/3 the cost (those of you who wish to argue that SMP machines are cheaper, we are bound by corporate purchasing agreements where SMP falls into the "Workstation" catagory while a uni-processor HT machine falls into the far cheaper "Desktop" catagory).

If you are performing just purely calculations and need to run two parallel threads, I would recommend a SMP or similar machine.

As always your milage may vary.

ZombieEngineer

Yup, all over the place... by DerProfi · 2004-01-06 23:37 · Score: 2, Informative

This guy can't even calculate his percentages correctly, so I wonder what else might be screwed up in his analysis?

If X is the lower number and Y is the higher number, he's figuring his percentage increases as (Y-X)/Y instead of (Y-X)/X .

Or is this some kind of "New New Math" that they started teaching in the 10 years since I graduated?

--

3000+ comments meta-modded. 0 mod points awarded.
Lesson for other meta-suckers: Don't believe the hype!

HT is awesome by Jeppe+Salvesen · 2004-01-07 00:25 · Score: 4, Interesting

In the app we develop here at work, we are highly conscious of performance and scalability. Simply put - the more transactions we can process, the bigger and happier the customers. And more money in our pockets.

With Xeon with HT, our performance has increased quite dramatically. We use Perl, so we simply fork off the jobs that do the processing. The result is that we fill all the four virtual processors in Linux if we have a sufficient number of jobs running.

--

Stop the brainwash

Re:HT is awesome by platypus · 2004-01-07 05:05 · Score: 1

In the app we develop here at work, we are highly conscious of performance and scalability. [...] We use Perl [...]

Huh? This is not meant as an offense, or a troll, but that really, really doesn't fit together. Have you considered using something faster (no, not C)? This should have a much bigger effect than a HT proc.
Re:HT is awesome by Jeppe+Salvesen · 2004-01-07 08:38 · Score: 2, Insightful

Absolutely. But Perl means we can produce more software with fewer manhours and fewer lines of code! Compared to our java-based competitors, we kick butt, both in terms of development team size and in terms of performance and TCO.

We have profiled our code and optimized the code where we spend most of our time. On those critical sections, we use most of the tricks in the book - dynamically created code, extensive use of hashes, etc. We can even write functions in C using XS if we want to!

Basically, Perl is about freedom. You get a high-level language with a lot of freedom to both do genius and very dumb things. And then you can write (or have someone write) C code for those truly performance-critical functions.

Perl looks ugly and looks hacky. I'll be the first to admit it. But once you figure it out, it's pretty damned powerful.

Anyhow - would you have learned this if you didn't ask? Keep attempting to offend, man :)

--
Stop the brainwash
Re:HT is awesome by platypus · 2004-01-07 09:20 · Score: 1

Thanks for the answer, I have coded in perl, that's why I asked, and I meant it really not as an flamebait ;).

Anyhow - would you have learned this if you didn't ask? Keep attempting to offend, man :)

Heh, it served a purpose. Maybe I should've mixed in some python evangelism for better balance - which is btw. all you praise about perl, without the ugly and hacky parts - but I digress.

Seriously, interesting to know that java seems even to be inferior to perl in most aspects.
LOL, see I can even offend two camps of language followers in one sentence ;).

how to enable for older processors? by Pivot · 2004-01-07 00:38 · Score: 2, Interesting

I have a computer with dual Xeon 1.7GHz. Those apparently have HT capability built in, but it's not enabled in the BIOS. Anyone know a way to circumvent this to enable HT on these?

Re:how to enable for older processors? by L10N · 2004-01-07 00:51 · Score: 1

I do not know how but I have a quick suggestion in case you haven't tried:

Check the vendor's web site for a BIOS update that may make this possible.

If not, get in touch with vendor's tech supp and see if they can help you.

Hoping this helps...

--
"What we do in life echoes in eternity." Maximus Decimus Meridius
Re:how to enable for older processors? by Anonymous Coward · 2004-01-07 03:22 · Score: 0

Those apparently have HT capability built in, but it's not enabled in the BIOS.

Are you sure it is not enabled in the BIOS? Maybe the operating system isn't recognizing two processors. For example, if you are using linux you want to use the smp kernel.
Re:how to enable for older processors? by Anonymous Coward · 2004-01-07 08:33 · Score: 0

It is true that some older P4 processors had all the circutry to perform HT, but were not enabled. I don't think there is anything you can do, I think it's more of something Intel has to do directly on the CPU.

Cost by Imperator · 2004-01-07 01:04 · Score: 1, Informative

Cost, cost cost. Cost cost cost cost cost, cost cost cost cost cost cost. Cost cost cost--cost! Cost cost cost, cost cost cost cost cost cost...cost cost. Cost cost "cost" cost cost cost cost cost cost cost cost. Cost cost cost cost cost COST cost cost.....

The lameness filter blows. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

--

Gates' Law: Every 18 months, the speed of software halves.

Re:Cost by Anonymous Coward · 2004-01-07 19:10 · Score: 0

Cost, cost cost. [etc.]
The lameness filter blows.

I dunno, sounds like it's working perfectly for a change...

Nitpick by GregWebb · 2004-01-07 01:06 · Score: 1

Cache hits are what you want. It's cache misses that kill performance.

--

Greg

(Inside a nuclear plant)
Aaaarrrggh! Run! The canary has mutated!

Hyper-threading explained in 300 words or less. by Anonymous Coward · 2004-01-07 01:09 · Score: 4, Informative

When a process blocks because it is trying to access memory that is not loaded into the cache, it sits idle while the data is retrieved from the much-slower main memory. If you can store two process contexts on the CPU instead of just one, whenever one process blocks to read from memory, the operating system can quickly switch the CPU to the other context which is waiting to run.

I can't remember the name of the machine, but one parallel shared-memory machine used this exclusively. The CPU had 128 process contexts and would switch through them in order. The time between subsequent activations of each context was great enough that data could be fetched from main memory and loaded into a register. This eliminated cache coherency problems (no cache!) and all delays related to memory fetching.

A P4 with hyperthreading is a simplified and much more practical version of that machine.

Re:Hyper-threading explained in 300 words or less. by Anonymous Coward · 2004-01-07 01:36 · Score: 0

Nitpick: What you're describing is a plain multithreading processor, but hyperthreading is a marketing name for simultaneous multithreading which means that instructions from the two process contexts can be executed simultaneously, on the same clock cycles. This is also why hyperthreading cannot really be described as simpler than plain multithreading as implemented in the Tera MTA (which is the name of the supercomputer you referred to).
Re:Hyper-threading explained in 300 words or less. by akuma(x86) · 2004-01-07 10:18 · Score: 1

You're describing SOEMT (switch on event multithreading). When the process blocks on an "event" like a cache miss, switch to another context.

The machine you were thinking about was the TERA machine which switched on cache misses.

This is NOT hyperthreading. Hyperthreading is SMT. The switching and interleaving is much more fine-grained. Thread switch heuristics and resource sharing algorithms exist in many places in the pipeline.

Badly written by plinius · 2004-01-07 01:09 · Score: 0, Flamebait

I hate it when inarticulate, uneducated people try to pretend to be eloquent by substituting "fancy" words for what they meant to say. This article is a perfect example of that. Because they almost always don't know exactly what the fancy words mean: they just found them in a thesaurus and never checked the definition to see whether they are accurate or really even appropriate.

Linguistic CISC vs RISC by marko123 · 2004-01-07 01:21 · Score: 1

"Hyperthreading is not untested or unproven"

This commented used RISC type language, and in the process, a logical error was accidentally introduced... the correct programmatic statement would be:

"Hyperthreading is not untested _nor_ unproven"

CISC has it's advantage in the way the intended statement would be encoded:

"Hyperthreading is better"

This is a complex statement succinctly written with fewer keywords and fewer potential (epistemological) errors.

--
http://pcblues.com - Digits and Wood

Re:Ever buy a car with auto-everything? by kasperd · 2004-01-07 01:45 · Score: 1

But if you want the control (or don't trust it) then you can switch it off.

That is not a good analogy. Sure you can choose not to use HT, it will give you the same control over the system as you would have on a computer without HT. But there is no way you could utilize the full power of the CPU without HT.

--

Do you care about the security of your wireless mouse?

Memory bottleneck (was: Future prognosis for HT) by davecb · 2004-01-07 01:53 · Score: 4, Interesting

One of the reasons for hyperthreading (aka chip multithreading) is the slowness of memory and cache.

If you refer back to Marc Tremblay's CMT Article, you'll see that one of the approaches is to run one thread until it blocks on a memory read, then run another until it blocks and so on, repeating for as many threads as it takes to soak up all the wasted time waiting for the memory fetches.

The Sun paper on their plans for it is here. Have a look at page 5 for the diagram.

--dave (biased, you understand) c-b

--
davecb@spamcop.net

The thing that got me about CPU performance by awol · 2004-01-07 01:53 · Score: 4, Insightful

I did comp sci (undergrad) in the days when we used unix/VMS to learn and so I have a pretty good understanding of architecture and the basics of threads and processes. The one thing that never sat well with me was that as processor speed "exploded" in the last 5 years, I was under the impression that a "lot" of the performance increase was achieved by parallelising stuff in the execution core. (You can see that my knowledge is _limited_) So as a result unless your applications could somehow take advantage of this parallelism a given bit of code would never really get the full benefit of todays uber processors. So all the speed gains were only really marginal improvements.

I think the advent of SMT confirms that it is indeed the case that a given process cannot of itself (unless it is _real_ special) take full advantage of a modern processor and so SMT is a way of reducing the problem by assuming that whilst one process aint enough to take full advantage, two processes are able to make more advantage. It sure makes sense to me.

But it also presents the very interesting question of the marginal benefit of execution pipelines compared to complexity in the front end to allow SMT. What I mean is, what are the trade offs between having a "virtual" (for want of a better word) processor for each execution pipepline rather than using them to out of order execute parts of a single stream of instructions. Is it simply a question of the nature of the work being undertaken my the machine? Ie a processor with 8 pipelines serving 20 users doing stuff, would it be better doing 1 bis of work from each of 8 users or maybe 2-4 bits of stuff from 4-2 users. And can we answer that question heuristically to allow the front end to make good use of each pipeline with a variable profile over the chaing use of the machine. Fascinating (well to me anyway).

--
"The first thing to do when you find yourself in a hole is stop digging."

Analogy by attonitus · 2004-01-07 02:15 · Score: 4, Interesting

This could be analogous to two people in moderate shape being able to pile more wood in total, than a single person who's in great shape

Could be, but isn't. A better analogy would be two people using the same narrow corridor to perform to chop and pile wood. If one piles wood, whilst the other chops, then they perform better than one person. If they both chop wood, and then both pile wood then they waste lots of time trying to squeeze past each other and accidentally hitting each other with axes.

Okay, so it's not that much better an analogy. But it least it bears some relevance to HyperThreading.

Re:Analogy by cant_get_a_good_nick · 2004-01-07 03:31 · Score: 2, Funny

Could be true. Not sure if many slashdot geeks can understand "being in shape" and "physical labor"

***ducks***
Re:Analogy by Anonymous Coward · 2004-01-07 03:49 · Score: 0

How many people chop and pile wood indoors? Most people do it outside, since that is where the trees are.

HT and VMWare: perfect together! by pw700z · 2004-01-07 02:30 · Score: 3, Interesting

I use VMware workstation extensively... and HT rocks. Ever have a virtual machine go to 100% CPU utilization, and your machine slow down to a crawl? With the extra 20% of cpu available, you system can still function and be responsive, and allow you to deal with whatever is going on. Or I can run two VMs and get much better performance out of them and the system as a whole.

Re:HT and VMWare: perfect together! by mixmasta · 2004-01-07 05:28 · Score: 2, Informative

Also, make sure to set the vm's to low priority when you are not in the window, it makes a huge difference in system response, even without Ht.

-Mike

--
#6495ED - cornflower blue
Re:HT and VMWare: perfect together! by ssstraub · 2004-01-07 07:00 · Score: 1

Do you do this manually, or is there some setting in VMware to make it automatically switch to low priorty when the window isn't in focus?

If you have to do it manually, then I'd rather not bother, but if there's a setting for it, I'd like to know about it.
Re:HT and VMWare: perfect together! by mixmasta · 2004-01-07 07:35 · Score: 1

It's in the Edit/Application Settings menu in vmware 4, can't remember where in vmware 3.

It is a global setting... you can also specify it on each vm, though I can't think of a good reason to do so.

--
#6495ED - cornflower blue
Re:HT and VMWare: perfect together! by pw700z · 2004-01-07 07:38 · Score: 1

This is a 'rtfm' - but in case you don't have tfm, it is:

edit->application settings
select the priority tab

This setting does help make a difference, but HT really is cool, especially if you don't want to make your VMs a lower priority.

The sound of software breaking by Latent+Heat · 2004-01-07 03:08 · Score: 2, Informative

OK, you are doing all this calculation in another thread, but you have to somehow synchronize with the GUI thread (PostMessage under Windows). If your calculation thread were to run faster than your GUI thread (GUI doing a lot of screen updating), you would get these PostMessages clogging up your GUI thread message queue because WM_PAINT is of very low priority (so frequent paints don't lock out key and mouse clicks).

In the old single-processor days, your calc thread could do a Wait(0) -- according to the Windows docs, this yields all of the calc thread's remaining time slice to blocked threads, like the GUI thread holding WM_PAINT in its queue. In these modern hyperthreaded times (I imagine true SMP works the same way), Wait(0) does nothing because the calc thread does not block when the GUI thread is on another virtual or real processor, and the screen updates gum up and get all blocky.

The solution I use is that when the GUI thread services a PostMessage from the Calc thread, it runs the message pump to check for and dispatch WM_PAINTs -- a kludge to give the PostMessage from the calc thread lower priority than WM_PAINT. But in the mean time I am cursing a blue streak that MSDN cannot document that Wait(0) is essentially meaningless with more than one processor and I have spend two weeks tearing my hair out about what is going on.

Re:The sound of software breaking by ZombieEngineer · 2004-01-07 15:13 · Score: 1

The comments on Wait(0) are interesting and valid (I have made the same observations).

The main calculation thread is a monolithic chunk of FORTRAN code (the code base harks back to the days of VAX machines with limited CPU and memory). Mixing FORTRAN and Win32 API calls is not a job for the faint hearted.

It is not the GUI that starts to "flake out" rather the response on the communications thread which responds to a packet comming in on a TCP/IP port. The packet processing for the DCS interface is typically arround 20 to 100 packets/second.

What actually happens is that the interface will run for a fraction of a timeslice then the scheduler runs the main calc thread for the remainder of the timeslice plus the next timeslice. Therefore the maximum requests per second is approximately 50 packets/second, if the load from the DCS is higher that this, a backlog builds up to the point where a three second timeout period has been exceeded. I guess the issue here is the granularity of the timeslice size when working with real-time or near real-time systems.

ZombieEngineer

HT Technology by sameerdesai · 2004-01-07 03:23 · Score: 3, Informative

I have some insight into this technology as I was part of a research group researching SMT. It is a really cool technology that exposes Instruction level parellelism (ILP) and increases performance. The basic HT technology for the processor however distributes the resources. The details of Intel HT are available here at http://www.intel.com/technology/hyperthread/ You can also find whitepapers associated with this. Now the catch is application should be multi threaded. You just can't buy a HT processors and run single thread application and expect to improve performance. The performance benefits lie if optimal number of threads are used. If too less it will be unnecessary wastage of resources. If too high they will queue up and cause bottlenecks. The other thing that can affect performance is unbalanced workload and can cause threads which cannot exploit the parallelism. This is a new technology and lot of research is going on in this area and it looks really promising.

Distributed Computing by DeadBugs · 2004-01-07 03:37 · Score: 1

With HT enabled I can run 2 copies of Folding@Home.

This is a significant boost in production over a non-HT processor because these programs.

I would assume this would also help other DC projects like Seti@Home.

--
http://www.kubuntu.org/

quieter by joss · 2004-01-07 04:01 · Score: 1

I could get a dual athlon system, but then I wouldnt be able to hear the dog barking

--
http://rareformnewmedia.com/

What it is, really by ratboy666 · 2004-01-07 04:04 · Score: 1

More correct:

We start with one wood chipper, one wood chipper operator and a pile of wood. We can chip (whatever) per unit time.

We make the chipper faster, and can do more (increase clock speed of processor), but at some point the operator can't bring us the wood. So, we use a wheelbarrow to transport more wood in a go, and we keep the stack next to the chipper (a cache).

Now, there's plenty of wood, so we get a SECOND chipper. The operator can stick wood into whatever chipper is free (multiple ALU units, out of order execution).

Add a third chipper, and a separate wheel-barrow operator.

This is what we have (pre-"hyperthreading").

Add a second wood chipper operator. If one of them gets tired, the second can take over stuffing the chippers (hyperthreading).

Is that a bit clearer?

Ratboy

--
Just another "Cubible(sic) Joe" 2 17 3061

Re:What it is, really by Dun+Malg · 2004-01-07 04:16 · Score: 0, Flamebait

The Sierra Club should be calling for a boycott on Intel. All this tree cutting just to feed a wood chipper...

--
If a job's not worth doing, it's not worth doing right.
Re:What it is, really by Jeremy+Erwin · 2004-01-07 13:58 · Score: 1

and if one of the wood chip operators gets really cheesed off at the other wood chip operator, one of the wood chip operators can be forced into the wood chipper.

Dual core Athlons will be called... by waferhead · 2004-01-07 04:12 · Score: 1

Space heaters!

(Nononononono, I'm an AMD fan, but I couldn't resist)

Beware of HT! by kryptkpr · 2004-01-07 04:43 · Score: 1

To folks considering buying HT-enabled processors, be warned that not everything will work when HT is enabled!

For one, burst!, my BitTorrent client simply crashes on start-up. I've been in contact with Intel about the issue, and after some initial jerking me around, I seem to have finally found a tech who's looking into the issue.. Probably has something to do with my compiler (the crash offset is within the delphi RTL).

My app is not alone, as others in this thread pointed out, hyperthreading can also trigger bugs in drivers..

--
DJ kRYPT's Free MP3s!

Re:Beware of HT! by 2short · 2004-01-08 08:04 · Score: 1

Generally speaking, it's not HT causing these bugs. The programs have timing/thread safety bugs, which will cause problems eventually even on a single non-HT processor. But these bugs will cause problems a lot faster on HT processors or multi-processor boxes.

AnandTech on Hyperthreading by glinden · 2004-01-07 04:46 · Score: 3, Informative

AnandTech did an excellent article on hyper threading a while back. Well written and worth reading.

I/O Bottleneck by EXTomar · 2004-01-07 05:03 · Score: 1

When parallelism is introduced you run the risk of "process inversion". If the system runs high enough all of your execution units are working as fast as the slowest process no matter how fast the execution units can run.

The key to this effect is that the slowest execution unit is taking the most time forcing all other execution to wait on it. Other faster execution units must wait for one reason or another so they all appear to be as slow as the slowest.

In software you can try to soften the blow by bumping up the priority on the slower threads as it crosses any critical sections wtih faster threads. In hardware the beast is a lot different. Doing a pure register calculation is fast. Loading a register from a cache is slower. Incuring a cache miss is even worse. If your system is running fast enough to incur many cache misses then it doesn't matter how fast your register operations or how many CPUs are operating: they will start to appear as if they are all operating like they are missing the cache.

CPUs are plenty fast these days. The future problems all seem to be around I/O. There can be N number of execution cores in your system but if there is only 1 "slow" memory bus then your system is going to be restricted hard. Looking into ways to speed up the memory-CPU bus cheaply would be of great use to any parallel system. Far better than figuring out to cram more faster units into the box.

IBM Will Do SMT Right by fupeg · 2004-01-07 05:06 · Score: 3, Informative

IBM will have SMT in the Power5. Their approach looks even better than Intel's, but part of that is the Power architecture and part of that is IBM learning from what Intel did. SMT is really the best way to get past the limiting reagents of modern processors : bandwidth.

Re:IBM Will Do SMT Right by forkazoo · 2004-01-07 18:32 · Score: 1

SMT isn't about bandwidth. It's mostly about latency. While far from ideal, memory bandwidth on modern processors is pretty impressive. But, when a program needs to load data from memory that isn't in cache, the CPU may have to stall for *hundreds* of clock cycles, trying to get that memory into cache. While a normal CPU would be stalled, and SMT CPU can just set the waiting thread aside for a few hundred cycles, and work on another thread which already has data in cache. Thus, each thread sees effective latency to memory as only slightly worse than a run to cache.

OTOH, if you have two bandwidth limited threads, they won't be helped at all. Each will chug through data as fast as the memory system feeds it.

(beginquote)
>>IBM will have SMT in the Power5. Their approach looks even better than Intel's, but part of that is the Power architecture and part of that is IBM learning from what Intel did. SMT is really the best way to get past the limiting reagents of modern processors : bandwidth.

(endquote)

Re:Just Marketing BS by Intel to get suckers to bu by Sunda666 · 2004-01-07 05:15 · Score: 1

yeah, but PAE is an ugly hack. if you happen to have a linux kernel
source at hand, read what the help says about enabling it.

cheers.

--

``If a program can't rewrite its own code, what good is it?'' - Mel

HT does work! by Anonymous Coward · 2004-01-07 05:22 · Score: 0

and that, of course, is why the dishonest Zealots at that fruity computer company DISABLED hyperthreading when testing the latest Pentium against the G5. Of course, the single processor benchmarks--even WITHOUT HT--beat the G5 on integer performance, even with Apple's flawed benchmarks. Imagine what would have happened if they used the manufacturer's recommended compiler (Intel) and OS (XP) when they did their benchmarks!

The diff between a used-car salesman and a PC one- by Glasswire · 2004-01-07 06:01 · Score: 1

...is that the used-car salesman knows when he's lying.

There's a really interesting philosopical point here, BTW. If you are chartered to (or are pretending to know) something that you don't really understand, can you really claim that you didn't lie (because you didn't realize what you said was false) or do you have a responsibility to be correct if you offer yourself as an authority on a subject?

"hyper-threading" vs. cache size by Animats · 2004-01-07 06:10 · Score: 4, Informative

The basic problem with hyperthreading is, of course, memory bandwidth. CPUs today are memory-bandwidth starved. 30 years ago, CPUs got about one memory cycle per instruction cycle. Since then, CPUs have speeded up by a factor of about 1000, but memory has only speeded up by a factor of 30 or so. The difference has been papered over, very successfully, with cache. The cache designers have accomplished more than seems possible. Compare paging to disk, which is a form of cacheing that hasn't improved much in decades.

If you want to benchmark a hyper-threaded machine, a useful exercise is to run two different benchmarks simultaneously. Running the same one is the best case for cache performance; one copy of the benchmark in cache is serving both execution engines. Running different ones lets you see if cache thrashing is occuring. Or try something like compressing two different video files simultaneously.

If you're seeing significant performance with real-world applications using a a "hyper-threaded" CPU, that's a sign that the operating system's dispatcher is broken. And, of course, hyper-threading dumps more work on the scheduler. There's more stuff to worry about in CPU dispatching now.

Intel seems to be desperate for a new technology that will make people buy new CPUs. The Inanium bombed. The Pentium 4 clock speed hack (faster clock, less performance per clock) has gone as far as it can go. The Pentium 5 seems to be on hold. Intel doesn't still have a good response to AMD's 64-bit CPUs.

Remember what happened with the Itanium, Intel's last architectural innovation. Intel's plan was to convert the industry over to a technology that couldn't be cloned. This would allow Intel to push CPU price margins back up to their pre-AMD levels. For a few years, Intel had been able to push the price of CPU chips to nearly $1000, and achieved huge margins and profits. Then came the clones.

Intel has many patents on the innovative technologies of the Itanium. Itanium architecture is different, all right, but not, it's clear by now, better. It's certainly far worse in price/performance. Hyperthreading isn't quite that bad an idea, but it's up there.

From a consumer perspective, it's like four-valve per cylinder auto engines. The performance increase is marginal and it adds some headaches, but it's cool.

Re:"hyper-threading" vs. cache size by Brandybuck · 2004-01-07 06:47 · Score: 4, Informative

If you're seeing significant performance with real-world applications using a a "hyper-threaded" CPU, that's a sign that the operating system's dispatcher is broken. And, of course, hyper-threading dumps more work on the scheduler. There's more stuff to worry about in CPU dispatching now.

That was my suspicion. Hyperthreading can't be much more efficient than threading via the OS, unless the software is specifically compiled for it, or you use a scheduler specific to hyperthreading. Scheduling work STILL has to be performed, and hyperthreading STILL isn't parallel processing. So where are these performance improvements people are seeing coming from?

I'm not using Linux, but FreeBSD. When I got my new HT P4, I considered turning it on. Then I read the hardware notes. Since FreeBSD does not use a scheduler specific for hyperthreading, it can't take full advantage of it. In some cases it might even result in sub-optimal performance. Just like logic would lead you to think.

The OS cannot treat hyperthreading the same as SMP, because they are two different beasts.

--
Don't blame me, I didn't vote for either of them!
Re:"hyper-threading" vs. cache size by davecb · 2004-01-07 09:30 · Score: 1

I very much disagree: what you've argued is true only if the original article is true, and it in fact misses the point of htpewrthreading/CMT. Jump back to my posting "Memory bottleneck (was: Future prognosis for HT)" for pointers to the relevant articles.
--dave

--
davecb@spamcop.net
Re:"hyper-threading" vs. cache size by fupeg · 2004-01-07 13:46 · Score: 1

You're way off, on several accounts. It's always fun to try and blame everything on corporate greed, but sometimes the facts just don't support it. Other posters have pointed out how Sun plans to use similar technology and how IBM plans to implement it. Do you think that they are just copying Intel blindly? They are all attacking the same problem: increasing throughput for fast CPUs. As you yourself pointed out, memory speed cannot come close to keeping up with modern CPUs. Increasing cache size is one way to combat this, but it is a very brute force way to do it. You increase the size, cost, and power consumption of the CPU when doing this. SMT and multi-core SMP systems allow for work to get done while waiting for memory to catch up. HT is just the tip of the iceberg, the stuff that Sun and IBM are working on is pretty amazing.

As for your knowledge of Intel, it is humorous at best. The Itanium was never meant to be the CPU of the future. It was never meant to be in home systems. It was designed to give Intel a way to compete with 64-bit servers that were all the rage in the late 90's. That's why Intel was willing to completely break with x86 instruction sets. If they were planning on trying to transition from PentiumXYZ to Itanium, then they would have never done that. Plus, for all its initial problems, the modern Itanium has put up some impressive numbers.
Re:"hyper-threading" vs. cache size by node159 · 2004-01-07 16:18 · Score: 1

HT is just another way to solve the very complex issue of CPU architecture and design.

HT is by no means a silver bullet but more of a simpler solution of optimization for super scalar pipelined processors.

Basically HT allow for more optimal use of the different subcomponents of the CPU when wastage is built into the design.

An analogy of how it works:
Think of it as trying to do lots of 2 big math's equations, you have 10 people to give bits of the math's equation to work on and you also have some conditions in the equations that you need to wait for results before you can carry on.
You start of by giving out as much as you can on the first equation and then any people left over you give them the second equation bits. As you work though each equation you may find you need to wait on someone before you can give out more bits and as such the other equation is given out more.

This is the basic concept of HT. To treat everything as subunits and optimize allocation.

Overall both equations take less time but each equation takes more time than if it were worked on on its own. It also means that a lower priority thread may eat into a higher priority thread (if the OS scheduler is not HT optimized [all OS's to date]), it also causes problems at the caching level which is much more of a performance bottle neck than the cpu ever will be.

HT is a very marketed term that is just one of many ways (and by no means the best way) to optimize a CPU architecture, its simplicity in CPU design makes it very appealing to integrated circuitry architects as it allows the sections of the CPU to be simplified at the cost of a more complex control unit.

--
GPLv2: I want my rights, I want my phone call! DRM: What use is a phone call, if you are unable to speak?
Re:"hyper-threading" vs. cache size by Animats · 2004-01-07 18:30 · Score: 1

The Itanium was never meant to be the CPU of the future. It was never meant to be in home systems.
Intel originally announced the 386 as "intended for servers". They did the same thing with the Pentium Pro, their first real superscalar.
For a while, Dell and HP sold Itanium desktops, but nobody bought.

Other Conclusions by suitti · 2004-01-07 07:11 · Score: 1

The author benchmarks a 2.8GHz Xeon with 533MHz FSB and 1MB of L3 cache. and a 3.2GHz P4C with an 800MHz FSB and 0.5MB of L3 cache. He claims he doesn't want to compare the two, but he does. Here are some other conclusions.

The Xeon has a slower clock, and yet outperforms the higher clock P4C. This is further evidence that MHz isn't everything.

The P4C has higher memory bandwidth (the FSB) yet slower performance. This shows that on-chip cache can be king over memory bandwidth too.

Some of my historic benchmarks fit completely in the 486's cache, so not all applications will benefit from more. Alien searches (SETI@Home) appear to benefit from large on-chip caches up to it's resident set size (about 13 MB). The more the better. My current favorite production application has a resident set size of about 200 MB. It isn't clear that on-chip cache size makes much difference. It is clear that FSB bandwidth makes all the difference.

As always, the best benchmark is your application. Unfortunately, most of us can't run our favorite application on a variety of machines before buying one. I know I end up buying something that appears cost effective. This favors the low end processors, which at the moment favors AMD in the X86 world. I've been particularly highly impressed with the Athlon's memory bandwith performance. My Athlon 1800+ (1.3 GHz) performs better than 1.5 Ghz P4's at work - primarily due to having more than double the memory bandwidth. It was also considerably cheaper. I feel as if I got a good deal. I personally have shown no brand loyalty, purchasing a chip from a differant vendor each time.

--
-- Stephen.

Assembly sucks? by dmelomed · 2004-01-07 07:14 · Score: 2, Informative

Not to be specific about SMT. Assembly too hard? You people haven't heard of Forth, right? Just use ficl, or some other embeddable forth instead of assembler, will save you lots of time. Better debugging too, since forth is interactive.

Hyperthreading in dual processor systems by mj2k · 2004-01-07 07:28 · Score: 1

I have a dual xeon 2.4GHz at home and I don't see any performance change when hyperthreading is enabled on WinXP. Hyperthreading is really a liability rather than an asset if a program tries to use only a single processor in hyperthreading mode rather than using the second processor. I can say that when I installed FreeBSD 5.1 with hyperthreading disabled it was significantly faster than when I installed it with hyperthreading enabled. I'd say that hyperthreading is more or less worthless at this point in time for multi-processor systems (at least until software properly recognizes it), though admittedly it does have its advantages in single processor configuration.

Re:Just Marketing BS by Intel to get suckers to bu by Anonymous Coward · 2004-01-07 07:37 · Score: 0

Hyperthreading is interesting, I agree, but I'd much prefer more affordable dual processor machines

Hyperthreading is not another form of dual processing. It gives you a nice boost in performance in many circumstances with very little hardware cost. The idea is that since a single thread of execution hardly ever makes full use of all the multiple instruction execution units, registers for renaming, and other CPU resources that you already have in your single (modern, superscalar) processor, you might as well make use of those resources by executing some of a second thread.

SMP, on the other hand, costs you an entire extra copy of every single component of the processor, and thus more expensive. There's additional design complexity in the memory architecture you have to pay for. Not to mention integration issues such as board complexity, heat, manufacturing, cost. A dual processor box is going to cost more than a single processor box, all else being equal. The interesting question becomes at what point does two older, simpler, cheaper processors, plus the cost of SMP, become more cost-effective than one newer, complex, faster, expensive processor.

Once you have an SMP box, you then have the problem of how to schedule threads on your processors. One of the interesting things about processor development over the past ten years is that the hardware guys pretty much bypassed all those people doing work on trying to build parallelizing compilers that could find parallelism within a traditional sequential program in order to generate faster code for an SMP setup. A lot of the real estate on the chip these days goes for hardware that automatically seeks out parallelism in the program and dispatches instructions accordingly. SMP with simpler processors is in some ways a step back, as you revert to having software -- or the system designer -- having to notice the parallelism and design for a particular system, rather than having the hardware find the parallelism for itself. There's a huge area for research awaiting methods for discovering parallelism at different levels of abstraction within a system, and developing hardware and software that can best take advantage of that parallelism.

HT is an idea that sits nicely between a single processor system and a full-up SMP arrangement. It's not a replacement for SMP, as one HT processor doesn't have a complete duplicate set of resources. But it does make better use of what you have, and runs (most) code faster -- which is the point, right?

Synthetic Benchmarks and HT by OppressiveGiant · 2004-01-07 07:41 · Score: 2, Informative

Dhrystone and Whetstone should show almost no difference in performance w/ w/0 Hyperthreading. The HT just allows the Superscalar superpipelined processor to stick multiple threads on the same processor at the same time.

So what may be interesting would be to run both dhrystone and whetsone at the same time. Seeing as then you'd be using the ALU and floating point unit. That should show a large difference in the performance w/ w/o HT.

--
i could not think of anything clever.

Memories. by Raven42rac · 2004-01-07 07:49 · Score: 1

This brings back memories of Sega's "blast processing" and Nintendo's "FX chip". Just a bunch of marketing, and a little smoke and mirrors added in for variety. Sort of like Intel's P4 "Expensive Edition" that does diddly squat in terms of performance gains.

--
I hate sigs.

Re:Memories. by DigiShaman · 2004-01-07 09:08 · Score: 1

The "Super FX Chip" was for real. Basically, it was a RISK FPU co-processor that ran at 10.5 Mhz inside the cartridge. While the SNES main CPU ran at 3.58Mhz, the ability to crunch floating point data much faster alowed for the SNES to render 3D games such as Starfox. There were also games in the works for using the Super FX2 Chip. But, it never surfaced. The FX2 system was having two FX chips in a cartridge running side by side for twice the performance.

As for Sega's "blast processing", I've heard that it's implemntation was nothing more then pre-processing data, then having the resaults pulled from memory at a later time. So basically, you had buffered data that acted as a processing template for future use. But, I personally cannot confirm this.

--
Life is not for the lazy.
Re:Memories. by Raven42rac · 2004-01-07 09:48 · Score: 1

Yeah but they never used it was the thing, it was just hype. I know the FX chip had huge potential, Starfox was awesome graphically for it's time. By the time that the FX and FX2 were coming to fruition, the N64 was well on it's way to market. Speaking of clock speeds for older consoles, isn't it amazing that an emulator runs very choppily on a gigahertz machine, while the original game runs fine on a 3.58 mhz chip?

--
I hate sigs.

No One Ever Seems To Mention... by Anonymous Coward · 2004-01-07 07:57 · Score: 0

No one ever seems to mention that a great advantage of SMT/HT ought to be reducing the number of context switches necessary. Why is this?

If you have two threads that want to run together -- say your program and the OS itself -- to time-slice between them efficiently so that both get service involves context switching each time, which is an expensive, time-consuming operation for x86 processors.

But with HT/SMT running, each thread can operate on one logical processor much longer without interruption. Given the multi-threaded nature of many OS's today, this alone should be a significant advantage that never seems to get mentioned in articles on HT/SMT.

Re:No One Ever Seems To Mention... by node159 · 2004-01-07 16:32 · Score: 1

Contex switches are not much of a preformance slow down, since they mainly hit the lowest lvl cache and never get past that.

Also by increasing the number for threads running you also increase the number of context switches so you really dont gain anything.

If context switches were a bottle neck then increasing the time between switches would be the ideal solution not increasing the number of threads = number of context switches. As for this on a HT machine you would have one CPU trying to do the work of two in regards to context switches.

A problem that many undergrad students have is that context switchs are a significant slowdown, this is not that case and difficult to grasp.

Think before you post please

--
GPLv2: I want my rights, I want my phone call! DRM: What use is a phone call, if you are unable to speak?

Re:Quick Q -- More Coolness by Anonymous Coward · 2004-01-07 08:11 · Score: 0

Except for the coolness factor of having your POST screen littered with "Hyperthreading Enabled"

It's that other coolness factor of only running 1 cpu, with its associated lower power dissapation and cooling requirements.

Re:Other Conclusions -- Excuse Me! by Anonymous Coward · 2004-01-07 08:22 · Score: 0

The Xeon has a slower clock, and yet outperforms the higher clock P4C.

Excuse be, BUT...that's one P4C against two SMT/HT Xeons. That's what the (x2) means after the Xeon name.

Woodchuck this by DigiShaman · 2004-01-07 08:41 · Score: 1

How much wood can a woodchuck chuck if a woodchuck could chuck wood?

--
Life is not for the lazy.

SMT idea is not new. by Anonymous Coward · 2004-01-07 09:14 · Score: 0

I first heard of SMT/HT in the mid 80s in a machine called the HEP (Heterogeneous Element Processor) designed by Burton Smith. I think this was at a company called Denelcorp. Smith has been working with this concept since then and later on founded Terra Computer. Terra finally bought into the remains of Cray and is now of of the companies that either calls itself Cray or has a product named after Cray. (I remember taking naps on the seat of a Cray XMP. The power supplies were in the seat and it was a warm place in the computer room.)

Re:HT Technology - unbalanced workload is bad? by DonGar · 2004-01-07 09:59 · Score: 1

An unbalanced workload is bad? That doesn't seem right to me.

With nothing but a quick impression, it seems that HT might be better at an unbalanced workload than an SMP machine. This is because with SMP, everything on the underutilized processor sits idle.

It would seem that HT would end up dedicating all functional units (outdated terminology?) to the thread the has the heavy load. Thus you can get better use of the functional units by moving them back and forth between threads as needed, at least until you have a cache miss in the busy thread.

--
plus-good, double-plus-good

Emulation by DigiShaman · 2004-01-07 10:06 · Score: 1

The problem with emulation is that everything being emulated is processed on the main CPU. For example, The SNES has dedicated processors for logic, video, and audio. But when you emulate the SNES on a computer, the main CPU is emulating everything. Then, the respective resaults are exported to the video card and audio. So in a nutshell, the video card in your PC is not doing any kind of work directly relating to the SNES video (just acting as a frame buffer really).

But, I do find it interesting that the N64 was not only emulated, but if you had a 3D video card, there was a glide wrapper that actually accelerated emulated 3D functions of the N64. It's almost like distributed emulation as far as the hardware is concerned in your PC.

--
Life is not for the lazy.

Please get your terms straight! by Prof.+Pi · 2004-01-07 10:35 · Score: 2, Informative

The RISC concept, implemented in CPUs like the MIPS R3000, originally meant very simple hardware without pipeline interlocks, instruction schedulers, or more than an absolute bare-bones set of instructions.

Not true at all! RISC refers to the instruction set, not the internal architecture. Even the earliest RISC processors to carry that name included pipeline interlocks -- it was the simplicity of RISC that made such techniques feasible, especially at the chip densities of the 80's.

There's a lot of confusion about what RISC means. Look up a computer architecture textbook. RISC is somewhat fuzzy, and most chips bend the edges of the definitions in places. The general operating principle is "reduced," and herein lies the ambiguity, since this is relative to the technology of the day. (A "RISC" Alpha made in the 90's has more opcodes than a "CISC" 8086 made in 1978.) But RISC processors typically have the following properties:

Limited addressing modes (typically register-register, loads and stores only, maybe with some variants like autoincrement)
Relatively simple instruction formats (often all instructions are the same size)
Emphasis on general instructions rather than specialized instructions with limited applicability (such as string ops)

CISC used to mean that many or most instructions were implemented in microcode on the processor.

Again, no. CISC means supporting many different kinds of operations directly in hardware. This was especially appealing in the days when back-end compiler code generation wasn't very good, so CISC means often a simple 1-for-1 translation from high-level constructs to machine opcodes. The ISA complexity usually meant microcode was the best approach, but this was not part of the definition.

another wrong point... by GunFodder · 2004-01-07 11:30 · Score: 1

CPU instruction sets are always designed around the software that will run on them. CISC instruction sets were popular because it made assembly programming possible; RISC only gained in popularity when compilers got good enough to produce optimal code.

Not a very good artical by node159 · 2004-01-07 14:17 · Score: 1

Just my 2c. But the guy doesn't really have a clue how HT works and its reflected all the way through the artical. His choice of benchmarks is one of the most obvouse flaws, like comparing an AMD vs Intel on a AMD optimizes codebase.

--
GPLv2: I want my rights, I want my phone call! DRM: What use is a phone call, if you are unable to speak?

Re:Beware of Bad programming! by Anonymous Coward · 2004-01-07 15:05 · Score: 0

I've had no problems with any programs including BitTorrent since buying my P4C 3.0Ghz. And I do microcontroller programming, which can cause timing related problems when using the COM and LPT ports for programming/communication.

The main cause of HT and driver bugs, is sloppy programming. An assumption was made, that turns out not to be true, typically a timing problem. This is not the fault of the hardware, but rather programers being lazy, or not knowing better.

All bugs that I have read about being fixed, were due to the programmer not following best practice as laid out by Intel whitepapers.

Hype-r-threading sais it all by node159 · 2004-01-07 16:52 · Score: 1

[see title]

--
GPLv2: I want my rights, I want my phone call! DRM: What use is a phone call, if you are unable to speak?

Re:HT Technology - unbalanced workload is bad? by sameerdesai · 2004-01-08 03:11 · Score: 1

Maybe my terminology confused you. With "different" workload we can have very good parallelism and can exploit it as we can assume independent threads will have independent instruction. How intel HT works is by dividing up resources between threads so lets assume we have 4 threads (call it t1,t2,t3 and t4) and that t1 is very busy while rest are not. Then the resources are split 4 ways however t1 will be queuing up(because it is busy) while resources with t2,t3 and t4 will be idle. This will also cause cache misses a lot and impact the other threads. The catch here is now we have to think up of thread selection policies which can be based on numerous logic like count, cache miss ratio, branch prediction count, etc. This is the area where I think the research is now concentrated on. I guess you will have more idea once you read the white papers on Intel's HT.

Granularity of the timeslice by Latent+Heat · 2004-01-08 03:21 · Score: 1

That's the point -- the time slice on Windows can be quite coarse grained (10's of ms), and if you depend on preemption, you won't get any kind of smooth screen updates or not-dropping of packets or whatever is required. Even though it is a preemptive multi-tasking system, you end up using it as a cooperative multi-tasking system where threads yield to other threads -- you use synchronization primitives as an efficient way of dispatching among multiple threads in coroutine fashion.

If you invoke a synchronization primitive that blocks, of course you are going to yield to another thread, whether it is on another processor or not. If you do Sleep(1) (I mispoke calling it Wait() -- it is called Sleep()), that blocks because you have to wait 1 ms. If you do Sleep(0), that is supposed to block for the remainder of the time slice, but it doesn't on SMP/hyperthreaded systems, and I wonder how many people out there have also lost hours of sleep because MS wouldn't document this.

Re:Off Topic - Your Tiny URL by don_s · 2004-01-08 04:39 · Score: 1

Why in god's name would you link that picture? That is just utterly hideous and disgusting. You have a demented sense of humor..then again my stupidity for not noticing Tiny URL after the link. But then again it was very misleading. ugggg...I will have nightmares for weeks.

I second that by r6144 · 2004-01-08 23:03 · Score: 1

In a DCT algorithm I wrote using SSE intrinsics (mainly _mm_addps and _mm_mulps), I tried really hard to optimize the code for icc8 (which I used by default during the optimization), but the resulting code runs at only 2.4Gflops on a 2.4GHz pentium 4 (which is pretty low efficiency for 4:1 vector code). gcc 3.2 generates 10% faster code without much hand-tweaking.

The strange thing is that the resulting assembly code doesn't seem to be much different or particularly inefficient --- both gcc's and icc's code are a long stream of addps, mulps and movaps instructions, and since the evaluation order is made explicit in the C code, dependency should not be much of a problem. The working set fits comfortably inside the L2 cache, but L1 cache is expected to thrash a little. I can't see why this code can be that inefficient.

Similar things happened when I was hand-optimizing an IIR filter for icc8. The speed is quite decent (about 7Gflops in the inner loop), but after I changed "a=b+c+d" to "a=d+b+c" (since d is calculated first, I think this should at least not hurt), speed mysteriously halved. The assembly code doesn't look much different at a glance, either.

The last two cases look similar. I guess the P4 may have much degraded performance when the reorder buffer fills up or something. Anyway, this at least shows that even icc (as of now) does not give a reliable performance. If you want the absolute highest performance, make sure you always keep an eye on the benchmark results.

Of course, icc has automatic vectorization while gcc doesn't, and this is the most important reason why icc often beats gcc 2:1 in some floating-point benchmarks. However, in my case the most time-consuming loops are invariably too complicated for icc to parallize automatically (one for a custom DCT algorithm, one for a 4th order IIR filter), so I still have to vectorize that by hand.

But the parent poster's 50:1 ratio does seem strange.

Slashdot Mirror

Hyper-Threading Explained And Benchmarked

245 comments