Linux 2.6 And Hyper-Threading

← Back to Stories (view on slashdot.org)

Linux 2.6 And Hyper-Threading

Posted by timothy on Monday February 23, 2004 @12:37PM from the bits-and-pieces dept.

David Peters writes "2CPU.com has posted an article on Hyper-Threading performance in Linux. They use Gentoo 1.4 and kernel 2.6.2 and run through several server-oriented benchmarks like Apache, MySQL and even Java server performance with Blackdown 1.4. The hardware they use in the tests is border-line ridiculous (3.2GHz Xeons, 3.2GHz P4 and P4 Prescott) and the results are actually quite interesting. It's a good read as he even takes the time to detail his system configuration all the way down to the CFLAGS used while compiling the software."

51 comments

License software based on # of CPUs by fredrikr · 2004-02-23 12:51 · Score: 2, Insightful

Has anybody run into a problem with Hyper-Threading and per-CPU licensing?
1. Re:License software based on # of CPUs by MerlynEmrys67 · 2004-02-23 12:57 · Score: 3, Interesting
  
  I haven't - but then I don't run Oracle and many of the "per CPU" server applications.
  I'm really waiting to see what these vendors will do when true Multicore CPUs are popular with the unwashed masses.
  Especially when there are 4-16 cores per CPU
  
  --
  I have mod points and I am not afraid to use them
2. Re:License software based on # of CPUs by Anonymous Coward · 2004-02-23 13:14 · Score: 2, Informative
  
  This is, I think, going to be more of a problem in the Windows world than the Linux world.
  
  There should only be the problem of under-utilization if the software doesn't support multiple processors. The software should not (if it's correctly designed) cease to function if it suddenly detects more than the number of processors it's licenced for - it should simply run on however many processors it was expecting.
  
  A possible work-around (if there is some multithreaded software that fails in a multiprocessor environment) in Windows NT / 2000 & XP Pro you can set a CPU affinity mask (either manually through task manager or if you launch the process through a particular API call). This is a 32-bit wide bitfield that indicates which CPUs the task is permitted to access (bit 0 = cpu 0, bit 1 = cpu 1, etc).
  
  Windows XP Home would be the most likely problem that anyone runs into as it only has single processor support. But it should still work. And nobody except mom & dad should be running XP Home anyway.
  
  FYI, Windows 2000 Pro supports 2 processors and works well with the hyperthreaded CPUs straight out of the box with no problem (this is what I'm running on the machine I'm typing on). However - if the number of CPUs changes (say, swapping out a single threaded CPU for a hyperthreading one), I believe you need to reinstall the operating system (because you'll otherwise be running the single-CPU kernel files).
3. Re:License software based on # of CPUs by saden1 · 2004-02-23 13:27 · Score: 2, Insightful
  
  If it is on one chip, it is one CPU. I'll be damned If I'm going to pay more.
  
  --
  
  -----
  One is born into aristocracy, but mediocrity can only be achieved through hard work.
4. Re:License software based on # of CPUs by kayen_telva · 2004-02-23 13:38 · Score: 5, Informative
  
  actually, Intel recommends against using HyperThreading with Win2K (all flavors)
  
  Intel.com
  
  it will run but performance sucks
5. Re:License software based on # of CPUs by Anonymous Coward · 2004-02-23 14:35 · Score: 0
  
  > However - if the number of CPUs changes, I believe you need to reinstall the operating system
  
  No, on any sort of modern motherboard you can do it right from the device manager. See Q234558.
6. Re:License software based on # of CPUs by dJCL · 2004-02-24 03:35 · Score: 1
  
  Well, even thou it's modded troll...
  
  Sure it's got problems, but that's where my paycheque comes from. I'd rather support linux.
  
  And I made the comment to highlight that windows sees it as two logical processors, not one.
  
  --
  On Arrakis: early worm gets the bird. Magister mundi sum!
7. Re:License software based on # of CPUs by KarmaPolice · 2004-02-24 07:23 · Score: 1
  
  Has anybody run into a problem with Hyper-Threading and per-CPU licensing?
  Are you talking about SCO?
8. Re:License software based on # of CPUs by Anonymous Coward · 2004-02-24 07:27 · Score: 0
  
  ..people actually use licenses? Wow.
  
  Anyway, just put in 2 CPU if it bitches/complains. If anyone audits you, you'll be good. If they don't already know, explain to them what hyperthreading is and that it's ONE CPU.
Says who? by Anonymous Coward · 2004-02-23 12:51 · Score: 5, Interesting

The hardware they use in the tests is border-line ridiculous

I'm typing this on a 3.0 GHz Pentium 4 that has hyperthreading. The entire system cost me $1200 to build just before Christmas - including 1GB of RAM, a Radeon 9800 Pro video card and a 120GB SATA hard drive. Dell and IBM sell 3GHz notebooks now for a similar price.

My point is that a 3.2GHz CPU is not ridiculous in an age where 2.66GHz processors are considered entry-level (FYI, Dell is currently selling a 2.66GHz desktop for $499).

What are you still running on? A 486?
1. Re:Says who? by MikeCapone · 2004-02-23 15:32 · Score: 4, Funny
  
  What are you still running on? A 486?
  
  I will *not* answer that question!
  
  *door slams*
  
  --
  Treehugger? Treehugger... Treehugger!
2. Re:Says who? by Anonymous Coward · 2004-02-24 02:18 · Score: 0
  
  You got to be kidding me; this guy is trying to compare the price of a P4 with Xeon.
3. Re:Says who? by Slashcrunch · 2004-02-24 15:31 · Score: 1
  
  Yes I am, you insensitive clod!
Redo. by BrookHarty · 2004-02-23 12:54 · Score: 4, Funny

Ok, Time to redo the benchmarks, Kernel 2.6.3 is out.
[joking]

Be nice when we see some nice Opteron benchmarks vs the new Xeons.

-
"But Calvin is no kind and loving god! He's one of the _old_ gods! He demands sacrifice!"
1. Re:Redo. by BuckaBooBob · 2004-02-25 05:56 · Score: 1
  
  Why compare a 64 bit CPU to a 32 bit one....
  
  Why not compare a Chevy Sprint to a Top Fuel Dragster while your at it... Those would be just as interesting :)
  
  --
  Who needs WiFi when we can have Packet Over Sheep! http://datacomm.org/PoS-InternetDraft.txt
2. Re:Redo. by Anonymous Coward · 2004-02-25 16:13 · Score: 0
  
  They are 32 bit with 64 bit extensions. They are the same. ;)
Cute comment on compiling by MerlynEmrys67 · 2004-02-23 12:55 · Score: 3, Informative

Of course my opinion is why not use as large of a -j as you can, and distribute the problem. Take a server farm and turn your compile into ccache and distcc (look up the projects on samba.org CCache distcc)
The first one performs semi-miracles on repetative build times where you aren't doing "incremental" builds. The second lets you distribute your compile to multiple build servers on the network (beware - there be deamons here)
Build times went from hours to minutes - it was great

--
I have mod points and I am not afraid to use them
1. Re:Cute comment on compiling by addaon · 2004-02-23 13:29 · Score: 2, Interesting
  
  Built into ProjectBuilder, using Rendevous, on all current macs.
  
  --
  
  I've had this sig for three days.
2. Re:Cute comment on compiling by Anonymous Coward · 2004-02-23 13:53 · Score: 0
  
  Where can I get ProjectBuilder for Linux?
3. Re:Cute comment on compiling by addaon · 2004-02-23 19:52 · Score: 2, Informative
  
  Here, if you have decent hardware.
  
  --
  
  I've had this sig for three days.
Tantalizing . . . by Mysteray · 2004-02-23 13:09 · Score: 4, Interesting

Those sure are some interesting numbers. On the order of a 49% increase or 35% decrease in performance depending on the application. I always figured those high-GHz CPUs would be completely IO-bound. I guess this sometimes allows threads to run with what they've got in the on-chip cache.
Makes you wonder if a kernel could detect if it was helping or not and selectively enable it.
I did some informal testing between VC++ native and C# to .Net bytecode. I had a little loop calculating primes. The native C++ kept everything in registers, while the CLR made everything relative memory accesses to BP. I figured that would devastate performance, but on the Pentium 4, it was only 5% slower! It seems to have an L1 cache that's as fast as the registers. That will certainly make it easier on the compiler writers.
Sort of off topic, did anyone else see that article in MSDN about using .Net for serious number crunching? The author seemed to write the whole article as if he thought it was a good idea. Not that there wouldn't be some advantages to doing that (such as the possibility of tuning for the processor at runtime), but the one graph he showed comparing with native code had .Net running 50% to 33% slower!
1. Re:Tantalizing . . . by metalix · 2004-02-23 13:26 · Score: 5, Funny
  
  I did some informal testing between VC++ native and C# to .Net bytecode. I had a little loop calculating primes. The native C++ kept everything in registers, while the CLR made everything relative memory accesses to BP. I figured that would devastate performance, but on the Pentium 4, it was only 5% slower! It seems to have an L1 cache that's as fast as the registers. That will certainly make it easier on the compiler writers.
  
  oops you just violated the VS.NET EULA by posting a performance benchmark. shame on you!
2. Re:Tantalizing . . . by Mysteray · 2004-02-23 13:43 · Score: 1
  
  Doh!
  I'll go shave my head now so the electrodes will make better contact. I do live in Florida, you know.
3. Re:Tantalizing . . . by Anonymous Coward · 2004-02-23 23:06 · Score: 0
  
  Not that there wouldn't be some advantages to doing that (such as the possibility of tuning for the processor at runtime), but the one graph he showed comparing with native code had .Net running 50% to 33% slower!
  Well 50% slower isn't really a problem -- you could just wait a year or two and run it on newer hardware at the same speed as the C++ program would have run.
4. Re:Tantalizing . . . by Anonymous Coward · 2004-02-24 10:13 · Score: 0
  
  I had a little loop calculating primes. The native C++ kept everything in registers, while the CLR made everything relative memory accesses to BP. I figured that would devastate performance, but on the Pentium 4, it was only 5% slower! It seems to have an L1 cache that's as fast as the registers. That will certainly make it easier on the compiler writers.
  I'm pretty sure L1 runs at core speed on P4, I think L2 does as well. Also, there are many optimization options that are possible on in this situation... why not try your code out again with intels compiler, or writing hand-optimized assembly and seeing how much of a performance boost you can get.
5. Re:Tantalizing . . . by Mysteray · 2004-02-24 11:29 · Score: 2, Insightful
  Well 50% slower isn't really a problem -- you could just wait a year or two and run it on newer hardware at the same speed as the C++ program would have run.
  
  Or just write it in C++ in the first place and:
  
  have the results of your computation a year or two sooner
  
  have a product that's not half as fast as your competitors'
  
  have a product that runs faster on newer hardware instead of one that performs like it's on yesterday's hardware
  
  save your customers' money on hardware and claim some of that back on your sale
  
  have a product that has a chance at portability
  
  have a product that is suitable for people who have computers today, instead of the (much smaller) market segment of people with computers from the future
6. Re:Tantalizing . . . by be-fan · 2004-02-24 17:40 · Score: 2, Interesting
  
  The P4 seems to handle indirect accesses extremely well. They did a benchmark of bcc awhile ago. Bcc is a version of GCC that does bounds checking. Now, bounds-checking in C sucks because you have arbitrary pointer arithmatic. So a pointer balloons from a 4-byte word that fits in a register, to a 12-byte structure that must be accessed indirectly. On a P3 and an Itanium, the penalty was huge, reaching 117% for the P3. However, the penalty on the P4 was only 34%.
  
  --
  A deep unwavering belief is a sure sign you're missing something...
7. Re:Tantalizing . . . by Anonymous Coward · 2004-02-25 05:31 · Score: 0
  
  L1 is 2 cycles on an original P4 and 4 cycles on the newer P4.
Money? by dJCL · 2004-02-23 13:20 · Score: 0, Redundant

OK, he cannot afford to buy a benchmark, but he has a trio of top of the line Intel systems to play with! WTF? Either he has a weird idea of money well spent, or someone has a lucrative agreement with the hardware vendors. I'm guessing the latter, and really wish I could write well enough to sucker them into sending me cool hardware to play with.

I'll live with my 2800+(2.133Ghz) AMD MP(only one for now, I'll upgrade when I need it) I'm running Seti, playing music, encoding DVD's and sometimes messing with the UT2004 demo and not even noticing it...

On another note, when do the PCI express test systems come out? I'd love to see some benchmarks on those as the pci performance for my secondary and tertiary video cards is below par. PCI express is supposed to allow multiple high speed video cards.

Anyway.

--
On Arrakis: early worm gets the bird. Magister mundi sum!
1. Re:Money? by Anonymous Coward · 2004-02-23 13:54 · Score: 3, Insightful
  
  Well the hardware is provided by the manufacturers for review (it is a hardware site after all). SPEC doesn't just go around handing out copies of their (very expensive) benchmarking applications.
They need -mm by keesh · 2004-02-23 13:22 · Score: 0, Informative

-mm kernels include fixes for the ht screwiness. Well, not fixes per se, but hacks that make the scheduler a bit smarter. Problem is, linux still sees a single HT CPU as two discrete CPUs, so there's a performance hit because of the way registers are handled.
1. Re:They need -mm by Anonymous Coward · 2004-02-23 14:57 · Score: 5, Insightful
  
  You are an idiot. To start with, a CPU with HT has two discrete visible register sets. If you are so smart, how would you fix this imaginary performance hit by "handling" registers better
  
  Second, the SMT scheduler in -mm kernels isn't a hack. It is a general and extensible topology description that the scheduler uses to achieve exactly the behaviour it needs.
2. Re:They need -mm by XaXXon · 2004-02-23 15:33 · Score: 1
  
  I'd just like to say that this post is exactly correct, and the original poster has no idea what's going on and should be modded down as such.
3. Re:They need -mm by HRbnjR · 2004-02-23 17:37 · Score: 4, Interesting
  
  For the record though, the important point was that the stock 2.6 kernels do not yet handle HT in an ideal manner. The article doesn't mention if the Gentoo kernel used for the benchmarks is HT patched or not.
  
  And with special thanks to Zack Brown, those interested can read summaries of HT issues here:
  
  http://www.kerneltraffic.org/kernel-traffic/topics /Hyperthreading.html
4. Re:They need -mm by Anonymous Coward · 2004-02-23 23:47 · Score: 0
  
  Unfortunately there isn't a "-1, Ignorant" option. Maybe we should replace "Overrated" with "Ignorant" and subject it to M2? That would kill two birds with one stone.
5. Re:They need -mm by Anonymous Coward · 2004-02-24 07:30 · Score: 0
  
  Isn't it Con Kolivas's new patchset that has fixes for HT?
  
  http://members.optusnet.com.au/ckolivas/kernel/
  
  AFAIK they aren't included in -mm.
Whats the big deal? by Anonymous Coward · 2004-02-23 17:52 · Score: 2, Funny

My entire lab at school is filled with Dual 3.2GHz Xeons with Quadro fx 1000 cards. People have those types of machines... or 100 of them.
VolanoMark by Anonymous Coward · 2004-02-23 21:31 · Score: 0

I'm guessing that the VolanoMark results is what happens when you have a lot of synchronization. When you go from one thread to 2 threads, it costs you a lot to set up the synchronization stuff. When you go from 2 to 4 threads performance increases again because the cost of synchronization will be less per thread of execution.
Re:i've never seen by dalutong · 2004-02-24 05:22 · Score: 0, Offtopic

I was trying to impress anyone. I was just surprised. I've never attempted a first post before, so I don't follow the different types.

I posted my comment simply because I thought it was odd.

--

What comes first, finding a teacher or becoming a student?
You Know What Would Be Real Funny? by severoon · 2004-02-24 09:05 · Score: 1

What if they discovered they could shrink down an entire 8086 processor to Truly Ridiculous Proportions (that's a technical term) and pile like a thousand or a million of them into the space of a single modern day chip? Ok, since we're a 32-bit world now maybe we'd need to go to bunches of 386's instead. But the point remains--I wonder what kind of modifications to current software would have to be made to exploit this, or if it could all be done in hardware.

It'd be massively parallel computing. Like a human brain. Slow at methodical linear tasks like adding a list of numbers, fast at intuitive tasks like modding this post down to -1.

sev

--
but have you considered the following argument: shut up.
1. Re:You Know What Would Be Real Funny? by homer_ca · 2004-02-24 15:05 · Score: 1
  
  An 8086 topped out at about 10Mhz and contained 29,000 transistors. How many of them were you planning to put on a die? 10x10? That's 2.9 million transistors which is about the same number as a Pentium MMX. Assuming perfect scaling and no overhead for interconnecting those mini 8086 cores, 100x10Mhz is 1 Ghz, but you won't get close to that in real life, so you probably won't be far off from the 233Mhz top speed of the P-MMX. I'm not even going to guess what it takes to interconnect those 8086 cores and build a memory controller to keep them fed with data.
  
  I think what you're really suggesting is a neural net processor. It should be good at doing human like tasks like pattern recognition, but practical applications of neural nets are few and far between.
2. Re: You Know What Would Be Real Funny? by atcurtis · 2004-02-24 16:03 · Score: 1
  
  Interesting point...
  
  I wonder how hard it would be to cram 64 200MHz 486 class CPUs onto a single die. It would give an theoretical max 'speed' of 12GHz. Maybe give it a nice wide 128bit planar bus and clock it at the same speed.
  
  Have to tune the OS to handle that many CPUs efficently but it should still be a pretty nimble (and relatively low power) computer.
  
  Reminds me of an April Fools article several years ago I think PCW magazine had where someone made a computer of a couple of hundred Z80 class CPUs each clocked at 100MHz... and claiming supercomputer performance figures.
  
  --
  -- The universe began. Life started on a billion worlds...
  -- Except on one where stupidity was there first.
3. Re: You Know What Would Be Real Funny? by MikeCapone · 2004-02-27 05:37 · Score: 1
  
  I wonder how hard it would be to cram 64 200MHz 486 class CPUs onto a single die.
  
  Not saying it's not a good plan, but I don't think that 486s went up to 200MHz.
  
  --
  Treehugger? Treehugger... Treehugger!
4. Re: You Know What Would Be Real Funny? by atcurtis · 2004-02-27 08:14 · Score: 1
  
  The 486DX4 was clock-tripled. (the DX2 were clock doubled) and there were 133MHz and 150MHz processors (with 33MHz and 50MHz system bus respectively)
  
  I don't think it would be too much of a stretch for that little extra...
  
  --
  -- The universe began. Life started on a billion worlds...
  -- Except on one where stupidity was there first.