Linux on an Intel PIII vs. G4?
An anonymous submitter sent in: "I'm currently looking into purchasing a new laptop. This machine will run SuSE linux and I will be developing some pretty processor intensive applications(genetic algorithms, mathmatical simulations,etc.) so raw speed is the major factor. I've been searching for information on the relative speeds of an 850Mhz P3 vs a 500Mhz G4 but all tests I've seen are on the 'native' OS (OS9/X vs WinMe/2000). Has anyone out there done some tests running the same OS (linux/openBSD)?"
There are many factors in the equation of a system's computational speed.
in this discourse by alpha i mean 21264 and will make distinctions between
p3/p4 and k7 where applicable, i am uncertain on most of the numbers for
sparc (UltraSPARC III) chips.
processer frequency:
x86's strongest point, followed by alpha, then sparc, then g4
(well, that might be a little out of order, and don't put too much stock in
just the frequency anyway, it's simply one component of the system speed)
system bus width:
most processors share this bus with the memory bus but not with the cache
bus. It is usually 64bit wide but at differing frequencies on different archs.
The p3 and the g4 have a 100Mhz bus, the K7 has a 133Mhz DDR(266 effective),
the alpha has a 333Mhz bus, and i can't find relevant literature of the
UltraSPARC III.
to the best of my knowledge all of these chips have a 64bit system bus
the system bus is where disk drive controllers and pci/agp etc reside.
memory bus width:
P3/P4/K7, G4, and alpha share this bus with the system bus, the sparc chips,
i believe, do not. one thing of note about the alpha, it has 4 seperate
memory controllers that talk down the same bus, so even if though it uses
100 MHz SDRAM, it can completely fill the 333Mhz bus.
a lot of crazy stuff comes in to play in the memory bus if you have an
excessively SMP machine, sparcs have on chip memory controllers and can
access the memory easily, and the chips with bigger cache size don't need
to read as often from the main memory. the cache size makes a staggering
difference since it is often at the same frequency as the CPU.
cache bus width:
Everything but the alpha has a 64bit cache bus, the alpha's is 128bit and
error checking to boot!
cache frequency:
Most chips have 2 seperate chip caches, most pc cpus have them on the same
die as the CPU and running at full speed. The 'L1' cache is usually only
about 8k-64k is always is at full speed. The 'L2' cahce is usually much
bigger, although the P4 has a very small (64k) one. The speed of the L2
is as follows:
P3/P4/K7(thunderbird) full speed, G4 200Mhz-350Mhz, Alpha 333Mhz
dunno on the sparc.
the frequency is not only a contributor to the cache bandwith, but also the
cache latency. if your cache is half speed you'll have to wait another cycle
to pull data from it.
cache size:
k7 512kb, p3 256kb, p4 64kb, p3 xeon 512-2048kb, alpha up to 8MB, g4 512kb
memory latency:
memory subsystems are another level of wait on the data you're after in the
cpu. it usually takes a few cycles to get data from memory, how long
is determined by CAS and RAS latencies, usually between 2 and 3 on each.
memory frequency:
RDRAM (some p3 and all p4) has 400-800Mhz.
athlon has 133Mhz ddr (266Mhz effective)
g4 has 100Mhz
alpha has 100Mhz but 4 controllers
memory bandwidth:
64bits, the alphas have 4 simultaneous memory controllers, the HeSL P3 chipset
has 2. i think sparcs have it controlled on a per chip basis. all others
have 1 64bit path.
well folks, there are some numbers that have nothing to do with the way the
cpu works or the benefits of multiple instructions per clock, but the system
architecture surrounding the chip is just as, if not more importanct, to the
system's performance than the operation of the chip itself.
CPU architecture:
ok, here's where my (half-hearted) research breaks down,
branch prediction, pipeline length, concurrent instructions/instructions per
cycle, fetches per cycle, and a bunch of other factors come in to play with
assessing the CPU architecture efficiency.
The g4 really stands out because of its super short pipeline on the 500Mhz
and lower models at like 5(?) stages, the p4 on the otherhand is at a
staggeringly high 20+ pipeline. the shorter the pipeline the shorter cache
and memory delays are, and the smaller the misprediction penalty is. on the
down side, it's usually hard to reach high clock speeds. most chips are in
the 9-15 range for cpu pipeline.
concurrent instruction is the realm of MMX, 3dnow, SSE2, and altivec.
the g4's altivec unit gives the largest improvement, but the use of
concurrent instructions is mostly useful in the context of 3d graphics, and
much of the work is now being offloaded to the graphics chips.
but back to the question, for a laptop, p3 is your only real option, even
though it's only real strong point is its clock frequency, its clock *is* twice as
high any of your options, which is certainly enough to make it the notebook
cpu champ. maybe, just maybe, if your specific applications lend themselves
to optimization for the altivec unit the g4 500 would be dethrown the p3.
if i were you, i would lie to myself and say the g4 was my best bet and then
i would have a great excuse to pick up a titanium powerbook.
If you can get a G4 at 850mhz in a laptop, it probably is the fastest. The 1ghz Intels likely cannot really run at that speed. Also, the mac has a higher max ram (1GB or better) that the PC (ok, I could be wrong).
/., how about getting that kind of question for the next time someone asks what kind of system they should buy?
So, is your data int or float, 8, 16, 32, or 64 bit, and can you work on several chunks at a time. If it is in 32 or smaller bit chunks, and you can do several at once, the mac is likely to rule suprem. It has 32 vs 8 128bit registers, and can do 2 instructions per clock tick vs 1 every other for the P3, for 4 times the speed, and better opps to boot.
Once again, what exactly are you doing?
Hey
Plato seems wrong to me today
Now I find myself wondering about a few things.
The fact that SGI and Compaq (Digital) have such good compilers may be explained that their machines are being used in scientific establishments where CPU performance is key, while Suns machines are the favourites of dotcom farmers requiring massive amounts of IO (databases, etc). When an uni needs a new super computer they'll look to SGI, Compaq (Alpha), Intel (they've got very good compilers) and maybe even IBM (SP2). But I've never heard of an uni using a Sun for a super computer (cluster of UE10000's anyone?)
SARA, a dutch institution that maintains and houses several of Hollands super computers, is housing mostly SGI/Cray, Alpha and IBM hardware (and even some beowulf clusters). They do have a lot of Sun hardware, but most of it is being used as a web or database server.
My point? Well, maybe compiler (gcc and vendor) performance is influenced by heritage. In a scientific setting people will use the vendor supplied compiler, demanding and paying for premium performance. They don't really feel the need to contribute a very good code optimizer to the gcc project. However, in the dotcom world everything must be done as cheap as possible with maximum (ahem) performance. Hence, there are a lot of people tinkering with gcc for Intel (and maybe even SPARC).
Whatever the case the may be, the day gcc generates working 64-bit code I'll drink a few beers for the guys working on gcc. As it stands now, gcc can't generate a decent (maybe I should say working) 64-bit binary for both the SGI and SPARC platforms :( (I haven't tried it on an Alpha yet.)
And yes, I'm one of those CS drop-outs (web farmer) being forced to accept a fairly large amount of cash for trivial work while I would prefer doing research work for a minimum wage. Oh well, we can't all be brilliant.
We've got a couple of Dell Inspiron laptops that do about 280 MB/sec (according to SiSoft Sandra 2001se), while we've also got some noname laptops that only do about 160-170 MB/sec. The Dells got a 500 MHz Pentium III (100 MHz bus), the noname laptop a 500 Celeron (66 MHz bus). rc5des runs about the same speed on both types of laptops, but seti@home is quite a bit faster on the Dell (seti@home is much more memory intensive than rc5des). This speed difference can be explained by the fact that the Dell uses a 100 MHz bus and faster RAM.
My noname desktop (Athlon 650 MHz) does about 420 MB/sec and runs rc5des and seti@home about 60-80% faster.
Just some useless numbers...
Aye. I know you want SuSE, but I'd recommend at least benchmarking your code with Watcom C/C++ compiler on Windows NT or 2000. Great numerical code generation, and this really can make a big difference.
This sure ain't getting marked as +1 Informative, but had you considered checking with the SuSE teams? As one of very few distros that are processor-agnostic, I bet they've done some tests of their own.
FWIW, OS X server on a PPC outperformed Linux on an Intel 450 PII by 23%, according to osOpinion. (YMMV, read the fine print, etc., etc.)
-Waldo
Your main problem if you're looking for a speed boost for applications won't be the processor - it'll be the algorithms you use and the compiler.
For the algorithm:
One word. Cache.
Main memory is up to an order of magnitude slower than the cache. Make your algorithms cache-friendly. This means optimizing row vs. column accesses and doing checkerboarding for things like matrices, and other optimizations for vectors. For things like linked lists and trees, try to keep nodes contiguous with other nodes in memory where possible (or even just the key and linkage pointers, since that's all you'll be accessing most of the time when doing a search).
It takes a while to fully zen into this, but it will pay off in spades.
For the compiler:
The following applies to the gcc C/C++ compiler. I'm assuming that you'll get similar performance results for the g77 Fortran compiler. You're on your own for hand-optimizing Fortran (I don't know the language).
Gcc is a nice tool; it's free, and it works well. Unfortunately, even with -O3 -funroll-loops, it can't optimize for beans. I had to study this in detail as a project for one of my grad courses, and I was appalled when I found out just how many potential optimizations it wouldn't catch.
If you're at the point where you're ready to optimize core algorithm code without worrying about it staying simple, then either replace it with inline assembly or (for better portability) write "pseudo-assembly" C code, with temp variables with the "register" keyword instead of registers, and statements only performing operations that can be easily mapped to machine code. Hand-unrolling and hand-software-pipelining worked wonders. Gcc will do the unrolling for you, but not the pipelining (I think) and it won't move even obvious candidate variables to registers.
Using a chip with a large register set (like the PPC) makes this a bit more scalable, but it still works well on x86 chips (to a point). I tested on x86 and Sparc architectures.
Lastly, bear in mind that you might, if you're lucky, get a factor of 10 out of all of this. Make sure that your algorithm is of a well-behaved order, and consider using a cluster of PCs for anything really power-hungry (though that involves optimizing communications, too).
As someone else pointed out, gcc is a great general-purpose compiler but it doesn't do a good job of optimizing for specialized instruction sets like the G4's AltiVec (or, for that matter, the Pentium's MMX). I'd go with a G4 and then get CodeWarrior; the folks who write CW have, for obvious reasons, more experience than anyone else in creating a compiler that can optimize G4 code than anyone else. (Er, I'm assuming CW for Linux is available for the PowerPC -- I'd be very surprised if it weren't. But I've been surprised before.) As a Mac guy, I can tell you that CW-compiled apps on a G4 absolutely scream. If that option is available, I think it's far and away the best.
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
As for the majority of apps, how many of them actually use the massive caching abilities of an Alpha (or UltraSparc, which you negelected to mention)? That's why they are used on database server, development machines (code compilers), and video systems (UltraSparcs + Sun graphics cards = playing several videos on several screens with real-time decoding of compressed and uncompressed video).
Anything further that you'd like to add?
I can't be karma whoring - I've already hit 50!
SIG: HUP
The UltraSparc(III) info can of course be found somewhere in Sun's website (www.sun.com). Keep in mind, however, that UltraSparc II, IIe, and others are in full force still. Also, the key area that makes a G4 be considered a 64 bit chip and and a Pentium a 32 bit is that while both access the PCI bus and RAM at 64 bits, only the G4 does internal calculations at 64 bits.
Also, for DNA modeling, etc., you'll be able to use larger data sets on the G4 than on the other laptop available chips. And most important: the Titanium laptops look pretty damned cool!
I can't be karma whoring - I've already hit 50!
SIG: HUP
Overall, go for an Alpha first, then the UtlraSparc (interchangeable). Obviously you can't really use these in a laptop, but they are there. Next shoot for a G4. You get more for your money at the lower speeds. Athlons are next. They ARE hard to find in laptops, but worth it (I think). Else, get a PIII.
I can almost bet that any benchmarks you do will follow my suggestions.
I can't be karma whoring - I've already hit 50!
SIG: HUP