Linux on an Intel PIII vs. G4?

Re:Processor features by vipw · 2001-03-24 11:11 · Score: 2

There are many factors in the equation of a system's computational speed.

in this discourse by alpha i mean 21264 and will make distinctions between
p3/p4 and k7 where applicable, i am uncertain on most of the numbers for
sparc (UltraSPARC III) chips.

processer frequency:
x86's strongest point, followed by alpha, then sparc, then g4
(well, that might be a little out of order, and don't put too much stock in
just the frequency anyway, it's simply one component of the system speed)

system bus width:
most processors share this bus with the memory bus but not with the cache
bus. It is usually 64bit wide but at differing frequencies on different archs.
The p3 and the g4 have a 100Mhz bus, the K7 has a 133Mhz DDR(266 effective),
the alpha has a 333Mhz bus, and i can't find relevant literature of the
UltraSPARC III.

to the best of my knowledge all of these chips have a 64bit system bus
the system bus is where disk drive controllers and pci/agp etc reside.

memory bus width:
P3/P4/K7, G4, and alpha share this bus with the system bus, the sparc chips,
i believe, do not. one thing of note about the alpha, it has 4 seperate
memory controllers that talk down the same bus, so even if though it uses
100 MHz SDRAM, it can completely fill the 333Mhz bus.

a lot of crazy stuff comes in to play in the memory bus if you have an
excessively SMP machine, sparcs have on chip memory controllers and can
access the memory easily, and the chips with bigger cache size don't need
to read as often from the main memory. the cache size makes a staggering
difference since it is often at the same frequency as the CPU.

cache bus width:
Everything but the alpha has a 64bit cache bus, the alpha's is 128bit and
error checking to boot!

cache frequency:
Most chips have 2 seperate chip caches, most pc cpus have them on the same
die as the CPU and running at full speed. The 'L1' cache is usually only
about 8k-64k is always is at full speed. The 'L2' cahce is usually much
bigger, although the P4 has a very small (64k) one. The speed of the L2
is as follows:
P3/P4/K7(thunderbird) full speed, G4 200Mhz-350Mhz, Alpha 333Mhz
dunno on the sparc.

the frequency is not only a contributor to the cache bandwith, but also the
cache latency. if your cache is half speed you'll have to wait another cycle
to pull data from it.

cache size:
k7 512kb, p3 256kb, p4 64kb, p3 xeon 512-2048kb, alpha up to 8MB, g4 512kb

memory latency:
memory subsystems are another level of wait on the data you're after in the
cpu. it usually takes a few cycles to get data from memory, how long
is determined by CAS and RAS latencies, usually between 2 and 3 on each.

memory frequency:
RDRAM (some p3 and all p4) has 400-800Mhz.
athlon has 133Mhz ddr (266Mhz effective)
g4 has 100Mhz
alpha has 100Mhz but 4 controllers

memory bandwidth:
64bits, the alphas have 4 simultaneous memory controllers, the HeSL P3 chipset
has 2. i think sparcs have it controlled on a per chip basis. all others
have 1 64bit path.

well folks, there are some numbers that have nothing to do with the way the
cpu works or the benefits of multiple instructions per clock, but the system
architecture surrounding the chip is just as, if not more importanct, to the
system's performance than the operation of the chip itself.

CPU architecture:
ok, here's where my (half-hearted) research breaks down,
branch prediction, pipeline length, concurrent instructions/instructions per
cycle, fetches per cycle, and a bunch of other factors come in to play with
assessing the CPU architecture efficiency.

The g4 really stands out because of its super short pipeline on the 500Mhz
and lower models at like 5(?) stages, the p4 on the otherhand is at a
staggeringly high 20+ pipeline. the shorter the pipeline the shorter cache
and memory delays are, and the smaller the misprediction penalty is. on the
down side, it's usually hard to reach high clock speeds. most chips are in
the 9-15 range for cpu pipeline.

concurrent instruction is the realm of MMX, 3dnow, SSE2, and altivec.
the g4's altivec unit gives the largest improvement, but the use of
concurrent instructions is mostly useful in the context of 3d graphics, and
much of the work is now being offloaded to the graphics chips.

but back to the question, for a laptop, p3 is your only real option, even
though it's only real strong point is its clock frequency, its clock *is* twice as
high any of your options, which is certainly enough to make it the notebook
cpu champ. maybe, just maybe, if your specific applications lend themselves
to optimization for the altivec unit the g4 500 would be dethrown the p3.
if i were you, i would lie to myself and say the g4 was my best bet and then
i would have a great excuse to pick up a titanium powerbook.

Laptop screamers by Chaostrophy · 2001-03-25 10:16 · Score: 2

If you can get a G4 at 850mhz in a laptop, it probably is the fastest. The 1ghz Intels likely cannot really run at that speed. Also, the mac has a higher max ram (1GB or better) that the PC (ok, I could be wrong).

So, is your data int or float, 8, 16, 32, or 64 bit, and can you work on several chunks at a time. If it is in 32 or smaller bit chunks, and you can do several at once, the mac is likely to rule suprem. It has 32 vs 8 128bit registers, and can do 2 instructions per clock tick vs 1 every other for the P3, for 4 times the speed, and better opps to boot.

Once again, what exactly are you doing?

Hey /., how about getting that kind of question for the next time someone asks what kind of system they should buy?

--
Plato seems wrong to me today

Re:For raw speed, ditch gcc. by shrike · 2001-03-24 09:29 · Score: 2

I did some testing on my own SGI Indigo2 R10K and my Sun Ultra 10 workstation at work. When I use the SGI compiler (MIPSpro 7.1 in my case) the compiled code is at least 30-50% faster than the gcc 2.95.2 compiled code. On the SPARC however, using gcc generated much faster code than Suns own compiler (Forte 4, I think). Another example is Compaq; their ccc compiler generates much faster code than gcc and to boot (unlike MIPSpro) it's free!

Now I find myself wondering about a few things.

Why is gcc so poorly optimized for 'exotic' architectures?
Does anyone really use C for computational work, knowing that the GNU Fortan compiler performs even worse?
Why is Sun selling such a crappy compiler?

The fact that SGI and Compaq (Digital) have such good compilers may be explained that their machines are being used in scientific establishments where CPU performance is key, while Suns machines are the favourites of dotcom farmers requiring massive amounts of IO (databases, etc). When an uni needs a new super computer they'll look to SGI, Compaq (Alpha), Intel (they've got very good compilers) and maybe even IBM (SP2). But I've never heard of an uni using a Sun for a super computer (cluster of UE10000's anyone?)

SARA, a dutch institution that maintains and houses several of Hollands super computers, is housing mostly SGI/Cray, Alpha and IBM hardware (and even some beowulf clusters). They do have a lot of Sun hardware, but most of it is being used as a web or database server.

My point? Well, maybe compiler (gcc and vendor) performance is influenced by heritage. In a scientific setting people will use the vendor supplied compiler, demanding and paying for premium performance. They don't really feel the need to contribute a very good code optimizer to the gcc project. However, in the dotcom world everything must be done as cheap as possible with maximum (ahem) performance. Hence, there are a lot of people tinkering with gcc for Intel (and maybe even SPARC).

Whatever the case the may be, the day gcc generates working 64-bit code I'll drink a few beers for the guys working on gcc. As it stands now, gcc can't generate a decent (maybe I should say working) 64-bit binary for both the SGI and SPARC platforms :( (I haven't tried it on an Alpha yet.)

And yes, I'm one of those CS drop-outs (web farmer) being forced to accept a fairly large amount of cash for trivial work while I would prefer doing research work for a minimum wage. Oh well, we can't all be brilliant.

Memory speeds by shrike · 2001-03-24 00:44 · Score: 3

I don't know how memory intensive your application will be, but you should be aware of the fact that memory access on a laptop is usually quite a bit slower than memory access on a desktop/server. Also, the memory speed can vary wildly between different brands and configurations.

We've got a couple of Dell Inspiron laptops that do about 280 MB/sec (according to SiSoft Sandra 2001se), while we've also got some noname laptops that only do about 160-170 MB/sec. The Dells got a 500 MHz Pentium III (100 MHz bus), the noname laptop a 500 Celeron (66 MHz bus). rc5des runs about the same speed on both types of laptops, but seti@home is quite a bit faster on the Dell (seti@home is much more memory intensive than rc5des). This speed difference can be explained by the fact that the Dell uses a 100 MHz bus and faster RAM.

My noname desktop (Athlon 650 MHz) does about 420 MB/sec and runs rc5des and seti@home about 60-80% faster.

Just some useless numbers...

Re:For raw speed, ditch gcc. by sql*kitten · 2001-03-25 20:59 · Score: 2

Gcc is a nice tool; it's free, and it works well. Unfortunately, even with -O3 -funroll-loops, it can't optimize for beans.

Aye. I know you want SuSE, but I'd recommend at least benchmarking your code with Watcom C/C++ compiler on Windows NT or 2000. Great numerical code generation, and this really can make a big difference.

Ask SuSE Folks by waldoj · 2001-03-24 01:25 · Score: 4

This sure ain't getting marked as +1 Informative, but had you considered checking with the SuSE teams? As one of very few distros that are processor-agnostic, I bet they've done some tests of their own.

FWIW, OS X server on a PPC outperformed Linux on an Intel 450 PII by 23%, according to osOpinion. (YMMV, read the fine print, etc., etc.)

-Waldo

For raw speed, ditch gcc. by Christopher+Thomas · 2001-03-24 01:51 · Score: 4

Your main problem if you're looking for a speed boost for applications won't be the processor - it'll be the algorithms you use and the compiler.

For the algorithm:

One word. Cache.

Main memory is up to an order of magnitude slower than the cache. Make your algorithms cache-friendly. This means optimizing row vs. column accesses and doing checkerboarding for things like matrices, and other optimizations for vectors. For things like linked lists and trees, try to keep nodes contiguous with other nodes in memory where possible (or even just the key and linkage pointers, since that's all you'll be accessing most of the time when doing a search).

It takes a while to fully zen into this, but it will pay off in spades.

For the compiler:

The following applies to the gcc C/C++ compiler. I'm assuming that you'll get similar performance results for the g77 Fortran compiler. You're on your own for hand-optimizing Fortran (I don't know the language).

Gcc is a nice tool; it's free, and it works well. Unfortunately, even with -O3 -funroll-loops, it can't optimize for beans. I had to study this in detail as a project for one of my grad courses, and I was appalled when I found out just how many potential optimizations it wouldn't catch.

If you're at the point where you're ready to optimize core algorithm code without worrying about it staying simple, then either replace it with inline assembly or (for better portability) write "pseudo-assembly" C code, with temp variables with the "register" keyword instead of registers, and statements only performing operations that can be easily mapped to machine code. Hand-unrolling and hand-software-pipelining worked wonders. Gcc will do the unrolling for you, but not the pipelining (I think) and it won't move even obvious candidate variables to registers.

Using a chip with a large register set (like the PPC) makes this a bit more scalable, but it still works well on x86 chips (to a point). I tested on x86 and Sparc architectures.

Lastly, bear in mind that you might, if you're lucky, get a factor of 10 out of all of this. Make sure that your algorithm is of a well-behaved order, and consider using a cluster of PCs for anything really power-hungry (though that involves optimizing communications, too).

Re:For raw speed, ditch gcc. by Chris+Pimlott · 2001-03-24 14:53 · Score: 2

Why is gcc so poorly optimized for 'exotic' architectures?

I think you overlook a more obvious answer. 'Exotic' architectures have less users, and therefore less developers who are knowledgable enough to contribute to gcc so that it optimizes better. It's commonly accepted that gcc is best on x86, which is unsurprising considering how widely used the platform is.

Considering expensiveness is certainly a factor in this too - after all, it was because x86 hardware was cheap and Minix expensive that Torvalds created Linux.
Re:For raw speed, ditch gcc. by gbnewby · 2001-03-25 00:56 · Score: 2

I didn't see people mentioning that there is a lot of optimization code for Pentium-type CPUs (how well it works, I can't attest to). There is. Someone mentioned the optimization also works well for Sparcs. The reason, of course, is this is what more of the gcc developers have access to.

As to Alphas, and 64-bit code: Like another postedr, I have great success on my Alpha DP264 system and gcc/g++. gcc has NO problem with 64-bit code. One of the problems that is probably confounded with perceptions of gcc performance is that the Linux kernel has only recently (e.g., 2.2.18 and beyond) been reasonably bug-free in 2GB+ memory. Again, it's a byproduct of more developers (and end-users/testers) having access to 64-bit CPUs on large memory machines.
Re:For raw speed, ditch gcc. by randombit · 2001-03-24 23:41 · Score: 2

Gcc is a nice tool; it's free, and it works well. Unfortunately, even with -O3 -funroll-loops, it can't optimize for beans. I had to study this in detail as a project for one of my grad courses, and I was appalled when I found out just how many potential optimizations it wouldn't catch.

So true. I've benchmarked crypto code [code that can take great advangage of pipelining and good register allocation] I've written with gcc and a few commerical compilers (all running on Linux on the same system), and in some cases I would see 3x-4x speed increase. And if you have gcc dump the asm, you'll see many silly things, even with full optimizations. This is totally from memory, but gives you an idea of what I'm talking about:

add esp,-4
[some instructions that just use registers, don't read or write to esp]
add esp,-8

I hand optimized the code (removing the second instruction and changing the first to add esp,-12), and it worked fine. This is of course a trivial example (yeah, I saved on cycle!), but in a large program things like this could mean tens of millions of cycles (think inner loops of long running programs).

If your algorithm is already in pipeline-friendly form, you'll generally be OK, but AFAIK, you're right about GCC not rearranging instructions to handle pipelining (but I haven't looked into this too carefully).

Of course I figure for heavy numerical work G4 will kick an x86's butt, just on the basis that a G4 has a reasonable number of registers. I'm amazed that Intel hasn't added a new extension like MMX or SSE that gives programs a few more GPRs; it would really be useful to a wide variety of programs.
Re:For raw speed, ditch gcc. by autocracy · 2001-03-24 11:43 · Score: 2

Sun supercomputers? Hell yea! Check out the 64 way STARFIRE machine. A little bit bigger than a fridge, this baby rocks. Each individual board takes 4 processors and 4 GB of RAM. And each board can be serviced without bringing down tha machine by pressing a button, waiting while information is flushed from the RAM and the system takes the processors out of the loop, and then pop it out when the light says it's OK. You'll find it fast in the Sun Store. IBM must have been pissed when Sun rolled two of those sweet babies in USAA's data center in San Antonio!

I can't be karma whoring - I've already hit 50!

--
SIG: HUP

Compilers matter most for G4's by Daniel+Dvorkin · 2001-03-24 04:34 · Score: 2

As someone else pointed out, gcc is a great general-purpose compiler but it doesn't do a good job of optimizing for specialized instruction sets like the G4's AltiVec (or, for that matter, the Pentium's MMX). I'd go with a G4 and then get CodeWarrior; the folks who write CW have, for obvious reasons, more experience than anyone else in creating a compiler that can optimize G4 code than anyone else. (Er, I'm assuming CW for Linux is available for the PowerPC -- I'd be very surprised if it weren't. But I've been surprised before.) As a Mac guy, I can tell you that CW-compiled apps on a G4 absolutely scream. If that option is available, I think it's far and away the best.

--
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.

Re:Processor features by autocracy · 2001-03-24 10:21 · Score: 2

Excuse me, but the G4 IS a 64 bit chip. And an AMD would be my choice before a G4, except he's LOOKING FOR A LAPTOP.

As for the majority of apps, how many of them actually use the massive caching abilities of an Alpha (or UltraSparc, which you negelected to mention)? That's why they are used on database server, development machines (code compilers), and video systems (UltraSparcs + Sun graphics cards = playing several videos on several screens with real-time decoding of compressed and uncompressed video).

Anything further that you'd like to add?

I can't be karma whoring - I've already hit 50!

--
SIG: HUP

Re:Processor features by autocracy · 2001-03-24 11:36 · Score: 2

Well hell, that say it all: You have too much time on you hands :)

The UltraSparc(III) info can of course be found somewhere in Sun's website (www.sun.com). Keep in mind, however, that UltraSparc II, IIe, and others are in full force still. Also, the key area that makes a G4 be considered a 64 bit chip and and a Pentium a 32 bit is that while both access the PCI bus and RAM at 64 bits, only the G4 does internal calculations at 64 bits.

Also, for DNA modeling, etc., you'll be able to use larger data sets on the G4 than on the other laptop available chips. And most important: the Titanium laptops look pretty damned cool!

I can't be karma whoring - I've already hit 50!

--
SIG: HUP

Processor features by autocracy · 2001-03-24 02:25 · Score: 5

Here are the key processor features:

Pure speed: Mhz is the definiton used here. The higher the number, the more cycles you get every second.
Bandwidth: Measured in bits. Currently, Alpha, UltraSparc, and PowerPC G4 chips have a 64 bit setup. All modern intels only reach 32 bit (forget IA-64, it's not really "out there").
Cache: Alpha and UltraSparcs carry a hefty 8 MB of processor cache. Xeon chips carry 1 and 2 MB caches. Pentiums usually have around 1/2 MB, and so I believe for the G4.
Coding: Not really measurable, but here is what the processors excel at:
- Alpha and UltraSparc: The big boys, these chips can handle anything. Usually used in database servers because of there massive caches.
- G4: Graphics and heavy math. The 64 bit data path allows much more information to travel through the chip per cycle than Intels. Any parallel data will go much faster here than on Intel.
- Xeon: Honestly, it's over-priced crap. Go buy an Alpha or an UltraSparc.
- Pentium: They do something? Wow! Seriously, the P4s aren't much help unless you can optimize for them. PIIIs can hold there own, but being consistently beat down by Athlons running at lower speeds is shaming them. The Pentium's only true strengh is that it is the most common chip, and therefore has more option (mobos, SMP, etc.).
- Athlon: There over-clockability is the shining point. If you don't mind screwing around with you box, go buy a water cooler and an Athlon 1.33 Ghz and pair it up with PC2100 (?) DDR RAM. You'll get a 266Mhz transfer on data from their RAM. And it all costs less than an off-the-shelf Pentium.

Overall, go for an Alpha first, then the UtlraSparc (interchangeable). Obviously you can't really use these in a laptop, but they are there. Next shoot for a G4. You get more for your money at the lower speeds. Athlons are next. They ARE hard to find in laptops, but worth it (I think). Else, get a PIII.

I can almost bet that any benchmarks you do will follow my suggestions.

I can't be karma whoring - I've already hit 50!

--
SIG: HUP

Slashdot Mirror

Linux on an Intel PIII vs. G4?

15 of 47 comments (clear)