The whole notion of CPU benchmarks is to try to isolate the CPU as much as possible. You can choose to ignore the principle, but it invalidates the notion of doing the benchmark at all.
So, if I'm a physicist doing SPECfp type computations, I can't go to spec.org and compare an IBM system to a PC because they use different compilers?
Under your constraints, it's pointless to compare systems with different OSes and compilers so why does spec.org even bother? The comparison will be invalid under your constraints.
Your solution is to normalize the compiler. I would say that this is invalid. Use the best compiler that you can for that platform. It will showcase the potential of the CPU. You simply can't isolate the CPU from the compiler. It's just as important as the CPU for performance.
Let's say I wanted to compare Photoshop performance on a Mac vs. a PC. We're stuck with 2 binaries, each built with a different compiler - but the binaries offer the same functionality. Is it impossible to compare performance? I'm arguing no. It's a valid comparison.
Most Linux kernel's are compiled with gcc. Most apache servers are compiled with gcc...
Gcc is great, I use it all the time, but for performance, I'll use Intel's icc. The reason the Intel compiler isn't used to compile the Linux kernel is that the Intel compiler for Linux is an immature product under that OS. It was only released on Linux recently as a non-beta product - it does not support the many GNU-extensions that the kernel uses. Another strike against it in the Linux community is that it is closed-source and you have to pay for it. Another reason to not use Intel is that it's not going to work with GDB and many Linux debugging tools are built around it. Sun and Adobe may use gcc to compile their Linux products, but that doesn't change the fact that if they used Intel, their apps would be faster.
BTW, I believe that the optimizations you mention Sun doing are not legal for the SPEC_*_base benchmarks, which were again expressly introduced to prevent clever compiler optimizations from producing misleading results.
Actually, the "art" loop-interchange optimization shows up in their base results. It's the only way Sun can get their Specfp_base scores to be even halfway respectable. No other vendor does this optimization because it's so lame.
The idea behind the tests was to, as much as possible, compare apples to apples (pun not really intended). The idea was to compare OS X + PPC 970 vs. Linux + a top of the line Dell, controlling as many of the other factors as possible. This meant trying to keep the other elements as similar as possible.
I disagree that you should keep elements similar. You should use what is readily available for purchase in the marketplace. Nobody compiles performance sensitive code with gcc on an x86, they either use Microsoft's compiler or Intel's compiler which have several hundred man-years of engineering put into making efficient x86 code. If Apple does not have a similarly powerful compiler, well that's too bad for Apple, because their apps will run not-as-efficient code. Having optimized photoshop code myself, I can assure you that the developers at Adobe do not use gcc. Neither do people at Microsoft. I'll bet Apple developers don't use gcc either (It's been a long time since I've programmed a Mac, and at that time I used Metrowerks). It's stupid to normalize the compiler. Use the best compiler for both platforms (ie - the ones that the developers actually use). If the compiler for Apple is not as good as the x86 compilers out there, well that's Apple's fault - not the user's.
I am suggesting that Apple's use of gcc is for the sole purpose of handicapping x86 systems for comparison purposes.
If you go look at the spec scores for the P4, you'll notice that some are Dell systems that you can go out and buy - as configured. Yes, they use microquill's "smartheap", but so does Apple as you said. They also happen to use Intel's compiler which is a production, commercial ready compiler that many performance sensitive developers use.
As for using RAM-disks and other things like that - you'd be suprised as to the lengths some vendors will go to boost SPEC scores. Believe me, if a RAM-disk would help performance, they sure as hell would put it in the SPEC system. As it turns out, SPEC is reasonably insensitive to disk performance.
The POWER-4 from IBM has a 128 MB L3 cache (which the PPC970 is a derivative of incidentally). Sun uses an auto-parallelizing compiler to claim higher SPECfp scores using 2 processors (for a single thread). Sun also does a little cheating in the compiler by doing not-too-safe loop interchanges to speedup the "art" benchmark - because, how can you sell an expensive workstation if a PC is kicking your ass:) Intel, auto-vectorizes loops in FP code to make use of SSE2.
Well, I'm not saying the PowerPC backend is better or worse than the x86 back end, but I am saying that it's rediculous to imply that Apple was able to "soup up" gcc for their benchmark, while the Intel compiler sucked. The Intel crew have had more than their fair shot at optimising the compiler's output.
I'm saying that the gcc backend sucks for both PowerPC and x86. I also did not want to imply that Apple improved gcc for PPC while inhibiting x86. I'm saying that the Intel compiler is far superior to gcc in generating optimized x86 code. The PPC compilers need to catch up, and until they do, the G5 apps will be handicapped and *may* not be taking full advantage of the G5 microarchitecture.
Your point on hyperthreading is correct. There is a very large OS component that needs to be accounted for.
As for the performance of the Athlon 3200+ you are quoting was on a very different system
Performance is system dependent. You can't get around that. SPEC is supposed to normalize the 'task' that you want to complete. For example, one of the tasks is to compute the optimal placement-and-routing of some logic circuits (the benchmark twolf).
Using whatever means possible (CPU, OS, Compiler, Memory subsystem, etc...), make this task go as fast as possible. That is to say, no matter what system you run it on, it should give you the same answer. The one that gives you the answer the fastest is the highest performer. This is after all what the user cares about. The combination of CPU, OS, compiler etc... for the P4 system is better than anything else available at this time.
I would say that's a point in Intel's favor that it has the fastest desktop computer, with AMD not too far behind it.
If the Apple compiler or OS or other non-CPU components are not up to the task, then that's a point against Apple systems. Users that care about performance care about completing a task in the shortest time possible.
This is why SPEC is respected amongst industry insiders as a decent measure of computer systems performance. The benchmarks are chosen by an industry wide committee with full source code. It's not perfect, but what do want for a performance metric that is supposed to compare such a wide array of dispirate computer systems?
Intel and various other parties have spent a lot of time trying to optimize gcc's performance for x86. They completely dwarf efforts made by Apple & IBM's to optimize PowerPC performance.
So you're saying that apps compiled on Intel machines will inherently be faster because of superior compiler technology? Score one for Intel then - better compiler = faster apps.
Actually, it's entirely likely that hyperthreading would have actually made the systems perform more slowly, particularly since the stock kernel they were using was not hyperthreading aware. That being said it was interesting that they turned on hyperthreading for the single CPU tests.
Indeed. Turning on hyperthreading can hurt performance because now both threads have to compete for shared resources such as the L2 cache. It can only hurt performance for single thread operation which is why they turn it on for single thread spec2k.
I haven't seen any compelling evidence yet as to whether Athlons or Opterons score better on Spec benchmarks. It'll be intersting to run these benchmarks again once the G5's are actually out.
How about www.spec.org for starters. The Athlon 3200+ scores 1044 on specint which is well above 800. If the G5 doesn't have a competitive compiler, then you might reasonably expect that apps compiled for G5 won't be competitive either.
On the other hand, you could claim that Apple chose GCC on the Intel platforms to make them look bad in this comparison...
I don't expect much out of the Apple marketing department. These are the same people that claimed that a G4 was a supercomputer.
It's obvious that gcc was chosen to make x86 CPUs look bad. Gcc sucks compared to the Intel compiler for x86. Even AMD uses the Intel compiler to report SPEC scores. The discrepancy is huge. The Intel compiler has global whole program optimization as well as a ton of other features that gcc won't have for years.
IBM could have given Apple a decent compiler with these types of optimization features, but then the performance would look bad against Intel.
Even with this huge gcc-handicap, the 3GHz P4 still beats out the 2GHz machine by more than 10% on specint (889 vs. 800) - and by the way, did you notice that you can now buy 3.2GHz machines for less than the G5 machine which won't be available until August?
Then there's the bogus specrate comparison of 2 processor systems. Why didn't they turn on hyperthreading? Could it be that the dual Xeons would then stomp the dual G5s?
Sure the P4 is higher frequency, but it computes the answer faster than the G5, and that's all that should matter if you care about absolute performance.
Athlons are faster too. So are Opeterons. Athlon64 will no doubt be faster as well.
Now there is something to be said for Macs. They have a superior OS and wonderful apps and will give Microsoft some good competition, but to claim the "fastest desktop computer" crown is pure bullshit.
You're assuming that the machines both have storage that runs at the same speed as the CPUs. If Machine-A and Machine-B have the same speed RAM, then machine-A's penalty is much worse if the branch predictor screws up badly; if we have to get data from RAM, and the RAM is the same speed on both machines, then machine-A takes twice as long to refill its pipeline as machine-B.
No. You're confusing things. The branch predictor penalty is independent of DRAM speed. What you're talking about is the penalty of a cache miss that goes out to DRAM.
If they have the same latency to RAM, then they'll both wait just as long (in absolute time) waiting for their DRAM request to come back. Sure, the 2GHz machine will wait for twice as many cycles, but again, those cycles are twice as fast as the 1GHz machine, so that they both wait the same amount of time.
Performance is measured in time, not clock cycles.
Actually, I predict that this application should do quite well running on an HT machine.
There are a lot of conditional branches in this kind of application that are frequently mispredicted. This means that the pipeline gets flushed often, which means that the other thread can take advantage of those cleared out resources.
Performance, as I have it defined in my head is simply the time it takes to complete a task. The lower the better. For this guy's chess application, I think he is more concerned with absolute performance irrespective of total energy consumed.
Your definition includes power efficiency as a consideration. This is a worthy metric that is not lost on the engineers at Intel and AMD - let me assure you. There's a very talented team of engineers in Haifa, Israel building very power efficient designs. Banias is only a taste of what's to come.
If you look at SPECint2000, you will find an integer benchmark called 'crafty'. This is a chess simulator with code sequences that are probably similar to what this guy used.
ASUS A7N8X Motherboard rev. 2.0, AMD Athlon (TM) XP 3200+ scores 1324
You'll find that P6 derivaties (Banias, Athlon, Opteron etc...) do better on this benchmark. There are lots of unpredictable conditional branches in this application, so the incidence of mispredictions is higher than normal. You would think that this is the main contributer to poor P4 performance, but actually that is a second order effect, because the predictor on the P4 is far better than on other machines. It's the fact that the code will not fit inside the trace cache, but will fit nicely within Athlon's 64KB I-Cache.
I don't think that it is in dispute that Intel went for low IPC/high clock at least partly because it was seen as good for PR -- with the MHz-race and MHz-myth and all.
There absolutely is no truth to the rumor that Intel went for high-frequency/low-ipc designs for marketing reasons. Don't believe me? Look at what an AMD employee says (he used to work for Intel on the P4).
Does this actually surprise anyone? The P4 was only an exercise in marketing by Intel - redesign the chipset so it can be clocked nice and high (so it appeals to the average consumer) and to hell with the performance...
Let me use the converse of your argument. AMD redesigned their chipset to make their IPC too high and to hell with performance.
Why do people insist that high frequency automatically means low performance? I'd say the P4 is pretty damn fast.
It does not matter if the frequency is high or low. If you get the performance, who cares if the frequency is 1GHz or 4GHz? There are lots of ways to go for performance - 2 extremes are "narrow-and-fast" and "wide-and-slow".
Nobody complained when Alpha went for low-ipc/high-frequency designs. Students of computer architecture will remember the days in the early 90s when there was a contest between the "speed-demons" and the "brainiacs". HP built the 'brainiac' machine (which was lower in frequency but had a wider issue) and Dec (Alpha) went for the 'speed-demon' (faster clock, lower-ipc). History shows that Alpha won that particular battle (performance-wise, not market-wise).
Getting higher IPC is hard. In fact, making a superscalar, out-of-order machine wider is really hard. The hardware cost and power grow as the square of the width. Getting higher frequency is hard too, but some believe it is not as hard as getting higher IPC. The cost of the hardware and power of a higher frequency machine grows linearly with frequency.
Yes, the P4 is designed to clock higher than an Athlon. They use fewer gates-per-clock and therefore, necessarily do less work per clock. Unfortunately, performance is not measured in work-done-per-clock. It's measured in absolute time. So if you can get the same amount of work done in the same amount of time, but use more clocks to do it, why should you as a user care? You still got the performance.
SPEC is largely OS independent. You spend almost 0% of your time in the actual OS when running the apps.
You use a geometric mean for SPEC. There is no argument. The SPEC number is a speedup (a ratio). Everyone knows that you use geometric means for ratios. In fact, SPEC requires that you do so. If you use a simple arithmetic mean, you would get a skewed result depending on if you measure ratio or 1/ratio.
The PPC970 wit its Power4 core, clocked at 1.6GHz completely trashes a 3GHz P4. Faster bus, faster integer, and completely outclasses the P4 for FPU and SIMD.
Yawn.
The IBM PPC 970 1.8GHz Specint 2000 score quoted at Microprocessor Forum last year was 937 (source) The current 3.0GHz P4 with Canterwood chipset already sports a Specint 2000 score of 1164 (source)
Wow...20% slower. As for floating point, the ratio is even worse.
Good call. The benchmark in question is "art", and they tweaked their compiler to do a loop interchange to speed it up significantly.
I'm calling shenanigans on Sun again for doing the same thing with the "swim" benchmark. Their latest machines have peak scores for this benchmark that are 4-5x faster than the base scores. What they have done here is to get the compiler to skew the data accesses such that they fit better into the cache.
Spec scores from Sun are laughable. Industry insiders all know that SPARCs are slow.
It's an old debate in computer architecture. What is the better architecture? Wide and slow or narrow and fast? I believe performance speaks for itself, independent of marketing.
Do a google search on "speed demons vs. brainiacs". The folks at DEC Alpha choose to design a high frequency, low-ipc machine and the folks at HP designed a low frequency, higher-ipc machine. History has shown that frequency always wins because it is easier to get performance from frequency than to get it from IPC.
The hardware power, complexity, area etc... grow with the square of the width of a superscalar out-of-order processor. Frequency grows the hardware linearly. Let's see who runs out of scalability first.
I think you'd be suprised as to how far Intel can take frequency scaling.
Why do you think that Ultrasparc III, Alpha and Itanium don't run at 2GHz? Because the designers felt like leaving performance off the table?
Think again. These processors won't run at anywhere near 2GHz due to a minor thing called "the laws of physics". They aren't designed to run that high and you could never make them go that high without re-designing the whole architecture.
The P4 was designed to run at high frequencies. They do less work per clock (fewer gates per cycle), but make up for it in frequency.
I'm not even going to comment on the stupidity of saying a 2GHz G5 is like a 6-10 GHz pentium 4. That's just ridiculous.
I'm afraid this might add to the confusion about serial interfaces being 'faster' than parallel. While it is true that you don't have to worry about data/clock skew when using serial interfaces, enabling you to clock them faster, a parallel interface running at the same clock speed as a serial interface will always be faster in terms of data throughput. The reason for this is simple: serial == 1 bit per clock, parallel = > 1 bit per clock.
That's the whole point isn't it? Wide and slow or narrow and fast, you still get the same throughput. You can't clock a parallel interface as fast as a serial one, so you shouldn't compare them with the same clock speed.
I am reminded of the whole P4 vs Athlon debates. It's stupid to compare P4 and Athlon at the same frequency or use the stupid "but Athlon does more per clock" arguments. The P4 is designed to run at a higher clock, but can do less in parallel (IPC), but makes up for that with the higher clock.
Reliability isn't free and people won't pay for extra for it. It's just the economics of the situation.
If people were willing to pay extra for reliability, then you'd see it. As it stands, the market seems to favor feature sets over reliability. Software designers invariably trade reliability for more features - and they should because this is what the market demands.
There's this thing called innovation. It is the primary driver of the economy. Forget about interest rates, the dollar vs. the euro, trade deficits and national debt... These are important and have measurable effects, but they should be treated as mere distractions.
To stay on top of things and become a global leader you need to do something that no one else can do because it has never been done before.
Innovate my friends. Think out of the box. Take a risk. Silicon valley was built on people tinkering in their garages. It's what has kept America on top for so long - and will continue to do so IMHO.
That large cache for your UltraSparc-III is off chip. This is important to note because the bandwidth you get from the off-chip caches is much lower than the bandwidth of an on-chip cache. In fact, if you read the Ultrasparc-III systems specs you'll find that the bandwidth to the cache is comparable to the bandwidth to DRAM on a relatively cheap PC (the new Intel Canterwood/Springdale chipsets have a peak DRAM bandwidth of 6.4GB/sec)
If you need a 2MB cache you should consider a Xeon-MP which has just that. Couple this with a reasonably fast core and you should see some good performance for your application. Most x86 processors will have at least 1Mb of cache by the end of the year (Hammer, Prescott, Banias).
As you might imagine, on-chip caches are expensive. As a rule of thumb, the closer the memory is to the processor core, the more expensive it will be.
Your argument that SPARC is superior to x86 is weak. I've designed both kinds of processors and everything these days is basically RISC-like. The x86 code is translated into micro-ops that look like RISC. SPARC also has some stupid instructions and idioms. For example, register windows may seem like a good idea, but they really grow your register file and limit your frequency. Also, delayed branches are stupid and limit many things you can do. If I had to do another SPARC chip, I'd do some translation of my own into more efficient hardware-friendly micro-ops.
SPARC systems are nowhere near as competive as x86 systems. Their last niche of superiority with server workloads will disappear with the proliferation of Opteron systems.
I tried Walmart's 1 month free DVD rental service and it sucks compared with Netflix. The titles are no where near as available as they are on Netflix. Trying to rent just about any decent movie resulted in a "short wait". The same movies were readily available on Netflix.
In a capitalist society, we call that delta a profit. If you have the billions of dollars to invest in Fabs and the army of super-smart Phd types it takes to build a state of the art x86 processor, why don't you do it yourself? It will only cost you $438.67 per cpu and you can save yourself $20.86 - This is of course ridiculous.
In reality, it's a win-win situation. Intel gets the profit from the consumer and the consumer get a product that he/she couldn't possibly get from other means (AMD notwithstanding - but I hear they too make a profit). There is 0 incentive to sell a product for equal-to or less than the cost of production.
Also, your analysis is flawed and absurd.
- CPU unit volume over time is not exponential. You're confusing unit volume with Moore's Law. - Market capitalization is correlated to FUTURE earnings potential (not past earning as your analysis would indicate) - Intel does not spend 53.6% of the MARKET-CAP on CPU development. It spends 53.6% of ANNUAL SALES.
Actually, if it's a very tight loop (which it is) and the branch in the loop is highly predictable (which it is), and there are no data or instruction cache misses (there aren't), then this program will spend very little time stalling the pipeline! Which means that it does not matter how many pipeline stages you have if you never stall - and this means that this application will scale 100% with frequency. In this case the pipeline depth is totally irrelavant. What is now relevant is fetch and execution bandwidth of the instructions on the critical path of the program.
As I mentioned in another post, you can further improve the performance of this app by increasing the execution bandwidth of rotate instructions (which the G4 does with it's data-parallel rotate instruction).
If you have the means, run this benchmark on a system at different frequencies, and you will notice that the score scales linearly 1:1 with frequency.
The whole notion of CPU benchmarks is to try to isolate the CPU as much as possible. You can choose to ignore the principle, but it invalidates the notion of doing the benchmark at all.
So, if I'm a physicist doing SPECfp type computations, I can't go to spec.org and compare an IBM system to a PC because they use different compilers?
Under your constraints, it's pointless to compare systems with different OSes and compilers so why does spec.org even bother? The comparison will be invalid under your constraints.
Your solution is to normalize the compiler. I would say that this is invalid. Use the best compiler that you can for that platform. It will showcase the potential of the CPU. You simply can't isolate the CPU from the compiler. It's just as important as the CPU for performance.
Let's say I wanted to compare Photoshop performance on a Mac vs. a PC. We're stuck with 2 binaries, each built with a different compiler - but the binaries offer the same functionality. Is it impossible to compare performance? I'm arguing no. It's a valid comparison.
Most Linux kernel's are compiled with gcc. Most apache servers are compiled with gcc...
Gcc is great, I use it all the time, but for performance, I'll use Intel's icc. The reason the Intel compiler isn't used to compile the Linux kernel is that the Intel compiler for Linux is an immature product under that OS. It was only released on Linux recently as a non-beta product - it does not support the many GNU-extensions that the kernel uses. Another strike against it in the Linux community is that it is closed-source and you have to pay for it. Another reason to not use Intel is that it's not going to work with GDB and many Linux debugging tools are built around it. Sun and Adobe may use gcc to compile their Linux products, but that doesn't change the fact that if they used Intel, their apps would be faster.
BTW, I believe that the optimizations you mention Sun doing are not legal for the SPEC_*_base benchmarks, which were again expressly introduced to prevent clever compiler optimizations from producing misleading results.
Actually, the "art" loop-interchange optimization shows up in their base results. It's the only way Sun can get their Specfp_base scores to be even halfway respectable. No other vendor does this optimization because it's so lame.
Gcc may be great for PPC, but it sucks for x86, thus making the comparison handicapped. Use the best compiler you can find for both systems.
The idea behind the tests was to, as much as possible, compare apples to apples (pun not really intended). The idea was to compare OS X + PPC 970 vs. Linux + a top of the line Dell, controlling as many of the other factors as possible. This meant trying to keep the other elements as similar as possible.
:) Intel, auto-vectorizes loops in FP code to make use of SSE2.
I disagree that you should keep elements similar. You should use what is readily available for purchase in the marketplace. Nobody compiles performance sensitive code with gcc on an x86, they either use Microsoft's compiler or Intel's compiler which have several hundred man-years of engineering put into making efficient x86 code. If Apple does not have a similarly powerful compiler, well that's too bad for Apple, because their apps will run not-as-efficient code. Having optimized photoshop code myself, I can assure you that the developers at Adobe do not use gcc. Neither do people at Microsoft. I'll bet Apple developers don't use gcc either (It's been a long time since I've programmed a Mac, and at that time I used Metrowerks). It's stupid to normalize the compiler. Use the best compiler for both platforms (ie - the ones that the developers actually use). If the compiler for Apple is not as good as the x86 compilers out there, well that's Apple's fault - not the user's.
I am suggesting that Apple's use of gcc is for the sole purpose of handicapping x86 systems for comparison purposes.
If you go look at the spec scores for the P4, you'll notice that some are Dell systems that you can go out and buy - as configured. Yes, they use microquill's "smartheap", but so does Apple as you said. They also happen to use Intel's compiler which is a production, commercial ready compiler that many performance sensitive developers use.
As for using RAM-disks and other things like that - you'd be suprised as to the lengths some vendors will go to boost SPEC scores. Believe me, if a RAM-disk would help performance, they sure as hell would put it in the SPEC system. As it turns out, SPEC is reasonably insensitive to disk performance.
The POWER-4 from IBM has a 128 MB L3 cache (which the PPC970 is a derivative of incidentally). Sun uses an auto-parallelizing compiler to claim higher SPECfp scores using 2 processors (for a single thread). Sun also does a little cheating in the compiler by doing not-too-safe loop interchanges to speedup the "art" benchmark - because, how can you sell an expensive workstation if a PC is kicking your ass
Well, I'm not saying the PowerPC backend is better or worse than the x86 back end, but I am saying that it's rediculous to imply that Apple was able to "soup up" gcc for their benchmark, while the Intel compiler sucked. The Intel crew have had more than their fair shot at optimising the compiler's output.
I'm saying that the gcc backend sucks for both PowerPC and x86. I also did not want to imply that Apple improved gcc for PPC while inhibiting x86. I'm saying that the Intel compiler is far superior to gcc in generating optimized x86 code. The PPC compilers need to catch up, and until they do, the G5 apps will be handicapped and *may* not be taking full advantage of the G5 microarchitecture.
Your point on hyperthreading is correct. There is a very large OS component that needs to be accounted for.
As for the performance of the Athlon 3200+ you are quoting was on a very different system
Performance is system dependent. You can't get around that. SPEC is supposed to normalize the 'task' that you want to complete. For example, one of the tasks is to compute the optimal placement-and-routing of some logic circuits (the benchmark twolf).
Using whatever means possible (CPU, OS, Compiler, Memory subsystem, etc...), make this task go as fast as possible. That is to say, no matter what system you run it on, it should give you the same answer. The one that gives you the answer the fastest is the highest performer. This is after all what the user cares about. The combination of CPU, OS, compiler etc... for the P4 system is better than anything else available at this time.
I would say that's a point in Intel's favor that it has the fastest desktop computer, with AMD not too far behind it.
If the Apple compiler or OS or other non-CPU components are not up to the task, then that's a point against Apple systems. Users that care about performance care about completing a task in the shortest time possible.
This is why SPEC is respected amongst industry insiders as a decent measure of computer systems performance. The benchmarks are chosen by an industry wide committee with full source code. It's not perfect, but what do want for a performance metric that is supposed to compare such a wide array of dispirate computer systems?
Intel and various other parties have spent a lot of time trying to optimize gcc's performance for x86. They completely dwarf efforts made by Apple & IBM's to optimize PowerPC performance.
So you're saying that apps compiled on Intel machines will inherently be faster because of superior compiler technology? Score one for Intel then - better compiler = faster apps.
Actually, it's entirely likely that hyperthreading would have actually made the systems perform more slowly, particularly since the stock kernel they were using was not hyperthreading aware. That being said it was interesting that they turned on hyperthreading for the single CPU tests.
Indeed. Turning on hyperthreading can hurt performance because now both threads have to compete for shared resources such as the L2 cache. It can only hurt performance for single thread operation which is why they turn it on for single thread spec2k.
I haven't seen any compelling evidence yet as to whether Athlons or Opterons score better on Spec benchmarks. It'll be intersting to run these benchmarks again once the G5's are actually out.
How about www.spec.org for starters. The Athlon 3200+ scores 1044 on specint which is well above 800. If the G5 doesn't have a competitive compiler, then you might reasonably expect that apps compiled for G5 won't be competitive either.
On the other hand, you could claim that Apple chose GCC on the Intel platforms to make them look bad in this comparison...
I don't expect much out of the Apple marketing department. These are the same people that claimed that a G4 was a supercomputer.
It's obvious that gcc was chosen to make x86 CPUs look bad. Gcc sucks compared to the Intel compiler for x86. Even AMD uses the Intel compiler to report SPEC scores. The discrepancy is huge. The Intel compiler has global whole program optimization as well as a ton of other features that gcc won't have for years.
IBM could have given Apple a decent compiler with these types of optimization features, but then the performance would look bad against Intel.
Even with this huge gcc-handicap, the 3GHz P4 still beats out the 2GHz machine by more than 10% on specint (889 vs. 800) - and by the way, did you notice that you can now buy 3.2GHz machines for less than the G5 machine which won't be available until August?
Then there's the bogus specrate comparison of 2 processor systems. Why didn't they turn on hyperthreading? Could it be that the dual Xeons would then stomp the dual G5s?
Sure the P4 is higher frequency, but it computes the answer faster than the G5, and that's all that should matter if you care about absolute performance.
Athlons are faster too.
So are Opeterons.
Athlon64 will no doubt be faster as well.
Now there is something to be said for Macs. They have a superior OS and wonderful apps and will give Microsoft some good competition, but to claim the "fastest desktop computer" crown is pure bullshit.
You're assuming that the machines both have storage that runs at the same speed as the CPUs. If Machine-A and Machine-B have the same speed RAM, then machine-A's penalty is much worse if the branch predictor screws up badly; if we have to get data from RAM, and the RAM is the same speed on both machines, then machine-A takes twice as long to refill its pipeline as machine-B.
No. You're confusing things. The branch predictor penalty is independent of DRAM speed. What you're talking about is the penalty of a cache miss that goes out to DRAM.
If they have the same latency to RAM, then they'll both wait just as long (in absolute time) waiting for their DRAM request to come back. Sure, the 2GHz machine will wait for twice as many cycles, but again, those cycles are twice as fast as the 1GHz machine, so that they both wait the same amount of time.
Performance is measured in time, not clock cycles.
Actually, I predict that this application should do quite well running on an HT machine.
There are a lot of conditional branches in this kind of application that are frequently mispredicted. This means that the pipeline gets flushed often, which means that the other thread can take advantage of those cleared out resources.
It does not necessarily follow that you need a better predictor if you add pipeline stages.
Let's say I had a 20 stage pipeline running at 2GHz on machine-A.
Let's say I had a 10 stage pipeline running at 1GHz on machine-B.
What machine has a longer pipeline flush penalty?
The answer is that they have the same penalty. Machine-A has 2x more stages to fill, but can re-fill them at 2x the rate that Machine-B can.
If Machine-A had a better predictor algorithm than Machine-B it would perform even better due to fewer pipeline flushes.
Ah, yes. You bring up a great point.
Performance, as I have it defined in my head is simply the time it takes to complete a task. The lower the better. For this guy's chess application, I think he is more concerned with absolute performance irrespective of total energy consumed.
Your definition includes power efficiency as a consideration. This is a worthy metric that is not lost on the engineers at Intel and AMD - let me assure you. There's a very talented team of engineers in Haifa, Israel building very power efficient designs. Banias is only a taste of what's to come.
If you look at SPECint2000, you will find an integer benchmark called 'crafty'. This is a chess simulator with code sequences that are probably similar to what this guy used.
Intel D875PBZ motherboard (3.0 GHz, Pentium 4 processor with HT Technology) scores 1137
ASUS A7N8X Motherboard rev. 2.0, AMD Athlon (TM) XP 3200+ scores 1324
You'll find that P6 derivaties (Banias, Athlon, Opteron etc...) do better on this benchmark. There are lots of unpredictable conditional branches in this application, so the incidence of mispredictions is higher than normal. You would think that this is the main contributer to poor P4 performance, but actually that is a second order effect, because the predictor on the P4 is far better than on other machines. It's the fact that the code will not fit inside the trace cache, but will fit nicely within Athlon's 64KB I-Cache.
I don't think that it is in dispute that Intel went for low IPC/high clock at least partly because it was seen as good for PR -- with the MHz-race and MHz-myth and all.
There absolutely is no truth to the rumor that Intel went for high-frequency/low-ipc designs for marketing reasons. Don't believe me? Look at what an AMD employee says (he used to work for Intel on the P4).
Does this actually surprise anyone? The P4 was only an exercise in marketing by Intel - redesign the chipset so it can be clocked nice and high (so it appeals to the average consumer) and to hell with the performance...
Let me use the converse of your argument. AMD redesigned their chipset to make their IPC too high and to hell with performance.
Why do people insist that high frequency automatically means low performance? I'd say the P4 is pretty damn fast.
It does not matter if the frequency is high or low. If you get the performance, who cares if the frequency is 1GHz or 4GHz? There are lots of ways to go for performance - 2 extremes are "narrow-and-fast" and "wide-and-slow".
Nobody complained when Alpha went for low-ipc/high-frequency designs. Students of computer architecture will remember the days in the early 90s when there was a contest between the "speed-demons" and the "brainiacs". HP built the 'brainiac' machine (which was lower in frequency but had a wider issue) and Dec (Alpha) went for the 'speed-demon' (faster clock, lower-ipc). History shows that Alpha won that particular battle (performance-wise, not market-wise).
Getting higher IPC is hard. In fact, making a superscalar, out-of-order machine wider is really hard. The hardware cost and power grow as the square of the width. Getting higher frequency is hard too, but some believe it is not as hard as getting higher IPC. The cost of the hardware and power of a higher frequency machine grows linearly with frequency.
Yes, the P4 is designed to clock higher than an Athlon. They use fewer gates-per-clock and therefore, necessarily do less work per clock. Unfortunately, performance is not measured in work-done-per-clock. It's measured in absolute time. So if you can get the same amount of work done in the same amount of time, but use more clocks to do it, why should you as a user care? You still got the performance.
SPEC is largely OS independent. You spend almost 0% of your time in the actual OS when running the apps.
You use a geometric mean for SPEC. There is no argument. The SPEC number is a speedup (a ratio). Everyone knows that you use geometric means for ratios. In fact, SPEC requires that you do so. If you use a simple arithmetic mean, you would get a skewed result depending on if you measure ratio or 1/ratio.
The PPC970 wit its Power4 core, clocked at 1.6GHz completely trashes a 3GHz P4. Faster bus, faster integer, and completely outclasses the P4 for FPU and SIMD.
Yawn.
The IBM PPC 970 1.8GHz Specint 2000 score quoted at Microprocessor Forum last year was 937 (source)
The current 3.0GHz P4 with Canterwood chipset already sports a Specint 2000 score of 1164 (source)
Wow...20% slower.
As for floating point, the ratio is even worse.
Move along...nothing to see here...
Good call. The benchmark in question is "art", and they tweaked their compiler to do a loop interchange to speed it up significantly.
I'm calling shenanigans on Sun again for doing the same thing with the "swim" benchmark. Their latest machines have peak scores for this benchmark that are 4-5x faster than the base scores. What they have done here is to get the compiler to skew the data accesses such that they fit better into the cache.
Spec scores from Sun are laughable. Industry insiders all know that SPARCs are slow.
It's an old debate in computer architecture. What is the better architecture? Wide and slow or narrow and fast? I believe performance speaks for itself, independent of marketing.
Do a google search on "speed demons vs. brainiacs". The folks at DEC Alpha choose to design a high frequency, low-ipc machine and the folks at HP designed a low frequency, higher-ipc machine. History has shown that frequency always wins because it is easier to get performance from frequency than to get it from IPC.
The hardware power, complexity, area etc... grow with the square of the width of a superscalar out-of-order processor. Frequency grows the hardware linearly. Let's see who runs out of scalability first.
I think you'd be suprised as to how far Intel can take frequency scaling.
Why do you think that Ultrasparc III, Alpha and Itanium don't run at 2GHz? Because the designers felt like leaving performance off the table?
Think again. These processors won't run at anywhere near 2GHz due to a minor thing called "the laws of physics". They aren't designed to run that high and you could never make them go that high without re-designing the whole architecture.
The P4 was designed to run at high frequencies. They do less work per clock (fewer gates per cycle), but make up for it in frequency.
I'm not even going to comment on the stupidity of saying a 2GHz G5 is like a 6-10 GHz pentium 4. That's just ridiculous.
I'm afraid this might add to the confusion about serial interfaces being 'faster' than parallel. While it is true that you don't have to worry about data/clock skew when using serial interfaces, enabling you to clock them faster, a parallel interface running at the same clock speed as a serial interface will always be faster in terms of data throughput. The reason for this is simple: serial == 1 bit per clock, parallel = > 1 bit per clock.
That's the whole point isn't it? Wide and slow or narrow and fast, you still get the same throughput. You can't clock a parallel interface as fast as a serial one, so you shouldn't compare them with the same clock speed.
I am reminded of the whole P4 vs Athlon debates. It's stupid to compare P4 and Athlon at the same frequency or use the stupid "but Athlon does more per clock" arguments. The P4 is designed to run at a higher clock, but can do less in parallel (IPC), but makes up for that with the higher clock.
Reliability isn't free and people won't pay for extra for it. It's just the economics of the situation.
If people were willing to pay extra for reliability, then you'd see it. As it stands, the market seems to favor feature sets over reliability. Software designers invariably trade reliability for more features - and they should because this is what the market demands.
There's this thing called innovation. It is the primary driver of the economy. Forget about interest rates, the dollar vs. the euro, trade deficits and national debt... These are important and have measurable effects, but they should be treated as mere distractions.
To stay on top of things and become a global leader you need to do something that no one else can do because it has never been done before.
Innovate my friends. Think out of the box. Take a risk. Silicon valley was built on people tinkering in their garages. It's what has kept America on top for so long - and will continue to do so IMHO.
That large cache for your UltraSparc-III is off chip. This is important to note because the bandwidth you get from the off-chip caches is much lower than the bandwidth of an on-chip cache. In fact, if you read the Ultrasparc-III systems specs you'll find that the bandwidth to the cache is comparable to the bandwidth to DRAM on a relatively cheap PC (the new Intel Canterwood/Springdale chipsets have a peak DRAM bandwidth of 6.4GB/sec)
If you need a 2MB cache you should consider a Xeon-MP which has just that. Couple this with a reasonably fast core and you should see some good performance for your application. Most x86 processors will have at least 1Mb of cache by the end of the year (Hammer, Prescott, Banias).
As you might imagine, on-chip caches are expensive. As a rule of thumb, the closer the memory is to the processor core, the more expensive it will be.
Your argument that SPARC is superior to x86 is weak. I've designed both kinds of processors and everything these days is basically RISC-like. The x86 code is translated into micro-ops that look like RISC. SPARC also has some stupid instructions and idioms. For example, register windows may seem like a good idea, but they really grow your register file and limit your frequency. Also, delayed branches are stupid and limit many things you can do. If I had to do another SPARC chip, I'd do some translation of my own into more efficient hardware-friendly micro-ops.
SPARC systems are nowhere near as competive as x86 systems. Their last niche of superiority with server workloads will disappear with the proliferation of Opteron systems.
I tried Walmart's 1 month free DVD rental service and it sucks compared with Netflix. The titles are no where near as available as they are on Netflix. Trying to rent just about any decent movie resulted in a "short wait". The same movies were readily available on Netflix.
How is that "ripping you off"?
In a capitalist society, we call that delta a profit. If you have the billions of dollars to invest in Fabs and the army of super-smart Phd types it takes to build a state of the art x86 processor, why don't you do it yourself? It will only cost you $438.67 per cpu and you can save yourself $20.86 - This is of course ridiculous.
In reality, it's a win-win situation. Intel gets the profit from the consumer and the consumer get a product that he/she couldn't possibly get from other means (AMD notwithstanding - but I hear they too make a profit). There is 0 incentive to sell a product for equal-to or less than the cost of production.
Also, your analysis is flawed and absurd.
- CPU unit volume over time is not exponential. You're confusing unit volume with Moore's Law.
- Market capitalization is correlated to FUTURE earnings potential (not past earning as your analysis would indicate)
- Intel does not spend 53.6% of the MARKET-CAP on CPU development. It spends 53.6% of ANNUAL SALES.
Actually, if it's a very tight loop (which it is) and the branch in the loop is highly predictable (which it is), and there are no data or instruction cache misses (there aren't), then this program will spend very little time stalling the pipeline! Which means that it does not matter how many pipeline stages you have if you never stall - and this means that this application will scale 100% with frequency. In this case the pipeline depth is totally irrelavant. What is now relevant is fetch and execution bandwidth of the instructions on the critical path of the program.
As I mentioned in another post, you can further improve the performance of this app by increasing the execution bandwidth of rotate instructions (which the G4 does with it's data-parallel rotate instruction).
If you have the means, run this benchmark on a system at different frequencies, and you will notice that the score scales linearly 1:1 with frequency.