Benchmark Program Rewritten to Favor Intel?

← Back to Stories (view on slashdot.org)

Benchmark Program Rewritten to Favor Intel?

Posted by ryuzaki0 on Saturday August 24, 2002 @02:02AM from the if-all-else-fails-change-the-requirements dept.

BrookHarty writes "Interesting article over at Van's Hardware, that BAPCo the maker of the SysMark benchmarking program, has re-written its SysMark 2002 benchmark program in favor of Intels P4. AMD joined BAPCo in order to "correct" these "broken" results. AMD reports that BAPCo's SysMark 2002 (written by Intel Engineers) is a collection of tasks to summarize "Real World" performance. Interestingly, these tasks are selected for Intel's favored performance, while removing certain tasks that favor AMD. Vans Hardware has additional information on BAPCo's Shady history."

12 of 228 comments (clear)

Min score:

Reason:

Sort:

Re:Big deal by GigsVT · 2002-08-24 02:17 · Score: 3, Informative

I don't think there is much motivation on the part of compiler writers to optimize for this particular implementation of the x86-32 ISA. This isn't like previous chips, where new cache handling opcodes were added, which compilers could use if available. I've talked to people much better versed in compiler writing than myself, and they all seem to agree, when it comes to "optimizing for P4", their answer is going to be "don't hold your breath".

--
I've had enough abrasive sigs. Kittens are cute and fuzzy.
You probably have a P4, right? by Anonymous Coward · 2002-08-24 02:24 · Score: 1, Informative

They're not just keeping up with intel, they are significantly faster in most real-world applications. Over 90% of systems dedicated to rendering and scientific computing sold in the last 2 years were Athlons (they're actually replacing SGI and Sun in thse markets because they are so much cheaper and actually offer better floating-point performance).

All the P4 is good at is moving memory around. And for this it seeds RDRAM (ie, very fast memory). Replace that with DDR and the P4 turns into a very expensive snail.

Even with RDRAM, the P4 is still slower than the Athlons on most real-world tasks. IMO, the only valid reason to buy a P4 is Quake.

I have several Athlon XPs and P4s and the P4s just drag themselves compared to the XPs. In fact, even the PIIIs feel faster than the P4s. And I'm not even talking about the Athlon MPs (that simply wipe the floor with the P4).
Re:Big deal by njdj · 2002-08-24 02:27 · Score: 2, Informative

it's plausible that the Athlon's performance is going to lag quite a bit more than it already does.

It doesn't. The fastest X86 processor is currently made by AMD. See a non-BAPCo performance comparison.

AMD has always been the value chip company. You can't expect them to keep up with Intel forever

Are you an Intel employee? Intel isn't keeping up with AMD. The P4 is underperforming, as well as overpriced. Take a look at the web page referenced in the lead story.
Re:Big deal by l33t-gu3lph1t3 · 2002-08-24 02:27 · Score: 2, Informative

For a year and a half AMD managed to blow Intel out of the water in the performance arena, from the time they both reached 1.0GHz with the P3 and Athlon, to the time Intel started releasing Northwood core P4s with the extra 256KB L2 cache. The longer pipeline of the P4 has nothing to do with performance. All it does is enable the processor to ramp in clock frequencies easily. A general rule of thumb is the longer the pipeline, the lower the IPC (instructions per clock/cycle) and the LOWER the actual number-crunching performance.

--
------- "From bored to fanboy in 3.8 asian girls" ----------
Not another benchmark... by mustprotectdata · 2002-08-24 02:40 · Score: 3, Informative

Coming from the Unix world, I'm used to comparing machines based on their SPECint and SPECfp performance...

In general the SPEC people have done a better job being platform agnostic than some of the "miscellaneous" PC benchmarks.

Current benchmarks for Intel http://www.spec.org/osg/cpu2000/results/res2002q2/ cpu2000-20020506-01357.html

and AMD http://www.spec.org/osg/cpu2000/results/res2002q3/ cpu2000-20020701-01441.html

Keep in mind that results for more recent AMD CPUs are not shown. If you compare the AMD 2200 with a 2.2G P4 you'll have 734 v's 784, which gives some credence to AMD's claimed rating.

html4me!
1. Re:Not another benchmark... by mustprotectdata · 2002-08-24 02:52 · Score: 3, Informative
  
  That's actually AMD = 764 v's Intel = 784 so it's even closer than stated above, i.e. within 3%.
  
  Like anyone would be able to tell :-).
  
  And my poor little Sunblade 100 is only 174. No wonder Solaris seems slower than linux.
  
  html 3
Not to defend Intel but.... by DeadBugs · 2002-08-24 03:26 · Score: 3, Informative

HardOCP notes that Vans got their info from AMD so it may be a bit biased. a quote from HardOCP:

" AMD has verified to me this morning that all of the graphed and tabled data shown on the VansHardware report is data that has been mined by AMD"

"AMD is not going to supply VansHardware with information that makes Intel look good. VansHardware represents to me, nothing more than an AMD fansite that takes shots at Intel every chance they get. I think they are far from what anyone could consider objective journalist and reporters."

--
http://www.kubuntu.org/
Re:Big deal by Sivar · 2002-08-24 05:24 · Score: 5, Informative

Obviously you flunked your freshman-level computer architecture course. The P4 8K L1's 2-cycle load-use latency is 50% better than Athlon 128k L1's 3-cycle load-use latency (not even accounting for P4's clock speed advantage).Obviously you are imagining things, as I never said that was not the case. Latency is important, but it doesn't matter if the cache size isn't large enough to fit enough code in to enjoy the low latency.
The difference in hit rate between 8k and 128k is only about 5% meaning that it is substantially faster to go with the small/fast cache than the big/slow cache.
Really? That's interesting, and here's me wondering why both AMD and, other than in the P4, Intel have wasted so much money adding more cache memory.

Because you seem to be such an expert, so why don't you go ahead and list a few common programs for me that have a working set of less than 8K--the size that will fit into the tiny L1 cache. Can't find any? Gee, I guess that makes the size of the cache pretty important then. When a program's working set has to be swapped in and out between L1 and L2 cache, suddenly that latency doesn't much matter. Of course, you may feel free to prove to me that the P4 can run addition loops faster. Those will fit into about 8k.

Do the math - even an infinitely large 3-cycle load-use cache is slower than an 8k 2-cycle load-use cache.
Who was it again flunked their freshman computer architecture course? You're saying that if the Athlon had 512MB of L1 cache that the system would be slower than the P4 and it's 8K of lower latency cache?
What math is it that I should do? Do you know what the working set of a program is?
Having a tiny amount of cache is analogous to having a tiny amount of RAM. Put 32MB of low-latency RAM in your system. Overclock some DDR SDRAM to 200MHz (AKA "400MHz" by people that don't understand clock speeds) and set it to CAS2. Tell me how your system performs. Just as your system will have to swap just about all running code to disk, the Pentium IV will not be able to contain the core loops of the various running programs in L1 cache. The vast majority will have to be dropped to L2, which is significantly slower and higher latency, kinda defeating the purpose of that 8k of fast memory, no?
Working sets that cannot be fit into the P4's 256k or 512k or L2 will then be relegated to main memory and moved to L2 then L1 when the data is executed, and anything that won't fit in main memory (very rarely which includes the working set of a program) will be swapped to disk if the platform supports virtualizing memory.

In closing, your comment was surprisingly brash and conceited, not to mention rude and totally innacurate. Thankyou.

--
Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
Re:Big deal by VAXman · 2002-08-24 06:09 · Score: 3, Informative

According to Hennesy & Patterson, 2nd Edition, page 391, the total miss rate (for SPEC92) of a 8k 4-way set associative cache (like the P4's) is 2.9%. The miss rate of a 128k 4-way set associative cache (like Athlon's) is 0.6%.

The hit time for P4 is 2 cycles, and for Athlon it's 3 cycles. The L2 hit / L1 miss is ~10 cycles for both. Everything further out is approximately the same so we can ignore it for simplicity.

So, the average memory access time for P4 is (0.971 * 2) + (0.029 * 10) = 2.2 Cycles. The average memory access time for Athlon is (0.994 * 3) + (0.006 * 10) = a little over 3 cycles.

Suppose Athlon had an infinite size L1 cache (or 512 MB if you like to use numbers). The highest hit rate it could ever achieve is 100% (actually slightly less, since you cannot eliminate complulsory misses). The average memory access time would then be 3 cycles - which is higher than P4's 2.2 cycles!

BTW, Paul DeMone wrote a pretty good article about P4's L1 cache.
Re:Not Quite So by Hoser+McMoose · 2002-08-24 06:26 · Score: 2, Informative

"impeccable"?! I dunno about that!

SPEC CFP2000 is by far the most widely supported benchmark, but it has it's share of flaws, just look at Sun's latest SPEC CFP2000 scores and you might notice one of them. For the Sun Blade 1900 (900MHz USIII), Sun had a rather abysmal CFP score, with the Sun Blade 1900 Cu (900MHz USIII), they were suddenly quite competative. Hugely improved compiler right?! Wrong. Actually they only improved by 0-15% on most benchmarks (about what could be expected with a slightly better compiler, ie what you could expect in the real world), however their CFP score improved by over 50%. Why is that? Well ONE single sub-bench of CFP (179.art) was increased by a remakable 560%! The end result is that the USIII looks like a rather fast chip at floating point, when it's actually butt-slow (WAY slower then either the Athlon or the P4) at everything except for one particular Fortran application. This isn't to say that Sun's score is invalid, simply that it doesn't take much to really skew the scores of ANY benchmark, including SPEC CFP.

I've also critized SPEC in the past for being a bit to focused on large datasets. For example, the 171.swim benchmark is essentially a memory bandwidth test taken from some shallow water modeling code. Now, that isn't to say that memory bandwidth isn't important for floating point and scientific applications (if it weren't IBM wouldn't have thrown a whole boatload of bandwidth at their Power4), simply that SPEC seems to have over-emphasized this aspect. I think the reason for this is that SPEC CPU95 was just the opposite, and often critized for emphasizing SMALL datasets too much, and therefore become almost completely a test of the cache architecture of the processor in many cases. For CPU2000, it seems like they over-corrected in my mind.

Long story short, if your floating point work involves small datasets, the Athlons MUCH larger L1 cache and 3 floating point execution units will crunch the code much faster. On large datasets, the P4's higher bus and memory bandwidth and larger L2 cache will crunch the code faster.

You're quite correct in that saying that the P4 has a weak FPU is definitely wrong. The chip actually is VERY good at floating point calculations, better then just about anything out there. Only the Power4 and the Itanium2 are decidely faster FP number crunchers then the P4.
Re:Big deal by Hoser+McMoose · 2002-08-24 06:41 · Score: 2, Informative

Actaully AMD had the performance lead for longer then that. They took the performance crown away from Intel the day that the Athlon was first released, since it came out at 650MHz when the fastest PIII was only 600MHz, and the Athlon was, at that time, just slightly faster clock for clock then the PIII. AMD kept the clock speed lead and increased their clock-for-clock performance over the PIII up for the next few years. Intel only just started to catch up with the P4 2.0GHz (which was released just a little bit before the AthlonXP, ie when the fastest Athlon was only at 1.4GHz).

Basically for the last 3 years (since the release of the Athlon), AMD has had the fastest x86 chips for about 2 years. Intel has had the fastest x86 chips for about 6 months, and for the remaining 6 months it's been too close to tell which was faster.

As for the P4s long pipeline, I'd say that it WAS largely responsible for increasing the performance because it allowed Intel to clock the chips so damn high. They clocked the P4 up to 2.0GHz easily on a 180nm fab process. Compare this to the PIII which they struggled to get up to 1.13GHz on the exact same 180nm process (and that took them until 1 year after their first attempt failed miserably and had to be recalled completely). AMD did slightly better with their Athlon design, but it still was only able to clock up to 1.73GHz on a 180nm process, and they had a more advanced process then Intel did in some ways (ie they were using copper interconnects).

Long story short, performance is determined (in an overly simplified way) by IPC * clock speed. With the P4, Intel looked to sacrafice IPC slightly to dramatically increase the clock speed with the goal of overall faster performance. When compared to the PIII at least, they definitely succeeded. Compare the fastest 180nm process PIII (1.13GHz) to the fastest 180nm process P4 (2.0GHz) and which do you think is faster?
corrections by RelliK · 2002-08-24 15:52 · Score: 3, Informative

The Pentium IV's really looong pipeline does allow the P4 to run at higher clockspeeds, but the branch prediction you mentioned is instant death.... a single branch prediction requires up to 20 full clock cycles of work to be discarded.
The situation is not quite as dire due to P4's trace cache (you actually addressed that later in your post). Nevertheless, your point stands.
On Intel SMP setups, even on P4 Xeons (Which, IMO, are inferior to P3 Tualatin chips by the same company) when one CPU accesses main memory, it locks main memory for the other CPUs. All other CPUs have to sit and twiddle their transistors while the main memory is on use by only one CPU. On AMD SMP setups, ALL processors can simultaneous access memory, merely sharing the bandwidth simultaneously. So, if one CPU is only using 100MB of memory bandwidth, the rest can be used by other CPUs at that time.
P4 Xeons (as well as P3s) have a shared memory bus. That is, multiple CPUs share the bandwidth of the 400MHz or 533MHz bus when accessing memory. However, Athlon has a point-to-point channel for each CPU. That is, each Athlon CPU has the full bandwidth of the 266MHz (soon to be 333MHz) memory bus, regardless of how many CPUs there are in the system. This means that beyond 2-way SMP systems, Athlon has a significant advantage in memory bandwidth over P4.

--
___
If you think big enough, you'll never have to do it.