Benchmark Program Rewritten to Favor Intel?

← Back to Stories (view on slashdot.org)

Benchmark Program Rewritten to Favor Intel?

Posted by ryuzaki0 on Saturday August 24, 2002 @02:02AM from the if-all-else-fails-change-the-requirements dept.

BrookHarty writes "Interesting article over at Van's Hardware, that BAPCo the maker of the SysMark benchmarking program, has re-written its SysMark 2002 benchmark program in favor of Intels P4. AMD joined BAPCo in order to "correct" these "broken" results. AMD reports that BAPCo's SysMark 2002 (written by Intel Engineers) is a collection of tasks to summarize "Real World" performance. Interestingly, these tasks are selected for Intel's favored performance, while removing certain tasks that favor AMD. Vans Hardware has additional information on BAPCo's Shady history."

7 of 228 comments (clear)

Min score:

Reason:

Sort:

Re:Big deal by GigsVT · 2002-08-24 02:17 · Score: 3, Informative

I don't think there is much motivation on the part of compiler writers to optimize for this particular implementation of the x86-32 ISA. This isn't like previous chips, where new cache handling opcodes were added, which compilers could use if available. I've talked to people much better versed in compiler writing than myself, and they all seem to agree, when it comes to "optimizing for P4", their answer is going to be "don't hold your breath".

--
I've had enough abrasive sigs. Kittens are cute and fuzzy.
Not another benchmark... by mustprotectdata · 2002-08-24 02:40 · Score: 3, Informative

Coming from the Unix world, I'm used to comparing machines based on their SPECint and SPECfp performance...

In general the SPEC people have done a better job being platform agnostic than some of the "miscellaneous" PC benchmarks.

Current benchmarks for Intel http://www.spec.org/osg/cpu2000/results/res2002q2/ cpu2000-20020506-01357.html

and AMD http://www.spec.org/osg/cpu2000/results/res2002q3/ cpu2000-20020701-01441.html

Keep in mind that results for more recent AMD CPUs are not shown. If you compare the AMD 2200 with a 2.2G P4 you'll have 734 v's 784, which gives some credence to AMD's claimed rating.

html4me!
1. Re:Not another benchmark... by mustprotectdata · 2002-08-24 02:52 · Score: 3, Informative
  
  That's actually AMD = 764 v's Intel = 784 so it's even closer than stated above, i.e. within 3%.
  
  Like anyone would be able to tell :-).
  
  And my poor little Sunblade 100 is only 174. No wonder Solaris seems slower than linux.
  
  html 3
Not to defend Intel but.... by DeadBugs · 2002-08-24 03:26 · Score: 3, Informative

HardOCP notes that Vans got their info from AMD so it may be a bit biased. a quote from HardOCP:

" AMD has verified to me this morning that all of the graphed and tabled data shown on the VansHardware report is data that has been mined by AMD"

"AMD is not going to supply VansHardware with information that makes Intel look good. VansHardware represents to me, nothing more than an AMD fansite that takes shots at Intel every chance they get. I think they are far from what anyone could consider objective journalist and reporters."

--
http://www.kubuntu.org/
Re:Big deal by Sivar · 2002-08-24 05:24 · Score: 5, Informative

Obviously you flunked your freshman-level computer architecture course. The P4 8K L1's 2-cycle load-use latency is 50% better than Athlon 128k L1's 3-cycle load-use latency (not even accounting for P4's clock speed advantage).Obviously you are imagining things, as I never said that was not the case. Latency is important, but it doesn't matter if the cache size isn't large enough to fit enough code in to enjoy the low latency.
The difference in hit rate between 8k and 128k is only about 5% meaning that it is substantially faster to go with the small/fast cache than the big/slow cache.
Really? That's interesting, and here's me wondering why both AMD and, other than in the P4, Intel have wasted so much money adding more cache memory.

Because you seem to be such an expert, so why don't you go ahead and list a few common programs for me that have a working set of less than 8K--the size that will fit into the tiny L1 cache. Can't find any? Gee, I guess that makes the size of the cache pretty important then. When a program's working set has to be swapped in and out between L1 and L2 cache, suddenly that latency doesn't much matter. Of course, you may feel free to prove to me that the P4 can run addition loops faster. Those will fit into about 8k.

Do the math - even an infinitely large 3-cycle load-use cache is slower than an 8k 2-cycle load-use cache.
Who was it again flunked their freshman computer architecture course? You're saying that if the Athlon had 512MB of L1 cache that the system would be slower than the P4 and it's 8K of lower latency cache?
What math is it that I should do? Do you know what the working set of a program is?
Having a tiny amount of cache is analogous to having a tiny amount of RAM. Put 32MB of low-latency RAM in your system. Overclock some DDR SDRAM to 200MHz (AKA "400MHz" by people that don't understand clock speeds) and set it to CAS2. Tell me how your system performs. Just as your system will have to swap just about all running code to disk, the Pentium IV will not be able to contain the core loops of the various running programs in L1 cache. The vast majority will have to be dropped to L2, which is significantly slower and higher latency, kinda defeating the purpose of that 8k of fast memory, no?
Working sets that cannot be fit into the P4's 256k or 512k or L2 will then be relegated to main memory and moved to L2 then L1 when the data is executed, and anything that won't fit in main memory (very rarely which includes the working set of a program) will be swapped to disk if the platform supports virtualizing memory.

In closing, your comment was surprisingly brash and conceited, not to mention rude and totally innacurate. Thankyou.

--
Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
Re:Big deal by VAXman · 2002-08-24 06:09 · Score: 3, Informative

According to Hennesy & Patterson, 2nd Edition, page 391, the total miss rate (for SPEC92) of a 8k 4-way set associative cache (like the P4's) is 2.9%. The miss rate of a 128k 4-way set associative cache (like Athlon's) is 0.6%.

The hit time for P4 is 2 cycles, and for Athlon it's 3 cycles. The L2 hit / L1 miss is ~10 cycles for both. Everything further out is approximately the same so we can ignore it for simplicity.

So, the average memory access time for P4 is (0.971 * 2) + (0.029 * 10) = 2.2 Cycles. The average memory access time for Athlon is (0.994 * 3) + (0.006 * 10) = a little over 3 cycles.

Suppose Athlon had an infinite size L1 cache (or 512 MB if you like to use numbers). The highest hit rate it could ever achieve is 100% (actually slightly less, since you cannot eliminate complulsory misses). The average memory access time would then be 3 cycles - which is higher than P4's 2.2 cycles!

BTW, Paul DeMone wrote a pretty good article about P4's L1 cache.
corrections by RelliK · 2002-08-24 15:52 · Score: 3, Informative

The Pentium IV's really looong pipeline does allow the P4 to run at higher clockspeeds, but the branch prediction you mentioned is instant death.... a single branch prediction requires up to 20 full clock cycles of work to be discarded.
The situation is not quite as dire due to P4's trace cache (you actually addressed that later in your post). Nevertheless, your point stands.
On Intel SMP setups, even on P4 Xeons (Which, IMO, are inferior to P3 Tualatin chips by the same company) when one CPU accesses main memory, it locks main memory for the other CPUs. All other CPUs have to sit and twiddle their transistors while the main memory is on use by only one CPU. On AMD SMP setups, ALL processors can simultaneous access memory, merely sharing the bandwidth simultaneously. So, if one CPU is only using 100MB of memory bandwidth, the rest can be used by other CPUs at that time.
P4 Xeons (as well as P3s) have a shared memory bus. That is, multiple CPUs share the bandwidth of the 400MHz or 533MHz bus when accessing memory. However, Athlon has a point-to-point channel for each CPU. That is, each Athlon CPU has the full bandwidth of the 266MHz (soon to be 333MHz) memory bus, regardless of how many CPUs there are in the system. This means that beyond 2-way SMP systems, Athlon has a significant advantage in memory bandwidth over P4.

--
___
If you think big enough, you'll never have to do it.