Being the author of the linked page, I wish to add just a simple thought: if multiplies requires 8 cycles to execute, you should put up to 7 other independent instructions between the scheduling of the multiply for execution and using the result in a subsequent dependent instruction; given the scarce number of registers available (only 8), it is quite difficult to schedule more than 2 or 3 threads of computation at the same time, so it is unlikely that there will be lots of instructions to execute while the multiply is being executed.
Another weak point of the P4 core is that there's only one MMX execution unit instead of two in the P3 (even if they have some usage limitations). I've been simulating some common code sequences on both the P3 and the P4, and the P4 is always at least 50% slower (i.e. it needs at least 50% more cycles to execute the same code).
But I am speaking about carefully optimized code, which does not stall due to cache misses (so uses optimal prefetching and streaming stores), when running not-so-optimized code that stalls often, the latency of instructions becomes a minor problem and the P4 might shine, due to the hardware prefetcher and huge memory bandwidth.
Best regards
Stefano Tommesani
Being the author of the linked page, I wish to add just a simple thought: if multiplies requires 8 cycles to execute, you should put up to 7 other independent instructions between the scheduling of the multiply for execution and using the result in a subsequent dependent instruction; given the scarce number of registers available (only 8), it is quite difficult to schedule more than 2 or 3 threads of computation at the same time, so it is unlikely that there will be lots of instructions to execute while the multiply is being executed.
Another weak point of the P4 core is that there's only one MMX execution unit instead of two in the P3 (even if they have some usage limitations). I've been simulating some common code sequences on both the P3 and the P4, and the P4 is always at least 50% slower (i.e. it needs at least 50% more cycles to execute the same code).
But I am speaking about carefully optimized code, which does not stall due to cache misses (so uses optimal prefetching and streaming stores), when running not-so-optimized code that stalls often, the latency of instructions becomes a minor problem and the P4 might shine, due to the hardware prefetcher and huge memory bandwidth.
Best regards
Stefano Tommesani