Explaining Disappointing XScale Performance In Pocket PCs
JYD writes: "I found this new article on a Pocket PC web site where Microsoft talks about why XScale Pocket PCs aren't as fast as people thought they would be. Is it the OS? The CPU not supporting ARM4 properly? I wonder if the Linux port would run faster on 400 Mhz ... or did Intel screw up the CPU?"
It's important to differentiate between architecture optimizations
and CPU specific optimizations. The ARMv5 instruction set is a
relatively minor architectural tweak to the ARMv4 instruction set.
The names give you the impression that it's some grand change between
v4 and v5, if a technical guy did the naming it would be ARMv4 and
ARMv4.01. ARM is playing some games with architecture naming
to protect their business position with patents in a silly way.
ARMv5 adds a couple of new instructions over v4, an instruction to count
leading zeros in a register (which a compiler would likely never
use), and a better method of switching between the ARM instruction
set and the 16-bit Thumb instruction set. The later isn't
relevant for PocketPC since Thumb mode isn't supported. I think
v5 might having a new debugging hook as well.
The new XScale parts are ARMv5te, the T is for the 16-bit Thumb
instruction set, which no one seems to care about. The "E" adds
some DSP oriented instructions that are pretty interesting for
media codecs and such. They are the MMX equivalent for the ARM
world. They likely won't improve performance of the general
purpose aspects of the platform.
I think it's a red herring to chase Microsoft for not optimizing for
the ARMv5, the changes are really small and I don't see any
performance impact, certainly not if you have to maintain another
version for all of the strongARM based products.
Now, as far as CPU specific optimizations for the PXA250 (XScale)
implementation of the ARM architecture. IMHO Intel chased
MHz and left behind a lot of good sense about system performance.
The high order bit is bus performance as others have already
pointed out.
In addition to the bus performance, Intel made many tradeoffs
to optimize for clock speed: The 7-stage pipe has a 4-clock penalty
for a mis-predicted branch. This is compared to the circuit
design heroics in the strongARM that implements "all branches
are 2-cycles". The Xscale approach is much more complicated, it
probably doesn't perform any better, but you get a high clock speed.
Intel adds clock cycles to all load/store-multiple instructions
in Xscale. This is a pretty big deal in ARM since they are
used in the entry and exit of most C functions, in memcpy(),
and any time you are moving chunks bigger than a register.
The load-use penalty is bigger in Xscale. This is a pretty big
deal in ARM. The ARM instruction set is pretty compact. It is a
RISC processor, but the combination of shifting operations
combined with ALU operations makes it possible for a good compiler
to generate reasonably compact code. As a result, it's harder
for a compiler to put instructions between a load and instructions
that use the destination of the load. This is another trade-off
in Xscale that allows a higher clock speed but hurts performance
otherwise.
I go on too long, but the DEC designed strongARM used in the SA1100
is a tour-de-force of clean implementation and balanced system
performance. It's amazing that core was designed in 1993 (I think,
someone please correct me) and is still the leader for handheld
apps. The Intel guys went after clock speed at the expense of
everything else in Xscale and it will probably never optimize well
for a platform like PocketPC.
jeff