Slashdot Mirror


Explaining Disappointing XScale Performance In Pocket PCs

JYD writes: "I found this new article on a Pocket PC web site where Microsoft talks about why XScale Pocket PCs aren't as fast as people thought they would be. Is it the OS? The CPU not supporting ARM4 properly? I wonder if the Linux port would run faster on 400 Mhz ... or did Intel screw up the CPU?"

1 of 133 comments (clear)

  1. ARMv5 versus ARMv4 and why Intel sucks by jeffmock · · Score: 5, Insightful

    It's important to differentiate between architecture optimizations
    and CPU specific optimizations. The ARMv5 instruction set is a
    relatively minor architectural tweak to the ARMv4 instruction set.
    The names give you the impression that it's some grand change between
    v4 and v5, if a technical guy did the naming it would be ARMv4 and
    ARMv4.01. ARM is playing some games with architecture naming
    to protect their business position with patents in a silly way.

    ARMv5 adds a couple of new instructions over v4, an instruction to count
    leading zeros in a register (which a compiler would likely never
    use), and a better method of switching between the ARM instruction
    set and the 16-bit Thumb instruction set. The later isn't
    relevant for PocketPC since Thumb mode isn't supported. I think
    v5 might having a new debugging hook as well.

    The new XScale parts are ARMv5te, the T is for the 16-bit Thumb
    instruction set, which no one seems to care about. The "E" adds
    some DSP oriented instructions that are pretty interesting for
    media codecs and such. They are the MMX equivalent for the ARM
    world. They likely won't improve performance of the general
    purpose aspects of the platform.

    I think it's a red herring to chase Microsoft for not optimizing for
    the ARMv5, the changes are really small and I don't see any
    performance impact, certainly not if you have to maintain another
    version for all of the strongARM based products.

    Now, as far as CPU specific optimizations for the PXA250 (XScale)
    implementation of the ARM architecture. IMHO Intel chased
    MHz and left behind a lot of good sense about system performance.
    The high order bit is bus performance as others have already
    pointed out.

    In addition to the bus performance, Intel made many tradeoffs
    to optimize for clock speed: The 7-stage pipe has a 4-clock penalty
    for a mis-predicted branch. This is compared to the circuit
    design heroics in the strongARM that implements "all branches
    are 2-cycles". The Xscale approach is much more complicated, it
    probably doesn't perform any better, but you get a high clock speed.

    Intel adds clock cycles to all load/store-multiple instructions
    in Xscale. This is a pretty big deal in ARM since they are
    used in the entry and exit of most C functions, in memcpy(),
    and any time you are moving chunks bigger than a register.

    The load-use penalty is bigger in Xscale. This is a pretty big
    deal in ARM. The ARM instruction set is pretty compact. It is a
    RISC processor, but the combination of shifting operations
    combined with ALU operations makes it possible for a good compiler
    to generate reasonably compact code. As a result, it's harder
    for a compiler to put instructions between a load and instructions
    that use the destination of the load. This is another trade-off
    in Xscale that allows a higher clock speed but hurts performance
    otherwise.

    I go on too long, but the DEC designed strongARM used in the SA1100
    is a tour-de-force of clean implementation and balanced system
    performance. It's amazing that core was designed in 1993 (I think,
    someone please correct me) and is still the leader for handheld
    apps. The Intel guys went after clock speed at the expense of
    everything else in Xscale and it will probably never optimize well
    for a platform like PocketPC.

    jeff