Explaining Disappointing XScale Performance In Pocket PCs
JYD writes: "I found this new article on a Pocket PC web site where Microsoft talks about why XScale Pocket PCs aren't as fast as people thought they would be. Is it the OS? The CPU not supporting ARM4 properly? I wonder if the Linux port would run faster on 400 Mhz ... or did Intel screw up the CPU?"
Actually, you're right. Sorta.
If you take Linux (source based, optimized for cpu) and a modern window manager like enlightenment (if you think its not modern, prove it in a reply) with a preemptive kernel and put it on a Celeron ~500mhz with 128mb of PC100 SDRAM it WILL BEAT the Windows 98 in speed, although it is different.
I just don't see how people assume KDE and Gnome are "modern" because they resemble Windows. Is that the trend?
Pocket PCs aren't as fast as people thought they would be. Is it the OS?
It could be the OS, which is the obvious answer since it's a Microsoft OS, and this is Slashdot. But I don't know. I've never tried running anything other than PocketPC OS on the iPaq, and probably never will. (It's a work thing.)
How did Microsoft become so popular? It was DOS, wasn't it? The program that ran on any x86 computer. Well, Microsoft should take a page from their previous success and allow a little more flexibility in PocketPC design. The main gripe that I and everyone else has about these gizmos is that they're locked into a 240 by 320 by 16-bit color display. That's lame, especially if one of the highlights of PocketPC is how easy it is to port your Win32 app. If you have to redesign all the screens to fit in a tiny-ass space, it's easy on the coders but hell on the systems analysts.
It looks to me like Palm have a much more open approach, they are using the same tactic that established Microsoft's dominance with DOS back in the 80s. You can get that new Sony Clie' with TWICE the screen real estate (as in pixels) of ANY PocketPC available. Kind of a no-brainer if you ask me.
Off to the solstice parade!
It's important to differentiate between architecture optimizations
and CPU specific optimizations. The ARMv5 instruction set is a
relatively minor architectural tweak to the ARMv4 instruction set.
The names give you the impression that it's some grand change between
v4 and v5, if a technical guy did the naming it would be ARMv4 and
ARMv4.01. ARM is playing some games with architecture naming
to protect their business position with patents in a silly way.
ARMv5 adds a couple of new instructions over v4, an instruction to count
leading zeros in a register (which a compiler would likely never
use), and a better method of switching between the ARM instruction
set and the 16-bit Thumb instruction set. The later isn't
relevant for PocketPC since Thumb mode isn't supported. I think
v5 might having a new debugging hook as well.
The new XScale parts are ARMv5te, the T is for the 16-bit Thumb
instruction set, which no one seems to care about. The "E" adds
some DSP oriented instructions that are pretty interesting for
media codecs and such. They are the MMX equivalent for the ARM
world. They likely won't improve performance of the general
purpose aspects of the platform.
I think it's a red herring to chase Microsoft for not optimizing for
the ARMv5, the changes are really small and I don't see any
performance impact, certainly not if you have to maintain another
version for all of the strongARM based products.
Now, as far as CPU specific optimizations for the PXA250 (XScale)
implementation of the ARM architecture. IMHO Intel chased
MHz and left behind a lot of good sense about system performance.
The high order bit is bus performance as others have already
pointed out.
In addition to the bus performance, Intel made many tradeoffs
to optimize for clock speed: The 7-stage pipe has a 4-clock penalty
for a mis-predicted branch. This is compared to the circuit
design heroics in the strongARM that implements "all branches
are 2-cycles". The Xscale approach is much more complicated, it
probably doesn't perform any better, but you get a high clock speed.
Intel adds clock cycles to all load/store-multiple instructions
in Xscale. This is a pretty big deal in ARM since they are
used in the entry and exit of most C functions, in memcpy(),
and any time you are moving chunks bigger than a register.
The load-use penalty is bigger in Xscale. This is a pretty big
deal in ARM. The ARM instruction set is pretty compact. It is a
RISC processor, but the combination of shifting operations
combined with ALU operations makes it possible for a good compiler
to generate reasonably compact code. As a result, it's harder
for a compiler to put instructions between a load and instructions
that use the destination of the load. This is another trade-off
in Xscale that allows a higher clock speed but hurts performance
otherwise.
I go on too long, but the DEC designed strongARM used in the SA1100
is a tour-de-force of clean implementation and balanced system
performance. It's amazing that core was designed in 1993 (I think,
someone please correct me) and is still the leader for handheld
apps. The Intel guys went after clock speed at the expense of
everything else in Xscale and it will probably never optimize well
for a platform like PocketPC.
jeff