Explaining Disappointing XScale Performance In Pocket PCs
JYD writes: "I found this new article on a Pocket PC web site where Microsoft talks about why XScale Pocket PCs aren't as fast as people thought they would be. Is it the OS? The CPU not supporting ARM4 properly? I wonder if the Linux port would run faster on 400 Mhz ... or did Intel screw up the CPU?"
now that i can buy a pentium 4 laptop with a nice video card for a fairly reasonable price...
unzip; strip; touch; finger; mount; fsck; more; yes; unmount; sleep
My group has been working on a syhthesizable secure G3 card CPU and it will probably be the slowest ARM ever made.
The CPU will be fully delay insensitive and asynchronous to stop power and clock glitch attacks.
We are currently looking at 4 Mhz on 0.18 process.
Mouse powered Chips, Open source Processors and Lego
They would run far slower. Yes, I know, you could fit a little text-only distro onto a floppy, but if you compare KDE or Gnome to Windows 98 or XP (had all 4 on the same box, in 3-4 installs of Linux, maybe more :( ) it's much, much, slower. I'm sure you could use some ancient window manager that's "fine for you" but speed is terrible on any modern Linux DE.
Now, you can either A) mod me down because I'm dissing Linux, even though I'm a user, or B) maybe work on that. I know there are lots of contributors here, and I'm not a coder.
..answer !
But unfortunatly SavaJe XE seem to be only there for redular ARM
Anyway it is still a cool OS : Check it out !
a review I read showed a 400Mhz XScale performing at 50%-75% the speed of a 206MHz Strongarm chip. I would be really interested in some none OS specific tests that showed whether or not the XScale offers any performance benifit whatsoever - I know that it is supposed to scale to 1Ghz and has better battery life than the 206Mhz Arms but if it NEEDS to run at 800MHz just to perform at the same level as its older sibling then it is a waste of space.
On the other hand, Intel often gives little thought to enhancing performance of old code on new processors. If memory serves me right, Intel's Pentium Pro ran 16-bit code embarrassingly slowly.
-jhp
/. -- the Free Republic of technology.
The Amulet group has been working for year to make a low power yet high speed asynchronous ARM processors.
The Amulet 3 runs at 120 MHz and consumes very little power. Most of all its asynchronous so when you dont have mych processing to do it just sits there consuming "no" power.
They take a hell of a beating and still run. I connected one to a hamster wheel and you can see it here running despite the power fluctuating madly.
The only reason it only goes at 120MHz is because the memory isnt fast enough.
Its a little strange that only three ARM production lisences were given out. One to intel one to motorola and one to Amulet group.
Mouse powered Chips, Open source Processors and Lego
Intel's Xscale architecture -should- have impressive performance. However, it seems that circumstances have conspired to keep it from showing its potential. In order to retain binary backwards compatibility, Microsoft has kept their software compiler rather basic, ensuring that it will work for multiple architectures. This lack of optimization also means that any architectural improvements Xscale has over Arm V4 or whatever it's called will not mean anything. Hopefully the history of Xscale will work like netburst architecture's history, where about 1 year after inception, software that makes use of its architecture efficiently (like SSE2 with the P4) will start to appear.
------- "From bored to fanboy in 3.8 asian girls" ----------
Umm... right, that's why my PocketPC 2000 Cassiopiea E115 is now as useful as a doorstop as it has a MIPS chip in it.
When I got my PocketPC, MS touted that 'software matters' - even in their publicity. Suddenly, they ditch all the SH3 and MIPS users and just support ARM in PocketPC 2002. Not only that, but applications like Terminal Services and Messenger they won't release for the older machines. I see a lot of people saying that this is becasue PocketPC 2002 is based on CE.NET - that's not correct. PocketPC 2002 is just another revamp of PocketPC 2000, which are both based on CE 3.0. So when it all boils down, it's just Microsoft playing marketing tricks. Net result of their decision - my £450 PDA became obsolete in 18 months.
I now own a Palm.
Pocket PCs aren't as fast as people thought they would be. Is it the OS?
It could be the OS, which is the obvious answer since it's a Microsoft OS, and this is Slashdot. But I don't know. I've never tried running anything other than PocketPC OS on the iPaq, and probably never will. (It's a work thing.)
How did Microsoft become so popular? It was DOS, wasn't it? The program that ran on any x86 computer. Well, Microsoft should take a page from their previous success and allow a little more flexibility in PocketPC design. The main gripe that I and everyone else has about these gizmos is that they're locked into a 240 by 320 by 16-bit color display. That's lame, especially if one of the highlights of PocketPC is how easy it is to port your Win32 app. If you have to redesign all the screens to fit in a tiny-ass space, it's easy on the coders but hell on the systems analysts.
It looks to me like Palm have a much more open approach, they are using the same tactic that established Microsoft's dominance with DOS back in the 80s. You can get that new Sony Clie' with TWICE the screen real estate (as in pixels) of ANY PocketPC available. Kind of a no-brainer if you ask me.
Off to the solstice parade!
Q: What could possibly have gone wrong?
A: While we acknowledge that some peoples' perception is of something having gone wrong, we believe that any wrongness is unavoidable.
Q: Well, some analysts say it's intel's fault
A: We have implemented what we could implement, and don't believe there is any implementable implementation that would implement significant gains.
Q: Analysts also say it will be 2004 before the issue is fixed
A: It is too early to talk about 2004. That said, we are committed to delivering a good product.
Q: This is really bad news for the Pocket PC platform
A: Yes, it is. However, fortunately the issue is so small that this really isn't bad news for the Pocket PC platform.
Cheers
-b
Well I have n't looked deeply into the Intel Xscale but from the article it says that although the clock rate has gone up from 206MHz to 400MHz the bus speed has gone down from 103MHz to 100MHz now unless they've significantly beefed up the amount of cache this is going to hurt you alot.
Additionally I seem to remember the XScale was targetted at telecoms equipment more then PDA's can anyone comment on how these new ARMv5 instructions affect peformance?
Running kde3 on a AMD K6-2 266MHz with 96MB EDO Ram and OpenBSD it actually runs quite well.
Hail to the king, baby!
PocketPC is the new name for Windows CE. MS tricked me into buying these beta devices twice. Hopefully not again.
One of the responders hit it dead on. What is being sold are basically beta devices. My workpad z50 and sharp mobilion are both pieces of junk that get no use.
They've probably used aggressive power saving on the chip to save every electron but at the expense of performance.
Thats not such a bad thing, most of these things run address books and sync to email. The battery is the real problem with them, not the fact it can't encode video streams!
Sure they'll get a few complaints, but nothing like the slating they've been getting for the battery life problem.
bus speed has gone down from 103MHz to 100MHz now unless they've significantly beefed up the amount of cache this is going to hurt you alot
A 3% decrease in bus speed is not going to "hurt you alot", especially not when the CPU clock's almost doubling.
Is "(Score:-1, Flamebait)" just another way of saying "OMG, he's right, but we don't want stuff like that getting out!"
Aw, fuck it. Let's go bowling. - The Big Lebowski
Well, that statement clearly deserves a +5 Funny.
MS admits in the linked article that the OS is not "optimized". It fails to use the new ARM instruction set, and worse, does not seem to use the power-management capabilities of the XScale. Supposedly the Xscale uses half the power of the StrongARM, but battery tests on the new PPCs do not show this savings. This fix will be a while coming, as the next version of the OS does not appear to be optimized either.
Interestingly, Asus in their upcoming Xscale PPC is coming up with workarounds, such as on the fly automatic clock and voltage throttling. So while the Xscale supports capabilites that MS is not using, the vendors are not waiting for next year for MS to get their act together.
Hopefully the vendors will also figure out a way to speed up the terrible benchmarks of the Xscale PPCs.
*Actual clock speed 400 mhz
I did n't mean the 3MHz drop hurts you alot, I meant that if you compare the clock to bus speed ratio you're looking at a 50% reduction in bus speed compared to cpu clock rate.
If there is not enough cache memory increasing processor clock speed will not have a positive affect on performance because the real effective clock rate will be bound by how fast the processor can fetch data from main memory.
Well, OK, the inflated MHz of the Pentium 4 does
actually pay off, I mean, the IPC (Instructions
per cycle) went downhill with the P4, so it
doesn't perform as well as you might expect from
a chip with that many MHz, but Intel managed to
get so many MHz out of the P4 that, in the end,
it is still much faster than a P3. Still, I
could imagine that in the case of XScale, the
IPC became so horrible that even having twice as
many MHz (as the StrongArm) doesn't save the
chip.
Mind you, that is just a wild, uninformed guess.
MS : "Moving to ARM V5 would break upgrade compatibility."
Translation : "We can't or won't write portable code."
There's absolutely no technical reason they can't take advantage of the V5 enhancements while still retaining support for ARM V4 and a common code base. This must have been a business decision, but I can't fathom the thought processes which led to it.
Here's the text:
Posted: Thu Jun 20, 2002 1:31 pm Post subject: XScale and the Pocket PC - what's going on?
With all the anticipation over the Intel XScale processor running at 400 MHz, expectations were high that any Pocket PC would double in speed. You double the MHz from 206 to 400 and the Pocket PC should get twice as fast, right? Not exactly. There are many issues that relate to overall device speed, not the least of which is software. Ed Suwanjindar from the Microsoft Mobile Devices group responded to my questions via email on this issue. I'll post my own thoughts on this issue under a separate entry. On to the Q&A!
THOUGHTS: Early reports based on those who own the Toshiba e740 Pocket PC 2002 device are telling us that XScale at 400 MHz performs slower than a StrongARM at 206 MHz on some tasks. This came as a surprise to many people.
SUWANJINDAR: "We are aware that PXA250 (XScale)-based devices are not demonstrating the huge performance gains that were anticipated. That said, Pocket PCs continue to offer the best performance and the richest functionality vs. other handhelds on the market today."
THOUGHTS: I've seen a few articles on line saying it's Microsoft's fault for not having an optimized OS in place for the XScale launch. What is Microsoft's response to this?
SUWANJINDAR: "Our software remains the same. This is the same Pocket PC 2002 software that performs fabulously across other ARM processors (StrongARM 1110, OMAP710, etc). We made a hard decision several years ago to move away from supporting several processor architectures and target a single core. This was a difficult decision that we think ultimately benefited our OEMs, developers and customers by unifying our platform around single processor architecture -- ARM V4. The PXA250 utilizes the ARM V5 instruction set with backwards compatibility for ARM V4. When we completed the Pocket PC 2002 software in June 2002, we optimized for the most broadly compatible processor core available at the time (ARM V4), which it still remains today. Choosing to support one processor core ensures we don't fragment our platform for developers and cause extra work for our ISVs to optimize their applications each time a new processor technology is released.
By staying with ARM V4 architecture we assure longer life spans for our customers existing hardware - for instance if we were to move to an ARM V5 architecture we would have to obsolete the all SA1110 iPAQ devices. Protecting the investments of our developers and customers is very important to us. To that end we've worked to make our devices upgradeable. Moving to ARM V5 would break upgrade compatibility. We're not prepared to strand an installed base of over 2 million iPAQ users."
THOUGHTS: Some industry analysts have said that Microsoft doesn't have any fix in place because Intel couldn't get the chips out in time.
SUWANJINDAR: "We have implemented and released specific software changes that our hardware partners are implementing without breaking compatibility for our OEMs and users. While we believe there may be incremental gains that could be had via small optimizations we are not convinced there are across the board improvements that would amount to any kind of dramatic system wide speed up. We have to develop software based on the processor architecture that offers the broadest compatibility for developers and when we shipped Pocket PC 2002 as it still is today, that was ARM V4."
THOUGHTS: Some of those same analysts have said it will be 2004 until there's an OS that can use the XScale CPU properly. Is that an accurate estimate?
SUWANJINDAR: "It's too early to talk about the next version of our software. That said, we're committed to delivering best-in-class functionality and performance while providing a foundation that enables our developer community to continue to innovate and build successful businesses on our platform.
Microsoft considers mobile devices a strategic business. We are committed to working closely with Intel and other silicon vendors on delivering future versions of our Pocket PC and Smartphone devices. We have released specific software modifications to our OEMs that in total are all of the optimizations we believe are possible to maximize PXA250 performance (without causing incompatibilities for our OEMs and developers)."
THOUGHTS: This isn't a good story for the Pocket PC and as more XScale devices hit the market, the issue will get more obvious and ultimately become more serious.
SUWANJINDAR: "Agreed, this isn't a good story. Very simply, we think this is one of those times when the technical reality didn't measure up to market expectations. That said for people who use these products, this isn't a big deal. I've used both of the new XScale products that are out there (new Toshiba and iPAQ). They offer the same type of performance that I've come to expect on a Pocket PC. I think the market expectation of what performance on a 400 MHz processor vs. 206 MHz processor has been unreasonable. In the mobile device space, we don't think that MHz is what ultimately matters to customers. What matters the most in this market is whether customers can do what they want to do with devices quickly and easily. With the richest set of software applications built into any PDA on the market, and the strong momentum that Pocket PC has with developers writing for our platform, we think that customers will be able to do the things they want to do with the performance they expect on devices using PXA250 processors."
This complaint was also based on the FIRST Xscale pda to EVER be released. Sure there's GOING to be problems. The iPaq started off with similar issues, but you don't hear anyone talking about it now do ya? There's alot of reasons that add up to create the total performance picture. Maybe Toshiba used cheaper internal ram? Maybe they need more memory for video (I think it has like 256 K maybe?? I don't know but I know it has dedicated video ram). The point is the performance on ONE Xscale based PocketPC does not make a prediction on how the others will perform. Also as these are flashable, we can expect even the Toshiba to get better performance as flash updates are made available.
Gorkman
This fits perfectly in Intel's "Megahertz sells" paradigm.
Just push clockspeed up at any cost - who cares about performance? It's already running windows - so what can you expect?!
In Intel's implementation of the
XScale architecture they have made
a great deal of changes. I suppose
it's Intel's way of forcing you to
go their way rather than with other
vendors, because to support the XScale
and get decent performance you need:
1) Rewrite your assembly code to handle
problems with their memory controller.
Code written for ARM920 cores perform
better than XScale using existing code.
This sucks for embedded software developers
because they have to branch their code
base to support the 'XScale', because
Intel has to be different.
2) Find a new compiler that will actually
optimize for the XScale. Pocket PC
development uses Microsoft's ARM compiler.
Palm OS development for OS 5.0 requires
ARM Ltd's ADS 1.X tool chain, which likewise
is not optimized for XScale.
So as a developer you are hit with two major
failings: (1) Branching code just to optimize
for XScale and (2) Lack of compilers to support
XScale.
Is it any wonder, embedded design houses are
switching to ARM9 cores by other companies?
Me.
Comment removed based on user account deletion
Comment removed based on user account deletion
Comment removed based on user account deletion
This runs Quake2. Really. http://www.oqo.com/
Without a compiler that has optimizations for the XScale, you will still get poor performance. So all the tweaks in the world to your existing code base will be for nought without a corresponding change in the compiler which is targeted for ARM7/9 cores and has only basic support for XScale.
When shopping for a PDA I don't remember speed meaning anything but "cost to much"
People buy PDAs for cheap lap tops or simple organisers. Nither needs speed.
Pocket PCs can get faster and faster while Palm Os PDAs outsell them.
The Palm Os devices are cheaper and use less power.
This is becouse they are slower.
It's not speed... it's memory....
Handspring Visors have memory cartrages and the Palm m500 use media cards so while Power PC devices play with added speed and don't get it Palm os devices get added memory.
Thies things are just portable databanks they aren't for processing information just storing it.
Want to play MP3s? Slap on an MP3 player... a sound chip that has an mp3 incoder built in and some added ram.
Want to do presentations? Slap on a presentaion device.
Go on the Internet? Snap on a wireless... (Unless it's built in)
Play Quake? compile data?
Hotsync with desktop...
I'm looking for a keyboard and a wireless for my Visor (the i705 can't handle telnet) so I can use a shell account from my PDA...
I'm not going to have any real computting power on a PDA. Thats not what a PDA is for.
I don't actually exist.
The Intel PXA250 has only 32K/32K of cache, which means that any real application will experience an extremely high cache miss rate. The memory bus is 16 or 32 bits, and has a maximum clock rate of 100 MHz. So, if you're running the maximum width bus at its maximum speed, you're likely to see an instruction dispatch rate of about 50~100 million ops/second. That's slow, and there's really nothing to be done short of adding much more cache.
If you are doing anything that requires performance, you shouldn't rely on the compiler. Intel has their "Integrated Performance Primitives" which give you a nice abstraction layer so you can wring the most power out of each processor without having to hand code asm for each one.
the next version of Pocket PC - that will surely have a .NET runtime in it. That will make all the problems go away, won't it? Just compile the apps to IL, target, and distribute.
Let's hope your skepticism is justified. Because if it isn't, Linux as a platform will be in very serious trouble.
Linux has no answer to cross-platform code, the one exception being Gnome with Mono. If that remains the only effort, and continues to attract hype and developer support, one day soon we'll wake up and find that the single viable open source platform to write to is under the technical direction of Microsoft.
However did this happen?
There are no new ARMv5 instructions that affect performance in any noticable way for general purpose computing (i.e using an optimized C-Compiler with your old code).
.clompletely filled before execution, etc)
The main new instructions are:
- a "find first one bit in word" instruction, which helps software division and huffman encoding
- some DSP-instructions like 16x16 bit multiplication/40Bit add for filters (audio-encoding, etc)
Both these enhancencents more or less require assembly coding
The other major architectural enhancements are branch-prediction (offset by higher penalties on branch misses) and larger caches (32K dcache versus 8K and 32K icache vs 16K, if i remember correctly)
However, the cache latency has increased from 1 to 3 cycles.
It means that when you load a value from memory and hit the cache, the compiler needs to find 3 unrelated instructions you can execute before you can use the result in the fourth instruction after the load.
This is a severe blow if your compiler does not figure it in, and even if it tries, or if you use assembly, you often cannot find three such instructions (table walks, or under register pressure)
In the worst case (table-walk, LUT's), this effectively halves your processor speed.
As far as i know, the bus interface has not improved from the SA1110, and this was not too efficient to start with (does not exploit accessing preloaded bank, cache-line has to be
Apart from that, there are some issues in the PXA silicon, which I think force some timeconsuming workarounds (extra cache flushes, Writeback-cache does not work, slow bus cycles). I would guess that these affect performance even more than the 100MHz SDRAM clock - after all that's about what you find in your 1GHz+ P-III-design.
However, this is only what i gathered from the datasheets, I have not yet used a PXA system as it does not yet seem to be an improvement over the SA1110 that justifies a new design.
Before blaming Intel for going with an Arm 5 core with "slow" (slow being relative, as the benchmarks vary) Arm 4 emulation, remember that all they did was produce a CPU for the embedded market that can run on batteries. The MS Pocket PC market is just one market for these processors. They want them used in cell phones and powering all kinds of devices, just like the StrongArm did.
Obviously, they felt that the majority of their customers would want an Arm5 based device. Wait a few months, and you might see some pretty impressive cell phones or linux based devices that use Arm5.
The complaint against Intel is only legitimate if their Arm5 scores are terrible. Otherwise it is the fault of the device maker for using a chip that doesn't perform well for the task at hand, or MS for not optimising.
He said, "You'll be able to tell your grandchildren that you helped assemble the first NT supercomputer," and I cringed.
Marvin wrote:
> It's simply Intel moving to a new instruction set (ARM V5)
> and building a (slow) emulation of the old one (ARM V4),
> and Microsoft says it would be horribly difficult to
> support two different instruction sets, so the choice was to either
> live with the new CPU performing slower than the old one or
> cut off support for the old hardware.
This is not correct.
All ARMV4 instructions are implemented natively in the XSCALE core.
The XSCALE core, just as the SA1110, executes almost all ARMV4 instructions in one clock, and, as far as I remember, uses more clocks only for very few instructions:
- shift register by register (2 instead of 1)
- mul / mul-acc (extra latency cycle in some cases)
- branch miss in the added BPU
- maybe some coprocessor accesses
Except for an assembly rewrite of some inner loops in the kernel, there is not much MS can do about the Memory interface that hasn't scaled with the CPU clock.
I do not think that compiler tweaking will gain much more than 10% in performance.
but I haven't any points. I'll just throw out the Compaq Aero series as another MIPS arch that was very quickly made useless. Not to say it was ever very useful. We bought five but the sync process was so shitty that we never used them.
"I assumed blithely that there were no elves out there in the darkness"
Perhaps the wince programmers should stop programming it like it were a pentium...
It's important to differentiate between architecture optimizations
and CPU specific optimizations. The ARMv5 instruction set is a
relatively minor architectural tweak to the ARMv4 instruction set.
The names give you the impression that it's some grand change between
v4 and v5, if a technical guy did the naming it would be ARMv4 and
ARMv4.01. ARM is playing some games with architecture naming
to protect their business position with patents in a silly way.
ARMv5 adds a couple of new instructions over v4, an instruction to count
leading zeros in a register (which a compiler would likely never
use), and a better method of switching between the ARM instruction
set and the 16-bit Thumb instruction set. The later isn't
relevant for PocketPC since Thumb mode isn't supported. I think
v5 might having a new debugging hook as well.
The new XScale parts are ARMv5te, the T is for the 16-bit Thumb
instruction set, which no one seems to care about. The "E" adds
some DSP oriented instructions that are pretty interesting for
media codecs and such. They are the MMX equivalent for the ARM
world. They likely won't improve performance of the general
purpose aspects of the platform.
I think it's a red herring to chase Microsoft for not optimizing for
the ARMv5, the changes are really small and I don't see any
performance impact, certainly not if you have to maintain another
version for all of the strongARM based products.
Now, as far as CPU specific optimizations for the PXA250 (XScale)
implementation of the ARM architecture. IMHO Intel chased
MHz and left behind a lot of good sense about system performance.
The high order bit is bus performance as others have already
pointed out.
In addition to the bus performance, Intel made many tradeoffs
to optimize for clock speed: The 7-stage pipe has a 4-clock penalty
for a mis-predicted branch. This is compared to the circuit
design heroics in the strongARM that implements "all branches
are 2-cycles". The Xscale approach is much more complicated, it
probably doesn't perform any better, but you get a high clock speed.
Intel adds clock cycles to all load/store-multiple instructions
in Xscale. This is a pretty big deal in ARM since they are
used in the entry and exit of most C functions, in memcpy(),
and any time you are moving chunks bigger than a register.
The load-use penalty is bigger in Xscale. This is a pretty big
deal in ARM. The ARM instruction set is pretty compact. It is a
RISC processor, but the combination of shifting operations
combined with ALU operations makes it possible for a good compiler
to generate reasonably compact code. As a result, it's harder
for a compiler to put instructions between a load and instructions
that use the destination of the load. This is another trade-off
in Xscale that allows a higher clock speed but hurts performance
otherwise.
I go on too long, but the DEC designed strongARM used in the SA1100
is a tour-de-force of clean implementation and balanced system
performance. It's amazing that core was designed in 1993 (I think,
someone please correct me) and is still the leader for handheld
apps. The Intel guys went after clock speed at the expense of
everything else in Xscale and it will probably never optimize well
for a platform like PocketPC.
jeff
Simple arithmetic: if it was CPU-bound, halving the clockspeed should roughly halve the FPS. Suspect the graphics chip. BTW, having it beaten by an iPaq in a graphics benchmark sucks rocks. A friend of mine got a 50-fold speed improvement in iPaq graphics by rewriting the GDI driver. Either the gfx driver is broken or there's something badly wrong with the gfx hardware.
Got time? Spend some of it coding or testing
We can use MIPS portables here.
Got time? Spend some of it coding or testing
Imagine a beowulf cluster of these!
If you want to get real work done on either of these, you should look here . Too bad MS doesn't ship this software with their wince systems.
The word processor is awesome, and the spreadsheet and database can't be far behind. Woohoo! I dusted off my old Jornada820 for that. First productive of my Jornada after, what, three years...
Only little girls and girly-men keep diaries. I laugh at you. I unclog my nose in front of you. I fart in your general direction.
We are aware that PXA250 (XScale)-based devices are not demonstrating the huge performance gains that were anticipated. That said, Pocket PCs continue to offer the best performance and the richest functionality vs. other handhelds on the market today.
Translation: We know your new car only goes 40mph instead of the 65mph you old car did, but it beats a bicycle, doesn't it? (credits to Jim S for that one).
Even better:
I think the market expectation of what performance on a 400 MHz processor vs. 206 MHz processor has been unreasonable.
Not at all. The process is almost twice as fast, I don't think it is utterly unreasonable to expect the product to be at least one and a half times faster.
But my question is, how is the battery life on one of these things? If it really is the 12-16 hours instead of the 8 currently then the XScale is still a worthwhile bet.
Avantslash - View Slashdot cleanly on your mobile phone.
It's usually pretty hard to thrash a code cache to the point of it being the bottleneck. You pretty much have to deliberately write code to do that. For reference, an Athlon has a 64KB code cache, and that's running at a far higher speed than both of these ARM processors. Your figure of 50-100 million ops/second assumes an unreasonable 100% instruction cache miss rate. You'd have to have a program totally devoid of loops to achieve that. At a still unreasonable hit rate of 95% you'd still get 96% ((95*400+5*100)/40000) of full performance.
My guess why the real world performance is so bad is probably Microsoft's lack of optimization specific to the processor. There's a few trade-offs Intel have made to get the clock higher, including:
I certainly wouldn't expect to be able to take code targetted for StrongARM and see all of the performance increase that 200->400 would indicate. I can imagine hand coding assembler to work around the latencies and getting near to 100% performance. It's not that hard for a modern compiler to work this out either - a current day x86 is far more difficult to target than an X-Scale.
If someone has the source code for PPC2k2 they could try some modifications if they knew what was necessary. I'm not programer but i think it would be sweet if you guys beat microsoft at their own game other companies might try etc.