Intel Lindenhurst Xeon DP Platform Discussion
Steve from Hexus writes "Hexus.net has a article looking at Intel's latest Xeon platform: Lindenhurst, discussing the Paxville dual-core processor, E7520 core-logic, where it could go right for Intel, and where it could all go wrong." From the article: "If you're I/O bound by your threads in any way, you can hit problems (all threads touch the MCH, then there's a 266MiB/sec bus link to the I/O processors to cross, then the data hits disks or network hardware). If you're memory subsystem bound in any way, especially on a majority of compute threads, performance is likely gone. There's just too much resource sharing for it to all conceivably work well, especially compared to Opteron. I can forsee many a scenario where dual-core Opteron will give Paxville Xeon DP a beating."
Intel Lindenhurst Xeon DP Platform Discussion
HEXUS have an article coming that evaluates the latest Intel Xeon DP platform, codenamed Lindenhurst. As you'll likely know, (current) Xeon is Intel's workstation and server processor based on many of the same technologies that define Pentium 4 in the desktop space. Lindenhurst (at its most basic definition) is the combination of the new Paxville Xeon processor in DP (dual processor) form (there's a multi processor version hosted by Truland), along with Intel E7520 core logic.
The Paxville generation of Xeon is dual-core and uses the latest generation of Netburst microarchitecture, making the DP version ostensibly a clone of the Pentium D 820, but with the ability to also turn on HyperThreading for both cores. The DP version of Paxville, at $1080 in volume, is only available in 2.8GHz form for the time being, MP variant available at up to 3GHz. Think about your breathing. Inhale and exhale voluntarily. It supports everything the dual-core Pentium D does, including SSE3 instructions and rides the same 200MHz system bus (800MHz effective).
E7250 provides a single dual-channel DDR2-400 memory controller, and a shared bus for the CPUs to get to that memory controller from. Other stuff like PCI Express, support for the Xeon CPU's execute disable bit and support for PCI-X via a mandatory 6700PXH segment bridge (2 PCI-X segments) mean that superficially its a forward thinking, modern workstation and server platform.
However, in advance of the Lindenhurst test platform arriving for evaluation, I've caught myself wondering just how it's supposed to work with any kind of real performance outside of a couple of scenarios. It's an issue of limited resource sharing, mainly at the CPU and memory controller levels.
Not much food to go round
We've evaluated HyperThreading-able processors many times in the past, since its launch with the 3.06GHz Pentium 4, and while there's opportunity for performance improvements with a single HyperThreaded processor, performance rarely doubles because HyperThreading is the sharing of the CPU's execution resources by the Hyper threads.
In an SMP scenario with Xeon, you've then got CPUs sharing a memory controller. When that memory controller only supports fairly slow DDR2-400, likely at higher latency and with a performance penalty compared to DDR-400 (even without ECC in the mix, which is almost mandatory for Xeon given the places its implemented), there's a performance issue. When the CPU-to-memory bus is shared between the two CPUs, so bus access is singular and access has to be interleaved, performance can be limited by a CPU-to-memory bottleneck.
Add in dual-core and you've now got four cores sharing one memory controller over one bus link. See where I'm going with this? Add in HyperThreading and eight logical processors in two sockets have to share that one memory resource, on one bus.
The lack of dedicated CPU bus connections to the memory controller on SMP Intel systems historically is one of the reasons why Athlon MP was able to do fairly well on introduction against SMP Pentium IIIs, CPUs which still shared the bus back then. Each Athlon MP had a dedicated bus connection to the memory controller.
With the introduction of Opteron by AMD in recent years, each CPU has its own memory controller right there on the CPU die and HyperTransport to allow the CPUs to access each other's memory controller and other connected system resources on non-heavily shared (only between a pair of CPUs, or a CPU and devices) bus links. That kind of topology, where all bus and memory access traffic isn't confined to one set of bus paths is why Opteron generally beats on Xeon in modern performance testing.
So while dual-core Opteron processors have the cores share a memory controller and HyperTransport link, that's as far as the sharing goes for the most part. Intel's comparison platforms with Xeon are sat sharing resources like nobody's business.
Where it could go ri
I've thought the same. I have racks of single core 3.0 GHz Xeons that strain the memory bus to the limit. Adding more cores to that mix is a waste. So, the new cluster is dual-core AMDs. The Intel architecture is generally good for the codes that we run, but I couldn't justify not buying AMDs. Price, thermal footprint, and performance all went that way.
Protip to Intel: Stop trying to feed your users this crap.
They wanted to get their Netburst cores into the DP world as quickly as possible.
Where AMD uses the HT bus for their 757 and 939/940 parts Intel was still using the good ole 64-bit FSB of yesteryear.
Most of what Intel does nowadays in the processor world is entirely market driven. The Netburst is a good example. High clock rate, low efficiency processor. Sounds good on paper but works poorly in practice. The EMT64 extensions are another example. A lot of code on the P4 in 32-bit mode takes roughly the same number of cycles on the 64-bit P4s with the notable exception being 64-bit math [e.g. additions and multiplies].
For example, most block ciphers are the same speed on both the 540J and 820D [in terms of clock cycles]. I think partially because they're just using rename registers for the additional GPRs. But compare the AthlonXP to the Athlon64 and there is a huge difference. The Athlon64 is an improvement over the 32-bit cousin. They didn't just slap 64-bits on the core they actually made it better.
I refer to my nice chart again
Operations per second at doing RSA-1024 decrypt
AMD64 = 2.2Ghz
AMD32 = 1.8Ghz
P4 = 3.2Ghz
Nocona = 2.8Ghz
At the 32-bit side of things the AMD32 can match or beat the P4 even though it's slower by 1.4Ghz. At the 64-bit side there simply is no comparison. I mean the dual-core RSA on the Nocona can't even match the SINGLE-CORE RSA on the Athlon64.
How pathetic is that?
Ever since the 64 came out Intel has basically been a poser in the CPU world. The only really proud achievements [outside the pure sciences they do in the background] are the ARM and P6 core designs...
Tom
Someday, I'll have a real sig.
For a single CPU, the quad piped 200Mhz FSB does fine. It can fully utilize two channels of DDR 400 RAM, which is the standard on the better desktop mainboards. A single AMD CPU does not better.
Things are different with multiprocessor setups:
Here each Opteron has its own memory interface, while the Xeons have to share one FSB. As a result, the total Opteron memory bandwith is proportional to the number of sockets. Total Xeon bandwith does not grow with more sockets.
This does show up heavily in reviews of 2-processor machines, expect it to be worse in 4- and 8-way-systems.
C - the footgun of programming languages
Wow - that's a *LOT* of Tommy Lee Joneses and Will Smiths!
:-)
Looking past the joke, for anyone who may be wondering why that 'i' is there, they're just being accurate. "MiB" is the abbreviation for "mebibyte", which is 2^20 bytes. The more "common" notation, "MB", is the abbreviation for "megabyte", which is 10^6 bytes.
The terms "gibibyte", "mebibyte", "kibibyte", etc. were defined in 1998 by the IEC to disambiguate "megabyte", etc. The "giga", "mega", "kilo" prefixes from the SI units have always referred to powers of 10. With the advent of computers, it became convenient to use them to refer to powers of two that are close to powers of 10. So, "kilo" was used to mean 1024, "mega" was used to mean 1048576 and "giga" was used for 1073741824. The context was generally sufficient to disambiguate those usages from the standard powers-of-ten usages. Basically, everyone figured that if you were talking about computers, the prefixes referred to powers of two.
But there are plenty of computer-related contexts where the prefixes have their traditional meanings. Hard disk drive storage sizes, for example, are measured with powers of 10 by drive manufacturers, but file systems generally use binary prefixes This is why your 80GB drive shows up as only 74.5GB "formatted". It's not that lots of space is wasted by the formatting; the issue is that 80*10^9/2^30=74.5. The two measurements are using different units. Data rates are also traditionally specified in powers of 10. RAM sizes are powers of two.
So, to disambiguate the prefixes while not disturbing the traditional meanings, the IEC coined a new set of binary prefixes, along with corresponding abbreviations. The new prefixes all end in "bi", for "binary".
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Except the Alpha was a RISC processor (and a pretty clean one at that), so its short pipelines didn't lose as much performance to branch miss-predictions as the P4/Netburst does. IIRC, both the P4 and Athlon CPU's had to get up to around 1.4-1.5GHz before they beat the performance of the 800MHz 21264, the last and fastest Alpha produced. *sigh*
Your Xeon system with the SCSI disks is hugely faster doing DBMS than the system with the SATA drive in large part (probably larger than the other reasons you've listed, although those do matter) because DBMSs tend to throw a heck of a lot of disk IO commands at the disk subsystem all at once. The SCSI disks and their controller are simply better able to handle the barrage. I'll be that a test with the drive subsystems reversed shows that while the Xeons are still faster, the P4 is only somewhat behind, not waaaay behind.