Dual Pentium III Xeon Review
Sander Sassen writes: "Intel has recently released its new line of Pentium III Xeon CPUs, based on their new .18 micron process. HardwareCentral takes a look at its performance, utilizing a dual CPU configuration on an Intel i840 platform with 256 MB of Rambus memory as a testbed. This Dual Pentium III Xeon review has all the details of their findings."
"Multiprocessin won't be available until the AMD 760 chipset"
youre wrong, the 760 chipset only supports DDR RAM and a 266mhz FSB (133mhz DDR), its the 770 chipset thats gonna support SMP...
Ya gotta love it. buy a $69k 8 proc xeon now, and they'll throw in a free palm pilot... !
That's why its "good stuff".
- A.P.
--
"One World, one Web, one Program" - Microsoft promotional ad
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
Only one meeeeelyun dollars.
It me, embobo. Please prevent the use (utilization) of the word "utilizing" for the rest of eternity. I'll give you a crufix-shaped cookie if you do. Don't make me unleash my core competencies or my skill set upon you.
"Revenge of the year" award
All the last couple of years, AMD was fighting against cheapest chipzilla's Celerons, eating bits after big Intel Pentium III lunch. Well, today is the day to revenge.
First trick was to involve Intel into crazy run for the first Ghz CPU. Having complicated two-year design cycle and expensive factory-replication pattern, Chipzilla is too big to move fast. Playing by AMD rules, it lost much of its production power to bad yield. Announcing 1 Ghz chip, Intel is practically shipping 850 Mhz parts at the best, and only 550-800 Mhz in volume. So the whole 800 Mhz to 1 Ghz market is at AMD control. And now follows the next AMD play - price reductions. So, the most powerful Intel chip will compete... against $324 800 Mhz AMD part? The whole $430 billion company is moving into sub - $350 market? Very bloody...
And what was the Intel game? All the last year it was drunk - dancing with Rambus, screwing up one mobo chipset after another. Today earnings report shows Intel's revenues falling. Maybe its a good time to open another bottle?
Andrew
Thanx to Hardware Central for yet another breathless description of the latest, greatest data point corroborating a trend I have been tracking since the original Pentium: Performance tracks clock speed!
All of the extra transistors that Intel keeps packing onto the chip accomplish nothing more than to compensate for the various nonlinear elements in the system (eg, RAM and HD). The data I have collected over the past several years (back to the original Pentium) show that (as far as benchmarks are concerned) Intel's newest Pentium architecture is no more efficient (in CPI terms) than its oldest.
In fact, Hardware Central's own benchmark shows that the new PIII Xeon is the least efficient performer in their group, despite having the highest overall performance and the snazziest new architecture. Viz:
CPU -------- Perf rating points / aggregate MHz
PIII Xeon -- 2.6
PIII 500 --- 2.7
PPro 200 --- 2.7
Athlon ----- 2.8
Cel 366 ---- 2.7
They could achieve the same performance shown by this dual 666 PIII by pushing an original P60 up to 1332 MHz (assuming ideal scaling, which you essentially get with synthetic benchmarks).
Every time I read a review wherein the writer wets himself over the "blistering speed" of Intel's newest architecture, I plot another data point on my straight-line graph, shake my head, and mutter a curse about the quality of technical education in America.
Oh yes! There is something to look forward to - the return of EMM386! (groan).
But they will have to call it EMM_IA64 - the new, old way for applications and the OS to get past that 1MB^H^H 4GB barrier.
The more things change, the more they stay the same.
Torrey Hoffman (Azog)
Torrey Hoffman (Azog)
"HTML needs a rant tag" - Alan Cox
This is the kind of processors everybody wants to have, but no one wants to pay for.
-henrik
The PowerPC 750 runs two 32 k 8-way set assoc L1 caches (data and instruction), and a 64 entry, 16 set, 4-way set assoc branch target instruction cache.
The L2 control provided by the 750 is implemented with an on-chip 2-way set assoc. tag memory.
Of course, you can always use a 740 and provide your own L2 cache control... and I don't know exactly what Apple does, but the L1 remains the same.
"It's tough to be bilingual when you get hit in the head."
Well, one should be able to glom some of the code from a dual-alpha implementation. The bus is of that design, and the arbitration (not really a bus, it's supposed to be a switch) is ~the same. Not an exact copy, of course, but it could be a good lead for some of the framework...
"It's tough to be bilingual when you get hit in the head."
Actually what 'nester' is saying is true.. most modern 'RISC' machines have 'complex' operations as well. It doesn't usually take more 'instructions' to perform the same operation. The x86 on the other hand has single byte complex instructions such as MOVSD which does a memory to memory copy (there isn't a mov
mem,mem) with the 1010010w opcode. For example the PPC also has string move instructions although I don't have my PPC reference handy to list the opcode but they take up 32bits instead of 8. Actually the MOVSD takes up 2 bytes because you almost always want the REPNZ prefix. This is one of the problems with decoding the stream. The instructions are 'variable' length as well as different sizes. On the other hand the complex instructions in most 'RISC' architectures are going to cause those architectures to need translation layers to more simplified micro RISC OOO engines for a large number of their complex instructions as their clock speeds scale.
The code size thing is just an obfuscation for the real issue which is optimal cache size -vs- associatively -vs- latency is a function of the application being used. An application like quake or word which can hold the vast majority of the active working set of its data structures and primary code paths in less than 256k doesn't
need more than that so any speed improvements you get in the cache access times pay off big. Note on the other hand that intel went from a 512k two way set associative to a 256k 4 way set associative vs maybe a 1 meg direct mapped. In theory all three should provide roughly the same hit rate. A workstation or server on the other hand has an entirely different cache footprint. Intel engineers are smart, intel marking is not.. Remember the PPRO vs PII issue? The PPRO supported large memory sizes, fast caches, and >2 way SMP the original PII did not. So until the xeon came out there was a small group of high end server/workstation customers that were pissed off because they couldn't get newer processors from intel to replace their servers.
Motorola caches are a funny thing. They don't offer on die (that i've seen) instead they offer on die tagging. The L2 tagging is 2way set associative and supports 512k,1m and 2m external caches. This actually sounds more like a case of motorola trying to squeeze all the performance out of an old cpu core while still making them cheap with the idea that all the performance numbers will be listed with 2megs external L2 (to make up for the fact that the latency is going to suck being an external cache) while everyone will sell cheap little 512k versions.
The only way I can see a virus trying to use it is as a method of hiding the bulk of its code.
...
Reminds me of the first polymorphic pseudovirii with the code segments that appeared benign, but assembled into final executables all with different signatures to avoid detection.
The main problem with using the chip is that only some of the potential targets will be infectable. This might be cool if it's a viral spreader, which seeds target systems via HTTP or other requests. Use one method to deliver the code to the big servers, then use the servers to deliver the target virii to get all the victims. That way the server can avoid having the code that it delivers, since it just needs to pass on the request to deliver the final package. You could even set up a multi-tier distributed approach, with host virii, server virii, and delivery virii - all of which can adapt to different OS and anti-viral protections by using different delivery methods and different trigger events suited to the target OS.
One ring to rule them all
Three rings to serve them all
Five rings to infect the different OS
Seven rings to make Bill G's day
Just a thought
Will in Seattle
I have one word to add - Thunderbird.
I have one word to add - Vaporware.
Agreed. You could just go to the chip PR websites and get this information.
What would be news if some superfast chip, or modified to be fast, was produced that was cheap. Like the earlier Celerons were.
The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
I read somewhere before that a dual Athlon would be possible. Does anyone know if there exists mainboards that will support this?
It's a learned skill, that unfortunately I lack. It would have been extremely useful yesterday when I installed a program that, when run, turns your numlock on and hitting the numlock key won't turn it off. I didn't realize it was the program's fault, and was screwing around trying to fix the problem. On the laptop, the number pad is on top of the regular letters. Needless to say, my ICQ friends got responses that made no sense, and if I knew how to write like a script kiddie, I would've been able to get my point across.
(obligatory slashdot commentary.)
This still reduces x86 code size. Compare the following memory to memory copy:
mov [eax], [ebx]
With:
mov ecx, [ebx]
mov [eax], ecx
The RISC-like (second set) code is twice as long as the CISC-like (first example) code. Granted, execution time is about the same (both processors do the same thing internally), but the code to represent that operation is smaller.
I'm not talking about pro's and con's of CISC vs. RISC internal to the processor (RISC wins hands down), I'm just saying that usually, CISC requires less instructions to do the same thing. Although, you're right about variable length instructions helping code size and being a pain in the ass internal to the processor.
Dan
...I'll just have to settle for a dual-celeron...
BlackNova Traders
This is pseudo on-topic, but aren't we moving away from these huge monolithic boxes and into more distributed envirnments? I suppose that isn't good for intel's busines..
Trolls, it must be cool to be that bored.
Alphas do.
Look at their
It shows, that PIII-1000 is 1.5 times faster than dual Xeon-667 on both CPU tests (CPU/FPU and multimedia), but looses on the memory speed test because of Rambu$.
Tigers respect lions, elephants and hippos. Maggots respect no one. (C) S. Dovlatov
--See now this is what I am talking about!
--I read this post and now my head hurts so much I want get on a subway an kill people.
--I had a miron once I rubbed some Lambda oil on it and it went away
--OW! The pain, my eyes bleed...
This
How come the obligatory Beowulf Cluster remark isn't out yet?
give me all your garmonbozia
while a dual Xeon system does get closer to the price of SGI, Sun, etc, it still is significantly cheaper. A few months ago I priced Xeon systems vs SGIs for some engineering work. The Xeons came in much cheaper, although we still went with the SGI for better performance. It is still a price/performance game, and always will be.
Intel's web site is at www.intel.com
Late.
Soy el plátano! No tengo gusto de monos!
Just fleecing their corporate customers?
Yes. Typical decision maker has about 1% of the knowledge of microprocessor architecture that anyone posting on this thread has, and about 0.1% available time to think about their decision. So the way they solve the problem: big numbers = good. Note: I said typical, not every.
Don't post on slashdot. Get back to work.
I doubt you'll be able to execute code from the scratch area.
The only way I can see a virus trying to use it is as a method of hiding the bulk of its code. Just think of a little micro-virus that is even harder to detect than regular viruses (because it is so small it doesn't have much of a signature) and works by loading itself from the EEPROM when it is executed. This might be viable if the virus detection companies don't think to check accesses to memory the same way the check accesses to the hard drive(s). Of course it would be rather difficult to spread this virus, as everyone would need a brand new ultra-expensive processor in their computers, and the people who tend to buy these things tend to know how to avoid getting viruses.
I read the internet for the articles.
You think I'm kidding...:-)
At a guess, anyone that needs to use a Xeon as opposed to a regular PIII.
"The invisible and the non-existent look very much alike." -- Delos B. McKown
If you read the full article it says the processor is only $50 to $100 dollars more then the slot 1/Socket 370 counter part. The big difference is the management functions that are part of the processor housing such as temp., 2 eeproms, and so on... Given the size of the cartrige and the metal back plate I would imagine that it also cools better and for a server that is good.
Anouther issue is the whole slot two thing. Alot of the i840 motherboards that are in production/planned are slot two making this processor nesicary if you don't want to use a 550MHz processor.
"... That probably would have sounded more commanding if I wasn't wearing my yummy sushi pajamas..."
-Buffy Summers
Goodbye Iowa
Well, maybe EMS wasn't as ugly as some other hacks, but it messed up the programs. I never used it on anything less than a 386, and when I got a compiler that could do protected-mode DOS programs, I promptly forgot all about it. And then I installed Linux instead, and got a flat address space (no need to keep the structures below 64K in size - Yay!).
And yes, XMS was hard to use, but when that's what you had... Again, flat address space rocks :)
And I can also fill you in on the HMA stuff. The extra memory space come from the segmented memory addressing, with a 16-bit segment and 16-bit offset. The true address is calculated by (segment << 4) + offset, which in general creates a 20-bit address, capable of addressing one meg. However, note the overlapping parts of the segment and offset - if you put the value 0xffff in the segment, and anything greater than 0x000f in the offset, you will overflow the address space on a 20-bit bus, and wrap back to low addresses. What they did was to disable the address wrapping / aliasing, and instead merrily continue up into high memory. Wrap or no wrap was selectable by the "A20 enable" thingy (don't remember exactly how it worked).
CISC instructions (fwik) rarely do more than comparable RISC ops. example: x86 mov will do register load/stores and mem->mem copies, but it can't do all at once. the reason x86 code is smaller is due to its variable opcode length. the fact that x86 ops can each do more than one job is just namespace (opcode) compression, and it necessitates microcode and complicates pipelining.
With 256KB RAM, this is clearly intended for the Workstation Market.
:p
Geez, nice workstation!
My nintendo has more than that!
What else are they going to do? They are working in other areas: advanced architecture (williamette, itanium) and Mhz (0.18 process). If increasing the cache also gets them a speed boost, however small, they'll do that too. You will always have a set of customers that are screaming for any amount of speed, regardless of the cost. Xeon is for them.
In addition, I'm sure it doesn't hurt them when comparing against Sparc, Alpha, PowerPC, etc. These all have a ton of cache (8MB!! in the case of high-end Sparcs) and, as we have discussed, more cache implies more speed to most people.
once you look at xeons you get into the price range of the good stuff (IBM, Alpha, SUN etc.)
While true, the "good stuff" doesn't run NT.
Warning: long technical post ahead.
It wasn't an ugly hack at all. It was a way of sending 64k "frames" back/forth over the ISA bus to an add-on memory board. It was SLOW, but not ugly. It supported 32 MB (an arbitrary number IIRC) of RAM on a machine with a 20-bit address space (8088/8086).
Anyway, 80286 had 24-bit addressing (16 MB) out of the box but few early motherboards supported it (although my PS/2 model 30 is expandable to 16 MB), and besides the cards were a lot cheaper than SIMMs. Plus you could of course only access the full 16 MB from the 286's broken "protected mode".
EMS was also easy to use from Real Mode. Of course XMS could be used too but it was a lot less clean than the EMS API (software int 0x68?).
Do not confuse EMS with EMM386. EMM386 emulated "hardware" EMS by providing a layer over top of XMS. It wasn't nearly as slow, and used the same clean API.
However, circa DOS 5.0, people stopped using EMM386 for EMS. Since EMM386 needed to fake its own adapter space (the original adapter boards used an address like 0xE0000 for the 64k "frames"), it contained code to support UMBs (Upper Memory Blocks). You could use UMBs to allocate unused memory in the adapter space between 640k and 1024k in Real Mode.
I would consider UMBs an ugly hack (remember MemMaker? LoadHigh? DeviceHigh? InstallHigh?). They're still in use by Win95/98/ME to hold things like the Real Mode mouse driver etc.
HMA was the ultimate ugly hack. That upped the Real Mode address space from 1024k to 1088k on the 80286 and higher. I have no idea how it works, but you may notice that your Win9x virus scanner looks at a full 1088k of "conventional memory" during boot.
Hands in my pocket
Intel has Physical Address Extensions for 36-bit addressing for a limit of 64GB. Win2k supports this in their Server version. I believe Linux support is done or coming soon. I also think SCO supports it. I have no idea about Solaris and various BSDs. Over at Unisys we have a monster system called ES7000 that supports 64GB. It also supports 32 processors and 96 PCI slots.
IA-64, AKA Merced, AKA Itanium supports full 64-bit addressing for a whopping 16 EB (exabytes!). Microsoft currently claims that Win64 wil only support up to 64 TB, although that may only be in Data Center Server. Anybody know what the other IA-64 projects will support?
-- soldack
Yea, but I doubt management features will carry the load of a large web server. I'm saying that if they intend to put this in the normal Xeon market, then they need more cache, or else people are going to continue to use the older proc, or they will get whopped by AMD and its 8meg cache Athlon.
A deep unwavering belief is a sure sign you're missing something...
This is probably because the PIII/Aluminummine processors use a fully associative cache, instead of a 4-way or an 8-way associative cache. This basically means that the cache is very good at optimizing itself for the most commonly accessed data, but also means that the cache doesn't scale as well to large amounts. I don't know what the cache association is on the G4 (I could go look, but I'm lazy, so I'll leave that for whoever isn't) but that probably has something to do with this issue.
I have one word to add - Thunderbird. Full speed cache in good amounts, for about 10% of the price of a Xeon. Also, faster clockspeeds (somewhat relevant in this case) - the Thunderbird 1250 should be out in the next couple of months. Multiprocessing won't be available until the AMD 760 chipset comes out later this year, but if you can wait, I think the AMD Thunderbird is going to be a good choice over the Xeon.
--- "So THAT's what an invisible barrier looks like!" - Time Bandits
Re:Two Words For Ya! (Score:3)
by garver on 10:16 AM April 18th, 2000 EST (#36)
(User Info)
As always, when looking at cache, you compare bang for buck. Adding cache costs money, lots of money sometimes. Some processor architectures get more mileage out of added cache than others.
For example, the G4 seems to love cache and screams faster and faster as you add it. Apple/Motorola have found the 1MB cache level to be their sweetspot, most bang for buck. On the other hand, the PIII is not as cache loving. Giving it another 0.75MB doesn't do it all that much good, so why waste the money? Their sweetspot seems to be 0.25MB.
Then why dump so much more onto their high end systems if the performance increase is negligible?
Just fleecing their corporate customers?
I still feel that 1mb cache would do a bit towards increasing the performance of x86 processors based machines.
Kintanon
Check out JoshJitsu.info for Brazilian Ji
Not impressed.
Yippy skip, for 6K$ extra I can drop another 1.75 megs of fullspeed cache on a processor. Gee, big surprise, that increases the performance, whoulda thunkit?!
What I want is for the X86 processor makers to catch up with Motorola and put 1m of full speed cache on their regular processors. I have a hard time finding a processor with 512K of cache, WTF is the problem here?
This is just IBM slapping the market around to try and increase their profits without actually giving us anything new.
Kintanon
Check out JoshJitsu.info for Brazilian Ji
Way to read the article you're commenting on.
"The Iwill DCA-200 motherboard is a pinnacle of stability and performance, but doesn't come cheap. The same applies to the 256 MB of ECC PC800 RDRAM; it is fast, very fast even, we've never seen memory scores this high, but will set you back considerably."
AND
"RDRAM finally showed some of its muscle here, with the highest memory throughput we've ever seen on any memory architecture. The dual RDRAM channels on the i840 chipset really show off its benefits and low latency."
Say it with the group "low latency"
Oh, and about Tom - when you have cancer, do you go to a systems engineer? Then why do you go to an MD for your tech information? Pabst has had some good info over the past several years, but he's also had some very questionable conclusions, and he has been getting more, shall we say, touchy, since the video benchmark fiasco.
THE YEAR WAS 2081, and everybody was finally equal...
I'll agree that it isn't the most useful set of benchmarks, but I disagree with your server only comment.
The Xeon has been marketted as a Workstation/Server chip, and has seen it's way into the SGI NT workstations, etc. With 256KB RAM, this is clearly intended for the Workstation Market. The wider cache bus and the new motherboard are nice additions for the workstation market, but I think that the server market would give up some MHz for the larger caches of the "old" Xeons or get the "new" Xeon in the 512KB, 1M, or 2M (if available?) versions. I mean, 256KB of L2 cache is going to be useless in a large database server, as you'll never make a cache hit, while a larger cache is useful if most of the accesses are within a general range.
However, I agree that this was mostly a stupid review. Testing it against obviously inferior hardware wasn't interesting. I mean, testing it against dual 800 MHz P3s or 1GHz P3s would give an understanding as to what the new cache system does. Testing it against processors from the same family at 2/3s the speed and shouting, wow, it's fast, is kinda silly.
Alex
You are right, 256 MB is a little weak. My personal computer has 384 MB of RAM... The motherboard they used for the test was an Iwill DCA200. This board will support up to 2 GB of RAM. I think the reason that they only used 256 MB was because that much RDRAM memory runs about $1,100. Peguin Computing has an 8-way Xeon system that will support up to 32 GB of ECC SDRAM memory. I am sure there are other x86 based machines like this, but I don't know of any off hand.
.18u my rosy red arse! For those of you not in the know, the .18u measure is the smallest feature measure, or the Lambda of the chip. Every other dimension on the chip is a multiple of that number. It is the distance across the gate of a transistor from source to drain. Now, when they bake the chips, that distance shortens by a few mirons. Unfortunately, the marketing dept. got wind of this and took off with it. Now, they measure the shortest distance from source to drain right near the gate, because the further from the gate the measurement is taken, the wider the gap is. (Sort of a curve...) So in reality, those .18u chips are actually .20u or .21u. It doesn't sound like much, but when you're talking about millions and millions of transistors, that's a lot of space. (But probably still no more than the head of a pin.)
"I threw up my hands in disgust and wondered if it had been such a good idea to have eaten my hands in the first place."
--I was going to go for the quad setup but I found that two asbestos leg protectors was cost prohibitive.
--these are the same a Celerons right?
This
They just want to give an idea of raw processor performance. What you claim (and I agree with you on the fact an Oracle benchmark would be much more significant to most of us) is a benchmark measuring the overall system performance and no longer just the CPU performance. So, it may not be possible to claim significant performance improvements from such a benchmark, since the result will not depend on sole CPU performance, but rather on the complete disk subsystem performance, memory performance, database tuning, etc.
Bottom line: You are always on you own when time comes to figure out performance in real life situations.
who gives a shit about Dhry stone and Whet stone? i want the Q3Arena benchmarks. mp3s prOn and Q3 are the only thing i use a computer for...
That said, there is a great site that compares the servers and databases you mention, and will likely give you the stats you are looking for. Its www.tpc.org.
No, Thursday's out. How about never - is never good for you?
Anyone know how much RAM you can put into one of these? They tested the system with 256 MB, which is a spit in the ocean for high-end systems nowadays (well, I might be exaggerating a bit...). I think it might be possible to use more than 4GB physical mem by some page table magic, but the per-process limit might be restricted to 4GB... Wait, maybe not - anyone remember LIM EMS? (*) Although, that is very ugly indeed.
As I see it, this is what they have to solve, and solve it pretty quick, if they want to continue selling 32-bit processors. Today, there are lots of people running their programs on supercomputers, only because of the large memory, not because they need the processing power. It would be possible to save millions if the high-end PC class desktop systems could be fitted with, say, 24GB mem.
But the built-in EEPROM was cool, I wonder if you can trick it into using that for booting, a la Sun's OpenBoot prom...? One can always dream.
(*) To those who are too young to remember, EMS was an ugly hack by Lotus, Intel and Microsoft to be able to use more than 1 MB of memory on the 8086 / 80286.
From the article:
Has anyone considered that this could be used to store virii? It'd be a pain - but if manufacturers can use it to keep info about usage data, no doubt it's re-writable.
Just a tiny thought.
Whatever you do... don't read this.
So while the test may have been somewhat entertaining it is completley useless. The benchmark isn't anything I recognise as an accurate simulation of a server environment and there are no real life tests. Show me a test comparing this to a Sun box running Oracle and 500GB of data and I might be interested.
Is it just me, or does Intel's new "use one die" for everything seem to have gotten them into a little trouble? I read the article, look for how exactly the new Xeon is different from a Coppermine PIII. Isn't the whole point of a Xeon the large full speed L2 cache? With the PIII having a 256K full speed cache, isn't a 256K Xeon, well, redundant? I do hope there are 2 meg integrated Xeons coming soon, because otherwise, you pay more for almost exactly the same processor.
A deep unwavering belief is a sure sign you're missing something...
Tom's Hardware Guide just had an article which convinced me to stick with SDRAM for quite some time to come. Maybe for highly memory intensive long processes RDRAM is worth it, but how many of us will fin that worthwhile?
The power of accurate observation is commonly called cynicism by those who have not got it. - G.B. Shaw