Stop using "Distributed.net" to compare microprocessor performance. It's a highly skewed benchmark that really only tests the speed of the "Rotate" instruction (which is on the critical path of the program).
Altivec supplies a data-parallel version of the Rotate instruction so processors with altivec can do many rotates in parallel which is why a G4 will beat anything else (no other processors have this data-parallel instruction because it is completely useless with the rare exception of this app). That is to say that most other computer designers felt that adding this instruction would be a complete waste of die area and power, since no other ISA supports it (x86, SPARC, MIPS, POWER etc...)
Distributed.net...
1) Does not test branch predictors because it's a simple loop that is very easily predictable by even the most trivial preditctors
2) Does not test the internal L1/L2 cache hierarchy because all of the data fits in the L1 of most processors
3) Does not test the memory system (DRAM/Front-side-bus/memory-controller) because, as mentioned in #2, all of the data fits in the L1 cache.
4) It does not test the instruction cache performance because all of the code fits in the L1 instruction cache.
Stop using it to compare general-purpose computer performance. It is only important if the only app you care about is distributed.net
Your Athlon 1600 will spank the G4 at most everything else.
Multiprocessor bus speeds and CPU frequencies always lag behind uniprocessor systems. It takes much longer to validate multiprocessor boards when compared to a uniprocessor system. This is because the number of things that can go wrong goes up exponentially with the number of CPUs on the board. The typical customers of multiprocessor systems value this sort of reliability even more than performance.
I agree that this is a good way for Sun to make money in the computer business. Solaris is better than Linux at certain things and that will differentiate Sun enough for customers to go with them.
Beware the Linux distributions that come out of Sun. It is in their interests to make it look bad compared with Solaris. They tried the same thing with x86 Solaris. They made it so crappy to try to convince customers to switch their hardware from x86 to Sparc.
Hardware (servers in particular) are becoming more and more commodity-like as standard components work their way up the enterprise stack. Sun can't play there, they're too inefficient compared with a company like Dell that has much lower overhead - Dell has minimal inventory and just about 0 R&D cost. In a commodity market the leanest players win and Sun is a big fat pig.
Another big problem with GPUs trying to do general purpose code is the lack of demand paged virtual memory. There are specific hard wired data paths to memory.
I work for a very large Fortune 500 company as an engineer. I have a Windows box that is in fact administered remotely. I get patches installed, I get software installed. The admin can take control of my computer remotely - I can see the pointer wiggling around as he goes into my machine and configures it. I also have a linux box at work, but we all know how easy those are to administer remotely.
Larry's network computer argument was predicated on the assumtion that PCs are expensive. They're not. They're cheap as hell. Price out Sun's network terminal vs. a cheap PC. Larry and other tech. leaders did not forsee the speed and magnitude of the price reduction in PCs. If you didn't notice, PCs are now networked and can be administered remotely.
It wasn't so long ago that a reasonable PC cost $1500. Look at where they are now.
Sun has a vertical business model. This means that they control the means and production of practically all of the components that go into their systems, like the CPU and the OS.
Dell's business model is horizontal. They have a standard system spec (the PC), and they buy components to build a system from the most efficient component makers to deliver the final product.
For many years, Sun's business model worked because for a long time, no-one could deliver their total systems solution.
Here is the problem that Sun now faces. Standardization is moving up the enterprise stack. What's the difference between an enterprise server and a PC? Lots, but at the end of the day, they're both just boxes made up of components. When component makers have the means to duplicate whatever is in the high end server, then a horizontal company like Dell can cherry-pick the best components for the component makers and build a high end server.
What components of the high end server are getting standardized? I would say that the CPU (will be x86) and the OS (will be either Linux or Microsoft). The rest will follow.
The horizontal model is way more efficient than the vertical model. You get to leverage the R&D of *many* multi-billion dollar component makers. This is what sun has to compete with. Sun needs to make the best CPU, the best OS, the best chipset, etc... They have to invest billions to acheive this. Dell, has to buy the components, build the system and invest practically 0 R&D. Their system will always be cheaper.
Sun's business model will not work in the future. They'll need to suck it up and abandon SPARC/Solaris and go with x86/Linux and/or Microsoft.
Here's an analogy:
DRAM is produced by a handful of large Asian semiconductor companies. Why doesn't Sun develop their own DRAM? It's too expensive and DRAM is standardized.
CPUs are developed by a small number of large American companies (Intel, AMD). Why doesn't Dell develop their own CPUs? It's too expensive and CPUs are standardized (read - x86).
By the way, horizontal businesses don't mean the end of innovation in technology. Just because Dell doesn't invest in R&D doesn't mean the Intel or IBM or others won't. After all, if they don't innovat and make the best components, Dell will buy from someone else.
But if you need a large memory bandwidth, I think probably still beats out Itanium, and definitely beats x86.
That's where you're wrong. x86 and Itanium both beat Sun on memory bandwidth - and not by just a little - by a LOT. Take a look at supercomputing applications which require tons of memory bandwidth. You don't see supercomputers getting build out of SPARCs. They're Itanium, x86 or Power architectures.
If you need a whole shitload of CPUs in one box, Sparc is also a better architecture - even if Itanium can scale up to hundreds of processors, there's no OS that runs on it which can properly handle that many.
Now that Opteron is out in the wild, x86 can scale just as well as SPARC can (thanks to Hypertransport). All we need now is a system builder to build it, spec it out, make it a standard, and let the Dell, IBM, HP and whomever else wants to (white box servers...) build them up and compete, just like the PC market. Do you think that Microsoft or the Linux people (IBM mostly) are standing still? It's only a matter of time before the OS gets the features needed to scale up. There's simply too much money involved, and several fierce competitors to let Sun have the high end market.
Standardization is moving up the enterprise stack. Systems are getting spec'ed out to use standard CPUs, standard components and standard software. This will kill Sun's vertical business model.
The XBox hack is something a little different. People hacked it to run unsigned/foreign code on proprietary closed hardware. What people would do in the Apple case is to try to run Apple software on non-closed open x86 systems.
IBM was the first to do this with the original IBM-PC and then some people hacked the BIOS and created the 'clones'.
I'm sure that others could clone an Apple-based system, but they would do so on pretty weak legal grounds these days of the DCMA.
Apple will never port their software to PC-based x86 systems. What they *may* do is build a closed apple motherboard/system with an x86 CPU. Let's face it - Motorola and IBM have fallen way behind what AMD and Intel have to offer in terms of raw CPU performance. AMD and Intel are pretty reliable suppliers.
Sun on the other hand has terminal cancer. Now that AMD has a scalable server processor solution, system vendors will be quick to integrate it into server systems. Sun will become irrelevant.
One thing that Tom does not do is compare the Opteron to a Xeon with 1MB of cache. I'll bet those server numbers look way different if you compare those two beasts.
After all, the Opteron has twice the cache of the Xeon it's compared against (1MB vs 512k).
Normalize the cache size and memory bandwidths and then do a comparison to see how well the "core" CPU performs.
How would you propose to measure the speed of "just the cpu"? Performance is dependent on many factors like memory bandwidth/latency and compiler optimization. You can't just isolate the CPU, unless you choose your benchmarks to fit entirely in the caches which is obviously not at all realistic.
SPEC has been a respected benchmark used for 14 years - probably more since I can only recall SPEC89.
You might also notice that Tom is using version 5.1 of 3DSMax and Ace's is using 4.26, so you can't really compare one review against another like that.
3DSMax 5.1 has been optimized for the P4. The performance descrepancy does not entirely come from bandwidth differences. The compiler is critical to acheiving high performance.
Remember that x86 is a variable length instruction set and that PowerPC is a a fixed length RISC instruction set. The instructions in 64 bit PowerPC are still 32 bits wide, so you would see no code expansion. The instructions in x86-64 are larger because they require extra prefix bits to specify. Now, the code won't be tremendously larger, but it will grow.
The other thing to worry about with 64 bit code is that now all of your pointers take up twice as much space which could hurt certain applications memory footprints. For example, the SPEC benchmark "mcf" does a lot of pointer chasing, and the 64 bit version of the code has twice the memory footprint of the 32 bit version
This whole idea of comparing chips of equal frequency is STUPID. The engineers at Intel have obviously *designed* the P4 chip to clock at a higher frequency than an Athlon. That is to say that their design rules use fewer "gates per clock" and can therefore clock at a higher frequency. This is why you see 3.0GHz P4s and ~2.0GHz Athlons. You will never see a 3.0GHz AthlonXP microarchitecture in the same process technology and power dissipation as a 3.0GHz P4 because the laws of physics won't allow it. The Athlon has been designed from the get-go to clock lower.
Whether a high-frequency design is good or bad is up for debate. The P4 design is narrow and fast, wheras the Athlon design could be seen as wide and slow. Both designs have impressive performance when compared to modern RISC chips. Performance is measured in time, not frequency. I have a unit of work to do, how long did it take my chip to complete that work. That's the question you should ask when you want to know what the best performer is.
There are other metrics that are important to people of course, like cost. Cost is a function of power, die size, pin count and a number of other things.
The engineers at Intel are trying to acheive performance through frequency - ie. deeper pipelines. For more insight into the tradeoffs of pipeline depth and performance, I would recommend reading the ISCA paper by Sprangle and Carmean.
In fact Intel/AMD *IS* inherently faster. Just look at the latest SPECfp scores on www.spec.org and you'll see the top end x86 systems kill everything else out of SGI. I mention SPECfp because that kind of workload corresponds well to the kinds of compute tasks that ILM needs to do. Linux is almost a non-issue. For these kinds of workloads the impact of the OS is invisible - it's all user-mode code crunching on inner floating point loops. The latest 3GHz Intel P4 has close to a 1200 SPECfp score. This is incredible for a machine with ONLY 512k of cache. Compare the cache sizes and memory bandwidths of the machines that beat x86 systems - hint - they're much much larger (which is a major contributer to system cost), yet the x86es keep up.
There are large economies of scale at work. x86, as ugly as it is, is the fastest thing on the planet these days for compute heavy tasks.
Please people.
...
Stop using "Distributed.net" to compare microprocessor performance. It's a highly skewed benchmark that really only tests the speed of the "Rotate" instruction (which is on the critical path of the program).
Altivec supplies a data-parallel version of the Rotate instruction so processors with altivec can do many rotates in parallel which is why a G4 will beat anything else (no other processors have this data-parallel instruction because it is completely useless with the rare exception of this app). That is to say that most other computer designers felt that adding this instruction would be a complete waste of die area and power, since no other ISA supports it (x86, SPARC, MIPS, POWER etc...)
Distributed.net
1) Does not test branch predictors because it's a simple loop that is very easily predictable by even the most trivial preditctors
2) Does not test the internal L1/L2 cache hierarchy because all of the data fits in the L1 of most processors
3) Does not test the memory system (DRAM/Front-side-bus/memory-controller) because, as mentioned in #2, all of the data fits in the L1 cache.
4) It does not test the instruction cache performance because all of the code fits in the L1 instruction cache.
Stop using it to compare general-purpose computer performance. It is only important if the only app you care about is distributed.net
Your Athlon 1600 will spank the G4 at most everything else.
Multiprocessor bus speeds and CPU frequencies always lag behind uniprocessor systems. It takes much longer to validate multiprocessor boards when compared to a uniprocessor system. This is because the number of things that can go wrong goes up exponentially with the number of CPUs on the board. The typical customers of multiprocessor systems value this sort of reliability even more than performance.
The only reason that the POWER4 outperforms the PPC970 on SPECfp is that they have 128MB of L3 cache.
If you put that cache on the 970 and you'll likely get a higher SPECfp score.
The 970 CPU core itself is very similar to the POWER4.
I agree that this is a good way for Sun to make money in the computer business. Solaris is better than Linux at certain things and that will differentiate Sun enough for customers to go with them.
Beware the Linux distributions that come out of Sun. It is in their interests to make it look bad compared with Solaris. They tried the same thing with x86 Solaris. They made it so crappy to try to convince customers to switch their hardware from x86 to Sparc.
Hardware (servers in particular) are becoming more and more commodity-like as standard components work their way up the enterprise stack. Sun can't play there, they're too inefficient compared with a company like Dell that has much lower overhead - Dell has minimal inventory and just about 0 R&D cost. In a commodity market the leanest players win and Sun is a big fat pig.
Go with software and services. It works for IBM.
Another big problem with GPUs trying to do general purpose code is the lack of demand paged virtual memory. There are specific hard wired data paths to memory.
That 128-bits is really 4 x 32 bit floating point (Data-parallel SIMD).
The Alpha was probably a debug version which typically runs a lot slower and is more memory bloated (it's not optimized)
What makes you think the P4 is overpipelined?
--Intel CPU Architect (not speaking for Intel)
I work for a very large Fortune 500 company as an engineer. I have a Windows box that is in fact administered remotely. I get patches installed, I get software installed. The admin can take control of my computer remotely - I can see the pointer wiggling around as he goes into my machine and configures it. I also have a linux box at work, but we all know how easy those are to administer remotely.
.
Have a look here
Larry's network computer argument was predicated on the assumtion that PCs are expensive. They're not. They're cheap as hell. Price out Sun's network terminal vs. a cheap PC. Larry and other tech. leaders did not forsee the speed and magnitude of the price reduction in PCs. If you didn't notice, PCs are now networked and can be administered remotely.
It wasn't so long ago that a reasonable PC cost $1500. Look at where they are now.
I'm also probably showing my age:
Bard's Tale
Impossible Mission
Ultima I, II, III, IV
Doom
Civilization
Sun has a vertical business model. This means that they control the means and production of practically all of the components that go into their systems, like the CPU and the OS.
Dell's business model is horizontal. They have a standard system spec (the PC), and they buy components to build a system from the most efficient component makers to deliver the final product.
For many years, Sun's business model worked because for a long time, no-one could deliver their total systems solution.
Here is the problem that Sun now faces. Standardization is moving up the enterprise stack. What's the difference between an enterprise server and a PC? Lots, but at the end of the day, they're both just boxes made up of components. When component makers have the means to duplicate whatever is in the high end server, then a horizontal company like Dell can cherry-pick the best components for the component makers and build a high end server.
What components of the high end server are getting standardized? I would say that the CPU (will be x86) and the OS (will be either Linux or Microsoft). The rest will follow.
The horizontal model is way more efficient than the vertical model. You get to leverage the R&D of *many* multi-billion dollar component makers. This is what sun has to compete with. Sun needs to make the best CPU, the best OS, the best chipset, etc... They have to invest billions to acheive this. Dell, has to buy the components, build the system and invest practically 0 R&D. Their system will always be cheaper.
Sun's business model will not work in the future. They'll need to suck it up and abandon SPARC/Solaris and go with x86/Linux and/or Microsoft.
Here's an analogy:
DRAM is produced by a handful of large Asian semiconductor companies. Why doesn't Sun develop their own DRAM? It's too expensive and DRAM is standardized.
CPUs are developed by a small number of large American companies (Intel, AMD). Why doesn't Dell develop their own CPUs? It's too expensive and CPUs are standardized (read - x86).
By the way, horizontal businesses don't mean the end of innovation in technology. Just because Dell doesn't invest in R&D doesn't mean the Intel or IBM or others won't. After all, if they don't innovat and make the best components, Dell will buy from someone else.
Don't you love the free market?
But if you need a large memory bandwidth, I think probably still beats out Itanium, and definitely beats x86.
That's where you're wrong. x86 and Itanium both beat Sun on memory bandwidth - and not by just a little - by a LOT. Take a look at supercomputing applications which require tons of memory bandwidth. You don't see supercomputers getting build out of SPARCs. They're Itanium, x86 or Power architectures.
If you need a whole shitload of CPUs in one box, Sparc is also a better architecture - even if Itanium can scale up to hundreds of processors, there's no OS that runs on it which can properly handle that many.
Now that Opteron is out in the wild, x86 can scale just as well as SPARC can (thanks to Hypertransport). All we need now is a system builder to build it, spec it out, make it a standard, and let the Dell, IBM, HP and whomever else wants to (white box servers...) build them up and compete, just like the PC market. Do you think that Microsoft or the Linux people (IBM mostly) are standing still? It's only a matter of time before the OS gets the features needed to scale up. There's simply too much money involved, and several fierce competitors to let Sun have the high end market.
Standardization is moving up the enterprise stack. Systems are getting spec'ed out to use standard CPUs, standard components and standard software. This will kill Sun's vertical business model.
Well for one reason, DVD media is still around 10x more expensive than CD media. There are also major quality control issues with DVD media.
The XBox hack is something a little different. People hacked it to run unsigned/foreign code on proprietary closed hardware. What people would do in the Apple case is to try to run Apple software on non-closed open x86 systems.
IBM was the first to do this with the original IBM-PC and then some people hacked the BIOS and created the 'clones'.
I'm sure that others could clone an Apple-based system, but they would do so on pretty weak legal grounds these days of the DCMA.
Apple will never port their software to PC-based x86 systems. What they *may* do is build a closed apple motherboard/system with an x86 CPU. Let's face it - Motorola and IBM have fallen way behind what AMD and Intel have to offer in terms of raw CPU performance. AMD and Intel are pretty reliable suppliers.
"What is your greatest weakness?"
My favorite answer to this question is "kryptonite"
AMD is not dead.
Sun on the other hand has terminal cancer.
Now that AMD has a scalable server processor solution, system vendors will be quick to integrate it into server systems. Sun will become irrelevant.
One thing that Tom does not do is compare the Opteron to a Xeon with 1MB of cache. I'll bet those server numbers look way different if you compare those two beasts.
After all, the Opteron has twice the cache of the Xeon it's compared against (1MB vs 512k).
Normalize the cache size and memory bandwidths and then do a comparison to see how well the "core" CPU performs.
How would you propose to measure the speed of "just the cpu"? Performance is dependent on many factors like memory bandwidth/latency and compiler optimization. You can't just isolate the CPU, unless you choose your benchmarks to fit entirely in the caches which is obviously not at all realistic.
SPEC has been a respected benchmark used for 14 years - probably more since I can only recall SPEC89.
You might also notice that Tom is using version 5.1 of 3DSMax and Ace's is using 4.26, so you can't really compare one review against another like that.
3DSMax 5.1 has been optimized for the P4. The performance descrepancy does not entirely come from bandwidth differences. The compiler is critical to acheiving high performance.
Remember that x86 is a variable length instruction set and that PowerPC is a a fixed length RISC instruction set. The instructions in 64 bit PowerPC are still 32 bits wide, so you would see no code expansion. The instructions in x86-64 are larger because they require extra prefix bits to specify. Now, the code won't be tremendously larger, but it will grow.
The other thing to worry about with 64 bit code is that now all of your pointers take up twice as much space which could hurt certain applications memory footprints. For example, the SPEC benchmark "mcf" does a lot of pointer chasing, and the 64 bit version of the code has twice the memory footprint of the 32 bit version
This whole idea of comparing chips of equal frequency is STUPID. The engineers at Intel have obviously *designed* the P4 chip to clock at a higher frequency than an Athlon. That is to say that their design rules use fewer "gates per clock" and can therefore clock at a higher frequency. This is why you see 3.0GHz P4s and ~2.0GHz Athlons. You will never see a 3.0GHz AthlonXP microarchitecture in the same process technology and power dissipation as a 3.0GHz P4 because the laws of physics won't allow it. The Athlon has been designed from the get-go to clock lower.
Whether a high-frequency design is good or bad is up for debate. The P4 design is narrow and fast, wheras the Athlon design could be seen as wide and slow. Both designs have impressive performance when compared to modern RISC chips. Performance is measured in time, not frequency. I have a unit of work to do, how long did it take my chip to complete that work. That's the question you should ask when you want to know what the best performer is.
There are other metrics that are important to people of course, like cost. Cost is a function of power, die size, pin count and a number of other things.
The engineers at Intel are trying to acheive performance through frequency - ie. deeper pipelines. For more insight into the tradeoffs of pipeline depth and performance, I would recommend reading the ISCA paper by Sprangle and Carmean.
Apparently there is not much an SGI machine can do that a PC cannot do (or other unix machine) since SGI has not posted a yearly profit since 1997!
I'm amazed that they're still around...They offer no value added.
In fact Intel/AMD *IS* inherently faster. Just look at the latest SPECfp scores on www.spec.org and you'll see the top end x86 systems kill everything else out of SGI. I mention SPECfp because that kind of workload corresponds well to the kinds of compute tasks that ILM needs to do. Linux is almost a non-issue. For these kinds of workloads the impact of the OS is invisible - it's all user-mode code crunching on inner floating point loops. The latest 3GHz Intel P4 has close to a 1200 SPECfp score. This is incredible for a machine with ONLY 512k of cache. Compare the cache sizes and memory bandwidths of the machines that beat x86 systems - hint - they're much much larger (which is a major contributer to system cost), yet the x86es keep up.
There are large economies of scale at work. x86, as ugly as it is, is the fastest thing on the planet these days for compute heavy tasks.