You're comparing Apples and oranges here! (no pun intended... honest!:> ).
The Linpack code used in this test was really designed to demonstrate the memory subsystem characteristics of P4 vs. the Athlon, not to crunch data. This should be blatently obvious even to someone making a troll/flamebait/annoying Apple user post, since the benchmarks you quoted show 3.2GHz Xeon processors (nearly identical to the P4) crunching at up to 4.35GFlops.
A more accurate view of the P4's capabilities for scientific computing is listed later in the review. Specifically the ScienceMark BLAS DGEMM tests is a pretty close approximation to the Linpack results that Apple is reporting. These tests show the 3.4GHz P4 'E' maxing out at ~4.1GFlops. Not quite where the G5 is, but pretty close. The G5 ends up being a fair bit faster because this sort of matrix solving is nothing but double precision floating point multiplies and adds. The G5 has this nifty FP Multiply-Add instruction that does both of these instructions at the same time, and the chip can do two such instructions per clock cycle. The P4 lacks this Multiply-Add instruction, so it needs to use two separate instructions, each with a 1 cycle throughput. The SSE2 unit allows each of those instructions to operate on two bits of data at once, so the total works out to the P4 having half the theoretical GFlops per clock cycle that a G5 has. In practice the P4 does a little better than this because it comes closer to matching it's theoretical peak than the G5 does (probably a memory subsystem issue).
side note: Altivec could potentially offer very high performance for this test, but it does not support double precision floating point numbers, only single percision, so it won't cut it.
How about you just use the fastest compiler for each chip? Or the compiler that will actually be used for running your code?
I know a lot of people like to complain about the effects of compilers in benchmarks (particularly SPEC CPU, the most widely used cross-platform benchmark), but compilers are a critical part of the equation for performance. So what if GCC is more optimized for Platform A vs. Platform B, the end result is that Platform A will run your code faster, and that's what really matters.
Just wondering: Is the described setup with the case open & lying on its side actually better or worse for cooling?
The only answer to that question is: It depends.
Ideally your case will be designed to have a decent flow of cool air through the case, and in particular the cool air will flow over the processor while the warm air will be sucked out the back. This is better than having the case lying open and on it's side where the only flow of cool air will be from free convection (ie hot air rising).
In reality though, there are dozens of factors that can affect this. Certainly the number, speed and placement of fans in your system can change things a lot. Occassionally adding fans can actually make cooling WORSE! (don't tell the overclockers this though, they're doing a good job of keeping the economy going by buying 75+ fans for their PC). Also things inside a case like cables, add-in cards and hard drives can change the flow of air.
In the end, chances are that it's not going to make a big difference one way or the other. What might make a bigger difference is the method used to measure and report temperature, something that is notoriously inaccurate on most motherboards. Even with the on-die thermal diode of the P4 I wouldn't trust the temperature monitoring software to more than about 1 digit of percision.
Still, this chip is getting too hot. Intel states that the maximum temperature the chip should ever reach is only 73.2C.
Nice thought except that there's no such thing as a dual processor "Northwood" (or "Prescott" for that matter). The P4 is simply not capable of running in a multi-processor setup, you need a Xeon for that. Now, one could make a pretty valid argument for a system powered by a pair of 2.4GHz Xeon processors, as those chips are rather reasonably priced these days, but the total cost of the chips + motherboard isn't likely to save you any money over these top-end chips. Better system? Perhaps (I'd certainly rather a pair of Xeons vs. one P4EE), but not cheaper.
The AthlonXP is a slightly different story. While it's not official capable of running in a dual processor setup, it can be hacked to work. Alternatively you could get a pair of slightly more expensive AthlonMP chips. Unfortunately dual-processor AthlonMP motherboards are rather dated now, so unless you're on a real budget a dual Opteron setup would probably be better.
One, stuff is smaller. I prefer large cases with a lot of space
The ATX specification calls for a maximum motherboard size of 305mm x 244mm. The BTX specification calls for a maximum motherboard size of 325.12mm x 266.70mm. I have no idea where you got the idea that BTX is smaller, because it's actually bigger. Maybe you're thinking of Micro-BTX or Pico-BTX? Those are both designed for small-form-factor designs, much like Micro-ATX and Flex-ATX today (the latter isn't very popular among consumer boards, but used in some OEM systems).
Of course, the size of the case does not necessarily have much to do with the size of the motherboard. As long as a case satisfies the requirements for the volumetric zones of BTX it can be any size the designer wants it to be. Full sized BTX cases will probably be pretty much teh same size as full size ATX cases.
Two, they are droping PS/2 ports,
Again, this is just flat out wrong. The BTX specification doesn't list ANYTHING to do with I/O ports except for the size of the rear panel I/O cutout. This is slightly shorter and longer than ATX, but not by a significant margin. It's still very possible to put PS/2 connectors on the back of a board, and I fully expect most motherboard vendors to do so. A few will opt to get rid of the PS/2 connectors in favor of more USB, but then again, Abit does that now with their "Max" line of ATX boards.
Three, everything is on the wrong side of the case
Err.. is that really that big of a problem?
Four, What is this PCI Express?
PCI-Express is a new point-to-point I/O protocol, allowing for a wide range of data rates. The main reason for PCI-Express is that it should eventually make everything cheaper by standardizing on a single bus instead of the multitude that we have now. Right now we've got PCI, PCI-X, AGP, AMR, ACR, CNR and CSA all being used on various PCs for a variety of uses. PCI-Express looks to improve on some of these and eventually to replace them all. Probably a good thing going forward, though not critical right off the bat.
Of course, PCI-Express has nothing to do with BTX. The first PCI-Express motherboards will all be ATX/Micro-ATX boards (maybe even a Mini-ITX board or two if we're lucky, the PCI-E 1x connector is very small making it a solution for tiny PCs). It's also quite possible to make a BTX board without any PCI-Express sockets, though that's a slightly pointless goal.
Why not just expand the existing ATX standerd to include some new tech
That's actually largely what BTX does. It's primarily just a redesign of the case to allow for more cooling for today's super-hot processors. ATX had the board sitting in the wrong place to do this, so a slight modification of ATX wasn't all that much of an option.
Even for the Stupid People (tm), I don't see this as being worse than it is now. Right now if someone goes out to buy a chip based on clock speed they might see an Intel Celeron 2.8GHz system for $700 and an otherwise identical P4 2.8C GHz system for $900. Which do you think they'll chose? Probably the Celeron, despite the fact that it's a SIGNIFICANTLY slower processor (much more than the $200 price difference would seem to indicate).
It's even worse with laptop chips, which is where the initiative to use a model number scheme started with. Here Intel's 1.5GHz Pentium-M chip is a faster processor that their Mobile Celeron 2.5GHz processor, but most Stupid People would prefer to buy the Celeron because of it's big clock speed number.
Hopefully the model numbers will be really arbitrary so that at least people will know that they're just bullshit numbers. Nobody would think that a Mercedes S600 is "20% better" than a Mercedes S500, everyone simply recognized that they are model numbers.
The cost of software is a rather small part of the cost for a TPC score. Even on the "cheap" systems (the cheapest system on that top-10 lists costs $32,772, and most cost about $50,000), hard disks are the dominant cost factor.
Perhaps an interesting flip-side to this argument is to look at the list of fastest systems overall.
Linux fanboys will be happy to know that their OS powers the most powerful system in this test (albeit through the use of a cluster while a known-weakness of the TPC-C test is that clusters can produce somewhat unrealisticly good results), while MS only appears in 3 of the top-10 systems. IBM's AIX is the most common operating system (4 systems) while Oracle is the most common database (also 4 entries). Linux fanboys may actually have good reason to show off this first-place result though, because with a system cost of $6.5M, HP almost certainly wasn't using the free OS for any sort of price advantage. Rather it may offer a performance advantage over Microsoft or even HP's own HP-UX.
Gigaflops is only a tiny fraction more useful than GHz, if at all.
Gigaflop tests come in three basic varieties. First are ones that fit entirely into the L1 cache of a processor, making the memory subsystem totally irrelevant. This is no good since the memory subsystem plays an important role in performance. In this sort of test a 2.8GHz Celeron processor with 128K of L2 cache and a 400MT/s bus speed would get a score essentially identical to a 2.8GHz P4 with 512KB or 1MB of L2 cache and an 800MT/s bus speed. In 90% of real-world applications though even a much slower 2.0GHz P4 would beat the pants off a 2.8GHz Celeron (the current Celeron chips are absolutely abysmal perfomers).
The second type of gigaflops test has a slightly larger dataset, so performance is almost entirely determined by what level of cache it fits into. For example, if they used something like a 60K dataset, an AthlonXP or Athlon64 would blow the doors off any P4 because it would be running everything in L1 cache while the P4 would be running out of (the much slower) L2 cache. Clock for clock the AthlonXP chips could easily be twice as fast in such a test. Things would get even worse if your data set fit into the L2 cache of one chip but not another, ie if you had a 750K data set, a "Prescott" P4, with 1MB of L2 cache, could be HUGELY faster than a "Northwood" P4 with only 512KB of L2 cache, even though in reality their performance is fairly close (with the "Northwood" usually being slightly faster).
The third option would be to use a HUGE dataset, turning this entirely into memory bandwidth test. Fine for what it's testing, but hardly an accurate picture of overall performance.
There are good reasons why the rather smart guys over at Ace's Hardware make use of Linpack (basic Gigaflops test used by Top500.org) to show off the memory subsystem of platform. By varying the size of your dataset it does a good job of illustrated the effects of cache and memory. However it doesn't tell you much else about processor performance.
I think that gigaflops would be a slightly worse metric for processor performance than MHz because it's FAR easier to abuse that test. The best thing for consumers is if the model numbers are really NOT meaningful at all. For example, look at video cards, where our top-dogs today are the ATI Radeon 9800 and the nVidia GeForce 5900. Nobody looks at those and says "Ohh, 9800 is bigger than 5900, therefore the ATI MUST be better". Everyone KNOWS that the model numbers here are meaningless, so if they want to know which is faster they ask a friend (or at least the salesperson) or do some research on their own. That is what I would like to see for processors as well. AMD's already got this with their Athlon64 FX line and Opteron line of processors. Hopefully Intel will do the same.
Why does this incorrect info keep getting posted (and modded as "informative" at that)? AMD stated several times quite publicly that their rating initially was meant to compare against the "Thunderbird" Athlon chips. More recently they've simply said that it's relative performance between the AthlonXP line and that it can "outperform it's closest competitors". Here's a direct quote from AMD's AthlonXP FAQ
Q: What does the 3200+ model mean?
A: This is a model number. AMD identifies the AMD Athlon XP processor using model numbers, as opposed to megahertz. Model numbers are designed to communicate the relative application performance among the various AMD Athlon XP processors. As additional evidence that performance is not based on megahertz alone: the AMD Athlon XP processor 3200+ operates at a frequency of 2.2GHz yet can outperform an Intel Pentium(R) 4 processor operating at 3.0GHz with an 800 FSB and HyperThreading on a broad array of real-world applications for office productivity, digital media and 3-D gaming.
AMD's model numbers not rated against Intel's P4 chips? You might want to tell AMD that!
The latency penalty of getting remote memory in an Opteron is a little bit less than the latency penalty of using an off-chip memory controller. The latency to remote memory may be higher than local memory, but it's not quite as high as the latency that the Apple G5 or Intel P4 would experience. This is probably a bit more of an issue for the P4 due to it's higher clock speed (it sees the same latency in total time but much higher latency in terms of clock cycles sitting around waiting for data), hence part of the reason why Xeons are available with 2 and 4MB of cache vs. 512MB on the PPC 970 and 1MB on the Opteron.
It's tough to give any sort of concrete "Chip A is better for this type of tasks, chip B is better for that type of task" breakdown with these chips, because there are MANY factors that come into play here, not the least of which being the compiler used. For example, while everyone always points to media encoding as being the P4's real strong suit, and benchmarks using the DivX codec support this, when using the XVid codec the Athlon actually ends up being faster. The two codecs are very similar in their design and both can produce great looking video at relatively low bit rates, but they are obviously different enough that their performance varies a fair bit. When you through the Apple G5 into the mix things get even trickier because you're talking about a different ISA, different OS, different compilers, etc. etc.
It depends on what floating point operations you're talking about. On some floating point ops, the PowerPC 970 can significantly outperform an Intel P4 (the much publicized/hyped Linpack test for Top500.org supercomputer list is a good example of this). On other FP ops, the P4 can significantly outperform the PPC 970. Overall though, they're fairly close, though I would tend to give the edge to the P4 if for no other reason than because there are better compilers for it. While lots of people like to argue that better compilers do not equal a better chip, in the real world the only thing that matters is the performance you get in the end, and compilers play a part in that performance.
"Stuck" as in dual channel DDR and now quad channel RDRAM solutions?
Who cares how much bandwidth you can get from memory to the memory controller, basically your CPU is the only consumer of that bandwidth (except the odd DMA transaction, but those are usually measured in the MB/s, not GB/s) and it's stuck getting that data at 533MT/s (4.2GB/s).
Xeons are still 533mhz because 800 on the desktop is basically marketing tripe - it really doesnt make your computer perform any better.
What the heck have you been smoking?! It sure as hell does help your computer perform better! There are DOZENS of tests out there to back this up! What's more, with Intel's slightly dated shared bus setup, running a dual-processor setups makes the slow bus even more of a bottleneck.
Fortunately Intel is well aware of this problem and they have full plans to upgrade their Xeon line to 800MT/s bus speeds with thier next revision of the chips (Nocona, basically the Xeon version of a "Prescott" P4).
I don't know if you want to wait, but Intel is planning on moving their Xeon line to 800MT/s bus speeds later this year, probably mid-summer.
of course, that being said, if you're looking at memory issues, the Opteron is definitely the way to go (except maybe for IBM's Power4 of Intel's Itanium, but they've both MUCH more expensive). Even though the Opteron and the G5 have the same theoretical memory bandwidth (6.4GB/s), the integrated memory controller of the Opteron will provide you with more real-world bandwidth. Add to that the NUMA design so that bandwidth scales with additional processors and it quickly gains a clear performance edge. Plus, to top it off, the integrated memory controller gives you SIGNIFICANTLY lower memory latency, something that is often even more important than high bandwidth.
Of course, price is a different matter, but they should be quite close. HP's new 2P Opteron servers are quite reasonably priced, shaving several hundred dollars off the price that IBM charges for their Opteron systems.
Also for math (especially floating point) calculations, the G5 (PPC970) is much superior to the Intel IA-32 (not really a big thing if all you do is run Word, of course).
That's a VERY broad statement there, and not really backed by much fact. For certain applications I'm quite certain that the PowerPC 970 is quite a bit faster than any x86 chips, but in other applications it's probably quite a bit slower, while overall they would seem to be fairly close.
Probably the most comprehensive cross-platform CPU benchmark we've got is SPEC CPU2000. It's far from perfect, but at least it's widely used. The best numbers I've seen for the PPC 970 is 937 CINT_base and 1051 CFP_base at 1.8GHz (numbers available in this product overview from IBM). Very respectible performance, and the 2.0GHz PPC 970 should be a bit higher, but it's not quite class-leading.
For comparison, a top-end Opteron system (Opteron 148, 2.2GHz) managed 1304 CINT and 1505 CFP. The Xeons in the same basic range with a score of 1532 CINT and 1338 CFP. And before anyone goes crying foul because of unfair compilers or anything like that, the Opteron numbers are achieved using GCC.
According to a talk by "Dr. BigMac" (from VA Tech) the only other high-volume CPU approaching it was the Intel Itanium, and here (quite an irony) Intel was under-clocked! (The G5, last year, was shipping at 2Gh, the Itanium less than that).
Ol' Dr. BigMac was basing his decision only on the specific performance tests he felt were important. In this case, that test was Linpack, where the PPC does very well. Linpack is certainly not the only measure of processor performance, it's actually a VERY limited test, albeit one that is applicable to many types scientific computing.
As for the Itanium it's likely more an issue of price rather than clock speed. When you look at the real-world performance of the Itanium2 1.5GHz vs. PPC 970 2.0GHz in Linpack, they're pretty close (probably within 5%). However a "cheap" dual-Itanium node will set you back a cool $15,000 or so, while a similarly equipped dual-G5 system from Apple will only cost you about $5000.
As for Pixar themselves. It's quite possible that they went through some benchmarks and found that the PowerPC 970 offered better performance for their particular work than any x86 chips. As mentioned above, there are some areas where the PPC970 does excel. However, I suspect that there was a STRONG incentive to find the PPC970 fastest regardless of what the actual performance was.
Definitely not BS, though whether or not it's useful depends heavily on your application.
The idea behind hyperthreading is that the P4's long pipeline will often stall with only a single thread going through. With hyperthreading you run two threads at once, so when one thread stalls you just start up the other thread and go with that one for a while. In a way it's almost like a poor-mans dual-processor system, giving you two logical processors on a single chip.
Now, obviously there are a few things to consider here. First off, if ALL of your processing is being done in a single thread then you aren't going to see any benefit to hyperthreading, and in fact the extra overhead might even make things a bit slower (usually only 1-2% slower).
Games almost always do all their major processing in a single thread. Even if they have extra threads hanging around, you almost always spend 99%+ of your time in a single thread. For this reason, games see virtually no benefit to hyperthreading (they don't see much/any benefit from dual-processor setups either).
On the other end of the spectrum, some applications see up to a 25% performance boost when hyperthreading is enabled. The tests I've seen show the biggest improvement have been things like Photoshop and rendering applications. Some server applications should benefit as well.
The other boost that hyperthreading gives you, like with a real dual-processor setup, is that it makes multitasking a bit "snappier". This is by no means a night-and-day difference here, but it is there.
Cooling is/was more important, especially for the older T-birds.
Cooling is VERY important for all current processors. It all becomes relative. When the Thunderbird was current, it used quite a bit more power than the PIII that it competed against, so cooling was very important for that chip. Now, the ~70W that the T-Bird used is not at all abnormal, this is the same basic power consumption of an AthlonXP (Barton or Thoroughbred), Athlon64/Opteron, P4 "Northwood" or even an IBM PowerPC 970 (aka Apple G5). The Intel P4 "Prescott" chips are a bit hotter, so for the moment people are talking a lot about how hot those chips get, but give it another year or two and 100W TDP probably will not be at all abnormal for processors.
You have to be more careful not to crack the chip when putting the heatsink on.
I've always wondered how in the hell anyone managed to crack their chips putting heatsinks on. I've put heatsinks on a fair number of AthlonXP chips and never even see a possible way to crack the core. Do people usually install heatsinks using a hammer or something?! While I do like the newer retention mechanism used on Intel P4 chips or AMD Athlon64/Opteron chips, it's really NOT difficult at all to put a heatsink on an AthlonXP processor.
AMD has had software issues. For example, Win2K had to be patched to SP1 because AMD messed up AGP coherence. AMD also never told the Linux developers about this same problem causing numerous people to have system crashes when using nVidia cards/drivers under linux. You'd think they would have sent a fricken e-mail to the kernel-dev lit.
You need to do a bit more reading on that problem, it was actually caused by nVidia's drivers doing some really stupid things (ATI does/did the same stupid things, as did most other video card vendors). It was only a matter of sheer luck that the problem DIDN'T affect Intel chips in the sameway.
The difference isn't all that large actually, though ICC is generally still a bit faster than GCC on Athlon64/Opteron chips. If you look at the SPEC CPU2000 results you can find a bunch of different Opteron results. Some of these results can be used to show a direct comparison between ICC and GCC, and they're often within about 10% of one another. In fact, when you start dealing with 64-bit Linux and 64-bit GCC code it often ends up being faster than 32-bit ICC code.
As for the compiler used in the article, they were using Microsoft SQL Server as best as I can tell, so I think that you can be fairly certain that the compiler used was MS Visual Studio.Net.
You do seem to be correct that companies don't want to emphasize NUMA too much from the software side of things. AMD has also been known to say that their Opterons don't need any NUMA optimizations.
They actually used the term "SUMO" (complete with a drawing of a sumo wrestler in one of their presentations!:) ), for Sufficiently Uniform Memory Organization.
Regardless of what the marketing depertment says and to who, it's still NUMA. A test on Ace's Hardware a while back showed that a 2P Opteron can see up to a 20% faster with NUMA optimizations, and presumably a 4P Opteron will see an even bigger boost.
I would guess that Anand's test linked in this article had those NUMA optimizations enabled already, though he doesn't specify any BIOS options. He did use Win2K3 which is NUMA aware.
I think that a large part of the story was simply that the new Xeons are no longer getting hugely beaten by the Opteron like they were in the previous test performed by the same site. While the Opterons still held a noticeable lead in 4P setups, the 2P systems were all pretty close.
In Anand's previous test comparing the two chips the Opteron came out WELL ahead of the Xeon. Of course, the tests were somewhat different, and it's been demonstrated several times that the Opteron is the chip to beat in web serving (subject of the first test) but things are much close in more pure-database tests (second article).
As for other platforms, I do with they had been able to throw an Itanium2 setup into the mix. In fact, what would be even more interesting is if/when they get Win2k3 64-bit edition running on the Opteron (and maybe Xeons as well) and THEN compare it to the Itanium2.
They WILL standardize on a socket, it's just that the socket will be Socket 939 and not the current one.
It's pretty much the same story with SlotA/SocketA. They had an initial design that was quickly replaced. The second socket then stuck it out for the duration.
Intel did pretty much the same thing with their P4, initially releasing it on socket 423 and then quickly moving to socket 478 which has lasted for several years now (though it too will soon be replaced).
Markets change, technology changes, and sometimes sockets need to change with them. Remeber that the specification for Socket 754 and Socket 940 for current Athlon64 chips was set in stone about 3 years ago (before the first beta chips tapped out), and a lot has changed since then. AMD has gone to great lengths to minimize socket changes, but there's only so much that they can do.
I highly suspect that they actually are running off a variant of WinCE, seeing as they already have a current version of that OS supporting PowerPC (not the PPC 970, but it shouldn't take much to get that supported).
By "NT kernel" I suspect that they are using the term VERY loosely. Microsoft seems to use the same basic kernel concept for all their operating systems, all based off the old WinNT microkernel.
Remember, this news came from The Inquirer, so don't expect too many technical details (or accuracy for that matter).
Don't hold your breath. XBox2 games will be primarily Win32 and DirectX API games, neither of which run on OS X. Just because the hardware is similar doesn't mean that the systems are all that closely related.
Be VERY careful with IBM's power specs, they don't tell a very complete picture.
For example, IBM lists the power consumption of the PPC 970 as being something like 48W "typical" power consumption at 1.8GHz. Unfortunately the maximum real-world power consumption is a LOT higher than that, and when you start comparing a 2.0GHz PPC 970 (aka G5) to a modern x86 chip from AMD (2.0GHz Athlon64) or Intel (3.2GHz P4), you end up with pretty darn similar power consumption figures.
Besides, it's not like Intel doesn't already have a low-powered design in their Pentium-M. Intel has even hinted that they might make a dual-core version of that processor sometime late next year.
In the end, I'm sure that the MAIN deciding factor here is cost, nothing more, nothing less.
As for an MS supported OS on PPC, that already exists. Current versions of WinCE run on PPC chips.
You're comparing Apples and oranges here! (no pun intended... honest! :> ).
The Linpack code used in this test was really designed to demonstrate the memory subsystem characteristics of P4 vs. the Athlon, not to crunch data. This should be blatently obvious even to someone making a troll/flamebait/annoying Apple user post, since the benchmarks you quoted show 3.2GHz Xeon processors (nearly identical to the P4) crunching at up to 4.35GFlops.
A more accurate view of the P4's capabilities for scientific computing is listed later in the review. Specifically the ScienceMark BLAS DGEMM tests is a pretty close approximation to the Linpack results that Apple is reporting. These tests show the 3.4GHz P4 'E' maxing out at ~4.1GFlops. Not quite where the G5 is, but pretty close. The G5 ends up being a fair bit faster because this sort of matrix solving is nothing but double precision floating point multiplies and adds. The G5 has this nifty FP Multiply-Add instruction that does both of these instructions at the same time, and the chip can do two such instructions per clock cycle. The P4 lacks this Multiply-Add instruction, so it needs to use two separate instructions, each with a 1 cycle throughput. The SSE2 unit allows each of those instructions to operate on two bits of data at once, so the total works out to the P4 having half the theoretical GFlops per clock cycle that a G5 has. In practice the P4 does a little better than this because it comes closer to matching it's theoretical peak than the G5 does (probably a memory subsystem issue).
side note: Altivec could potentially offer very high performance for this test, but it does not support double precision floating point numbers, only single percision, so it won't cut it.
How about you just use the fastest compiler for each chip? Or the compiler that will actually be used for running your code?
I know a lot of people like to complain about the effects of compilers in benchmarks (particularly SPEC CPU, the most widely used cross-platform benchmark), but compilers are a critical part of the equation for performance. So what if GCC is more optimized for Platform A vs. Platform B, the end result is that Platform A will run your code faster, and that's what really matters.
Just wondering: Is the described setup with the case open & lying on its side actually better or worse for cooling?
The only answer to that question is: It depends.
Ideally your case will be designed to have a decent flow of cool air through the case, and in particular the cool air will flow over the processor while the warm air will be sucked out the back. This is better than having the case lying open and on it's side where the only flow of cool air will be from free convection (ie hot air rising).
In reality though, there are dozens of factors that can affect this. Certainly the number, speed and placement of fans in your system can change things a lot. Occassionally adding fans can actually make cooling WORSE! (don't tell the overclockers this though, they're doing a good job of keeping the economy going by buying 75+ fans for their PC). Also things inside a case like cables, add-in cards and hard drives can change the flow of air.
In the end, chances are that it's not going to make a big difference one way or the other. What might make a bigger difference is the method used to measure and report temperature, something that is notoriously inaccurate on most motherboards. Even with the on-die thermal diode of the P4 I wouldn't trust the temperature monitoring software to more than about 1 digit of percision.
Still, this chip is getting too hot. Intel states that the maximum temperature the chip should ever reach is only 73.2C.
Nice thought except that there's no such thing as a dual processor "Northwood" (or "Prescott" for that matter). The P4 is simply not capable of running in a multi-processor setup, you need a Xeon for that. Now, one could make a pretty valid argument for a system powered by a pair of 2.4GHz Xeon processors, as those chips are rather reasonably priced these days, but the total cost of the chips + motherboard isn't likely to save you any money over these top-end chips. Better system? Perhaps (I'd certainly rather a pair of Xeons vs. one P4EE), but not cheaper.
The AthlonXP is a slightly different story. While it's not official capable of running in a dual processor setup, it can be hacked to work. Alternatively you could get a pair of slightly more expensive AthlonMP chips. Unfortunately dual-processor AthlonMP motherboards are rather dated now, so unless you're on a real budget a dual Opteron setup would probably be better.
One, stuff is smaller. I prefer large cases with a lot of space
The ATX specification calls for a maximum motherboard size of 305mm x 244mm. The BTX specification calls for a maximum motherboard size of 325.12mm x 266.70mm. I have no idea where you got the idea that BTX is smaller, because it's actually bigger. Maybe you're thinking of Micro-BTX or Pico-BTX? Those are both designed for small-form-factor designs, much like Micro-ATX and Flex-ATX today (the latter isn't very popular among consumer boards, but used in some OEM systems).
Of course, the size of the case does not necessarily have much to do with the size of the motherboard. As long as a case satisfies the requirements for the volumetric zones of BTX it can be any size the designer wants it to be. Full sized BTX cases will probably be pretty much teh same size as full size ATX cases.
Two, they are droping PS/2 ports,
Again, this is just flat out wrong. The BTX specification doesn't list ANYTHING to do with I/O ports except for the size of the rear panel I/O cutout. This is slightly shorter and longer than ATX, but not by a significant margin. It's still very possible to put PS/2 connectors on the back of a board, and I fully expect most motherboard vendors to do so. A few will opt to get rid of the PS/2 connectors in favor of more USB, but then again, Abit does that now with their "Max" line of ATX boards.
Three, everything is on the wrong side of the case
Err.. is that really that big of a problem?
Four, What is this PCI Express?
PCI-Express is a new point-to-point I/O protocol, allowing for a wide range of data rates. The main reason for PCI-Express is that it should eventually make everything cheaper by standardizing on a single bus instead of the multitude that we have now. Right now we've got PCI, PCI-X, AGP, AMR, ACR, CNR and CSA all being used on various PCs for a variety of uses. PCI-Express looks to improve on some of these and eventually to replace them all. Probably a good thing going forward, though not critical right off the bat.
Of course, PCI-Express has nothing to do with BTX. The first PCI-Express motherboards will all be ATX/Micro-ATX boards (maybe even a Mini-ITX board or two if we're lucky, the PCI-E 1x connector is very small making it a solution for tiny PCs). It's also quite possible to make a BTX board without any PCI-Express sockets, though that's a slightly pointless goal.
Why not just expand the existing ATX standerd to include some new tech
That's actually largely what BTX does. It's primarily just a redesign of the case to allow for more cooling for today's super-hot processors. ATX had the board sitting in the wrong place to do this, so a slight modification of ATX wasn't all that much of an option.
Even for the Stupid People (tm), I don't see this as being worse than it is now. Right now if someone goes out to buy a chip based on clock speed they might see an Intel Celeron 2.8GHz system for $700 and an otherwise identical P4 2.8C GHz system for $900. Which do you think they'll chose? Probably the Celeron, despite the fact that it's a SIGNIFICANTLY slower processor (much more than the $200 price difference would seem to indicate).
It's even worse with laptop chips, which is where the initiative to use a model number scheme started with. Here Intel's 1.5GHz Pentium-M chip is a faster processor that their Mobile Celeron 2.5GHz processor, but most Stupid People would prefer to buy the Celeron because of it's big clock speed number.
Hopefully the model numbers will be really arbitrary so that at least people will know that they're just bullshit numbers. Nobody would think that a Mercedes S600 is "20% better" than a Mercedes S500, everyone simply recognized that they are model numbers.
The cost of software is a rather small part of the cost for a TPC score. Even on the "cheap" systems (the cheapest system on that top-10 lists costs $32,772, and most cost about $50,000), hard disks are the dominant cost factor.
Perhaps an interesting flip-side to this argument is to look at the list of fastest systems overall.
Linux fanboys will be happy to know that their OS powers the most powerful system in this test (albeit through the use of a cluster while a known-weakness of the TPC-C test is that clusters can produce somewhat unrealisticly good results), while MS only appears in 3 of the top-10 systems. IBM's AIX is the most common operating system (4 systems) while Oracle is the most common database (also 4 entries). Linux fanboys may actually have good reason to show off this first-place result though, because with a system cost of $6.5M, HP almost certainly wasn't using the free OS for any sort of price advantage. Rather it may offer a performance advantage over Microsoft or even HP's own HP-UX.
Gigaflops is only a tiny fraction more useful than GHz, if at all.
Gigaflop tests come in three basic varieties. First are ones that fit entirely into the L1 cache of a processor, making the memory subsystem totally irrelevant. This is no good since the memory subsystem plays an important role in performance. In this sort of test a 2.8GHz Celeron processor with 128K of L2 cache and a 400MT/s bus speed would get a score essentially identical to a 2.8GHz P4 with 512KB or 1MB of L2 cache and an 800MT/s bus speed. In 90% of real-world applications though even a much slower 2.0GHz P4 would beat the pants off a 2.8GHz Celeron (the current Celeron chips are absolutely abysmal perfomers).
The second type of gigaflops test has a slightly larger dataset, so performance is almost entirely determined by what level of cache it fits into. For example, if they used something like a 60K dataset, an AthlonXP or Athlon64 would blow the doors off any P4 because it would be running everything in L1 cache while the P4 would be running out of (the much slower) L2 cache. Clock for clock the AthlonXP chips could easily be twice as fast in such a test. Things would get even worse if your data set fit into the L2 cache of one chip but not another, ie if you had a 750K data set, a "Prescott" P4, with 1MB of L2 cache, could be HUGELY faster than a "Northwood" P4 with only 512KB of L2 cache, even though in reality their performance is fairly close (with the "Northwood" usually being slightly faster).
The third option would be to use a HUGE dataset, turning this entirely into memory bandwidth test. Fine for what it's testing, but hardly an accurate picture of overall performance.
There are good reasons why the rather smart guys over at Ace's Hardware make use of Linpack (basic Gigaflops test used by Top500.org) to show off the memory subsystem of platform. By varying the size of your dataset it does a good job of illustrated the effects of cache and memory. However it doesn't tell you much else about processor performance.
I think that gigaflops would be a slightly worse metric for processor performance than MHz because it's FAR easier to abuse that test. The best thing for consumers is if the model numbers are really NOT meaningful at all. For example, look at video cards, where our top-dogs today are the ATI Radeon 9800 and the nVidia GeForce 5900. Nobody looks at those and says "Ohh, 9800 is bigger than 5900, therefore the ATI MUST be better". Everyone KNOWS that the model numbers here are meaningless, so if they want to know which is faster they ask a friend (or at least the salesperson) or do some research on their own. That is what I would like to see for processors as well. AMD's already got this with their Athlon64 FX line and Opteron line of processors. Hopefully Intel will do the same.
Why does this incorrect info keep getting posted (and modded as "informative" at that)? AMD stated several times quite publicly that their rating initially was meant to compare against the "Thunderbird" Athlon chips. More recently they've simply said that it's relative performance between the AthlonXP line and that it can "outperform it's closest competitors". Here's a direct quote from AMD's AthlonXP FAQ
Q: What does the 3200+ model mean?
A: This is a model number. AMD identifies the AMD Athlon XP processor using model numbers, as opposed to megahertz. Model numbers are designed to communicate the relative application performance among the various AMD Athlon XP processors. As additional evidence that performance is not based on megahertz alone: the AMD Athlon XP processor 3200+ operates at a frequency of 2.2GHz yet can outperform an Intel Pentium(R) 4 processor operating at 3.0GHz with an 800 FSB and HyperThreading on a broad array of real-world applications for office productivity, digital media and 3-D gaming.
AMD's model numbers not rated against Intel's P4 chips? You might want to tell AMD that!
The latency penalty of getting remote memory in an Opteron is a little bit less than the latency penalty of using an off-chip memory controller. The latency to remote memory may be higher than local memory, but it's not quite as high as the latency that the Apple G5 or Intel P4 would experience. This is probably a bit more of an issue for the P4 due to it's higher clock speed (it sees the same latency in total time but much higher latency in terms of clock cycles sitting around waiting for data), hence part of the reason why Xeons are available with 2 and 4MB of cache vs. 512MB on the PPC 970 and 1MB on the Opteron.
It's tough to give any sort of concrete "Chip A is better for this type of tasks, chip B is better for that type of task" breakdown with these chips, because there are MANY factors that come into play here, not the least of which being the compiler used. For example, while everyone always points to media encoding as being the P4's real strong suit, and benchmarks using the DivX codec support this, when using the XVid codec the Athlon actually ends up being faster. The two codecs are very similar in their design and both can produce great looking video at relatively low bit rates, but they are obviously different enough that their performance varies a fair bit. When you through the Apple G5 into the mix things get even trickier because you're talking about a different ISA, different OS, different compilers, etc. etc.
It depends on what floating point operations you're talking about. On some floating point ops, the PowerPC 970 can significantly outperform an Intel P4 (the much publicized/hyped Linpack test for Top500.org supercomputer list is a good example of this). On other FP ops, the P4 can significantly outperform the PPC 970. Overall though, they're fairly close, though I would tend to give the edge to the P4 if for no other reason than because there are better compilers for it. While lots of people like to argue that better compilers do not equal a better chip, in the real world the only thing that matters is the performance you get in the end, and compilers play a part in that performance.
"Stuck" as in dual channel DDR and now quad channel RDRAM solutions?
Who cares how much bandwidth you can get from memory to the memory controller, basically your CPU is the only consumer of that bandwidth (except the odd DMA transaction, but those are usually measured in the MB/s, not GB/s) and it's stuck getting that data at 533MT/s (4.2GB/s).
Xeons are still 533mhz because 800 on the desktop is basically marketing tripe - it really doesnt make your computer perform any better.
What the heck have you been smoking?! It sure as hell does help your computer perform better! There are DOZENS of tests out there to back this up! What's more, with Intel's slightly dated shared bus setup, running a dual-processor setups makes the slow bus even more of a bottleneck.
Fortunately Intel is well aware of this problem and they have full plans to upgrade their Xeon line to 800MT/s bus speeds with thier next revision of the chips (Nocona, basically the Xeon version of a "Prescott" P4).
I don't know if you want to wait, but Intel is planning on moving their Xeon line to 800MT/s bus speeds later this year, probably mid-summer.
of course, that being said, if you're looking at memory issues, the Opteron is definitely the way to go (except maybe for IBM's Power4 of Intel's Itanium, but they've both MUCH more expensive). Even though the Opteron and the G5 have the same theoretical memory bandwidth (6.4GB/s), the integrated memory controller of the Opteron will provide you with more real-world bandwidth. Add to that the NUMA design so that bandwidth scales with additional processors and it quickly gains a clear performance edge. Plus, to top it off, the integrated memory controller gives you SIGNIFICANTLY lower memory latency, something that is often even more important than high bandwidth.
Of course, price is a different matter, but they should be quite close. HP's new 2P Opteron servers are quite reasonably priced, shaving several hundred dollars off the price that IBM charges for their Opteron systems.
Also for math (especially floating point) calculations, the G5 (PPC970) is much superior to the Intel IA-32 (not really a big thing if all you do is run Word, of course).
That's a VERY broad statement there, and not really backed by much fact. For certain applications I'm quite certain that the PowerPC 970 is quite a bit faster than any x86 chips, but in other applications it's probably quite a bit slower, while overall they would seem to be fairly close.
Probably the most comprehensive cross-platform CPU benchmark we've got is SPEC CPU2000. It's far from perfect, but at least it's widely used. The best numbers I've seen for the PPC 970 is 937 CINT_base and 1051 CFP_base at 1.8GHz (numbers available in this product overview from IBM). Very respectible performance, and the 2.0GHz PPC 970 should be a bit higher, but it's not quite class-leading.
For comparison, a top-end Opteron system (Opteron 148, 2.2GHz) managed 1304 CINT and 1505 CFP. The Xeons in the same basic range with a score of 1532 CINT and 1338 CFP. And before anyone goes crying foul because of unfair compilers or anything like that, the Opteron numbers are achieved using GCC.
According to a talk by "Dr. BigMac" (from VA Tech) the only other high-volume CPU approaching it was the Intel Itanium, and here (quite an irony) Intel was under-clocked! (The G5, last year, was shipping at 2Gh, the Itanium less than that).
Ol' Dr. BigMac was basing his decision only on the specific performance tests he felt were important. In this case, that test was Linpack, where the PPC does very well. Linpack is certainly not the only measure of processor performance, it's actually a VERY limited test, albeit one that is applicable to many types scientific computing.
As for the Itanium it's likely more an issue of price rather than clock speed. When you look at the real-world performance of the Itanium2 1.5GHz vs. PPC 970 2.0GHz in Linpack, they're pretty close (probably within 5%). However a "cheap" dual-Itanium node will set you back a cool $15,000 or so, while a similarly equipped dual-G5 system from Apple will only cost you about $5000.
As for Pixar themselves. It's quite possible that they went through some benchmarks and found that the PowerPC 970 offered better performance for their particular work than any x86 chips. As mentioned above, there are some areas where the PPC970 does excel. However, I suspect that there was a STRONG incentive to find the PPC970 fastest regardless of what the actual performance was.
Definitely not BS, though whether or not it's useful depends heavily on your application.
The idea behind hyperthreading is that the P4's long pipeline will often stall with only a single thread going through. With hyperthreading you run two threads at once, so when one thread stalls you just start up the other thread and go with that one for a while. In a way it's almost like a poor-mans dual-processor system, giving you two logical processors on a single chip.
Now, obviously there are a few things to consider here. First off, if ALL of your processing is being done in a single thread then you aren't going to see any benefit to hyperthreading, and in fact the extra overhead might even make things a bit slower (usually only 1-2% slower).
Games almost always do all their major processing in a single thread. Even if they have extra threads hanging around, you almost always spend 99%+ of your time in a single thread. For this reason, games see virtually no benefit to hyperthreading (they don't see much/any benefit from dual-processor setups either).
On the other end of the spectrum, some applications see up to a 25% performance boost when hyperthreading is enabled. The tests I've seen show the biggest improvement have been things like Photoshop and rendering applications. Some server applications should benefit as well.
The other boost that hyperthreading gives you, like with a real dual-processor setup, is that it makes multitasking a bit "snappier". This is by no means a night-and-day difference here, but it is there.
Perhaps a better test would be for them to figure out what people actually did with their servers and tried to duplicate it as closely as possible.
Ohh wait, that's exactly what they did.
Yes. It has a narrower bus to the processor when compared to L2 cache and has higher latency, but it's still on-die and running at full core speed.
Cooling is/was more important, especially for the older T-birds.
Cooling is VERY important for all current processors. It all becomes relative. When the Thunderbird was current, it used quite a bit more power than the PIII that it competed against, so cooling was very important for that chip. Now, the ~70W that the T-Bird used is not at all abnormal, this is the same basic power consumption of an AthlonXP (Barton or Thoroughbred), Athlon64/Opteron, P4 "Northwood" or even an IBM PowerPC 970 (aka Apple G5). The Intel P4 "Prescott" chips are a bit hotter, so for the moment people are talking a lot about how hot those chips get, but give it another year or two and 100W TDP probably will not be at all abnormal for processors.
You have to be more careful not to crack the chip when putting the heatsink on.
I've always wondered how in the hell anyone managed to crack their chips putting heatsinks on. I've put heatsinks on a fair number of AthlonXP chips and never even see a possible way to crack the core. Do people usually install heatsinks using a hammer or something?! While I do like the newer retention mechanism used on Intel P4 chips or AMD Athlon64/Opteron chips, it's really NOT difficult at all to put a heatsink on an AthlonXP processor.
AMD has had software issues. For example, Win2K had to be patched to SP1 because AMD messed up AGP coherence. AMD also never told the Linux developers about this same problem causing numerous people to have system crashes when using nVidia cards/drivers under linux. You'd think they would have sent a fricken e-mail to the kernel-dev lit.
You need to do a bit more reading on that problem, it was actually caused by nVidia's drivers doing some really stupid things (ATI does/did the same stupid things, as did most other video card vendors). It was only a matter of sheer luck that the problem DIDN'T affect Intel chips in the sameway.
The difference isn't all that large actually, though ICC is generally still a bit faster than GCC on Athlon64/Opteron chips. If you look at the SPEC CPU2000 results you can find a bunch of different Opteron results. Some of these results can be used to show a direct comparison between ICC and GCC, and they're often within about 10% of one another. In fact, when you start dealing with 64-bit Linux and 64-bit GCC code it often ends up being faster than 32-bit ICC code.
As for the compiler used in the article, they were using Microsoft SQL Server as best as I can tell, so I think that you can be fairly certain that the compiler used was MS Visual Studio.Net.
You do seem to be correct that companies don't want to emphasize NUMA too much from the software side of things. AMD has also been known to say that their Opterons don't need any NUMA optimizations.
They actually used the term "SUMO" (complete with a drawing of a sumo wrestler in one of their presentations! :) ), for Sufficiently Uniform Memory Organization.
Regardless of what the marketing depertment says and to who, it's still NUMA. A test on Ace's Hardware a while back showed that a 2P Opteron can see up to a 20% faster with NUMA optimizations, and presumably a 4P Opteron will see an even bigger boost.
I would guess that Anand's test linked in this article had those NUMA optimizations enabled already, though he doesn't specify any BIOS options. He did use Win2K3 which is NUMA aware.
I think that a large part of the story was simply that the new Xeons are no longer getting hugely beaten by the Opteron like they were in the previous test performed by the same site. While the Opterons still held a noticeable lead in 4P setups, the 2P systems were all pretty close.
In Anand's previous test comparing the two chips the Opteron came out WELL ahead of the Xeon. Of course, the tests were somewhat different, and it's been demonstrated several times that the Opteron is the chip to beat in web serving (subject of the first test) but things are much close in more pure-database tests (second article).
As for other platforms, I do with they had been able to throw an Itanium2 setup into the mix. In fact, what would be even more interesting is if/when they get Win2k3 64-bit edition running on the Opteron (and maybe Xeons as well) and THEN compare it to the Itanium2.
They WILL standardize on a socket, it's just that the socket will be Socket 939 and not the current one.
It's pretty much the same story with SlotA/SocketA. They had an initial design that was quickly replaced. The second socket then stuck it out for the duration.
Intel did pretty much the same thing with their P4, initially releasing it on socket 423 and then quickly moving to socket 478 which has lasted for several years now (though it too will soon be replaced).
Markets change, technology changes, and sometimes sockets need to change with them. Remeber that the specification for Socket 754 and Socket 940 for current Athlon64 chips was set in stone about 3 years ago (before the first beta chips tapped out), and a lot has changed since then. AMD has gone to great lengths to minimize socket changes, but there's only so much that they can do.
I highly suspect that they actually are running off a variant of WinCE, seeing as they already have a current version of that OS supporting PowerPC (not the PPC 970, but it shouldn't take much to get that supported).
By "NT kernel" I suspect that they are using the term VERY loosely. Microsoft seems to use the same basic kernel concept for all their operating systems, all based off the old WinNT microkernel.
Remember, this news came from The Inquirer, so don't expect too many technical details (or accuracy for that matter).
Don't hold your breath. XBox2 games will be primarily Win32 and DirectX API games, neither of which run on OS X. Just because the hardware is similar doesn't mean that the systems are all that closely related.
Be VERY careful with IBM's power specs, they don't tell a very complete picture.
For example, IBM lists the power consumption of the PPC 970 as being something like 48W "typical" power consumption at 1.8GHz. Unfortunately the maximum real-world power consumption is a LOT higher than that, and when you start comparing a 2.0GHz PPC 970 (aka G5) to a modern x86 chip from AMD (2.0GHz Athlon64) or Intel (3.2GHz P4), you end up with pretty darn similar power consumption figures.
Besides, it's not like Intel doesn't already have a low-powered design in their Pentium-M. Intel has even hinted that they might make a dual-core version of that processor sometime late next year.
In the end, I'm sure that the MAIN deciding factor here is cost, nothing more, nothing less.
As for an MS supported OS on PPC, that already exists. Current versions of WinCE run on PPC chips.