Fairly good summary of the situation, but I think you can cut it even shorter:
People chose Cray for the I/O systems and the expertise available. The I/O just happens to be built around Opterons, since that's what it was first designed for, back when Opterons kicked Xeons ass.
The SI prefixes are specifically base-10 units, and have been so since the 1800's, with the metric system, and later adapted into the SI system. The fact that computer scientists and programmers misused the units and disregarded an established standard of communications and data encapsulation, and the fact that people STILL do it, is what's vexing, not the fact that the storage manufacturers have taken to use the proper approach.
Your application is one in which GPU's excel normally, so I have to say that yours must be very badly written.
By treating it as streams of texture sets, rather than just working chunk by chunk, you improve performance. That way, you can just set up different Streaming Processors in a chain to perform the various steps. When programmed in that way, my dual quad core Xeon is outperformed by an old GTS 250.
When you program a GPU, even with CUDA or OpenCL, a DSP programming mindset is more appropriate than a general purpose-CPU programming mindset.
So that's AMD's marketing approach these days "Anyone who doesn't buy our systems is worthless, a fanboy and we'll deride them, and hope they buy our systems when we've insulted them enough"? And in regards to overclockers, do you really think they are the only hobbyists? Seriously? And even then, many overclockers who know what they are doing will buy them and make use of them.
For his work, the 3930 will beat dual Opterons, because the computational tasks are not that easily parallellized, so you need strong per-core performance. Which these have... And AMD's current and last 2 generations haven't had.
Personally, I moved away from Opterons in my workstation, because they simply couldn't keep up with Xeons. When my current workstation gets a bit too slow for what I'm doing, I'll carefully look at both options available, but in the current state, going with AMD would be unjustifiable and thus very unprofessional.
Oh, you're astroturfing for AMD's marketing department again.
Many hobbyists will easily spend $1k on this, it's a fact. Hell, averaged over a lifetime, computing is still a cheap hobby, compared to things like flying, amateur motor racing etc.
As for the double cards, you may very well need the CPU to generate the data that you want to display over all those cards.
My brother for example would love this CPU for his CAD work for his hobby(he designs and builds stuff as a hobby), and all the CUDA/OpenCL modules have had serious deficiencies.
I'm disappointed with its power consumption, same as with the Bulldozer. However, unlike with the Bulldozer, you actually get computational performance to match the power consumption.
"That's kind of the definition of parallelizable and is the ideal case. Actually, it's the case I have. It means that I pay a hefty premium for the fast HT links and large system image, but it's still the cheapest way of getting high density computing."
No, the definition of embarrassingly parallell, aka ideal case, is when a single task can easily be spread over multiple processors with little performance loss due to overhead/locks/stalls or simply waiting for other processors to finish their job.
Raytracing is trivial to parallellize. CFD, not so much...
"The cluster I use spends a good fraction of the time maxed out."
Once again, your field is not the same as all fields. In quite a few fields you have peak demands, followed by periods when the systems are idle. Massive real-time statistics systems with hundreds of users for example, with very few working graveyard shifts... That means they need massive amounts of computational resources during office hours, but the system is idle during the night for example.
No, I specifically compared Bulldozer with Sandy Bridge.
If we add in the x6(Thuban core), it just gets MORE embarrasing for Bulldozer, considering that Thuban is using the 45nm process while Bulldozer is on the 32nm process, yet has less power consumption and overall comparable performance to the 8150.
The direct reason the Jaguar/Titan is upgraded with Interlagos is that it only requires major redesign of cooling. Everything else, including the internode memory controllers etc don't require much in that way, making it mostly(keyword mostly) a drop-in replacement. I'm not saying Interlagos is bad or anything, I'm just pointing out that in the case of Jaguar/Titan, it's for economical and engineering simplicity reasons on the hardware side at least.
The software side is going to be "somewhat" more tricky.....
"Yes, but if you're running a cluster, you are by definition running problems that parallelize well. If your workload isn't parallelizable, then the best you can do is run the single thread on the most overclocked, most expensice i7 you can get your hands on."
This is not true these days, since many use clusters even for tasks that are not easily parallellizable, simply because that's what's available.
Also, the 12-core 6100 is Magny-Cours which is not based on Bulldozer. Bulldozer-based Opterons are under the name Interlagos.
I specifically stated that the Bulldozers run really hot, I said nothing about other AMD chips.
Also, in your calculations, add in the additional space required for the extra cooling equipment etc. And renting that is rather expensive in many places.
This is entire system power use, minus monitor. The systems also used the same PSU's, to remove that factor from the comparison.
The Sandy Bridge chipset and CPU revisions really cut down power consumption.
The i7-2600k put under heavy load sucked down 164 watts, the FX-8150 sucked down 243 watts. The i5-2500k sucks down 148 watts under the same heavy load. Another interesting comparison is the A8-3850, which sucks down 165 watts under the same heavy load.
Or you may just as likely have 48 very CPU intensive non-parallellizable tasks.
Or most likely of all, looking over all the different fields, you have a mix of tasks that utilize CPU's differently, and find that at peak use, you need 48 cores.
The problem for AMD is, over the lifespan of a cluster/supercomputer/data center, the major costs aren't manpower, it's floorspace, power and cooling. These Bulldozer cores use drastically more power and run MUCH hotter than the 10 months older Intel parts. Also, not all workloads(even in science) are easily parallellized, so overall balance of performance advantage leans over towards Intel.
Using the same memory, SSD's, GPU and such, the FX-8150 gurgles down 79 watts more under heavy load than the i7-2600k.
So Antarctica is having a rather mild winter then.... -45 celsius is what we had for a while during Arctic Survival courses I took in the military, up in northern Lappland.
"Hardware support for BCD or decimal FP? Because x86 has had hardware BCD support since the 8086, and now you can do BCD with SSE. How may digits are your BCD values?"
Use of both. And the "hardware support" on x86 for BCD is... slow, takes way more cycles than should be needed. And they are using 8-byte Packed BCD.
Note, I was brought on for a specific niche here, tweaking and tuning the Infiniband setup.
As for GiB/s numbers, depends on the time of day/time of year, 25-30GiB/s to and from the storage array is not unusual. When the project was deployed about halfway, we managed to saturate 8 of the 12 Infiniband links to the storage array during a peak demand, though that was with some of the most intense users having been connected already. The storage array has a pair of RAMSAN 630 devices as a buffer for recent/frequently requested data.
More interesting to mention is the fact that the whole setup serves about 15000 concurrent "terminals"(read, workstations/desktops) nationwide, spread over hundreds of offices, some with gigabit access, some with 100 megabit access, working with statistical data, payroll/budget processing, analysis, forecasting etc, with strict separation of users/privileges, audit trails etc. And of course everything is encrypted by default.
What I mean with lackluster on x86 etc is that I/O is still sequential bus limited, and even with DMA etc, the CPU STILL has to do some of the I/O shuffling gruntwork. On the mainframe, you have channels that can be individual or bonded as per your needs. The mainframe processor just tells a channel processor "here, job to do" and then proceeds with the next bit of processing it has to do.
That also has benefits if you move onto virtualization
Actually, it is the numbers we generated on our own that I'm running. For the project I worked on, a single loaded mainframe outperformed the Altix, off-the-shelf Dell cluster and a couple of other solutions the client looked at. Hardware support for BCD and the massive external I/O.
As for partitioning, in secure environments, the low overhead and the ease with which you can do it on IBM's mainframe reduces the operational costs.
The biggest operational cost over the years is floorspace+cooling+power, and that's where the real gain in, and that's where my clients really learned the difference. The primary and the backup system, complete with their storage arrays, cost just slightly more than just the primary off-the-shelf Dell system when factoring in the number of spares that have to be running just to keep the primary system operational in case of failures. Add to that the state of immaturity of reliable failover systems in the Linux world and the operational costs skyrocket.
As for Westmere, it has nice performance for FP math or non-BCD integer math, and it has nice I/O to RAM/local devices, but external I/O is.. lackluster compared to what a z10 can do.
My personal workstation is a dual quad-core Xeon with a crapload of RAM, because it fits the tasks I personally work with better than a z10 would, but if I were to actually work fulltime with the sort of stuff my last client uses their systems for, it'd be mainframes all over, because the performance and reliability for those tasks is just unparallelled by anything x86-based.
When it comes to for example IBM's mainframes, for the jobs where they are used, they massively outperform any Intel/AMD cluster both in raw performance and in operational costs over the years.
Should have waited for the new 15W TDP Pentium, essentially an i3 without built-in GPU.
I'm beating myself for not waiting for it >:/
Fairly good summary of the situation, but I think you can cut it even shorter:
People chose Cray for the I/O systems and the expertise available. The I/O just happens to be built around Opterons, since that's what it was first designed for, back when Opterons kicked Xeons ass.
The SI prefixes are specifically base-10 units, and have been so since the 1800's, with the metric system, and later adapted into the SI system. The fact that computer scientists and programmers misused the units and disregarded an established standard of communications and data encapsulation, and the fact that people STILL do it, is what's vexing, not the fact that the storage manufacturers have taken to use the proper approach.
Your application is one in which GPU's excel normally, so I have to say that yours must be very badly written.
By treating it as streams of texture sets, rather than just working chunk by chunk, you improve performance. That way, you can just set up different Streaming Processors in a chain to perform the various steps. When programmed in that way, my dual quad core Xeon is outperformed by an old GTS 250.
When you program a GPU, even with CUDA or OpenCL, a DSP programming mindset is more appropriate than a general purpose-CPU programming mindset.
Actually, right now, 2 years behind the curve uses more electricity than being on the edge
Look at how he responded to one of my posts.
So that's AMD's marketing approach these days "Anyone who doesn't buy our systems is worthless, a fanboy and we'll deride them, and hope they buy our systems when we've insulted them enough"? And in regards to overclockers, do you really think they are the only hobbyists? Seriously? And even then, many overclockers who know what they are doing will buy them and make use of them.
For his work, the 3930 will beat dual Opterons, because the computational tasks are not that easily parallellized, so you need strong per-core performance. Which these have... And AMD's current and last 2 generations haven't had.
Personally, I moved away from Opterons in my workstation, because they simply couldn't keep up with Xeons. When my current workstation gets a bit too slow for what I'm doing, I'll carefully look at both options available, but in the current state, going with AMD would be unjustifiable and thus very unprofessional.
Ignore Unity100, he's astroturfing for AMD's marketing department
Oh, you're astroturfing for AMD's marketing department again.
Many hobbyists will easily spend $1k on this, it's a fact. Hell, averaged over a lifetime, computing is still a cheap hobby, compared to things like flying, amateur motor racing etc.
As for the double cards, you may very well need the CPU to generate the data that you want to display over all those cards.
My brother for example would love this CPU for his CAD work for his hobby(he designs and builds stuff as a hobby), and all the CUDA/OpenCL modules have had serious deficiencies.
They sent it out to some reviewers. Sweclockers has it for example. Google translate it or something if you don't know swedish http://www.sweclockers.com/recension/14699-intel-core-i7-3930k-och-3960x-sandy-bridge-e
I'm disappointed with its power consumption, same as with the Bulldozer. However, unlike with the Bulldozer, you actually get computational performance to match the power consumption.
Well, they want to see your raw abilities, not your development environments abilities.
"That's kind of the definition of parallelizable and is the ideal case. Actually, it's the case I have. It means that I pay a hefty premium for the fast HT links and large system image, but it's still the cheapest way of getting high density computing."
No, the definition of embarrassingly parallell, aka ideal case, is when a single task can easily be spread over multiple processors with little performance loss due to overhead/locks/stalls or simply waiting for other processors to finish their job.
Raytracing is trivial to parallellize. CFD, not so much...
"The cluster I use spends a good fraction of the time maxed out."
Once again, your field is not the same as all fields. In quite a few fields you have peak demands, followed by periods when the systems are idle. Massive real-time statistics systems with hundreds of users for example, with very few working graveyard shifts... That means they need massive amounts of computational resources during office hours, but the system is idle during the night for example.
No, I specifically compared Bulldozer with Sandy Bridge.
If we add in the x6(Thuban core), it just gets MORE embarrasing for Bulldozer, considering that Thuban is using the 45nm process while Bulldozer is on the 32nm process, yet has less power consumption and overall comparable performance to the 8150.
The direct reason the Jaguar/Titan is upgraded with Interlagos is that it only requires major redesign of cooling. Everything else, including the internode memory controllers etc don't require much in that way, making it mostly(keyword mostly) a drop-in replacement. I'm not saying Interlagos is bad or anything, I'm just pointing out that in the case of Jaguar/Titan, it's for economical and engineering simplicity reasons on the hardware side at least.
The software side is going to be "somewhat" more tricky.....
"Yes, but if you're running a cluster, you are by definition running problems that parallelize well. If your workload isn't parallelizable, then the best you can do is run the single thread on the most overclocked, most expensice i7 you can get your hands on."
This is not true these days, since many use clusters even for tasks that are not easily parallellizable, simply because that's what's available.
Also, the 12-core 6100 is Magny-Cours which is not based on Bulldozer. Bulldozer-based Opterons are under the name Interlagos.
I specifically stated that the Bulldozers run really hot, I said nothing about other AMD chips.
Also, in your calculations, add in the additional space required for the extra cooling equipment etc. And renting that is rather expensive in many places.
This is entire system power use, minus monitor. The systems also used the same PSU's, to remove that factor from the comparison.
The Sandy Bridge chipset and CPU revisions really cut down power consumption.
The i7-2600k put under heavy load sucked down 164 watts, the FX-8150 sucked down 243 watts. The i5-2500k sucks down 148 watts under the same heavy load. Another interesting comparison is the A8-3850, which sucks down 165 watts under the same heavy load.
Or you may just as likely have 48 very CPU intensive non-parallellizable tasks.
Or most likely of all, looking over all the different fields, you have a mix of tasks that utilize CPU's differently, and find that at peak use, you need 48 cores.
The problem for AMD is, over the lifespan of a cluster/supercomputer/data center, the major costs aren't manpower, it's floorspace, power and cooling. These Bulldozer cores use drastically more power and run MUCH hotter than the 10 months older Intel parts. Also, not all workloads(even in science) are easily parallellized, so overall balance of performance advantage leans over towards Intel.
Using the same memory, SSD's, GPU and such, the FX-8150 gurgles down 79 watts more under heavy load than the i7-2600k.
Bulldozer=SUV of CPU's....
That should have said "Antarctica's having a mild winter by Antarctica standards"
So Antarctica is having a rather mild winter then.... -45 celsius is what we had for a while during Arctic Survival courses I took in the military, up in northern Lappland.
And before that there was Menachim "everything the palestinians learned about terrorism, they learned from me and my friends" Begin.
And before the apologists start, keep in mind that even many jews condemned Begin(and also Shamir, Stern etc)
I'd go so far as to say that Irix didn't just have the best WM, it had the best X implementation too.
Another awesome feature was the scaling function for icons etc, very fast and easy to adapt it to your preferences.
"Hardware support for BCD or decimal FP? Because x86 has had hardware BCD support since the 8086, and now you can do BCD with SSE. How may digits are your BCD values?"
Use of both. And the "hardware support" on x86 for BCD is... slow, takes way more cycles than should be needed. And they are using 8-byte Packed BCD.
Note, I was brought on for a specific niche here, tweaking and tuning the Infiniband setup.
As for GiB/s numbers, depends on the time of day/time of year, 25-30GiB/s to and from the storage array is not unusual. When the project was deployed about halfway, we managed to saturate 8 of the 12 Infiniband links to the storage array during a peak demand, though that was with some of the most intense users having been connected already. The storage array has a pair of RAMSAN 630 devices as a buffer for recent/frequently requested data.
More interesting to mention is the fact that the whole setup serves about 15000 concurrent "terminals"(read, workstations/desktops) nationwide, spread over hundreds of offices, some with gigabit access, some with 100 megabit access, working with statistical data, payroll/budget processing, analysis, forecasting etc, with strict separation of users/privileges, audit trails etc. And of course everything is encrypted by default.
What I mean with lackluster on x86 etc is that I/O is still sequential bus limited, and even with DMA etc, the CPU STILL has to do some of the I/O shuffling gruntwork. On the mainframe, you have channels that can be individual or bonded as per your needs. The mainframe processor just tells a channel processor "here, job to do" and then proceeds with the next bit of processing it has to do.
That also has benefits if you move onto virtualization
Actually, it is the numbers we generated on our own that I'm running. For the project I worked on, a single loaded mainframe outperformed the Altix, off-the-shelf Dell cluster and a couple of other solutions the client looked at. Hardware support for BCD and the massive external I/O.
As for partitioning, in secure environments, the low overhead and the ease with which you can do it on IBM's mainframe reduces the operational costs.
The biggest operational cost over the years is floorspace+cooling+power, and that's where the real gain in, and that's where my clients really learned the difference. The primary and the backup system, complete with their storage arrays, cost just slightly more than just the primary off-the-shelf Dell system when factoring in the number of spares that have to be running just to keep the primary system operational in case of failures. Add to that the state of immaturity of reliable failover systems in the Linux world and the operational costs skyrocket.
As for Westmere, it has nice performance for FP math or non-BCD integer math, and it has nice I/O to RAM/local devices, but external I/O is.. lackluster compared to what a z10 can do.
My personal workstation is a dual quad-core Xeon with a crapload of RAM, because it fits the tasks I personally work with better than a z10 would, but if I were to actually work fulltime with the sort of stuff my last client uses their systems for, it'd be mainframes all over, because the performance and reliability for those tasks is just unparallelled by anything x86-based.
You're just showing how little you know.
When it comes to for example IBM's mainframes, for the jobs where they are used, they massively outperform any Intel/AMD cluster both in raw performance and in operational costs over the years.