IEEE Says Multicore is Bad News For Supercomputers
Richard Kelleher writes "It seems the current design of multi-core processors is
not good for the design of supercomputers. According to IEEE: 'Engineers at Sandia National Laboratories, in New Mexico, have simulated future high-performance computers containing the 8-core, 16-core, and 32-core microprocessors that chip makers say are the future of the industry. The results are distressing. Because of limited memory bandwidth and memory-management schemes that are poorly suited to supercomputers, the performance of these machines would level off or even decline with more cores.'"
Sounds like its time for supercomputers to go their own way again. I'd love to see some new technologies.
If you make a simulation like that keeping the memory interface constant then of course you'll see diminishing returns. That's why we're still not running plain old FSBs as AMD has HyperTransport, Intel has QPI, the AMD Horus system expands it up to 32 sockets / 128 cores and I'm sure something similar can and will be built as a supercomputer backplane. The header is more than a little sensationalist...
Live today, because you never know what tomorrow brings
Well we are talking about CPU to ram not the Hard drive. But a similar process the Ram is order of magnitude slower then the CPU. But the When the CPU talks to the ram it goes over the bus and talks to the ram and back threw the bus to the CPU. With a single core Fast CPU you can have a bus for each core, which is like adding more lanes to a highway it allows more traffic so the CPU while may be waiting for the ram it will be faster as you are not waiting for your bits because an other core requested some other bits.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Once we get to 32 or 64 core cpus that cost less than $100 (say, five years), I'd HATE to have a beowulf cluster of those!
I hold very few opinions. I hold information based on observation and fact. If you wish to disagree, please use facts.
>>>"After about 8 cores, there's no improvement," says James Peery, director of computation, computers, information, and mathematics at Sandia. "At 16 cores, it looks like 2 cores."
>>>
That's interesting but how does it affect us, the users of "personal computers"? Can we extrapolate that buying a CPU larger than 8 cores is a waste of dollars, because it will actually run slower?
FOX NEWS.com should be BANNED from television and internet. Have the Congress take it over and give us Truespeak.
That to remove the 'memory wall', main memory and CPU will have to be integrated.
I mean, look at general-purpose computing systems past & present: there is a somewhat constant relation between CPU speed and memory size. Ever seen a 1 MHz. system with a GB. RAM? Ever seen a GHz. CPU coupled with a single KB. of RAM? Why not? Because with very few exceptions, heavier compute loads also require more memory space.
Just like the line between GPU and CPU is slowly blurring, it's just obvious that the parts with the most intensive communication, should be the parts closest together. Instead of doubling nummber of cores from 8 to 16, why not use those extra transistors to stack main memory directly on top of the CPU core(s)? Main memory would then be split up in little sections, with each section on top of a particular CPU core. I read sometime that semiconductor processes that are suitable for CPU's, aren't that good for memory chips (and vice versa) - don't know if that's true but if so, let the engineers figure that out.
Ofcourse things are different with supercomputers. If you have a 1000 'processing units', where each PU would consist of say, 32 cores and some GB's RAM on a single die, that would create a memory wall between 'local' and 'remote' memory. The on-die section of main memory would be accessible at near CPU speed, main memory that is part of other PU's would be 'remote', and slow. Hey wait, sounds like a compute cluster of some kind... (so scientists already know how to deal with it).
Perhaps the trick would be to make access to memory found on one of the other PU's transparent, so that programming-wise there's no visible distinction between 'local' and 'remote' memory. With some intelligent routing to migrate blocks of data closer towards the core(s) that access it? Maybe that could be done in hardware, maybe that's better done on a software level. Either way: the technology isn't the problem, it's an architectural / software problem.
So your saying that next generation processors need a gig of cache. Plus 4gigs of ram.
I think what is really needed is new OS designs. Something that is no longer tied quite as close to the hardware. So that new hardware ideas can be tried.
i thought once I was found, but it was only a dream.
I once heard someone define a supercomputer as a $10 million memory system with a CPU thrown in for free. One of the interesting CPU benchmarks is to see how much data it can move when the cache is blown out.
Mea navis aericumbens anguillis abundat
This doesn't quite make sense to me. You wouldn't replace a 64 CPU supercomputer with a single 64 core CPU, but would instead use 64 multicore CPUs. As production switches to multicore, the cost of producing multiple cores will be about the same as the single core CPUs of old. So eventually you'll get 4 cores from the price of 2, then get 8 cores from the price of 4, then 16 for the price of 8, etc. So the extra cores in the CPUs of a supercomputer are like a bonus, and if software can be written to utilize those extra cores in some way that benefits performance, then that's a good thing.
Better known as 318230.
Maybe they should do something like we did back when the Paragon (yes, that far back) had multiple CPUs on a node and the memory bandwidth wasn't enough to support them all simultaneously... Don't use some of CPUs on the card (leave some idle) so that all the bandwidth is availalbe to the one, or few, cores that need it. Alternatively, figure out a way (algorithms) to make sure that no more than one core is memory intensive at a time... take turns being bandwidth intensive. Or, just realize, as it's always been, that some solutions/algorithms just aren't optimal on commodity hardware.
Well in a way it could be. I'd read the spectrum article some time back, but since I work in the field I can give some insight.
RAM latencies are a huge hit for applications that are based on random access. DRAMs etc. don't actually do random access the way you'd want they access one memory over a large time period, and provide faster access to some successive elements. New processor architectures based on smart caches and intelligent memories could be a lot more useful, basically though a rethinking of processor architecture is involved - in the end electrical and computer engineering is still that : Engineering there will always be tradeoffs.
blog plug -> The Darker Side of Light
And do you need a supercomputer to run a spellchecker ?
I'd give my right arm to be ambidextrous.
Isn't this already a problem in today's computers? The CPU isn't the bottleneck, the HDD is.
Generally this isn't true if you're talking about a supercomputer because of the tasks they'll be performing. You don't build supercomputers to be file servers (or even database servers, which can still use a lot of CPU)
For a given node count, we've seen increases in performance. The claimed problem is that for the workloads that concern these researchers, they don't see people mentioning significant enhancements to the fundamental memory architecture projected to follow the scale at which multi-core systems go. So you buy a 16 core chip system to upgrade your quad-core based system and hypothetically gain little despite the expense. Power efficencies drop and getting more performance requires more nodes. Additionally, who is to say that clock speeds won't lower if programming models in the mass market change such that distributed workloads are common and single-core performance isn't all that impressive.
All that said, talk beyond 6-core/8-core is mostly grandstanding at this time. As memory architecture for the mass market is not considered as intrinsically exciting, I would wager there will be advancements that no one speaks to. For example, Nehalem leapfrogs AMD memory bandwidth by a large margin (like by a factor of 2). It means if Shanghai parts are considered satisfactory today to get respectable yield memory wise to support four cores, Nehalem, by a particular metric, supports 8 equally satisfactorily. The whole picture is a tad more complicated (i.e. latency, numbers I don't know off hand), but the one metric is a highly important one in the supercomputer field.
For all the worry over memory bandwidth though, it hasn't stopped supercomputer purchasers from buying into Core2 all this time. Despite improvements in their chipset, Intel Core2 still doesn't reach AMD performance. Despite that, people spending money to get into the Top500 still chose to put their money on Core2 in general. Sure, Cray and IBM supercomputers in the Top2 used AMD, but from the time of its release, Core2 has decimated AMD supercomputer market share despite an inferior memory architecture.
XML is like violence. If it doesn't solve the problem, use more.
You might see a super computer design around other RISC processors such as the ARM. A supercomputer using the ARM takes more chips perhaps but the power savings is substantial compared to the x86. Furthermore, companies that like Nvidia with their Telsa platform are pushing into the supercomputing space with specialized chips that are purposefully designed to deal with large linear problem solving. Interestingly the Telsa chip is a multicore chip as well. http://www.nvidia.com/object/product_tesla_s1070_us.html
Except that will obviously be slower due to the overhead of abstracting from the hardware. And that is already what most OS out there do. Linux on RISC does exist and so does Linux on * But there is a drawback to all that.
Someone who has already thought of this... http://www.nytimes.com/2008/03/24/technology/24wafer.html?_r=1&ref=technology
Don't feed the penguins
Would optical get around such a barrier?
Physical space seems to be one of the major hurdles in CPU design today, due to leakage with the ever shrinking processes.
And i think it is about damn time that new silicon laser receiver thing (forgot the details) was put into implementation and testing.
IBM is already working on it. Stay tuned.
Only for Office 2007.
http://rocknerd.co.uk
So they can play GTA IV in their time off.
http://rocknerd.co.uk
Like on the 386 0 wait computers.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
The issue is with a single processor that has multiple cores.
There's no real way to split the banks for each core, so the net effect is that you have 4-32 cores sharing the same lanes for memory.
No, sorry. That's how Phenom processor are *Already* working.
Each physical CPU package has two 64-bit memory controllers, each controlling a separate bank of 64bits DDR-2 memory chips. (Each of the two bank in a dual channel mother board).
Phenom have two mode of function :
- Ganged : both memory controllers work in parallel, working as if they were a huge 128bits memory connection. That's how dual channel has worked since it was invented.
That's good for system running few very bandwidth-hungry applications (for example : benchmarks)
- Unganged : each memory controller work on its own. Thus you have two completely separate 64bits memory channel accessible at the same time. By correctly lying the applications in memory thanks to a NUMA-aware OS (anything better than Windows Vista), that means that two separate applications can simultaneously access each one's memory at the exact same moment, although at only half the bandwith *per process* (but still the same total of bandwidth for all processes running at the same time on a multi core chip).
This is perfect for systems running lots of tasks in parallel, and is the default mode on most BIOSes I've seen.
This gives a tremendous boost to heavily multi-tasked applications (a busy database server, for example), and it's what TFA's author are looking for.
Probably that at some point in the future, Intel will follow the same trend with its QPI processors.
Also, the future trend is to multiply the memory channels on the CPU: Intel has already planned Triple Channel DDR-3 for their high-end server Xeons (the first crop of QPI chips). AMD has announced 4 memory channels for their future 6- and 12- core chips targeting the G34 socket.
So the net effect of Unganged Dual Channel is that today you already have 4 cores having a choice of 2 sets of memory lanes to choose among, and within 1 year, you'll have 6-to-12 cores sharing 4 sets of memory lanes.
By the time you reach 32 cores on CPU, probably that almost each slot will have its own dedicated memory channel (probably with the help of some technology which communicates serially with fewer lines, like FB-DIMM). Or even weirder memory interfaces (who knows ? maybe DDR-6 will be able to give several simultaneous access to the same memory module).
So, well, once again, it proves that running stupid simulations without taking into account that other technologies will improves beside the number of cores* yields stupid non realistic results.
Shame on TFA's Author, because the trends to increase bandwith have already started. I little bit more background research would have avoided this kind of stupidity.
But on the other hand, they would have missed the opportunity to publish an alarmist article with an eye catching title.
--
*: Although, yes, the number of cores you can slap inside the same package seems to be the "new megahertz" in the manufacturers' race, with some like Intel trying to increase this number faster without putting so much efforts on the rest.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
CUDA has zero benefit for supercomputing projects that cannot be broken into tiny bits and spread across multiple cores.
It's not just about memory, or clock speed.
"A supercomputer is a device for turning compute-bound problems into I/O-bound problems."
-Ken Batcher
What's distressing here? That they have to keep building supercomputers the same way they always have? I worked with an ex IBM'er from their supercomputing algorithms department, he and I BSed about future chip performance alot in the late 2006 - early 2007 timeframe. We were both convinced that the current approaches to CPU design were going to top out in usefulness at 8 to maybe 16 cores due to memory bandwidth.
I guess the guys at Sandia had to do a little more than BS about it before they published, but c'mon guys, this has been obvious for a while. And, if it's obvious to all of us out here, don't you think that Intel knew about it during their 2002 roadmap meetings?
Because of limited memory bandwidth and memory-management schemes that are poorly suited to supercomputers, the performance of these machines would level off or even decline with more cores.
So increase the bandwidth on the memory to something more suited to supercomputers then. Design and make a supercomputer for supercomputer purposes. You are scientists using supercomputers, not kids begging mom for a new laptop on christmas. Make it happen.
It's hardly any secret that CPU speed, even for single core processors, has been running ahead of memory bandwidth gains for years - that's why we have cache, and ever increasing amounts of it. It's also hardly any relevation to realize that if you're sharing your memory bandwidth between multiple cores then the bandwidth available per core is less than if you weren't sharing. Obviously you need to keep the amount of cache per core and the number of cores per machine (or, more precisely, per unit of memory sybsystem bandwidth) within reasonable bounds to keep it usable for general purpose aplications, else you'll end up in GPU-CPU (e.g. CUDA) territory where you're totally memory constrained and applicability is much less universal.
For cluster-based ("supercomputer") applications, partitioning between nodes is always going to be an issue in optimizing performance for a given architecture, and available memory bandwidth per node and per core is obviously a part of that equation. Moreover, even if CPU designers do add more cores per processor than is useful for some applications, no-one is forcing you to use them. The cost per CPU is going to remain approximately fixed, so extra cores per CPU essentially come for free. A library like pthreads, and different implementations of it (coroutine vs LWP based), gives you the flexibility over the mapping of threads to cores, and your overall across-node application partitioning gives you control over how much memory bandwidth per node you need.
The other thing I'd point out is that your analogy to "balanced" general purpose computing systems can (and should) fail for supercomputers... there is no rational reason to continue scaling in a linear fashion.
Then again, this is seriously old news. Trying to optimize a supercomputer to get anywhere close to 100% CPU utilization is known to be a problem. Others have already pointed out IBM's Blue Gene, and there's a reason it's a good example.
From Marc Snir, et al. at IBM, September, 2001 in a file called BlueGenePublic.pdf, which discussed the design philosophy for the Blue Gene supercomputer:
"Standard microprocessors are optimized for running as fast as possible one instruction stream...
Standard nodes suffer from 'von Neumann' bottleneck: computation speed increases much faster than memory access speed"
"Let's think from scratch.... in order to build general purpose systems that overcome constraints of conventional architectures.
Let's accept that significant improvements in cost/performance can be achieved by building an 'unbalanced' system" (emphasis added)
"CPU is a vanishingly small fraction of total system, in silicon area, or power
It is rational to build systems with a surfeit of compute power, so as to reduce memory requirements and reduce the need to move data around
It is cost effective to have a low CPU utilization"
Use the same trick that RAID does: multiple memory modules in parallel. The SUN Niagara 2 processors have 4 memory controllers to feed the 4-6-8 cores (32-48-64 threads). The Tilera TILE64 processors also have 4 memory controllers to feed the 64 cores.
The phrase "By logical extension" is just another way of saying "This is a straw man argument"
I believe that the point he was making was not that it's pointless to go beyond X86 hardware, but that it's more cost-effective to use consumer hardware. Consumer hardware is not necessarily X86 hardware. See IBM's Roadrunner, presently the fastest supercomputer in the world, which uses an advanced version of the PS3's processor (the PowerXCell 8i).
In time, we'll probably see demand in consumer hardware for breaking past the boundaries and bottlenecks of multi-core processing, and so supercomputers will follow.
mysql> SELECT * FROM `places` WHERE `place` LIKE 'home`; Empty set (0.00 sec)
NEc still makes the SX9 vector system, and cray still sells X2 blades that can be installed into their xt5 super. So vector processors are available, they just aren't very popular, mostly due to cost/flop.
A vector processor implements an instruction set that is slightly better than a scalar processor at doing math, considerably worse than a scalar processor at branch-heavy code, but orders of magnitude better in terms of memory bandwidth. The X2, for example, has 4 25gflop cores per node, which share 64 channels of DDR2 memory. Compare that to the newest xeons where 6 12 gflop processors share 3 channels of DDR3 memory. While the vector instruction set is well suited to using this memory bandwidth, a massively multi-core scalar processor could also make use of a 64-channel memory controller.
The problem is about money. These multicore processors are coming from the server industry. web-hosting, database-serving, and middleware crunching jobs tend to be very cache-friendly. Occasionally they benefit from more bandwidth to real memory, but usually they just want a larger L3 cache. Cache is much less useful to supercomputing tasks, which have really large data-sets. The server-processor makers aren't going to add a 64-channel memory controller to server processors; it wouldn't do any good for their primary market, and it would cost a lot.
Of course, you could just buy real vector processors, right? Not exactly. Many supercomputing tasks work acceptably on quad-core processors with 2 memory channels. It's not ideal, but they get along. This has put a lot of negative market pressure on the vector machines, and they are dying away again. It's not clear if cray will make a successor to the X2, and NEC has priced itself into a tiny niche market in weather forcasting, that is unapproachable by other supercomputer users, for price reasons.
The continued increase in multicore processing power is doomed unless a solution to the memory bottleneck is found. We need a memory system that obviates the needs for caching by completely eliminating bus contention in shared memory. This should be one of the primary research areas for companies like Intel and for government-funded research labs. We should pump billions of dollars into finding a solution to this problem over the next five years or ten years. I suspect that optical memory or quantum tunelling are promising areas of inquiry. This is what physicists should be focusing their efforts on instead of pursuing pipe dreams like quantum computing.
The number of cores per megabyte should double every 18 months so as to pursue the hypothetical ideal of one processor per byte. At that point we will have reached the end of the performance curve.
It's worth noting that multicore CPUs are just a plan B technology. What the market really wants is faster CPUs, but the current old technology can't deliver them, so CPU makers are trying to convince people that multicore is a good idea.
Even if this is the case, which sounds plausible... So what?
Somehow I have a sneaking suspicion that if multi-core has less performance in super computing than single-core... Companies will continue to manufacture specialized single-core processors for supercomputing.
This just in:
* Intel sucks at making zillion-dollar computers
* AMD sucks at everything
* Supercomputer engineers are worried for their jobs
I realize these people have a legitimate complaint, but quite frankly if you're worried about a certain processor affecting your code, maybe you suck at programming ?! So what if the internal bandwidth is ho-hum ? These old dogs need to stop complaining and learn to adapt, else their overpaid jobs will be given to others who can.
-Billco, Fnarg.com
Actually, my thinking is that rather than just tossing more cores at the problems, we should be looking at making the hardware adapt itself to the problem to be solved. IE: instead of just crunching "instructions" on data, we need hardware that effectively rewires itself to the problem at hand.
Something like an FPGA integrated into the archetecture with huge gate/interconnect counts plus some "normal" cores may be a better approach. Done well, loops can be unrolled and executed in one clock cycle, entire memories can be created on-chip and then destroyed when no longer needed, etc.
Of course, this would require changing some programming techniques around, and require compilers far more advanced than we currently have available. Still, it should be at least a semi-achievable technology.
Supercomputing is mostly a Government-funding boondoggle. The private sector buys few if any supercomputers.
Most of the US government applications are either related to nuclear weapons, or are busywork for underutilized nuclear weapons labs. Sandia, Los Alamos, and Livermore, lacking bombs to design, are looking for something else to justify their continued existence. To some extent, they're senior activity centers for old physicists. There's also "stockpile stewardship", which is an activity center for younger physicists. The idea is to keep some people around who can build an H-bomb if necessary, so that the technology isn't lost as people die off. Since that crowd isn't allowed to actually do much of anything, they want to simulate a lot. It's really a political problem. If the US and Russia allowed each other one underground bang each year, there would be less need for all this iffy simulation.
So don't worry too much about whining from Sandia about supercomputers. When Google or Amazon start complaining that multicore machines are choking in their server farms, it's time to listen.
Rather than laptops with zillions of CPU cores, we're probably going to see CPU chip real estate used for more cache, and maybe even main memory. The near future is the one-chip laptop that sells for $100 or so.
Good call, I didn't see the Direct Connect(tm) stuff. I should try to keep more abreast of such things. Nice design too.
...so nothing ties the application to the hardware? You need to have SOMETHING there. That's basically the whole point of an OS... to tie itself to the hardware so the applications don't have to. If you want to try new hardware ideas, you need to write a new OS. There's no way around it. How in the hell did this get modded up?
My blog. Good stuff (when I remember to update it). Read it.
What the article points out is that while the number of ALUs per chip has increased, memory to processor throughput has not. If you are working with large amounts of data (i.e. not factoring numbers) the processor is unable to keep the cores fed. Most supercomputer applications today involve large data sets. In one situation examined an 8-core CPU performed about the same as a dual core and with more cores the processor degraded quickly to less than that of a dual core with 16-64 cores.
This is the memory bottleneck and is likely to be the case for database systems and other systems processing large data sets. The bottleneck needs a name. Any ideas?
In any case, the article only refers to data mining. Perhaps these questions are better answered by Oracle, DB2, or other TPC score winners.
There is less need for such a scheme today because sequential addresses are not sent across the memory bus. Instead, burst mode is used. A burst is specified by sending a start address and a size. This is necessary because the memory bus latency may be hundreds of clock cycles; bursts are the only way to achieve reasonable bandwidth in such conditions.
The tao of democracy: the government you can vote for is not the real government.
Sun's Niagara processors are not particularly suitable for supercomputers, but they have some innovative useful features for getting around the memory bandwidth problem.
Their truly innovative processor that should be superb for supercomputing is ROCK, if it ever sees the light of day.
As well as multiple cores, multiple threads per core (i.e. "contexts"), powerful floating-point cores and SIMD units, its killer feature is what sounds like a very clever kind of automatic speculative pre-fetching from main memory into cache.
intel's chips always make me laugh a little. All that processing power and no memory or I/O bandwidth. They've only just caught up with AMD in that respect and now they're planning 80 cores with very little improvement in memory bandwidth...
Stick Men
There's a very simple solution for the memory and interconnect bandwidth bottlenecks, and that is to widen the channels. If you look at Intel's Nehalem roadmap, they're planning to move from triple-channel to quad-channel on the really monstrous chips coming out down the line. Likewise, AMD has been doing a lot of work on making hypertransport channels more configurable, so you could allocate less bandwidth to an I/O bridge and more to the interconnect, if you're building a system where that's what's important.
If you're just adding more execution cores without changing what's around them, then this criticism hold, but the long-term significance of multi-core design is about VLSI, which lowers the latency between components. Latency is critical in supercomputing applications, so any time you can squeeze those gigaflops and their attached memory closer together, performance improves.
There's no failure quite as dissatisfying as a complete and total solution to the wrong problem.
*applause*
http://rocknerd.co.uk
A previous Slashdot article included an nVidia executive saying Intel has been wrong on cpu design for a long time - that the critical design feature needs to be memory bandwidth, not cpu ticks or speed or any of the numbers they've so far focused on.
But I think this just shows supercomputer designers need to stop thinking about CPUs and start thinking about GPUs. Multicore is here and commoditized already, and if you can do your work on shaders then you're looking at not 8, 16, or 32 cores but 640 or 1280 cores to do your work, all with bus designs that put memory first.
yea but the OS is so tied into the hardware that you can't port apps out of the OS, hardware combo. Applications shouldn't care what they are running on, hardware or software.
The only system to even begin to accomplish that is Inferno.
i thought once I was found, but it was only a dream.
GPUs don't do error detection/correction. Not a desirable feature for scientific models.
.
bravo.
I was just reading the AMD documentation on the Phenom. The part I found interesting was that Phenom uses the same 144 bit ECC code in both ganged and unganged mode. In the later case, the ECC code is used across two 72 bit transfers from the same channel which optionally can be bitwise interleaved. 4 bit chipkill correction is lost in this case but detection still works.
do you mean like for multiple processor single box supercomputers or do you mean for clusters? For clusters they already do, and for supercomputers-in-a-box they would have to.
In re: clusters, start here http://en.wikipedia.org/wiki/Message_Passing_Interface or http://en.wikipedia.org/wiki/OpenMP for more info.
2^3 * 31 * 647
erm, don't you still only have one RAID controller per system? Don't I/O requests still bottleneck at the DMA interface if they're faster ($DEITY forbid that happening) because there is only one bus? So how does your RAI[R|M] facilitate this system? You've still got to coordinate all that memory. Unless....
What if each RAM chip is assigned a write bit to one DMA style controller, but multiple or various other controllers have READ ONLY access to the RAM? I don't know if this is even feasible, but it's a thought...
2^3 * 31 * 647
I bought a Mac Pro 8-core machine to learn multi-core programming, then I discovered NVIDIA CUDA programming and I am looking at buying a C1080 240-core GPU to learn to program that. The industry is manufacturing lots of multi-core devices, but programming (parallel) hasn't adapted to this new paradigm and provided the right tools to leverage off these new technologies.
vector processing in commodity designs isn't enough. Of course we are going to see it, at this point it's not very expensive to add. Adding vector processing for increased flops is easy. The hard part is the bandwidth. One of the reasons the X1 processors were expensive, was that they were custom, but so are the network chips in commodity-cpu supers, and they only add $1000/node. The real cost of X1-style memory is that you have 64 channels of memory, which is a lot of wires, dimms, memory parts, etc. There's a very real cost to all the memory components needed to get the kind of bandwidth you need to support a high-throughput vector pipeline.
The commodity processor vendors aren't going to do this sort of thing, as it adds to the cost of the chip, but provides nothing to the bulk of their customers who are running mysql, apache, or halflife.
The one hope I have is something like the core2 architecture, where ddr3 is used for desktop processors, and fbdimm is used for server parts. The two components share a lot of architecture, and only a few of the asic cells are different. If a cpu vendor were interested in the HPC market, they could design a cpu to use a standard memory channel for desktop/low-end server parts, and something more expensive, but higher bandwidth for the HPC space. It would mean HPC specific processors, but sharing most of the engineering with the commodity part. Maybe Cray could license them the design for their weaver memory controller in the X2. It's kind of like the AMB on a FB-DIMM, but it includes 4 channels of DDR2 on each stick of memory.
I'd love to see each core on a massively multicore design get its own memory controller. I'm not holding my breath, however. If you think of a 32-core CPU, it's pretty unlikely that most supercomputer or cluster vendors are going to pay for 32 dimms for each cpu socket. So then you're talking about multiple memory channels per memory stick. You can still get ECC using 5 memory chips per channel, so you can imagine 4 channels fitting on a memory riser. Cray does this on the X2. Then 32 channels would only require 8 dimms, which is reasonable. Then what do you do for 64-core CPUs?
It's tricky, and the problem for the market is that it's expensive. Can you get the commodity CPU vendors interested in such a thing, given that most of their addressable market is not in the supercomputing space?
I think We're gonna see more cores in a CPU that there's bandwidth to use. They might increase the bandwidth a bit, but probably just enough to get good linpack numbers.
Note the comment from
Steve Conway from IDC
Steve Conway, senior analyst with IDC for high performance computing issues, said this problem has been around for a while, and multi-core is only exacerbating it. "x86 processors were never designed for HPC," he told InternetNews.com. "Those processors were not designed to communicate with each other at a high speed. With these big systems, you have to move data over large territories.