IEEE Says Multicore is Bad News For Supercomputers
Richard Kelleher writes "It seems the current design of multi-core processors is
not good for the design of supercomputers. According to IEEE: 'Engineers at Sandia National Laboratories, in New Mexico, have simulated future high-performance computers containing the 8-core, 16-core, and 32-core microprocessors that chip makers say are the future of the industry. The results are distressing. Because of limited memory bandwidth and memory-management schemes that are poorly suited to supercomputers, the performance of these machines would level off or even decline with more cores.'"
Sounds like its time for supercomputers to go their own way again. I'd love to see some new technologies.
If you make a simulation like that keeping the memory interface constant then of course you'll see diminishing returns. That's why we're still not running plain old FSBs as AMD has HyperTransport, Intel has QPI, the AMD Horus system expands it up to 32 sockets / 128 cores and I'm sure something similar can and will be built as a supercomputer backplane. The header is more than a little sensationalist...
Live today, because you never know what tomorrow brings
Figure it out.
Gotta love job security.
I am very small, utmostly microscopic.
Isn't this already a problem in today's computers? The CPU isn't the bottleneck, the HDD is.
Once we get to 32 or 64 core cpus that cost less than $100 (say, five years), I'd HATE to have a beowulf cluster of those!
I hold very few opinions. I hold information based on observation and fact. If you wish to disagree, please use facts.
>>>"After about 8 cores, there's no improvement," says James Peery, director of computation, computers, information, and mathematics at Sandia. "At 16 cores, it looks like 2 cores."
>>>
That's interesting but how does it affect us, the users of "personal computers"? Can we extrapolate that buying a CPU larger than 8 cores is a waste of dollars, because it will actually run slower?
FOX NEWS.com should be BANNED from television and internet. Have the Congress take it over and give us Truespeak.
That to remove the 'memory wall', main memory and CPU will have to be integrated.
I mean, look at general-purpose computing systems past & present: there is a somewhat constant relation between CPU speed and memory size. Ever seen a 1 MHz. system with a GB. RAM? Ever seen a GHz. CPU coupled with a single KB. of RAM? Why not? Because with very few exceptions, heavier compute loads also require more memory space.
Just like the line between GPU and CPU is slowly blurring, it's just obvious that the parts with the most intensive communication, should be the parts closest together. Instead of doubling nummber of cores from 8 to 16, why not use those extra transistors to stack main memory directly on top of the CPU core(s)? Main memory would then be split up in little sections, with each section on top of a particular CPU core. I read sometime that semiconductor processes that are suitable for CPU's, aren't that good for memory chips (and vice versa) - don't know if that's true but if so, let the engineers figure that out.
Ofcourse things are different with supercomputers. If you have a 1000 'processing units', where each PU would consist of say, 32 cores and some GB's RAM on a single die, that would create a memory wall between 'local' and 'remote' memory. The on-die section of main memory would be accessible at near CPU speed, main memory that is part of other PU's would be 'remote', and slow. Hey wait, sounds like a compute cluster of some kind... (so scientists already know how to deal with it).
Perhaps the trick would be to make access to memory found on one of the other PU's transparent, so that programming-wise there's no visible distinction between 'local' and 'remote' memory. With some intelligent routing to migrate blocks of data closer towards the core(s) that access it? Maybe that could be done in hardware, maybe that's better done on a software level. Either way: the technology isn't the problem, it's an architectural / software problem.
Would optical get around such a barrier?
Physical space seems to be one of the major hurdles in CPU design today, due to leakage with the ever shrinking processes.
And i think it is about damn time that new silicon laser receiver thing (forgot the details) was put into implementation and testing.
I once heard someone define a supercomputer as a $10 million memory system with a CPU thrown in for free. One of the interesting CPU benchmarks is to see how much data it can move when the cache is blown out.
Mea navis aericumbens anguillis abundat
This doesn't quite make sense to me. You wouldn't replace a 64 CPU supercomputer with a single 64 core CPU, but would instead use 64 multicore CPUs. As production switches to multicore, the cost of producing multiple cores will be about the same as the single core CPUs of old. So eventually you'll get 4 cores from the price of 2, then get 8 cores from the price of 4, then 16 for the price of 8, etc. So the extra cores in the CPUs of a supercomputer are like a bonus, and if software can be written to utilize those extra cores in some way that benefits performance, then that's a good thing.
Better known as 318230.
From the article: "âoeThe key to solving this bottleneck is tighter, and maybe smarter, integration of memory and processors,â says Peery. For its part, Sandia is exploring the impact of stacking memory chips atop processors to improve memory bandwidth." ... breaking news, cache is good. Hasn't getting data to the processor quickly without wasting cycles always been a critical bottleneck in servers?
"this year the U.S. Department of Energy formed the Institute for Advanced Architectures and Algorithms. Located at Sandia and at Oak Ridge National Laboratory, in Tennessee, the instituteâ(TM)s work will be to figure out what high-performance computer architectures will be needed five to 10 years from now and help steer the industry in that direction."
While I heavily like moving technology forward as quickly as possible and understand that much of the U.S's power relies on being ahead technologically I question if having groups guess theoretically what architectures will be the best in the future is the best spending of tax dollars in our debt heavy government. Especially since by the nature of competition Intel and AMD are already doing this. Isn't about time the g-man got forced to declare bankruptcy and shed unnecessary assets and debt holes?
-Brian
Maybe they should do something like we did back when the Paragon (yes, that far back) had multiple CPUs on a node and the memory bandwidth wasn't enough to support them all simultaneously... Don't use some of CPUs on the card (leave some idle) so that all the bandwidth is availalbe to the one, or few, cores that need it. Alternatively, figure out a way (algorithms) to make sure that no more than one core is memory intensive at a time... take turns being bandwidth intensive. Or, just realize, as it's always been, that some solutions/algorithms just aren't optimal on commodity hardware.
For a given node count, we've seen increases in performance. The claimed problem is that for the workloads that concern these researchers, they don't see people mentioning significant enhancements to the fundamental memory architecture projected to follow the scale at which multi-core systems go. So you buy a 16 core chip system to upgrade your quad-core based system and hypothetically gain little despite the expense. Power efficencies drop and getting more performance requires more nodes. Additionally, who is to say that clock speeds won't lower if programming models in the mass market change such that distributed workloads are common and single-core performance isn't all that impressive.
All that said, talk beyond 6-core/8-core is mostly grandstanding at this time. As memory architecture for the mass market is not considered as intrinsically exciting, I would wager there will be advancements that no one speaks to. For example, Nehalem leapfrogs AMD memory bandwidth by a large margin (like by a factor of 2). It means if Shanghai parts are considered satisfactory today to get respectable yield memory wise to support four cores, Nehalem, by a particular metric, supports 8 equally satisfactorily. The whole picture is a tad more complicated (i.e. latency, numbers I don't know off hand), but the one metric is a highly important one in the supercomputer field.
For all the worry over memory bandwidth though, it hasn't stopped supercomputer purchasers from buying into Core2 all this time. Despite improvements in their chipset, Intel Core2 still doesn't reach AMD performance. Despite that, people spending money to get into the Top500 still chose to put their money on Core2 in general. Sure, Cray and IBM supercomputers in the Top2 used AMD, but from the time of its release, Core2 has decimated AMD supercomputer market share despite an inferior memory architecture.
XML is like violence. If it doesn't solve the problem, use more.
I mean, really, this sounds like a poorly designed experiment. NVidia GPU's can have hundreds of cores and they just get faster. Memory management is different on GPUs than standard CPUs/chipsets use for that very reason. Hopefully our tax dollars didn't pay these geniuses for this crap.
You might see a super computer design around other RISC processors such as the ARM. A supercomputer using the ARM takes more chips perhaps but the power savings is substantial compared to the x86. Furthermore, companies that like Nvidia with their Telsa platform are pushing into the supercomputing space with specialized chips that are purposefully designed to deal with large linear problem solving. Interestingly the Telsa chip is a multicore chip as well. http://www.nvidia.com/object/product_tesla_s1070_us.html
Any idea why the current top of the supercomputer pack uses Cell processors? Besides having mad vector processing skillz with their SPUs the memory bandwidth is fairly large.
Seriously. While the algorithms/code they execute may run out of memory bandwidth when spread across 16 cores (doubtful) the bottleneck is more likely to come from the external interfaces long before the cpu runs out of bandwidth to access that memory. Current motherboards from Tyan and the like hold up to 128GB of ram, which, when consumed by the CPU at its THEORETICAL 64GB/s in triple channel mode means that it will simply run out of data in 2 seconds. Even if you are backfilling the entire time (reducing overall memory throughput) you will run out of data a short time later (dictated by the network connections backfill rate). The problem is one of being able to keep main memory full. Stacking memory on the chip will still have its limitations and really doesn't make sense when you consider the commodity pricing of these chips, the complexity of stacking, and the limitations of external data access. Buy more computers, build a smarter cluster and write smarter software. 16 cores looking like 2... really.. seriously.. wtf ever. Sandia (and I used to live down the street so I have met many people there) makes its living by raising government money. Nothing like coming up with a scare tactic to open up the coffers of government spending.
... what makes anyone think more cpus can be done?
Hoenetly, using a core 2 duo at work and after 3 weeks of swapping out every part of the system including the cpu.... sent hack to hp where they finally figured out it was the cpu.... oh wait it was swapped out too... so.... it was both cpus.... Whats the odds of that?
And now on another system, also a core 2 duo, I get application hangs and taskmanager saying no cpu is being used. Hmmm, must be some sort of fancy dancy hibernation power saving mode thingy that puts the user in a wait to get heat for not getting the job done..... piss on you....
So who is god? those who create the hardware or the software creators? because even at work, I'm assumed to be at fault as a user first.,....
And I'm fucking god damn sick of this shit from the computer industry arrogance. Don't they teach how to get things right before moving to the next step in the god training of CS?
My biggest pet peeve is this button with the symbols "0" & "1" on it ... and everyone knows it means "on" and "off" as its used on all sorts of consumer and industrial electronic equipment.
But on a computer, these fucking GODS damn its gotta mean something more and hidden so the morons can think they know something more than their consumer slaves... so they can masterbait their arrogance and stroke their egos.
Hey I got this really neat idea and I have all sorts of sheep skins from a universatays that says I smarts more then you.... so lets do this and sell it and if it doesn't work then I still get a pay check so who the fucjk cares.... Lets abstract stuff so much that we are only experts at a small part and nobody is an expert of the whole... and dat way we can blame the other guy...
Maybe the computer industry needs a bailout too?
Here... have a holy bucket... start bailing... you have plenty experience at that.
Bailing on teh end users....
End users are stupid..... we can blame them.....
Couldn't they just break up their programs into threads? Obviously, this wouldn't work for real-time applications, but modeling and other asynchronous programs could definitely be split and coprocessed.
It's all fun and games till someone divides by 0. Then it's hilarious.
The issue is with a single processor that has multiple cores.
There's no real way to split the banks for each core, so the net effect is that you have 4-32 cores sharing the same lanes for memory.
No, sorry. That's how Phenom processor are *Already* working.
Each physical CPU package has two 64-bit memory controllers, each controlling a separate bank of 64bits DDR-2 memory chips. (Each of the two bank in a dual channel mother board).
Phenom have two mode of function :
- Ganged : both memory controllers work in parallel, working as if they were a huge 128bits memory connection. That's how dual channel has worked since it was invented.
That's good for system running few very bandwidth-hungry applications (for example : benchmarks)
- Unganged : each memory controller work on its own. Thus you have two completely separate 64bits memory channel accessible at the same time. By correctly lying the applications in memory thanks to a NUMA-aware OS (anything better than Windows Vista), that means that two separate applications can simultaneously access each one's memory at the exact same moment, although at only half the bandwith *per process* (but still the same total of bandwidth for all processes running at the same time on a multi core chip).
This is perfect for systems running lots of tasks in parallel, and is the default mode on most BIOSes I've seen.
This gives a tremendous boost to heavily multi-tasked applications (a busy database server, for example), and it's what TFA's author are looking for.
Probably that at some point in the future, Intel will follow the same trend with its QPI processors.
Also, the future trend is to multiply the memory channels on the CPU: Intel has already planned Triple Channel DDR-3 for their high-end server Xeons (the first crop of QPI chips). AMD has announced 4 memory channels for their future 6- and 12- core chips targeting the G34 socket.
So the net effect of Unganged Dual Channel is that today you already have 4 cores having a choice of 2 sets of memory lanes to choose among, and within 1 year, you'll have 6-to-12 cores sharing 4 sets of memory lanes.
By the time you reach 32 cores on CPU, probably that almost each slot will have its own dedicated memory channel (probably with the help of some technology which communicates serially with fewer lines, like FB-DIMM). Or even weirder memory interfaces (who knows ? maybe DDR-6 will be able to give several simultaneous access to the same memory module).
So, well, once again, it proves that running stupid simulations without taking into account that other technologies will improves beside the number of cores* yields stupid non realistic results.
Shame on TFA's Author, because the trends to increase bandwith have already started. I little bit more background research would have avoided this kind of stupidity.
But on the other hand, they would have missed the opportunity to publish an alarmist article with an eye catching title.
--
*: Although, yes, the number of cores you can slap inside the same package seems to be the "new megahertz" in the manufacturers' race, with some like Intel trying to increase this number faster without putting so much efforts on the rest.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
CUDA has zero benefit for supercomputing projects that cannot be broken into tiny bits and spread across multiple cores.
It's not just about memory, or clock speed.
"A supercomputer is a device for turning compute-bound problems into I/O-bound problems."
-Ken Batcher
What's distressing here? That they have to keep building supercomputers the same way they always have? I worked with an ex IBM'er from their supercomputing algorithms department, he and I BSed about future chip performance alot in the late 2006 - early 2007 timeframe. We were both convinced that the current approaches to CPU design were going to top out in usefulness at 8 to maybe 16 cores due to memory bandwidth.
I guess the guys at Sandia had to do a little more than BS about it before they published, but c'mon guys, this has been obvious for a while. And, if it's obvious to all of us out here, don't you think that Intel knew about it during their 2002 roadmap meetings?
Because of limited memory bandwidth and memory-management schemes that are poorly suited to supercomputers, the performance of these machines would level off or even decline with more cores.
So increase the bandwidth on the memory to something more suited to supercomputers then. Design and make a supercomputer for supercomputer purposes. You are scientists using supercomputers, not kids begging mom for a new laptop on christmas. Make it happen.
It's hardly any secret that CPU speed, even for single core processors, has been running ahead of memory bandwidth gains for years - that's why we have cache, and ever increasing amounts of it. It's also hardly any relevation to realize that if you're sharing your memory bandwidth between multiple cores then the bandwidth available per core is less than if you weren't sharing. Obviously you need to keep the amount of cache per core and the number of cores per machine (or, more precisely, per unit of memory sybsystem bandwidth) within reasonable bounds to keep it usable for general purpose aplications, else you'll end up in GPU-CPU (e.g. CUDA) territory where you're totally memory constrained and applicability is much less universal.
For cluster-based ("supercomputer") applications, partitioning between nodes is always going to be an issue in optimizing performance for a given architecture, and available memory bandwidth per node and per core is obviously a part of that equation. Moreover, even if CPU designers do add more cores per processor than is useful for some applications, no-one is forcing you to use them. The cost per CPU is going to remain approximately fixed, so extra cores per CPU essentially come for free. A library like pthreads, and different implementations of it (coroutine vs LWP based), gives you the flexibility over the mapping of threads to cores, and your overall across-node application partitioning gives you control over how much memory bandwidth per node you need.
By logical extension, if commodity X86 is all the world needs, why is anyone in school learning to design chips?
Just give up. If you aren't planning to be hired (in THIS bad, layoff-prone, and outsourcing-prone economy!) by Intel or AMD to replace retirees, it's hopeless. You can pretty much just kill off the whole chip design field-- and lay off professors while you're at it.
After all, if we drink the Intel Atom kool-aid, even ARM's days are numbered, right? The prescribed solution to everything is an X86 based PC top to bottom--- from embedded to set-top-box to supercomputer.
The other thing I'd point out is that your analogy to "balanced" general purpose computing systems can (and should) fail for supercomputers... there is no rational reason to continue scaling in a linear fashion.
Then again, this is seriously old news. Trying to optimize a supercomputer to get anywhere close to 100% CPU utilization is known to be a problem. Others have already pointed out IBM's Blue Gene, and there's a reason it's a good example.
From Marc Snir, et al. at IBM, September, 2001 in a file called BlueGenePublic.pdf, which discussed the design philosophy for the Blue Gene supercomputer:
"Standard microprocessors are optimized for running as fast as possible one instruction stream...
Standard nodes suffer from 'von Neumann' bottleneck: computation speed increases much faster than memory access speed"
"Let's think from scratch.... in order to build general purpose systems that overcome constraints of conventional architectures.
Let's accept that significant improvements in cost/performance can be achieved by building an 'unbalanced' system" (emphasis added)
"CPU is a vanishingly small fraction of total system, in silicon area, or power
It is rational to build systems with a surfeit of compute power, so as to reduce memory requirements and reduce the need to move data around
It is cost effective to have a low CPU utilization"
Use the same trick that RAID does: multiple memory modules in parallel. The SUN Niagara 2 processors have 4 memory controllers to feed the 4-6-8 cores (32-48-64 threads). The Tilera TILE64 processors also have 4 memory controllers to feed the 64 cores.
Yeah, and how many programs can you run in parallel where as task that must be run in serial. Some task just can be broken into bits, Besides the fact that most programming is still working in serial steps.
NEc still makes the SX9 vector system, and cray still sells X2 blades that can be installed into their xt5 super. So vector processors are available, they just aren't very popular, mostly due to cost/flop.
A vector processor implements an instruction set that is slightly better than a scalar processor at doing math, considerably worse than a scalar processor at branch-heavy code, but orders of magnitude better in terms of memory bandwidth. The X2, for example, has 4 25gflop cores per node, which share 64 channels of DDR2 memory. Compare that to the newest xeons where 6 12 gflop processors share 3 channels of DDR3 memory. While the vector instruction set is well suited to using this memory bandwidth, a massively multi-core scalar processor could also make use of a 64-channel memory controller.
The problem is about money. These multicore processors are coming from the server industry. web-hosting, database-serving, and middleware crunching jobs tend to be very cache-friendly. Occasionally they benefit from more bandwidth to real memory, but usually they just want a larger L3 cache. Cache is much less useful to supercomputing tasks, which have really large data-sets. The server-processor makers aren't going to add a 64-channel memory controller to server processors; it wouldn't do any good for their primary market, and it would cost a lot.
Of course, you could just buy real vector processors, right? Not exactly. Many supercomputing tasks work acceptably on quad-core processors with 2 memory channels. It's not ideal, but they get along. This has put a lot of negative market pressure on the vector machines, and they are dying away again. It's not clear if cray will make a successor to the X2, and NEC has priced itself into a tiny niche market in weather forcasting, that is unapproachable by other supercomputer users, for price reasons.
The continued increase in multicore processing power is doomed unless a solution to the memory bottleneck is found. We need a memory system that obviates the needs for caching by completely eliminating bus contention in shared memory. This should be one of the primary research areas for companies like Intel and for government-funded research labs. We should pump billions of dollars into finding a solution to this problem over the next five years or ten years. I suspect that optical memory or quantum tunelling are promising areas of inquiry. This is what physicists should be focusing their efforts on instead of pursuing pipe dreams like quantum computing.
The number of cores per megabyte should double every 18 months so as to pursue the hypothetical ideal of one processor per byte. At that point we will have reached the end of the performance curve.
disable those you don't! Why make it a big deal?
It's worth noting that multicore CPUs are just a plan B technology. What the market really wants is faster CPUs, but the current old technology can't deliver them, so CPU makers are trying to convince people that multicore is a good idea.
Even if this is the case, which sounds plausible... So what?
Somehow I have a sneaking suspicion that if multi-core has less performance in super computing than single-core... Companies will continue to manufacture specialized single-core processors for supercomputing.
This just in:
* Intel sucks at making zillion-dollar computers
* AMD sucks at everything
* Supercomputer engineers are worried for their jobs
I realize these people have a legitimate complaint, but quite frankly if you're worried about a certain processor affecting your code, maybe you suck at programming ?! So what if the internal bandwidth is ho-hum ? These old dogs need to stop complaining and learn to adapt, else their overpaid jobs will be given to others who can.
-Billco, Fnarg.com
I have a simple question. Do modern computers use grey code on their address bus or are still sticking with straight binary. As with a 64bit program that is mostly sequential using a grey code could significantly reduce power consumption.
Actually, my thinking is that rather than just tossing more cores at the problems, we should be looking at making the hardware adapt itself to the problem to be solved. IE: instead of just crunching "instructions" on data, we need hardware that effectively rewires itself to the problem at hand.
Something like an FPGA integrated into the archetecture with huge gate/interconnect counts plus some "normal" cores may be a better approach. Done well, loops can be unrolled and executed in one clock cycle, entire memories can be created on-chip and then destroyed when no longer needed, etc.
Of course, this would require changing some programming techniques around, and require compilers far more advanced than we currently have available. Still, it should be at least a semi-achievable technology.
Supercomputing is mostly a Government-funding boondoggle. The private sector buys few if any supercomputers.
Most of the US government applications are either related to nuclear weapons, or are busywork for underutilized nuclear weapons labs. Sandia, Los Alamos, and Livermore, lacking bombs to design, are looking for something else to justify their continued existence. To some extent, they're senior activity centers for old physicists. There's also "stockpile stewardship", which is an activity center for younger physicists. The idea is to keep some people around who can build an H-bomb if necessary, so that the technology isn't lost as people die off. Since that crowd isn't allowed to actually do much of anything, they want to simulate a lot. It's really a political problem. If the US and Russia allowed each other one underground bang each year, there would be less need for all this iffy simulation.
So don't worry too much about whining from Sandia about supercomputers. When Google or Amazon start complaining that multicore machines are choking in their server farms, it's time to listen.
Rather than laptops with zillions of CPU cores, we're probably going to see CPU chip real estate used for more cache, and maybe even main memory. The near future is the one-chip laptop that sells for $100 or so.
Good call, I didn't see the Direct Connect(tm) stuff. I should try to keep more abreast of such things. Nice design too.
Did a paper on this one in 1995 for a previous MS degree.
Amdahls law says that the performance in the ieee article will happen....
I eagerly await the Slashdot story about someone freezing their junk off with their laptop.
What the article points out is that while the number of ALUs per chip has increased, memory to processor throughput has not. If you are working with large amounts of data (i.e. not factoring numbers) the processor is unable to keep the cores fed. Most supercomputer applications today involve large data sets. In one situation examined an 8-core CPU performed about the same as a dual core and with more cores the processor degraded quickly to less than that of a dual core with 16-64 cores.
This is the memory bottleneck and is likely to be the case for database systems and other systems processing large data sets. The bottleneck needs a name. Any ideas?
In any case, the article only refers to data mining. Perhaps these questions are better answered by Oracle, DB2, or other TPC score winners.
Sun's Niagara processors are not particularly suitable for supercomputers, but they have some innovative useful features for getting around the memory bandwidth problem.
Their truly innovative processor that should be superb for supercomputing is ROCK, if it ever sees the light of day.
As well as multiple cores, multiple threads per core (i.e. "contexts"), powerful floating-point cores and SIMD units, its killer feature is what sounds like a very clever kind of automatic speculative pre-fetching from main memory into cache.
intel's chips always make me laugh a little. All that processing power and no memory or I/O bandwidth. They've only just caught up with AMD in that respect and now they're planning 80 cores with very little improvement in memory bandwidth...
Stick Men
There's a very simple solution for the memory and interconnect bandwidth bottlenecks, and that is to widen the channels. If you look at Intel's Nehalem roadmap, they're planning to move from triple-channel to quad-channel on the really monstrous chips coming out down the line. Likewise, AMD has been doing a lot of work on making hypertransport channels more configurable, so you could allocate less bandwidth to an I/O bridge and more to the interconnect, if you're building a system where that's what's important.
If you're just adding more execution cores without changing what's around them, then this criticism hold, but the long-term significance of multi-core design is about VLSI, which lowers the latency between components. Latency is critical in supercomputing applications, so any time you can squeeze those gigaflops and their attached memory closer together, performance improves.
There's no failure quite as dissatisfying as a complete and total solution to the wrong problem.
*applause*
http://rocknerd.co.uk
A previous Slashdot article included an nVidia executive saying Intel has been wrong on cpu design for a long time - that the critical design feature needs to be memory bandwidth, not cpu ticks or speed or any of the numbers they've so far focused on.
But I think this just shows supercomputer designers need to stop thinking about CPUs and start thinking about GPUs. Multicore is here and commoditized already, and if you can do your work on shaders then you're looking at not 8, 16, or 32 cores but 640 or 1280 cores to do your work, all with bus designs that put memory first.
bravo.
I was just reading the AMD documentation on the Phenom. The part I found interesting was that Phenom uses the same 144 bit ECC code in both ganged and unganged mode. In the later case, the ECC code is used across two 72 bit transfers from the same channel which optionally can be bitwise interleaved. 4 bit chipkill correction is lost in this case but detection still works.
erm, don't you still only have one RAID controller per system? Don't I/O requests still bottleneck at the DMA interface if they're faster ($DEITY forbid that happening) because there is only one bus? So how does your RAI[R|M] facilitate this system? You've still got to coordinate all that memory. Unless....
What if each RAM chip is assigned a write bit to one DMA style controller, but multiple or various other controllers have READ ONLY access to the RAM? I don't know if this is even feasible, but it's a thought...
2^3 * 31 * 647
I bought a Mac Pro 8-core machine to learn multi-core programming, then I discovered NVIDIA CUDA programming and I am looking at buying a C1080 240-core GPU to learn to program that. The industry is manufacturing lots of multi-core devices, but programming (parallel) hasn't adapted to this new paradigm and provided the right tools to leverage off these new technologies.
vector processing in commodity designs isn't enough. Of course we are going to see it, at this point it's not very expensive to add. Adding vector processing for increased flops is easy. The hard part is the bandwidth. One of the reasons the X1 processors were expensive, was that they were custom, but so are the network chips in commodity-cpu supers, and they only add $1000/node. The real cost of X1-style memory is that you have 64 channels of memory, which is a lot of wires, dimms, memory parts, etc. There's a very real cost to all the memory components needed to get the kind of bandwidth you need to support a high-throughput vector pipeline.
The commodity processor vendors aren't going to do this sort of thing, as it adds to the cost of the chip, but provides nothing to the bulk of their customers who are running mysql, apache, or halflife.
The one hope I have is something like the core2 architecture, where ddr3 is used for desktop processors, and fbdimm is used for server parts. The two components share a lot of architecture, and only a few of the asic cells are different. If a cpu vendor were interested in the HPC market, they could design a cpu to use a standard memory channel for desktop/low-end server parts, and something more expensive, but higher bandwidth for the HPC space. It would mean HPC specific processors, but sharing most of the engineering with the commodity part. Maybe Cray could license them the design for their weaver memory controller in the X2. It's kind of like the AMB on a FB-DIMM, but it includes 4 channels of DDR2 on each stick of memory.
I'd love to see each core on a massively multicore design get its own memory controller. I'm not holding my breath, however. If you think of a 32-core CPU, it's pretty unlikely that most supercomputer or cluster vendors are going to pay for 32 dimms for each cpu socket. So then you're talking about multiple memory channels per memory stick. You can still get ECC using 5 memory chips per channel, so you can imagine 4 channels fitting on a memory riser. Cray does this on the X2. Then 32 channels would only require 8 dimms, which is reasonable. Then what do you do for 64-core CPUs?
It's tricky, and the problem for the market is that it's expensive. Can you get the commodity CPU vendors interested in such a thing, given that most of their addressable market is not in the supercomputing space?
I think We're gonna see more cores in a CPU that there's bandwidth to use. They might increase the bandwidth a bit, but probably just enough to get good linpack numbers.
Note the comment from
Steve Conway from IDC
Steve Conway, senior analyst with IDC for high performance computing issues, said this problem has been around for a while, and multi-core is only exacerbating it. "x86 processors were never designed for HPC," he told InternetNews.com. "Those processors were not designed to communicate with each other at a high speed. With these big systems, you have to move data over large territories.