Supercomputer Advancement Slows?
kgeiger writes "In the Feb. 2011 issue of IEEE Spectrum online, Peter Kogge, an IEEE Fellow and professor of computer science and engineering at the University of Notre Dame, outlines why we won't see exaflops computers soon. To start with, consuming 67 MW (an optimistic estimate) is going to make a lot of heat. He concludes, 'So don't expect to see a supercomputer capable of a quintillion operations per second appear anytime soon. But don't give up hope, either. [...] As long as the problem at hand can be split up into separate parts that can be solved independently, a colossal amount of computing power could be assembled similar to how cloud computing works now. Such a strategy could allow a virtual exaflops supercomputer to emerge. It wouldn't be what DARPA asked for in 2007, but for some tasks, it could serve just fine.'"
Then I could play Duke Nukem forever properly!
...of all the existing supercomputers.
Is that Crysis 2 isn't out yet. When it is, people will be all going out to buy their own supercomputer to run the game.
In the past, there were a lot of applications that a true supercomputer was needed to be built for to solve, be it basic modeling of weather, rendering stuff for ray-tracing, etc.
Now, most applications are able to be done by COTS hardware. Because of this, there isn't much of a push to keep building faster and faster computers.
So, other than the guys who need the top of the line CPU cycles for very detailed models, such as the modelling used to simulate nuclear testing, there isn't really as big a push for supercomputing as there was in the past.
In the computer industry, the most impressive displays of stupidity generally result from linear extrapolation.
There will be 'exascale' computers, they just won't look like scaled up versions of today's computers, where the bulk of the die area, power, and complexity is spent on making something which is programmable by a hoard of low-skill programmers and performs equally poorly for thousands of different applications. The whole notion of general-purpose for the supercomputer industry is kinda silly, considering that there are only a handful of applications which justify the expense of building such a computer.
Just design a specialized machine for each of those applications and one can get a factor of hundred improvement.
..but doesnt history show us most things stall before the next large wave of advances?
Joe Investor
Current super computers are limited by consumer technology. Adding cores is already running out of steam on the desk top. On servers it works well be cause we are using them mainly for virtualization. Eight and sixteen core CPUs will boarder on useless on the desktop unless some significant change takes place in software to use them.
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
I would be very willing to run something akin to Folding@Home where I get paid for my idle computing power. Why build a super computing cluster when, for some applications, the idle CPU power of ten million consumer machine is perfectly adequate? Yes, there needs to be some way to verify the work, otherwise you could have cheating or people trolling the system, but it can't be too hard a problem to solve.
Love sees no species.
A little bird informs the world that the US has a supercomputer already running on them, somewhere between 100Ghz-1Thz per processor. Looks like exoflop has come and gone.
In the past, there were a lot of applications that a true supercomputer was needed to be built for to solve, be it basic modeling of weather, rendering stuff for ray-tracing, etc.
Now, most applications are able to be done by COTS hardware
It's true, many applications that needed supercomputers in the past can be done by COTS hardware today. But this does not mean there are no applications for bigger computers. As each generation of computers assume the tasks done by the former supercomputers, new applications appear for the next supercomputer.
Take weather modeling, for instance. Today we still can't predict rain accurately. That's not because the modeling itself is not accurate, but because the spatial resolution needed to predict rainfall beyond our computers. Engineers still use wind tunnels, they still have tanks to test ship models, there are many situations where the most powerful computers today cannot perform calculations at the same level of precision one gets from scale models.
And then there are entirely new applications that are way beyond the capacity of our current computers. Drug design is one example, a computer capable of calculating accurately the shape a protein molecule will have given its sequence of amino acids is still a dream.
How long until "the cloud" becomes Skynet?
These modern machines which consist of zillions of cores attached over very low bandwidth and high latency link are really not supercomputers for a huge class of applications. Unless your application exhibits extreme memory locality and hardly any interconnect bandwidth / can tolerate long latencies.
The current crop of machines is driven mostly by marketing folks and not by people who really want to improve the core physics like Cray used to.
BANDWIDTH COSTS MONEY, LATENCY IS FOREVER
Take any of these zillion dollar plies of CPU's and just try doing this: .lt. bounds; ++x )
for ( x=0; x
{
humungousMemoryStructure [ x ] = humungousMemoryStructure1 [ x ] * humungousMemoryStructure2 [ randomAddress ] + humungousMemoryStructure3 [ anotherMostlyRandomAddress ] ;
}
It'll suck eggs. You'd be better off with a single liquid nitrogen cooled GaAs/ECL processor surrounded by the fastest memory you can get your hands on all packed into the smallest place you can and cooled with LN or LHe.
Half the problem is that everyone measures performance for publicity with LINPACK MFLOPS. It's a horrible metric.
If you really want to build a great new supercomputer get a (smallish) bunch of smart people together like Cray did, and focus on improving the core issues. Instead of spending all your erfforts on hiding latency, tackle it head on. Figure out how to build a fast processor and cool it. Figure out how to surround it with memory.
Yes,
Customers will still use commodity MPP machines for the stuff that parallelizes.
Customers will still hire mathematicians, and have them look at ways to Map things that seem inherently non local into spaces that are local.
Customers who have money and the mathematicians couldn't help will need your company and your GaAs/ECL or LHe cooled fastest SCALAR / Short Vector box in the world.
That little Cray thing looks really nice. Nice work, whoever did it. Reminds me of '90s side-scrolling games for some reason.
"When information is power, privacy is freedom" - Jah-Wren Ryel
I read what I thought were the relevant sections of the big PDF file that went along with the article. They know that the actual RAM cell power use would only be 200 KW for an exabyte, but the killer comes when you address it in rows, columns, etc... then it goes to 800KW, and then when you start moving it off chip, etc... it gets to the point where it just can't scale without running a generating station just to supply power.
What if instead of trying to address everything that way, they break up the computing and move it to the data... so that RAM is tied directly to the logic that would use it... it would waste some logic gates, but the power savings would be more than worth it.
Instead of having 8kit rows... just a 16x4 bit look up table would be the basic unit of computation. Globally read/writable at setup time, but otherwise only accessed via single bit connections to neighboring cells. Each cell would be capable of computing 4 single bit operations simultaneously on the 4 bits of input, and passing them to their neighbors.
This bit processor grid (bitgrid) is turing complete, and should be scalable to the exaflop scale, unless I've really missed something. I'm guessing somewhere around 20 megawatts for first generation silicon, then more like 1 megawatt after a few generations.
A little bird informs the world that the US has a supercomputer already running on them, somewhere between 100Ghz-1Thz per processor
Unlikely. If you do the calculations, you'll find that the current 3GHz limit is about as fast as you can get data from other chips on a circuit board. 3GHz is 0.33 nanoseconds period, the time it takes for light to travel ten centimeters in a vacuum. A faster CPU will stay idle most of the time, waiting for the data it requested from other chips to arrive at the speed of light.
I guess Peter Cogge doesn't keep up with current events in the tech industry like this one: http://www.thinq.co.uk/2011/1/24/new-nanotape-tech-promises-cooler-chips/
I liked this computer before, when it was called a beowolf cluster.
Well.. maybe. Or Maybe not. But Definitely not sort of.
Well, yeah, if you deliberately design a program to not take advantage of the architecture it's running on, then it won't take advantage of the architecture it's running on. (This, btw, is one of the great things about Linux, but that's not really what we're talking about.)
One mistake you're making is in assuming only one kind of general computing improvement can be occurring at a time (and there is some good, quality irony in that *grin*). Cray (and others) can continue to experiment on the edge of the thermal/lightspeed barrier, while others can simultaneously increase computing resources by running clusters.
There's nothing inherently detrimental about clusters, and almost all tasks can be parallelized to some degree. There are, of course, always latency and bandwidth issues in distributed memory systems, but InfiniBand (which is what serious clusters use) has latency in the 1 microsecond range, which is only about five times memory latency itself, and a bandwidth that you'll never saturate with current data distribution models (we're working on that).
It's true that typically performance to processor growth is sublinear, but it's also nonzero. For even moderately parallel tasks, a $300k cluster can outperform a $3M supermachine, and the failure of that to extend to the very boundary is more an effect of our inability to grok the parallel coding model (parallel code is hard); as we understand more and more about concurrent programming you will see the oft-cursed 'penalty' of cluster computing dwindle more and more towards nothing.
That doesn't seem like a show stopper. In the 1950s, the US Air Force built over 50 vacuum tube SAGE computers for air defense. Each one used up to 3 MW of power and probably wasn't much faster than an 80286. They didn't unplug the last one until the 1980s.
If they get their electricity wholesale at 5 cents/kWh, 67 MW would cost about $30,000,000 per year. That's steep, but probably less than the cost to build and staff the installation.
In spite of some of the benefits of cloud computing or cluster computing Super Computers are still needed and we need to go way beyond where people are currently thinking. The integration of seebeck effect materials into the electron paths of processors as well as integrated layers of seebeck effect materials inside of processors can reduce the internal core temperature of processors. In addition advanced manufacturing can be used to integrate fluid based meso and micro cooling channels that, when combined with nucleated phase change, can super-cool super-computers as well as super-conductors. Lot of supers there, but, the future is full of super ideas and this professor is just out of the loop.
And while giant high end supercomputers may progress more slowly, we're slowly seeing a revolution in personal supercomputing, where everyone can have a share of the pie. Witness CUDA, OpenCL, and projects like GPU.NET (.NET for the GPU, and apparently easy to use, though expensive for now).
Along with advancements such as multitasking in the next generation of GPUs (yes, they can't actually multitask yet, but when they do it'll be killer for a few reasons), and a shared memory with the CPU (by combining both the GPU and CPU into one die, which goes into the CPU socket), and I think the future will be very interesting.
Why OpalCalc is the best Windows calc
His statements are both true and false. Its true that exaflops is a big challenge, however, research on supercomputers has not stopped. But there are other areas which are being looked at too. For example - algorithms. Whenever a new supercomputer is developed, parallel programmers try to modify or come up with new algorithms that take advantage of the architectures/network speeds to make things faster. Heck, there are some applications that have started looking at avoiding huge computations and instead going for incorrect/approximate ones and converging on the solution techniques. There are many apps that are ok with approx results. Also, new denser and less power hungry memory technologies have come up and some of those links have already been posted here so I am not going to post them. Besides, there are a bunch better cooling techniques proposed by IBM and use of optical interconnects which can help organize things in a more distributed way and provide easier heat dissipation. All in all, someone is going to come up with exaflop computing soon :)
I definitely agree with DARPA's conclusion that we cannot build an exascale computer. Thankfully, I think the Chinese might sell us some time on theirs, as long as we promise not to use it for nuclear-weapons simulations. With the money we save by outsourcing that job, we could probably end the estate tax once and for all.
Ray Kurzweil says so!
I've read the article (the WHOLE article) and the exaflop issue is generally posed in terms of power requirements in reference to current silicon technlogy and its most strictly related future advancements. The caveat of that is that not even IBM thinks exaflop computing can be achieved with current technology, that's why they are deeply involved with photonic CMOS, of which they have already made the first working prototype. Research into exaflop computing in IBM is largely based on that. You can't achieve the necessary power requirements without moving (at least in part) from electronic to photonic. This will decrease power requirements (and cooling requirements) by a large factor.
http://www.genomeweb.com/blog/anton-sets-protein-folding-record
Take a small team and build a high-performance, extremely power-efficient machine targeting a single application--accelerate said application by 100x. For the $1 Billion that the government is likely to waste on the exascale nonsense we could probably build five different special-purpose machines each targeting a vital area of high performance computing (fluid dynamics, oil exploration, weather prediction, etc.) and literally own those industries in the same way that the Anton supercomputer now owns molecular dynamics.
The machines could share a lot of technology if it's done right.
I hear you but sure, you can harness a million chickens over slow links and reinvent the transputer, or Illiac IV but your then constraining yourself to problems where someone can actually get a handle on the locality. If they can't your *screwed* if you want to actually really improve your ability to answer hard problems in a fixed amount of time.
You can even just take your problem, and brute force parallelize it and say 'wow lets run it for time steps 1..1000' and farm that out to your MPP or your clusters, and then wait a good long amount of time till you get your answers... but it doesn't help you one IOTA if after seeing the first five steps you realize that you really should have asked a different question.
I'm not against MPP or Multithreading (I liked the Tera-MTA), Transputers, Dataflow, CUDA, Millions of School children with Abacus (ii?), or any of that stuff, but there are real science questions that can't be answered by them. There really are people out there who do stuff where they really do need to be able to say
a[i] = b[j]*c[k]+d[l]; ... and where *nobody* (with lots of smart nobodys) has managed to figure a decent way to partition the problem.
You do realize that if you go off-node on your cluster even over infiniband the 1uS is about equal to a late 1960's core memory access time right?
Check out the Limulus Project
HPC for Primates. Read Cluster Monkey
Well said, sir.
Jim
Why has nobody tried this before? They could easily plow through the data from SETI, fold proteins, or even have a platform for creating and distributing cloud based computing turnkey computing solutions! It's too bad that the cloud was not invented until a year or two ago, this stuff could have probably started out in 1999 if the cloud existed back then.
What about a super computer made out of FPGAs ?
Just put a bunch of FPGAs in a board, and write real hardcore tasks in hardware language like vhdl...
You do realize that if you go off-node on your cluster even over infiniband the 1uS is about equal to a late 1960's core memory access time right?
Sure, but having 1960 mag core access to entirely different systems is pretty good, I'd say. And it will only improve.
It's a false dichotomy. There are some problems that clusters are bad at. That is true. The balancing factor that you are missing is that there are problems that single-proc machines are bad at, also. For every highly sequential problem we know, we also know of a very highly parallel one. There are questions that cannot be efficiently answered in a many-node environment, but there are also questions that can be answered far faster. What you're saying is basically that a desktop computer running modern games doesn't need a GPU, it just needs a faster CPU. It's failing to optimize for the problem; driving the screw with a hammer, if you will.
There is no such thing as effective "brute force parallelize". Effective concurrent computation requires thought. Sequential programming is MUCH easier to design and optimize for; that's the draw of single-proc systems. There is nothing inherently non-partitionable about "a[i] = b[j] * c[k] + d[l]"; indeed, I would submit that there are very easy threading optimizations that can be done for any combinations of i,j,k,l; a single thread of execution is GUARANTEED to be slower than even the most trivially optimized multithreaded case. Any multithreaded case benefits from additional execution units up to the number of threads as long as the computation bandwidth (NOT the interprocess bandwidth!) increases enough to account for the increase in computation latency (which is linked to, but not identical to, interprocess latency).
Sequential is dying, because it's suboptimal. Trying to burn the gas faster and faster doesn't get around the fact that the engine just isn't that efficient.
I disclosed this sort-of-cogeneration idea before on the open manufacturing list so that no one could patent it, but for years I've been thinking that the electric heaters in my home should be supercomputer nodes (or doing other industrial process work), controlled by thermostats (or controlled by some algorithm related to expectations of heat needs).
When we want heat, the processors click on and do some computing and we get the waste heat to heat our home. When the house is warm enough, they shut down. They would use the network to talk to the rest of the nodes in neighbor's homes, or homes across the globe, to form a supercomputing cloud. Basically, any place in the country that has an electric heater (or similar thing) could have a processing node instead (this includes water heating, too, and even things like kilns). (Hydroponic agriculture would be another example use as well instead of computing, growing plants in winter where the grow lights were controlled by thermostats or heating algorithms or timers.)
For reference, for those who don't know much physics, essentially all use of electricity produces waste heat eventually, so if you run a computer that takes 100 watts, it heats the room as much as running a 100 watt heater. The same goes for a 100 watt incandescent lightbulb, which also doubles as a 100 watt heater. For those who live in homes in cold climates (heating somewhat most of the time) and who do not have very well insulated homes, paying more for energy efficient appliance may not pay well, because your electric heaters just have to pick up the slack left by the the more efficient lights.
I don't know the industrial figures, but for residential electric heating use in 2001:
http://www.eia.doe.gov/emeu/reps/enduse/er01_us.html
"Electric space heating accounted for an additional 116 billion kWh (10 percent of the total)... Electric water heating accounted for over 100 billion kWh (9 percent) in 2001."
So that is about 200 billion kWh per year, or about 23 gigawatts continuously. It is "free" power to use for computing in a sense. (I know, it would need to be networked -- maybe with integrated wireless of some sort?) So, that would be enough power for about 46 of the 500 MW computers they mention in the article. The cost savings would be (at US$0.10 per kWh) 20 billion dollars a year in energy costs. Looking around, commercial buildings use about the same amount of electric heating. Electric use has increased over the past decade, as well. So potentially 100 or so of these exaflops machines could be powered by residential and commercial heating needs alone.
I don't know what the figure would be for industrial process heat. If we shift way from fossil fuels and towards more energy from PV, wind, nuclear, and cold fusion, there might be terawatts of power available to use for computing in this way, where the waste heat (on demand) then drove industrial processes like making plastic or refining ore. Waste heat could also drive heat engines for mechanical action. So, industrial processes might be able to power (for "free") thousands of these supercomputers.
Large datacenters could also be located in places that wanted the heat, like near big buildings. Power plants sometimes have industrial plants near them that want their waste heat already, so this would be a similar thing. The datacenter waste heat could also be concentrated by heat-pumps and used for industrial processes (like melting silicon to make solar cells or IC chips).
I guess with cold fusion in the air (with the Italy demo claim) I should disclose the idea of integrating cold fusion power production (such as without limitation nickel/hydrogen fusion) directly into, or adjacent to, computing nodes that somehow directly use the energy, either electricity generated someway or even running directly off any generated radiation. These too could also be thermostat controlled (or controlled by some algorithm related to exp
A 21st century issue: the irony of technologies of abundance in the hands of those still thinking in terms of scarcity.
a single thread of execution is GUARANTEED to be slower than even the most trivially optimized multithreaded case.
That is true if and only if the cost of multithreading doesn't include greatly increased latency or contention. Those are the real killers. Even in SMP there are cases where you get eaten alive with cache ping ponging. The degree to which the cache, memory latency, and lock contention matter is directly controlled by the locality of the data.
For an example, let's look at this very simple loop:
FOR i=1to100
A[i]=B[i]+c[i]+A[i-1]
You might be tempted to pre-compute B[i]+c[i] in one thread and add in a[i-1] in another, but you then have 2 problems. First, if you aren't doing a barrier sync in the loop the second thread might pass the first and the result is junk, but if you are, you're burning more time in the sync than you saved. Next, the time spent in the second thread loading the intermediate value cold from either RAM or L1 cache into a register will exceed the time it would take to perform the addition.
Given some time, I can easily come up with far more perverse cases that come up in the real world.
You might be tempted to pre-compute B[i]+c[i] in one thread and add in a[i-1] in another, but you then have 2 problems. First, if you aren't doing a barrier sync in the loop the second thread might pass the first and the result is junk, but if you are, you're burning more time in the sync than you saved. Next, the time spent in the second thread loading the intermediate value cold from either RAM or L1 cache into a register will exceed the time it would take to perform the addition.
Given some time, I can easily come up with far more perverse cases that come up in the real world.
...of course there's going to be some kind of synchronization. The suggestion otherwise implies a lack of experience in the field; failure to plan sync before anything else is an undergrad mistake.
I fail to see how the sync burns more time than you save by threading the computation. It seems to me that doing operation a and operation b in sequence will almost always be slower than doing them simultaneously with one joining the other at the end (or, better and a little trickier, a max-reference count for the dependent thread).
Re: cache, you are making strong assumptions that you're going to get cache hits for equidistant indecies across three arrays. In the kind of real-world computations that need this kind of hardware, you're not going to look at three flat 100 element arrays. You're going to be looking at 10k x 10k x 10k arrays, and the number of cache hits you get is not going to decrease by parallelizing the work -- indeed, since you're getting more cache for every node, the cache hits are going to increase with any degree of optimization. This is true in all cases.
There is nothing 'perverse' about the case you have presented. It's a trivial optimization problem that doesn't require more than a basic knowledge of cache and some hint of the target architecture. The more perverse cases are the ones that get worked on at the highest level, and I have seen no proof of any computational problem that is sequential in the fastest case.
...of course there's going to be some kind of synchronization. The suggestion otherwise implies a lack of experience in the field; failure to plan sync before anything else is an undergrad mistake.
As is not realizing that synchronization costs. How fortunate that I committed none of those errors! Synchronization requires atomic operations. On theoretical (read cannot be built) machines in CS, that may be a free operation. On real hardware, it costs extra cycles.
As for cache assumptions, I am assuming that liner access to linear memory will result in cache hits. That's hardly a stretch to think so given the way memory and cache are laid out these days.
If you are suggesting that handing off those subtotals and doing barrier sync through main memory (not sharing a data cache), you either have no idea what modern register vs cache vs main memory timings are like or you're using a fairly exotic and expensive system where RAM is one big cache. But if the latter were true, you would have pointed out that cache hits are irrelevant, so...
Here's a thought, give it a try! See how it goes.
I never suggested that synchronization is free. However, a CAS or other (x86-supported!) atomic instruction would suffice, so you are talking about one extra cycle and a cache read (in the worst case) for the benefit of working (at least) twice as fast; you will benefit from extra cores almost linearly until you've got the entire thing in cache.
The cache stuff is pretty straightforward. More CPUs = more cache = more cache hits. Making the assumption that a[], b[], and c[] are contiguous in memory only increases this effect -- in your scenario, there is only one cache, and you'll have at most * 3 of the values local; whereas for every cpu you add in distribution the value increases linearly for quite some time.
This is ignoring the trivially shallow dependance of the originally proposed computation (there's a simple loop invariant) and making the assumption that a difficult computation is being done.
This is ignoring the trivially shallow dependance of the originally proposed computation (there's a simple loop invariant) and making the assumption that a difficult computation is being done.
I put the dependence there because it reflects the real world. For example, any iterative simulation. I could prove a lot of things if I get to start with ignoring reality. You asserted that there existed no case where a single thread performs as well as multiple threads, a most extraordinary claim. It's particularly extraordinary given that it actually claims that all problems are infinitely scalable with only trivial optimization.
CAS is indeed an atomic operation that could be used (i would have used a semaphore though), but to claim 1 cycle is far from the whole story. The instruction may take 1 cycle to execute (7 in an x86 actually) but the necessary cache invalidation is what kills you. A cache miss can cost 1000 cycles. If you go off node, it will cost many thousand even with a specialized backplane. An add costs 5. MOV from cache to register costs 1. So, the question comes down to this: would you rather take a guaranteed 1000 cycle hit for each iteration or would you prefer to spend 7 cycles doing the add yourself?
So, I say again, give it a try! Actually try to get a multi-threaded implementation of the loop I specified to outrun a single threaded version. You claimed it is trivial, so it shouldn't take you long.
Exactly!
It's all easy if you ignore:
Cache-misses.
Pipeline stalls
Dynamic clock throttling on cores
Interconnect delays
Timing skews
It's the same problems as the async CPU people go through, except everyone is wearing rose-colored-spectacles and acting like there still playing with nice synchronous clocking.
The semantics become horrible once you start stringing together bazillions of commodity CPU's. Guaranteeing the dependencies are satisfied becomes non-trivial like you say even for a single multi-core x86 processor . You end up either with heavy duty synchronization that is reliable and slow or risk the chance of garbage and all kinds of synchronization hell. 1uS infiniband messages are several thousand clock cycles...
The only caveat I would add is if people really got honest and just built big piles of transfer-triggered architecture stuff then many of the timing problems would be solved - but they don't...
What bugs me is that there are plenty of scientists that hate the MPP boxes because their codes that do real stuff simply don't parallelize well, yet every time someone says fastest on TOP 5000 and manages to string even more commodity CPU's together with even worse latency the press say 'new breakthrough''.
Parallel computers are great, and parallel algorithms research is fine, and it should continue to be worked on, but someone needs to be figuring out how to build a damn fast GaAs box (or something else with better electron mobility than silicon) that has scalar performance that kicks serious ass. Hell even if someone was doing what the ETA people tried to do, and use commodity CMOS and seriously cool it to get a speedup would be start..
Agreed. I'm really glad MPP machines are out there, there is a wide class of jobs that they do handle decently well for a tiny fraction of the cost. In fact, I've been specifying those for years (mostly a matter of figuring out where the budget is best spent given the expected workload and estimating the scalability) but as you say, it is also important to keep in mind that there is a significant class of problem they can't even touch. Meanwhile, the x86 line seems top have hit the wall at a bit over 3GHz clock and the various RISC processors have been focusing on low power embedded applications. The DDR signaling already bears more resemblance to QAM than a logic line, so the only way to get any more out of it is faster switching/higher frequency.
Some of the new ARM cores are getting interesting. I do wonder how much market share from x86 ARM will win. Your right about the DDR specs smelling like QAM. They are doing a great job at getting more bandwidth but the latency stucks worse than ever. When it gets too much we will finally see processors distributed in memory and Cray 3/SSS here we come...
I keep thinking more and more often that Amdahls 'wafer scale' processor needs to be revisited. If you could build a say 3 centimeter square LN2 cooled CPU that had a few hundred MB of on-chip SRAM that would be an improvement. You wouldn't need to have signal pins, just power pins, Replace the memory pins with interconnect pins - you might as well, and then cram a bunch of them into cube - Cray 3 style.
I'm sure there would be a market for even a single chip fairly typical current x86 like that, let alone a bunch of them It should still be an order of magnitude lower latency than external stuff. It wouldn't solve the problem, but after you've built the 3 cm square chip - figure out how to do it with a 4 cm square chip that's LN2 cooled,
If you didn't want to do x86 you could tailor the instruction set to add a bunch of useful synchronization stuff, and most critically you'd want the processor to be deterministic with regards to timing - so the compilers could know that if you accessed the emory at the edges of the chip it was slower. If you got rid of all the cache you could even try explicitly handling dependencies with predicate registers. I'd have to think more about if the predicate registers versus cache makes sense but it seems like it would make things way easier for the compiler.
You could say it would be a design that was designed to remove all the latencies trying to drive all those memory and address lines fast - and kindof like a transputer...
The key part there is getting the memory up to the CPU speed. On-die SRAM is a good way to do that. It's way too expensive for a general purpose machine, but this is a specialized application. A few hundred MB would go a long way, particularly if either a DMA engine or separate general purpose CPU was handling transfers to a larger but higher latency memory concurrently. By making the local memory large enough and manually manageable with concurrent DMA, it could actually hide the latency of DDR SDRAM.
For applications where parallelization was possible, additional CPUs could be interconnected through fast links as you suggest.
On the lower end, just adding a fast common scratch space to multi-core CPUs so they could actually do locking in just a few cycles or even do very low latency message passing would be a big help. On the OS side, the ability to dedicate a core exclusively to a task could be useful.
I thought about this some more and came to the same conclusion re external memory. I was trying to weigh the relative merit of very fast very small (Say 4K instructions) channel processors that can stream memory into the larger SRAM banks. The idea would be DMA on steroids. If your going to build a DMA controller and have the transistor budget then replacing a DMA unit with a simple in-order fast core might be a win, especially if it was fast enough that you could do bit vector stuff / record packing and unpacking down in the channels. The caveat being you need to keep the timing consistent so if that became difficult I would go back to just straight DMA.
The downside of adding lots of channels with the necessary drivers for memory is you'll end up needing more pins and then driving those pins, so power distribution becomes a factor. I think there's real benefit in surfacing the timing of the channel/memory processors details to compilers. Having compilers able to schedule things taking into account the performance of the larger memory.
I really like your idea of a common fast scratch space with deterministic timing for locking. I think that's a great idea and would hugely improve things on commodity CPU's.
Looking at the above you end up with a picture of basically a Crayesque machine. It occurs to me that if you added FP to the channel processors then you can start thinking about them like vector pipelines, I'm not sure that makes sense, it seems like doing DMA first, then maybe a simple in-order integer processor and then maybe FP would be a growth path for channels.
I wish I knew more about the limitations of putting SRAM on a die. Doing some searching it looks like Tukwilla's L3 cache of 24 MB is http://forums.anandtech.com/showthread.php?t=2093673 I think thats at 1.6 Ghz so 25 ns. If you could cool it at least as well as the ETA guys and get say 4x your down to ~6 ns for that 24 MB.
One real trade-off question to me would be at that point, how long would it take to do RDMA / use a channel processor to fetch from a different CPU versus talk to a bunch of external DRAM? At which point your deterministic timing goes out the window so maybe DRAM is the way to go...
The channel controllers are a good idea. One benefit to that is there need be no real distinction between accessing another CPU's memory and an external RAM other than the speed/latency. So long as all off-chip access is left to the channel controllers with the CPU only accessing it's on-die memory, variable timing off chip wouldn't be such a big problem. Only the channel controller would need to know.
The SDRAM memory controller itself and all the pins necessary to talk to SDRAM modules can be external to the CPU if necessary. It would be just one more thing a channel controller might talk to. If the inter-chip communication is hypertransport like with smarter endpoints, the pin count can be kept under control. The current Opteron 8 and 12 core chips support 4 memory channels and (I think) 4 Hypertransport channels. Drop the on board memory channels and it starts looking quite reasonable.
If channel controllers also handle disk access, it looks like a mainframe. If they map the disk into the address space like very slow RAM, it looks like an AS/400. I suppose the supercomputer/mainframe/mini distinction would be a matter of how many of what are included in the system, and perhaps which OS is loaded.