500 Billion Very Specialized FLOPs
sheckard writes: "ABC News is reporting about the world's fastest 'supercomputer,' but the catch is that it doesn't do much by itself. The GRAPE 6 supercomputer computes gravitational force, but needs to be hooked up to a normal PC. The PC does the accounting work, while the GRAPE 6 does the crunching." The giant pendulum of full-steam-ahead specialization vs. all-purpose flexibility knocks down another one of those tiny red pins ...
So along come some doods who said why don't we recursively stick the particles into boxes and then calculate the attraction between the boxes instead and it should be a lot faster. So they tried it and it seemed to work great- it only takes more like 10,000 calculations to do 1000 particles.
Anyway along came some other guys and they were a bit suspicious. They showed that some galaxies fell apart under some conditions with the recursive boxes method, when like they shouldn't. Back to the drawing board.
There are some fixes for this now- they run more slowly, but still a lot faster than the boring way. Still, its better than the end of the universe. Even if it is only a toy universe.
For descriptions of loadsa algorithms, including 'symplectics' which are able to predict the future of the solar system to 1 part in 10^13 ten million years in the future check out this link:
-WolfWithoutAClause
"Gravity is only a theory, not a fact!"If you re-read the article, you'll see that 500 billion is just ONE OF THE BOARDS in the GRAPE. There are going to be 200 boards in this puppy, making for a machine that's getting 100 petaflops.
Damn fast!
Installing Grape 6
Processor of gravity
Quake sure feels real now
Processors with embedded RAM's have been under research for some time. Check out the IRAM project at Berkeley and the PIM project at University of Michigan and elsewhere. Despite all of the research, though, Processor-in-memory hasn't made it into general use yet.
There are many problems with implementing a system like this in practice. The fabrication process used for DRAM's is completely different from that used for logic. In general, for DRAM you want a *high* capacitance process so that the wells holding your bits don't discharge very quickly -- that way you can refresh less often. In logic you want *low* capacitance so that your gates can switch quickly (high capacitance -> high RC time constant -> slow rise/fall time on gates -> slow clock speed).
Fabricating both with the same set of masks doesn't work particularly well, so you really have to compromise -- you'll basically be making a processor with a RAM process, or vice-versa. Alternately, you could use SRAM, which is nice and fast and is built with a logic process, but is 1/6th the storage density of DRAM. This is why SRAM is used for caches and DRAM is used for main memory.
Having the memory on the same die as the processor definately gives a bandwidth and latency advantage. For instance, when you are on the same die, you can essentially lay as many data lines as you like so that you can make your memory interface as wide as you like.
But another large advantage is the power-savings. Processors consume a great deal of their power in the buffers driving external signals. Basically, driving signals to external devices going through etch is power-expensive, and introduces capacitances that kill some of your speed. Keeping things on die, no such buffers are needed, and a great deal of power is saved.
The first commercial application of the processor-in-memory concept that I am aware of is Neomagic's video cards. They went with PIM not for bandwidth, but for power-conservation, and chip reduction. These characteristics are extremely appealing to portable computing, and thus Neomagic now pretty much owns the laptop market.
In a limited application, such as a 2D graphics card, this is feasible because the card only needs perhaps 4 MB of memory. Placing an entire workstation's main memory (say, 128 MB) on a single die *with* a processor would lead to a ridiculously massive die. Big dies are expensive, lead to low yield and increase design problems with clock skew. Thus, having 128 MB of DRAM slapped onto the same die as your 21264 isn't going to happen in the near future.
Placing a small (4-8 MB) amount of memory on-die, and leaving the rest external is possible, but leads to non-uniform access memory, which complicates software optimization and general performance tuning greatly. It is generally considered undesirable.
Another approach is to build systems around interconnected collections of little processors, each with modest computing power and a small amount (say 8 MB) of memory. Thus, you are essentially building a mini-cluster, where each node is a single chip. This, too, leads to a NUMA situation, but it is more interesting, and many people are pushing it.
PIM's are going to be used more and more, and the massive hunger for bandwidth in 3D-gaming cards very well may drive it to market acceptance. The power consumption adavantages will continue to appeal to portable and embedded markets as well. However, general purpose processors based on this design are unlikely in the near future. This style of design doesn't mesh well with current workstation-type architectures.
A bit of a tangent, but I hope it was informative...
--Lenny
What, me worry?
The problem with anything based on a microprocessor is the pathetic main memory bandwidth. If your program blows out the cache, the performance goes to hell.
A vector supercomputer is designed to have massive memory bandwidth, enough to keep the vector processing units operating at high efficiency. No cache or VM to slow things down. An engineer once told me that a Cray was a multimillion dollar memory system with a CPU bolted on the side.
See the STREAM benchmark web page for some measurements of sustained memory bandwidth. This separates the real computers from the toys.
Mea navis aericumbens anguillis abundat