Startup Combines CPU and DRAM

← Back to Stories (view on slashdot.org)

Startup Combines CPU and DRAM

Posted by samzenpus on Sunday January 22, 2012 @10:27PM from the two-in-one dept.

MojoKid writes "CPU design firm Venray Technology announced a new product design this week that it claims can deliver enormous performance benefits by combining CPU and DRAM on to a single piece of silicon. Venray's TOMI (Thread Optimized Multiprocessor) attempts to redefine the problem by building a very different type of microprocessor. The TOMI Borealis is built using the same transistor structures as conventional DRAM; the chip trades clock speed and performance for ultra-low low leakage. Its design is, by necessity, extremely simple. Not counting the cache, TOMI is a 22,000 transistor design. Instead of surrounding a CPU core with L2 and L3 cache, Venray inserted a CPU core directly into a DRAM design. A TOMI Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of 16 ICs per 2GB DIMM. This works out to a total of 128 processor cores per DIMM. That said, when your CPU has fewer transistors than an architecture that debuted in 1986, there is a good chance that you left a few things out--like an FPU, branch prediction, pipelining, or any form of speculative execution. Venray may have created a chip with power consumption an order of magnitude lower than anything ARM builds and more memory bandwidth than Intel's highest-end Xeons, but it's an ultra-specialized, ultra-lightweight core that trades 25 years of flexibility and performance for scads of memory bandwidth."

2 of 211 comments (clear)

Min score:

Reason:

Sort:

Either/or? by Gwala · 2012-01-22 22:35 · Score: 5, Insightful

Does it have to be a either-or suggestion?
I could see this being useful as an accelerator - in the same way that GPUs can accellerate vector operations. E.g. memory that can calculate a hash table index by itself. Stuffed in as a component of a larger system it could be a really clever breakthrough for incremental performance improvements.

--
#!/bin/csh cat $0
Re:Map Reduce? by lkcl · 2012-01-23 00:09 · Score: 5, Insightful

Aspex Semiconductors took this a lot further. they did content-addressable-memory. ok, they did a hell of a lot more than that. they created a massively-parallel deep SIMD architecture with a 2-bit CPU (early versions were 1 bit), with each CPU having something like 256 bits of memory to play with. ok, early versions had 128-bits of "straight" RAM and 256 bits of content-addressable RAM. when i was working for them they were planning the VASP-G architecture which would have 65536 such 2-bit CPUs on a single die. it was the 10th largest CPU being designed, in the world, at the time.
programming such CPUs was - is - a complete f*****g nightmare. you not only have the parallelism of the CPU to deal with but you have the I/O handling to deal with. do you try to fit the data 1-bit-wide per CPU and process it serially? or... do you try to fit the data across 32 CPUs and process it in parallel? (each CPU was connected to its 2 neighbours so you could do this sort of thing). or... do you do anything in between, because if you have only 1-bit-wide that means that the I/O is held up, but if you do 32-bits across 32 CPUs you process it so quick that you're now I/O bound.
much of the work in fitting algorithms onto ASPs involved having to write bloody spreadsheets in Excel to analyse whether it was best to use 1, 2, 4 .... 32 CPUs just to process the bloody data! 6 weeks of analysis to write 30 lines of code for god's sake!
it gets worse: you can't even go read a book on algorithms for hardware because that doesn't apply; you can't go read a book on algorithms for software because _that_ doesn't apply. working out how to fit AES onto the Aspex Semi CPU took me about... i think it was 6 weeks, to even _remotely_ make it optimal. i had to read up on the design of the 2-bit Galois Field theory behind the S-Boxes, because although you could do 8-bit S-Box substitution by running 256 "compare" instructions, one per substitution, in parallel across all 4096 CPUs, it turned out that if you actually implemented the *original* 2-bit Galois Field mathematical operations in each of the 2-bit CPUs you could get it down to 40 instructions, not 256.
and that was just for _one_ part of the Rijndael algorithm: i had to do a comprehensive detailed analysis of _every_ aspect of the algorithm.
in other words, everything that you _think_ you know about optimising software and algorithm design for either hardware or for software is completely and utterly wrong, for these types of massively-parallel map-reduce and content-addressable-memory CPUs.
that leaves them somewhere in the very very specialist dept, and even there, they have problems, because it takes so long to verify and design a new CPU. when the Aspex VASP-F architecture was being planned, it was AMAZING! wow! 100x faster than the best Pentium-III processor! of course, within 18 months it was only 20x better than the top-of-the-line Pentium that was available, and by the time it _actually_ came out, it was only 5x better than a bunch of x86 CPUs, which are a hell of a lot easier to program.
it was the same story for the next version of the CPU, even though that promised to have 64k processing elements...