Slashdot Mirror


Startup Combines CPU and DRAM

MojoKid writes "CPU design firm Venray Technology announced a new product design this week that it claims can deliver enormous performance benefits by combining CPU and DRAM on to a single piece of silicon. Venray's TOMI (Thread Optimized Multiprocessor) attempts to redefine the problem by building a very different type of microprocessor. The TOMI Borealis is built using the same transistor structures as conventional DRAM; the chip trades clock speed and performance for ultra-low low leakage. Its design is, by necessity, extremely simple. Not counting the cache, TOMI is a 22,000 transistor design. Instead of surrounding a CPU core with L2 and L3 cache, Venray inserted a CPU core directly into a DRAM design. A TOMI Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of 16 ICs per 2GB DIMM. This works out to a total of 128 processor cores per DIMM. That said, when your CPU has fewer transistors than an architecture that debuted in 1986, there is a good chance that you left a few things out--like an FPU, branch prediction, pipelining, or any form of speculative execution. Venray may have created a chip with power consumption an order of magnitude lower than anything ARM builds and more memory bandwidth than Intel's highest-end Xeons, but it's an ultra-specialized, ultra-lightweight core that trades 25 years of flexibility and performance for scads of memory bandwidth."

15 of 211 comments (clear)

  1. but... by Anonymous Coward · · Score: 5, Funny

    does it run GNU/Linux?

    1. Re:but... by robthebloke · · Score: 5, Funny

      Multiple, low power, semi useless processor cores? Sounds like sony has just found the silicon to power the Playstation 4! :p

  2. Either/or? by Gwala · · Score: 5, Insightful

    Does it have to be a either-or suggestion?

    I could see this being useful as an accelerator - in the same way that GPUs can accellerate vector operations. E.g. memory that can calculate a hash table index by itself. Stuffed in as a component of a larger system it could be a really clever breakthrough for incremental performance improvements.

    --
    #!/bin/csh cat $0
  3. Processing In Memory by Anonymous Coward · · Score: 5, Interesting

    This isn't new. The MIT Terasys platform did the same in 1995, and many have since. Nobody has yet come up with a viable programming model for such processors.

    I'm expecting AMD's Fusion platform to move in the same direction (interleaved memory and shader banks), and they already have a usable MIMD model (basically OpenCL).

  4. Why not a hexagonal design? by G3ckoG33k · · Score: 4, Interesting

    Speaking of unconventional design, why don't we see hexagonal or triangular CPU-designs? All I have seen are the Manhattan-like designs. Are these really the best? Embedding the CPU inside a hexagonal/triangular DRAM design should be possible too. What would be the trade-offs?

  5. synthesis by lkcl · · Score: 4, Informative

    there's a problem with doing designs like this. the tooling for CPUs is very very specific: 28nm, 32nm, 45nm - all those companies that do the simulations where they charge something like $USD 250,000 per week to license their tools like mentor do - have written the tools SPECIFICALLY for those geometries.

    if you wander randomly outside of those geometries you are either on your own or you are into some unbelievably-high development costs.

    why is this relevant?

    it's because the DRAM manufacturers do *not* stick to the well-known geometries: they vary the geometry in order to get the absolute best performance because the cell layout is absolutely identical for DRAM ICs. and, because those cells _are_ identical, the verification process is much simpler than is required for a complex CPU.

    in other words, this company is trying to mix-and-match two wildly different approaches. in other words, what he's doing is either incredibly expensive or is sub-optimal. which begs the question: what's it _for_?

  6. Re:All on one chip by stevelinton · · Score: 4, Interesting

    There are basically two problems:

    1. The external connectivity -- SATA, USB, ethernet, etc. needs too much power to easily move or handle on a chip (and the radio stuff needs radio power). You can do the protocol work on the main chip if you like, but you'll need amplifiers, and possibly sensors off chip.

    2. DRAM and CPUs are made in quite different processes, optimised for different purposes. Cache is memory made using CPU processes (so it's expensive and not very dense). These guys are trying to make CPUs using DRAM processes, which are slow.

  7. Don't count this out yet by fyngyrz · · Score: 5, Interesting

    Useless? My key question would be does it have decent speed integer multiply and perhaps even divide instructions. A whole heck of a lot can be achieved if you have, say, the basic instruction set of a 6809, but fast and wide (and it didn't even have a divide... so we built multiply-by-reciprocal macros to substitute, that works too.)

    I know everyone's used to having FP right at hand, but I'm telling you, fast integer code and table tricks can cover a lot more bases than one might initially think. A lot of my high performance stuff -- which is primarily image processing and software defined radio -- is currently limited considerably more by how fast I can move data in and out of main memory than it is by actually needing FP operations. On a dual 4-core machine, I can saturate the memory bus without half trying with code that would otherwise be considerably more efficient, if it could actually get to the memory when it needs to.

    Another thing... when you're coding with C, for instance, the various FP ops can just as easily be buried in a library, then who cares why or how they get done anyway, as long as they are? With lots-o-RAM, you can write whatever you need to and it'd be the same code you'd write for another platform. Just mostly faster, because for many things, FP just isn't required, or critical. Fixed point isn't very bard to build either and can cover a wide range of needs (and then there's BCD code... better than FP for accounting, for instance.)

    Signed, old assembly language programmer guy who actually admits he likes asm...

    --
    I've fallen off your lawn, and I can't get up.
    1. Re:Don't count this out yet by walshy007 · · Score: 4, Interesting

      Exactly. ARM2 didn't have FP, people still wrote some extremely good stuff for it.

      Nintendo DS doesn't have an fpu on either cpu.

    2. Re:Don't count this out yet by tibit · · Score: 4, Interesting

      Agreed. I'm working on a digital oscilloscope display system and that thing might be very useful in this application -- where you need lots of bandwith, but also plenty of storage. Say, zooming, filtering, scaling of one second long acquisition done at 2Gs/s, using a 12 bit digitizer. You tweak the knobs, it updates, all in real time. In the worst case, you need about 120 Gbytes/s memory bandwidth to make it real time on a 30FPS display. And that's assuming the filter coefficients don't take up any bandwidth, because if they do you've just upped the bandwidth to terabytes/s.

      --
      A successful API design takes a mixture of software design and pedagogy.
    3. Re:Don't count this out yet by TeknoHog · · Score: 4, Funny

      Signed, old assembly language programmer guy

      I see what you did there.

      --
      Escher was the first MC and Giger invented the HR department.
    4. Re:Don't count this out yet by fyngyrz · · Score: 5, Funny

      That's a bit shifty, don't you think? I don't mean to negate your point, but too, it's beyond my power to complement you -- I'm somewhat over a barrel. Perhaps if you add one to your argument, we'd have something else. Logically speaking. HCF.

      --
      I've fallen off your lawn, and I can't get up.
  8. Re:Just a first step... by lkcl · · Score: 4, Informative

    the cache is there because the speed of DRAM, regardless of how fast you can communicate with it, still has latency issues on addressing.

    to do the "routing" to address a 4-bit bus, you need 1/2 the number of transistors than if you addressed a 2-bit bus. each time you add another bit to the address range, you have increased the latency of access.

    if you were to provide entirely random-access to an entire 32-bit range you would absolutely kill performance. so, what RAM IC designers do is they go "ok, you're not going to get 32-bit addressing, you're going to get 14-bit addressing, you're going to have to read an entire page of 1k or 2kbits, and you're going to have to have parallel ICs, the first IC does bits 0 to 1 of the data, the second IC does bits 2 and 3 etc."

    this relies on the design of the processor having a VM architecture - paging.

    but the same principle applies *inside* the processor: even just decoding the addressing, in the MMU, it's *still* too much latency involved.

    so this is why you end up with hierarchical cacheing - 1st level is tiny, 2nd level is huge.

    even with RISC designs you _still_ have to have 1st and 2nd level caches in order to remain competitive. if you've ever seen a picture of a RISC CPU, it's astounding: the actual CPU is like 1% of the total area; caches are huuuge by comparison, crossbar routing takes up 50% of the chip and the I/O pads, required to be massive in order to handle the current, can take up something like 5% of the chip (guessing here, it's been a while since i looked at an annotated example CPU).

  9. Re:Map Reduce? by lkcl · · Score: 5, Insightful

    Aspex Semiconductors took this a lot further. they did content-addressable-memory. ok, they did a hell of a lot more than that. they created a massively-parallel deep SIMD architecture with a 2-bit CPU (early versions were 1 bit), with each CPU having something like 256 bits of memory to play with. ok, early versions had 128-bits of "straight" RAM and 256 bits of content-addressable RAM. when i was working for them they were planning the VASP-G architecture which would have 65536 such 2-bit CPUs on a single die. it was the 10th largest CPU being designed, in the world, at the time.

    programming such CPUs was - is - a complete f*****g nightmare. you not only have the parallelism of the CPU to deal with but you have the I/O handling to deal with. do you try to fit the data 1-bit-wide per CPU and process it serially? or... do you try to fit the data across 32 CPUs and process it in parallel? (each CPU was connected to its 2 neighbours so you could do this sort of thing). or... do you do anything in between, because if you have only 1-bit-wide that means that the I/O is held up, but if you do 32-bits across 32 CPUs you process it so quick that you're now I/O bound.

    much of the work in fitting algorithms onto ASPs involved having to write bloody spreadsheets in Excel to analyse whether it was best to use 1, 2, 4 .... 32 CPUs just to process the bloody data! 6 weeks of analysis to write 30 lines of code for god's sake!

    it gets worse: you can't even go read a book on algorithms for hardware because that doesn't apply; you can't go read a book on algorithms for software because _that_ doesn't apply. working out how to fit AES onto the Aspex Semi CPU took me about... i think it was 6 weeks, to even _remotely_ make it optimal. i had to read up on the design of the 2-bit Galois Field theory behind the S-Boxes, because although you could do 8-bit S-Box substitution by running 256 "compare" instructions, one per substitution, in parallel across all 4096 CPUs, it turned out that if you actually implemented the *original* 2-bit Galois Field mathematical operations in each of the 2-bit CPUs you could get it down to 40 instructions, not 256.

    and that was just for _one_ part of the Rijndael algorithm: i had to do a comprehensive detailed analysis of _every_ aspect of the algorithm.

    in other words, everything that you _think_ you know about optimising software and algorithm design for either hardware or for software is completely and utterly wrong, for these types of massively-parallel map-reduce and content-addressable-memory CPUs.

    that leaves them somewhere in the very very specialist dept, and even there, they have problems, because it takes so long to verify and design a new CPU. when the Aspex VASP-F architecture was being planned, it was AMAZING! wow! 100x faster than the best Pentium-III processor! of course, within 18 months it was only 20x better than the top-of-the-line Pentium that was available, and by the time it _actually_ came out, it was only 5x better than a bunch of x86 CPUs, which are a hell of a lot easier to program.

    it was the same story for the next version of the CPU, even though that promised to have 64k processing elements...

  10. Re:Map Reduce? by Pieroxy · · Score: 4, Funny

    Why in the world is people always saying the word Map Reduce nowerdays, I hear it every week atleast. Like it would be the solution to world war 3.

    Since WWIII hasn't happened yet, you cannot rule out the fact that it *might* be the solution.