Slashdot Mirror


Startup Combines CPU and DRAM

MojoKid writes "CPU design firm Venray Technology announced a new product design this week that it claims can deliver enormous performance benefits by combining CPU and DRAM on to a single piece of silicon. Venray's TOMI (Thread Optimized Multiprocessor) attempts to redefine the problem by building a very different type of microprocessor. The TOMI Borealis is built using the same transistor structures as conventional DRAM; the chip trades clock speed and performance for ultra-low low leakage. Its design is, by necessity, extremely simple. Not counting the cache, TOMI is a 22,000 transistor design. Instead of surrounding a CPU core with L2 and L3 cache, Venray inserted a CPU core directly into a DRAM design. A TOMI Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of 16 ICs per 2GB DIMM. This works out to a total of 128 processor cores per DIMM. That said, when your CPU has fewer transistors than an architecture that debuted in 1986, there is a good chance that you left a few things out--like an FPU, branch prediction, pipelining, or any form of speculative execution. Venray may have created a chip with power consumption an order of magnitude lower than anything ARM builds and more memory bandwidth than Intel's highest-end Xeons, but it's an ultra-specialized, ultra-lightweight core that trades 25 years of flexibility and performance for scads of memory bandwidth."

211 comments

  1. but... by Anonymous Coward · · Score: 5, Funny

    does it run GNU/Linux?

    1. Re:but... by robthebloke · · Score: 5, Funny

      Multiple, low power, semi useless processor cores? Sounds like sony has just found the silicon to power the Playstation 4! :p

    2. Re:but... by ByOhTek · · Score: 1

      Not yet, but I'm sure it'll run a herd of HURD!

      --
      Self proclaimed typo king, and inventor of the bear destroying coffee table (patent not pending).
    3. Re:but... by sourcerror · · Score: 1

      I can suggest an application from the top of my head: spatial configuration tables for experimantal/reconfigurable robotic arms.

    4. Re:but... by Anonymous Coward · · Score: 1

      It's a *design*. So it runs a GNU/Linux OS on paper.

        (English majors: the company is now hiring kernel bug copy editors to ensure that the end-reader experience is flawless)

    5. Re:but... by Anonymous Coward · · Score: 0

      At $99 each, it will remain in the experimental robotic for some time. However, once this hits massive production, it would drop in price and then would be on regular robotics.

    6. Re:but... by Anonymous Coward · · Score: 0

      It's a *design*. So it runs a GNU/Linux OS on paper.

      Now that's an achievement. Imagine, you no longer have to buy an expensive CPU, just use a piece of paper to run your software on!

      BTW, does recycled paper work as well, or has it to be new paper? And what about toilet paper?

    7. Re:but... by DaVince21 · · Score: 1

      Sure it does. Toilet paper storage can't be written to as easily, though, and breaks more quickly.

      --
      I am not devoid of humor.
  2. Fantastic by Anonymous Coward · · Score: 1, Funny

    I'd love to see a beowulf cluster of these things...

    Oh, wait..

  3. Either/or? by Gwala · · Score: 5, Insightful

    Does it have to be a either-or suggestion?

    I could see this being useful as an accelerator - in the same way that GPUs can accellerate vector operations. E.g. memory that can calculate a hash table index by itself. Stuffed in as a component of a larger system it could be a really clever breakthrough for incremental performance improvements.

    --
    #!/bin/csh cat $0
    1. Re:Either/or? by Anonymous Coward · · Score: 3, Interesting

      Like Mitsubishi 3D RAM

      They put the logic ops and blend on the RAM

      The 3D-RAM is based on the Mitsubishi Cache DRAM (CDRAM (Cached DRAM) A high-speed DRAM memory chip developed by Mitsubishi that includes a small SRAM cache. ) architecture that integrates DRAM memory and SRAM cache on a single chip. The CDRAM was then optimized for 3-D graphics rendering and further enhanced by adding an on-chip arithmetic logic unit See ALU. (ALU (Arithmetic Logic Unit) The high-speed CPU circuit that does calculating and comparing. Numbers are transferred from memory into the ALU for calculation, and the results are sent back into memory. Alphanumeric data are sent from memory into the ALU for comparing. ) and video buffer.

    2. Re:Either/or? by Anonymous Coward · · Score: 0

      Exactly, in a heterogeneous computing future, something like this may find a place.

  4. So, is it a CAM or a DRPU? by Anonymous Coward · · Score: 0

    And how do I add more RAM to my system?

    1. Re:So, is it a CAM or a DRPU? by Sulphur · · Score: 1

      And how do I add more RAM to my system?

      Its Computer Assisted Memory. Check the NUMA box. /Ducks and Covers

    2. Re:So, is it a CAM or a DRPU? by somersault · · Score: 1

      I think UM-CRAP, or RAD-MC-PU would be more catchy.

      --
      which is totally what she said
    3. Re:So, is it a CAM or a DRPU? by somersault · · Score: 3, Funny

      Missed a D - better make that DUM-CRAP*.

      I wonder, how much DUM-CRAP could we fit into a single PC?

      * this name is by no means a reflection on what I think of the tech - it sounds like a pretty cool idea.

      --
      which is totally what she said
    4. Re:So, is it a CAM or a DRPU? by Anonymous Coward · · Score: 2, Insightful

      The idea is not new and lots of products having a CPU on the RAM die exist. Sun had this on graphic cards for example.
      The missing FP is not a great deal since FP can be calculated with ints if needed - but it shall get an FP in the follow up products to stop the rants.
      The dirty secrect of the computer industrie is that the CPU has to wait "lots" of cycles to go to memory and back since the CPUs are much higher clocked that RAM plus there are other chips in between...
      RAM can be added in system designs here - but you simply get more CPUs as well ;-)
      I guess market acceptance is always a matter of integration effort so Linux it shall run and have broad IO chipsets available.
      Well done - hope it succeedes

    5. Re:So, is it a CAM or a DRPU? by hitmark · · Score: 1

      Yea, my understanding is that in a modern CPU a cache failure is more costly than poorly optimized code.

      --
      comment first, facts later. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
    6. Re:So, is it a CAM or a DRPU? by hechacker1 · · Score: 1

      This is true, but modern branch predictors are pretty good. Sandy Bridge reportedly is correct more than 80% of the time. So it really depends if having fast on chip DRAM (but a smaller amount) is more valuable than having L1, L2, L3, and RAM caches at differing speeds with their own predictors.

      I'm guessing it depends on the application.

  5. performance vs. memory bandwidth by Anonymous Coward · · Score: 2, Insightful

    > "that trades 25 years of flexibility and performance for scads of memory bandwidth"

    Right... because memory bandwidth isn't one of the greatest bottlenecks in current designs...

    1. Re:performance vs. memory bandwidth by hattig · · Score: 3, Interesting

      And how much performance per clock are you going to get out of a 22,000 transistor chip, with what looks like 3 registers (and 3 shadow registers)?

      One of the issues they had to deal with was that DRAM is usually made on a 3 metal layer process, whereas CPUs usually take a lot more layers due to their complexity.

      This will have to compete with TSV connected DRAM, which will be a major bandwidth and power aid to conventional SoCs.

    2. Re:performance vs. memory bandwidth by ByOhTek · · Score: 1

      for a limited subset of tasks, very high performance.

      If the chip can achieve either (a) higher clock speed or (b) fewer cycles for the same op, or even both - then there can easily be some operations that are faster. For tasks focused on those operations, the chip will be faster. The memory improvements won't hurt things either.

      CPUs and GPUs are rarely the same speed and transistor count, but we use both. GPUs excel for floating point and rapid fire streams of the same op against an array of data. CPUs are better at integer processing, and rapidly changing operations on individual values.

      The idea of something like this isn't to replace the CPU I suspect, except in extremely low power/size situations, but rather, to add another unit to offload difficult tasks. Also, it could easily be a proof of concept, in which case, a more powerful built-in CPU could be in the pipeline.

      --
      Self proclaimed typo king, and inventor of the bear destroying coffee table (patent not pending).
    3. Re:performance vs. memory bandwidth by tibit · · Score: 3, Insightful

      Perfect for networking -- switching, routing, ... Think of content addressable memory, etc.

      --
      A successful API design takes a mixture of software design and pedagogy.
    4. Re:performance vs. memory bandwidth by K.+S.+Kyosuke · · Score: 3, Interesting

      And how much performance per clock are you going to get out of a 22,000 transistor chip, with what looks like 3 registers (and 3 shadow registers)?

      Quite a lot, I would guess. A stack-based design would give you 1 instruction per cycle with a compact opcode format capable of packing multiple instructions into a single machine word, which means a single instruction fetch for multiple actual instructions executed. Oh, and make it word addressed, that simplifies things a bit as well. In the end, you'll have a core that does perhaps 50%-100% as much clock cycles per second on a given manufacturing technology level (say, 60 nm), with just a single thread of execution, but with a negligible transistor budget and power consumption. The resulting effective computational performance per energy consumed will be at least one OOM better than the current offerings by Intel and AMD, although you first have to learn how to program it.

      --
      Ezekiel 23:20
    5. Re:performance vs. memory bandwidth by Anonymous Coward · · Score: 0

      You have here a uproc with a pretty unheard of bandwidth to RAM and you complain about only three registers? If you want 256, 1K, 256 million registers, feel free with this design. It's likely very little penalty to use RAM like registers if there are any sort of useful addressing modes to the opcodes.

    6. Re:performance vs. memory bandwidth by Anonymous Coward · · Score: 0

      And how much performance per clock are you going to get out of a 22,000 transistor chip, with what looks like 3 registers (and 3 shadow registers)?

      Quite a lot, I would guess. A stack-based design would give you 1 instruction per cycle with a compact opcode format capable of packing multiple instructions into a single machine word, which means a single instruction fetch for multiple actual instructions executed. Oh, and make it word addressed, that simplifies things a bit as well. In the end, you'll have a core that does perhaps 50%-100% as much clock cycles per second on a given manufacturing technology level (say, 60 nm), with just a single thread of execution, but with a negligible transistor budget and power consumption. The resulting effective computational performance per energy consumed will be at least one OOM better than the current offerings by Intel and AMD, although you first have to learn how to program it.

      Wow, that's a whole lotta handwaving there.

      If they don't have much cache (and at 22K transistors, they cannot), having lots of cycles per second is meaningless, even with the ultra wide connection to DRAM.

      "50%-100%" clock frequency is nonsense. You can always get frequency up, at the cost of adding more pipeline stages to decrease the delay between stages. Since they're manufacturing on a DRAM process, they will have to accept very substandard (by microprocessor standards) basic logic gate and routing performance, no matter what node they're on. This means very short pipeline stages if they want to hit even 50% of the clock frequency of a fast x86 CPU.

      Short stages have undesirable performance and power consequences, but those aren't even worth getting into because, once again, 22K transistors. That number rules out the possibility of crazy deep pipelines to get frequency. (More stages equals lots more transistors spent on flipflops, and 22K is already a small number just for implementing a basic 32-bit CPU with very simple ALU features and a minimal cache.)

      Finally, it's easy to make something which has "one OOM better" perf/W than a general purpose microprocessor like Intel's and AMD's CPUs. Trivial, even. There are plenty of existing, shipping, non pie in the sky products which do that well, or better. But guess what? They aren't nearly as general purpose as those x86 CPUs. This thing won't be either. Which is why this is not a serious threat to Intel or AMD, or even ARM.

    7. Re:performance vs. memory bandwidth by Anonymous Coward · · Score: 0

      So this is how Google do it?
      And before them ...

  6. Map Reduce? by complete+loony · · Score: 3, Interesting

    So you could implement some simple map reduce operations and run them directly in RAM?

    --
    09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    1. Re:Map Reduce? by Anonymous Coward · · Score: 1

      Why in the world is people always saying the word Map Reduce nowerdays, I hear it every week atleast. Like it would be the solution to world war 3.

    2. Re:Map Reduce? by lkcl · · Score: 5, Insightful

      Aspex Semiconductors took this a lot further. they did content-addressable-memory. ok, they did a hell of a lot more than that. they created a massively-parallel deep SIMD architecture with a 2-bit CPU (early versions were 1 bit), with each CPU having something like 256 bits of memory to play with. ok, early versions had 128-bits of "straight" RAM and 256 bits of content-addressable RAM. when i was working for them they were planning the VASP-G architecture which would have 65536 such 2-bit CPUs on a single die. it was the 10th largest CPU being designed, in the world, at the time.

      programming such CPUs was - is - a complete f*****g nightmare. you not only have the parallelism of the CPU to deal with but you have the I/O handling to deal with. do you try to fit the data 1-bit-wide per CPU and process it serially? or... do you try to fit the data across 32 CPUs and process it in parallel? (each CPU was connected to its 2 neighbours so you could do this sort of thing). or... do you do anything in between, because if you have only 1-bit-wide that means that the I/O is held up, but if you do 32-bits across 32 CPUs you process it so quick that you're now I/O bound.

      much of the work in fitting algorithms onto ASPs involved having to write bloody spreadsheets in Excel to analyse whether it was best to use 1, 2, 4 .... 32 CPUs just to process the bloody data! 6 weeks of analysis to write 30 lines of code for god's sake!

      it gets worse: you can't even go read a book on algorithms for hardware because that doesn't apply; you can't go read a book on algorithms for software because _that_ doesn't apply. working out how to fit AES onto the Aspex Semi CPU took me about... i think it was 6 weeks, to even _remotely_ make it optimal. i had to read up on the design of the 2-bit Galois Field theory behind the S-Boxes, because although you could do 8-bit S-Box substitution by running 256 "compare" instructions, one per substitution, in parallel across all 4096 CPUs, it turned out that if you actually implemented the *original* 2-bit Galois Field mathematical operations in each of the 2-bit CPUs you could get it down to 40 instructions, not 256.

      and that was just for _one_ part of the Rijndael algorithm: i had to do a comprehensive detailed analysis of _every_ aspect of the algorithm.

      in other words, everything that you _think_ you know about optimising software and algorithm design for either hardware or for software is completely and utterly wrong, for these types of massively-parallel map-reduce and content-addressable-memory CPUs.

      that leaves them somewhere in the very very specialist dept, and even there, they have problems, because it takes so long to verify and design a new CPU. when the Aspex VASP-F architecture was being planned, it was AMAZING! wow! 100x faster than the best Pentium-III processor! of course, within 18 months it was only 20x better than the top-of-the-line Pentium that was available, and by the time it _actually_ came out, it was only 5x better than a bunch of x86 CPUs, which are a hell of a lot easier to program.

      it was the same story for the next version of the CPU, even though that promised to have 64k processing elements...

    3. Re:Map Reduce? by Pieroxy · · Score: 4, Funny

      Why in the world is people always saying the word Map Reduce nowerdays, I hear it every week atleast. Like it would be the solution to world war 3.

      Since WWIII hasn't happened yet, you cannot rule out the fact that it *might* be the solution.

    4. Re:Map Reduce? by Andy_R · · Score: 2

      I know this is probably going to sound flippant, but I'm sure there is a genuine reason and I'd be interested to hear it... Why not just write it both ways and test?

      Better yet, why not get the compiler to try different parallelisations and use a genetic algorithm to optimise automatically?

      --
      A pizza of radius z and thickness a has a volume of pi z z a
    5. Re:Map Reduce? by postbigbang · · Score: 1

      Meh.

      From a theorists standpoint, it's classical. You get classical Von Neumann state machine. There's the problem of heat and die size, and buses are absolutely custom if you use them, although someone will put together a nice chipset to deal with the timing.

      Multiple cores still have the same problem in terms of cache state, fetch state, and synch, so no real benefit there. Add in memory protection and this has become more wicked still. Fast, but wicked difficult from an OS maker's standpoint. Not that it's easy now.

      --
      ---- Teach Peace. It's Cheaper Than War.
    6. Re:Map Reduce? by Anonymous Coward · · Score: 0

      Why in the world is people always saying the word Map Reduce nowerdays, I hear it every week atleast. Like it would be the solution to world war 3.

      Since WWIII hasn't happened yet, you cannot rule out the fact that it *might* be the solution.

      Why thank you, sweety!

    7. Re:Map Reduce? by PiMuNu · · Score: 2

      Why in the world is people always saying the word Map Reduce nowerdays.

      Distributed computing...

    8. Re:Map Reduce? by K.+S.+Kyosuke · · Score: 0

      Why in the world is people always saying the word Map Reduce nowerdays, I hear it every week atleast. Like it would be the solution to world war 3.

      Since WWIII hasn't happened yet, you cannot rule out the fact that it *might* be the solution.

      I thought that reducing the map (of world) was supposed to be the outcome of WWIII, not a solution to it.

      --
      Ezekiel 23:20
    9. Re:Map Reduce? by DrSkwid · · Score: 1

      You want "Content Addressable Parallel Processors".

      http://en.wikipedia.org/wiki/Content_Addressable_Parallel_Processor

      STARAN was one such beast. It had a PDP-11 as the control unit and ran queries in parallel for instance for air traffic control. Load the memory up with your planes (that's whatthe PDP is for). And then perform operations in parallel on all memory units at once. And query for anything you need to know.

      Very interesting devices.

      See Also :
      Foster, Caxton C (1976), Content Addressable Parallel Processors, Van Nostrand Reinhold

      I can confirm it is a good read.

      --
      There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
    10. Re:Map Reduce? by Anonymous Coward · · Score: 1

      There's a compiler?

    11. Re:Map Reduce? by Anonymous Coward · · Score: 0

      Why in the world is people always saying the word Map Reduce nowerdays, I hear it every week atleast. Like it would be the solution to world war 3.

      No, "map reduce" will be the result of World War III.

    12. Re:Map Reduce? by Anonymous Coward · · Score: 0

      It might even be THE FINAL SOLUTION....(Hitler reference lol ) to avoid anonymous coward username shallowthought i welcome the hate

    13. Re:Map Reduce? by lkcl · · Score: 1

      I know this is probably going to sound flippant, but I'm sure there is a genuine reason and I'd be interested to hear it... Why not just write it both ways and test?

      Better yet, why not get the compiler to try different parallelisations and use a genetic algorithm to optimise automatically?

      *sigh*. because it took literally days to write the assembly code and get the I/O routines right. the code rate was measured in DAYS per instruction, not the other way round. you couldn't reuse the functions because they were all completely different.

      it also didn't help that the compiler was actually a pre-processor, which took out "ASP" instructions that were inserted with asp { .... } and replaced them with c code where the assembly instructions were encoded as 32-bit memory-writes to the FIFO's address! the reason for that was that the CPUs were actually on a memory-mapped PCI card.

      plus, there were several hardware engineers in the company: they didn't really have time or resources for doing things like optimise the software, despite that being far more important than the actual hardware!

    14. Re:Map Reduce? by Anonymous Coward · · Score: 0

      WWIII will reduce the map. Several hundred gigatonnes of nuclear explosion do that sort of thing.

  7. Processing In Memory by Anonymous Coward · · Score: 5, Interesting

    This isn't new. The MIT Terasys platform did the same in 1995, and many have since. Nobody has yet come up with a viable programming model for such processors.

    I'm expecting AMD's Fusion platform to move in the same direction (interleaved memory and shader banks), and they already have a usable MIMD model (basically OpenCL).

    1. Re:Processing In Memory by Omnifarious · · Score: 2

      *nod* I'm not surprised it isn't new.

      I've noticed that locality has become more and more important as speeds have gone up. I kind of wonder if something like this isn't the future.

      I'm noticing, for example, that programming models involving channels and lots of threads have shown up and seem like a viable model for something like this. Erlang and Go are the two languages that do this that I can think of right offhand.

    2. Re:Processing In Memory by Anonymous Coward · · Score: 0

      Uuum, look up Haskell thread sparks.
      Combine that with vector processing.

      Looks very viable to me.

    3. Re:Processing In Memory by master_p · · Score: 2

      Nobody has yet come up with a viable programming model for such processors.

      The Actor model fits well to this type of CPU. Each CPU could be considered an actor.

      A Jump instruction to a memory bank of another CPU could be translated as an Actor call to be executed in parallel with the caller.

      A call instruction to a memory bank of another CPU could be translated as a parallel call and wait instruction.

      A load/store instruction could be translated as a queued request for retrieving/updating data.

      This provides a very natural multitasking solution, that provides good expandability both in memory and processing power by adding more memory/CPU chips.

      An object-oriented programming language could spread out parallel objects into as many CPUs as possible.

    4. Re:Processing In Memory by Attila+the+Bun · · Score: 1

      This isn't new. The MIT Terasys platform did the same in 1995, and many have since. Nobody has yet come up with a viable programming model for such processors.

      Indeed, but PC architecture is going in this direction. The powerful and flexible main CPU will remain, but there are more and more devices with their own specialised processors and memory. First graphics cards, then HDDs and other devices followed suit, and now we think nothing of putting microcontrollers in mice, keyboards, even speakers. Perhaps in the future I/O could be handled entirely by the in-memory processors. The more work the CPU can outsource to specialised processors, the faster it's going to get done.

    5. Re:Processing In Memory by hackertourist · · Score: 1

      What about the programming model that was used for every processor that had a 1:1 clock relationship with its memory, i.e. everything before the 80386?

    6. Re:Processing In Memory by Omnifarious · · Score: 1

      That's not the issue. It's the massive parallelism that's the issue. And most models for getting a grip on that tacitly assume symmetric access to all memory by all CPUs. It's just now that C++ is getting the atomic operations that have as an implicit assumption that perhaps some memory is seen differently by one thread vs. another.

    7. Re:Processing In Memory by Anonymous Coward · · Score: 0

      The DAP http://en.wikipedia.org/wiki/Distributed_Array_Processor was a similar idea... SIMD programming model. Worked really well for certain kinds of applications.

    8. Re:Processing In Memory by Anonymous Coward · · Score: 0

      ObWikipediaLink: Processor-in-memory.

    9. Re:Processing In Memory by Anne+Thwacks · · Score: 3, Interesting
      C++ is your problem. Algol68 dealt with these issues over 40 years ago.There were two problems with Algol68:

      It was not American (NIH)

      The best training manual for it was called "Algol68 with fewer tears"

      Other than that, it was able to handle parallelism, and most everything else, in a relatively painless manner.

      For those who actually LIKE pain, there is always Occam.

      --
      Sent from my ASR33 using ASCII
    10. Re:Processing In Memory by Entrope · · Score: 1

      The programming challenge with these architectures is not how to write applications for them. It's how to write efficient, correct applications reasonably quickly. In practice, the processors quickly become special-purpose rather than general-purpose as a result of their programming frameworks focusing on particular problems that the architecture is good at. (Not to mention Amdahl's law kicks in pretty quickly.)

    11. Re:Processing In Memory by RoccamOccam · · Score: 1

      For those who actually LIKE pain, there is always Occam.

      Obviously, I'm going to have to exception here. In my opinion, Occam handled parallelism beautifully. A practical implementation of CSP.

    12. Re:Processing In Memory by RoccamOccam · · Score: 1

      Ack. That should be "take exception here".

    13. Re:Processing In Memory by GrumpySteen · · Score: 1

      Considering the topic, it might be more suitable to say you meant "throw an exception here"

    14. Re:Processing In Memory by Lawrence_Bird · · Score: 1

      I get a little giddy whenever someone brings back memories of Algol But why stop reinventing the wheel every five years with a new greatest bestest programming language.. just think of all the lost revenue to publishers and software companies among others.

    15. Re:Processing In Memory by rrohbeck · · Score: 1

      Locality is nice if you stay within one memory chip. As soon as you go off-chip, you lose all the performance. And the premise is always that you gain high overall performance by ganging together several or many of these.
      Even simple, logical things like a memcpy or bitblt engine inside the RAM fall apart when you consider multiple modules with multiple chips each.

    16. Re:Processing In Memory by rrohbeck · · Score: 1

      We had a very nice Algol68 text in university from the UK Royal Radar Establishment. It was very gentle, just like K&R later. I can't remember the title - it's in a box in a closet somewhere and I'm not digging it up.
      Algol68 on ICL190x 24-bit machines, yay! I helped upgrading it from 8M words core to (gasp) all solid state memory.

    17. Re:Processing In Memory by Omnifarious · · Score: 1

      Yes, I agree. You have to take a very different approach to the design of software and algorithms to minimize communication with 'far away' entities. The complex cache system in most CPUs is to this as CISC and modern RISC is to WISC architectures. It's basically an attempt to manage with complex specialized hardware something that you might be able to better manage in software.

      Whether or not you actually can of course is the question.

    18. Re:Processing In Memory by master_p · · Score: 1

      The Actor model helps write correct multi-threaded applications reasonably quickly.

      From experience, I'd say that this model allows not only for 'reasonably quickly', but it is on par with single-threaded programming.

  8. Just a first step... by bradley13 · · Score: 3, Interesting

    Really, this was inevitable, and this first implementation is just a first step. Future versions will undoubtedly include more functionality.

    Current processors are ridiculously complicated. If you can knock out the entire cache with all of its logic, give the processor direct access to memory, and stick to a RISC design, you can get a very nice processor in under a million transistors.

    --
    Enjoy life! This is not a dress rehearsal.
    1. Re:Just a first step... by Anonymous Coward · · Score: 0

      Or if we could allocate a part of L1 cache in x86 for processes for direct memory access, that alone would result in a considerable speedup.

    2. Re:Just a first step... by lkcl · · Score: 4, Informative

      the cache is there because the speed of DRAM, regardless of how fast you can communicate with it, still has latency issues on addressing.

      to do the "routing" to address a 4-bit bus, you need 1/2 the number of transistors than if you addressed a 2-bit bus. each time you add another bit to the address range, you have increased the latency of access.

      if you were to provide entirely random-access to an entire 32-bit range you would absolutely kill performance. so, what RAM IC designers do is they go "ok, you're not going to get 32-bit addressing, you're going to get 14-bit addressing, you're going to have to read an entire page of 1k or 2kbits, and you're going to have to have parallel ICs, the first IC does bits 0 to 1 of the data, the second IC does bits 2 and 3 etc."

      this relies on the design of the processor having a VM architecture - paging.

      but the same principle applies *inside* the processor: even just decoding the addressing, in the MMU, it's *still* too much latency involved.

      so this is why you end up with hierarchical cacheing - 1st level is tiny, 2nd level is huge.

      even with RISC designs you _still_ have to have 1st and 2nd level caches in order to remain competitive. if you've ever seen a picture of a RISC CPU, it's astounding: the actual CPU is like 1% of the total area; caches are huuuge by comparison, crossbar routing takes up 50% of the chip and the I/O pads, required to be massive in order to handle the current, can take up something like 5% of the chip (guessing here, it's been a while since i looked at an annotated example CPU).

    3. Re:Just a first step... by K.+S.+Kyosuke · · Score: 1

      even with RISC designs you _still_ have to have 1st and 2nd level caches in order to remain competitive. if you've ever seen a picture of a RISC CPU, it's astounding: the actual CPU is like 1% of the total area; caches are huuuge by comparison,

      Don't do caches, do scratchpad memories and minimal instruction formats that require minimum bandwidth per opcode performed. And write a reasonable compiler. I've already seen books on it (static/automatic allocation of storage for scratchpad-equipped CPUs), I think CRC published a chapter on it in one of their recent compiler construction handbooks.

      --
      Ezekiel 23:20
    4. Re:Just a first step... by tibit · · Score: 1

      A lot of digital signal processing, that those chips would seem useful for, requires sequential access at very high bandwidths. When used that way, modern DRAM has no latency to speak of.

      --
      A successful API design takes a mixture of software design and pedagogy.
    5. Re:Just a first step... by dpilot · · Score: 1

      First step??

      Been there, done that, decades ago. Agreed it was in a much cruder, simpler technology, and sometimes size does matter. But US Patents 5278800, 5508968, 5519664, 5555528. There's more, but not relevant to the current topic.

      --
      The living have better things to do than to continue hating the dead.
    6. Re:Just a first step... by Forever+Wondering · · Score: 1

      the cache is there because the speed of DRAM, regardless of how fast you can communicate with it, still has latency issues on addressing.

      Combining this with memristor memory would solve this, which is what HP is doing. From their roadmap, they're going to roll out memristor replacement for flash this year. Next round is to replace DRAM. Then SoC combined CPU and memristor. Memristor memory is as fast as cache.

      This architecture has promise as a replacement for FPGA/ASIC designs for realtime video encoding.

      --
      Like a good neighbor, fsck is there ...
    7. Re:Just a first step... by lkcl · · Score: 1

      yes. that's what i was referring to about the paging. but this is for a massively-simplified CPU, with "no cacheing". so you'd have to feed the DRAM very very precisely. can you imagine, however, explaining to application writers that they have to create the output / input at *exactly* the rate at which the DDR RAM accepts it? .... :)

    8. Re:Just a first step... by tibit · · Score: 1

      You don't do it for general applications, you do it for very specific niche things where if you want decently performing code you better be on top of the architecture you code for. I have a scope display application that is exactly like you say: it's timed to utilize every clock cycle for DDR3 access. The DRAM runs at 100% theoretically possible utilization, and the code is written around it.

      --
      A successful API design takes a mixture of software design and pedagogy.
  9. Why not a hexagonal design? by G3ckoG33k · · Score: 4, Interesting

    Speaking of unconventional design, why don't we see hexagonal or triangular CPU-designs? All I have seen are the Manhattan-like designs. Are these really the best? Embedding the CPU inside a hexagonal/triangular DRAM design should be possible too. What would be the trade-offs?

    1. Re:Why not a hexagonal design? by Anonymous Coward · · Score: 1

      Problematic edges.

    2. Re:Why not a hexagonal design? by PSVMOrnot · · Score: 2

      It probably boils down to ease and efficiency of manufacture. Certainly for the core of the cpu I would imagine it's because squares tesselate nicely on the silicon wafer.

    3. Re:Why not a hexagonal design? by ByOhTek · · Score: 1

      hexagons would probably tessellate even better, with less waste.

      Ease of manufacture is still the case though. Cutting them out would be a bitch though.

      --
      Self proclaimed typo king, and inventor of the bear destroying coffee table (patent not pending).
    4. Re:Why not a hexagonal design? by mdenham · · Score: 2

      Ease, yes. Efficiency, especially the amount of the wafer that's wasted due to it being circular initially, not so much.

      A triangular layout probably would be ideal - you could (or at least, should be able to) adapt existing equipment (instead of two cuts at right angles, you make three cuts at 60-degree angles - hexagonal layouts would require either piecing together triangles produced this way, casting much smaller ingots such that it's one chip per wafer, or stamping out the hexagons), and you reduce waste somewhat (depending on the size of the chip and the size of the initial wafer).

      However, the problem with triangular layouts (and hexagonal, for that matter) is that it involves cutting along planes the original ingot (which is a single huge silicon crystal) won't naturally fracture in.

      So... unfortunately, we're stuck with square chips (if we start building up 3D chips, we'll have the option of cubes and octahedra) because of nature.

    5. Re:Why not a hexagonal design? by dkf · · Score: 2

      hexagons would probably tessellate even better, with less waste.

      Ease of manufacture is still the case though. Cutting them out would be a bitch though.

      That's why triangles would be good; they can act as parts of hexagons, and yet you can cut them out with straight cuts. OTOH, you'll have to deal with acute angles in the result, which might have its own set of problems. Squares are likely a reasonable compromise, all things considered.

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
    6. Re:Why not a hexagonal design? by TechnoCore · · Score: 2, Interesting

      Guess it is because the silicon wafers that a CPU's are made from must be cut along the atomic layers of silicon. Silicon in solid form at room temperature crystalizes into a diamond cubic crystal structure. It is very strong, but also very brittle. It is easy to cut along stright lines, following the faces of an octahedron. To cut at any other angle would propably be very difficult and risky. Maybe it would therefore be hard to cut a wafer it trianglular shapes?

    7. Re:Why not a hexagonal design? by ByOhTek · · Score: 1

      acute angles are definitely bad in electronics like that, also, if you tried to use them as hexagons, you'd have to merge six, which would have a whole extra set of complexities, and room for error. Yes, square is going to be the best option.

      --
      Self proclaimed typo king, and inventor of the bear destroying coffee table (patent not pending).
    8. Re:Why not a hexagonal design? by Anonymous Coward · · Score: 0

      Speaking of unconventional design, why don't we see hexagonal or triangular CPU-designs? All I have seen are the Manhattan-like designs. Are these really the best? Embedding the CPU inside a hexagonal/triangular DRAM design should be possible too. What would be the trade-offs?

      1. The wiring, transistors, and so forth which make up a CPU are laid out on a Manhattan grid. The natural shape of large assemblies of these small building blocks is also rectangular.

      2. DRAM (and memories in general) are 2D arrays of bit cells, select transistors, and the wires which connect them to strips of circuits on the edges of the array. The only space-efficient, high performance layout for a 2D array is a rectangle.

      3. The only real benefit to non rectangular chip layouts would be packing more die into a single circular wafer. An isosceles triangle would probably pack better than anything else. However, the benefit might not be as good as you'd think.

    9. Re:Why not a hexagonal design? by Anonymous Coward · · Score: 0

      Layout efficiency is lost due to silicon patterning constraints...right angles only. Even then, the industry is having a hard time with 2 directions; up down, left right.

  10. Their money isn't old enough. by Anonymous Coward · · Score: 2, Interesting

    They're innovating?

    T-minus two days until they've been hit with 13 different patent lawsuits by companies that don't even produce anything similar.

    Sorry about your luck!

    1. Re:Their money isn't old enough. by Anonymous Coward · · Score: 0

          I was thinking the same. The last thing the establishment wants is competition, even if it isn't direct competition. Players like this, if successful, could lead to less market control. OTOH, this CPU sounds pretty specialized, I kinda doubt it'll cut it for most of my general computing needs.
          Best case scenario for them: the big players think it's better to buy them up, rather than beat them back. Although this bullet :

      TOMI Technology will be built on flash memories creating the elemental unit of a learning machine... the machines will be able to self organize, build robust communicating structures, and collaborate to perform tasks.

      Seems to coincide well with the industry oligopoly's intention to embed DRM into common storage devices.

  11. Sorry , I don't believe it by Viol8 · · Score: 1

    Memory bottlenecks might be an issue but cache generally solves a lot of them. Binning just about every advance in processor design since the Z80 simply to speed up memory access is farcical. I'm afraid this is going to sink without trace since if you need low power you can just use ARM anyway which incidentally will have a shed load more performance.

    1. Re:Sorry , I don't believe it by mdenham · · Score: 1

      Think of this as a proof-of-concept work, which can rapidly (relative to the initial rate of progress from the 22k transistor range) be pushed to something closer to present-day processor strength. So figure sometime around 2020-2025 for them to have caught up to present-day transistor counts, with a system that'd have higher performance than anything we can get right now (without overclocking).

      Granted, that 10-12 years figure is a gigantic ass pull on my part, but it shouldn't be too much slower than that. They'll catch up eventually, and to low-power chips faster than everything else.

    2. Re:Sorry , I don't believe it by Anonymous Coward · · Score: 0

      I'm not an engineer of any sort, but based on my readings, this is the general direction I would see these systems going in

      What I'm envisioning is a 64bit in-order fairly simple CPU with a chunk of memory attached. Latency is mostly hidden by extra registers. Then a bunch(think hundreds) on a single die. Each CPU has its own kernel and it works cooperatively with the other CPUs to keep related threads "near" each other.

      You would effectively treat this system like a bunch of single CPU nodes in a grid with different latencies to other nodes, nearest being the fastest. You would need some way to describe related work loads. They would use message passing, so all messages transferred would be immutable structs instead of passing pointers and accessing relatively high-latency remote memory.

    3. Re:Sorry , I don't believe it by tibit · · Score: 1

      There's plenty of perhaps specialized but still fairly common digital signal processing that doesn't care at all about those "advances" in processor design. All it needs to do is plenty of multiplies-and-adds, saturated operations, test-and-modifies, etc. It doesn't require branch predictors, virtual memory, memory protection, layered caches, cache coherency, speculative execution, and plenty of other stuff that's needed to make x86 perform decently. The x86 instruction set is just very bad at extracting useful performance from CPU hardware without a lot of pre-processing. Even in the 80s you had DSP processors that had memory addressing units that would do circular buffers and strided access, as well as hardware loop generators. That way operands and data were always fetched from where they were supposed to come from, there was no speculation, loops had no overhead, and interleaving or parallelizing execution and fetches was easy to do, with no worry about data dependencies (whoever wrote the software would have already taken care of it), etc.

      --
      A successful API design takes a mixture of software design and pedagogy.
  12. All on one chip by jaweekes · · Score: 3, Interesting

    I'm just wondering and maybe it exists already, but why not make everything on one chip? The CPU, memory, GPU, etc? Most people don't mess with the insides of their computer, and I'm guessing that it will speed up the computer as a whole. You won't even need to make it high-performance. Just do a I3 core with the associated chipset (or equivalent), maybe 4GB of RAM, some connectivity (USB 2, DVI, SATA, Wi-Fi and 1000Base-T) and you have it all. The power savings should be huge as everything internally should be low voltage. The die will be huge but we are heading that way anyway.
    Am I talking bollocks?

    1. Re:All on one chip by Anonymous Coward · · Score: 0

      Yes, you are.

    2. Re:All on one chip by BeardedChimp · · Score: 2

      You are basically describing a system on chip. You have one in your phone.

    3. Re:All on one chip by jaweekes · · Score: 1

      I didn't think about that! Now why can't they do that for a desktop or laptop? Is the ARM system just that much smaller then the Intel desktop chips?

    4. Re:All on one chip by stevelinton · · Score: 4, Interesting

      There are basically two problems:

      1. The external connectivity -- SATA, USB, ethernet, etc. needs too much power to easily move or handle on a chip (and the radio stuff needs radio power). You can do the protocol work on the main chip if you like, but you'll need amplifiers, and possibly sensors off chip.

      2. DRAM and CPUs are made in quite different processes, optimised for different purposes. Cache is memory made using CPU processes (so it's expensive and not very dense). These guys are trying to make CPUs using DRAM processes, which are slow.

    5. Re:All on one chip by TheDarkMaster · · Score: 1

      I thought the same thing. The question would be to: Is possible to make an SoC with the performance of a desktop PC? And I do not know if doing so would be justified, given that the main advantage of a desktop PC is able to make hardware upgrades.

      --
      Religion: The greatest weapon of mass destruction of all time
    6. Re:All on one chip by fuzzyfuzzyfungus · · Score: 1

      The trouble is that, not only will the die be huge(which is an issue because it increases the odds that you'll have to throw the whole thing away because of a defect in some vital bit of it); but the entire die will have to be produced on a single process, presumably the one used by the most demanding of the parts.

      That doesn't make it impossible; but it would very likely make it extraordinarily expensive. If you totted up the total die area of a contemporary PC, CPU,RAM, GPU, assorted peripherals and interconnect stuff,it is already very large; but it is very large spread out over a number of processes, and a lot of dice that can be tested(and if necessary trashed) individually during production. Requesting the same functions, on a single die, from the fancy-cutting-edge process probably used to make the CPU, would be ruinously expensive...

    7. Re:All on one chip by Anonymous Coward · · Score: 0

      This is what the Parallax Propeller (2006) does. Also, it's a 8 core design and costs ten bucks.

    8. Re:All on one chip by Anonymous Coward · · Score: 0

      BeagleBoard is similar to what you described. It has the CPU and GPU on one IC, then uses PoP to mount the memory directly on top of the CPU.

    9. Re:All on one chip by Tim4444 · · Score: 1

      Well, Rasberry Pi could be described as a proof of concept for the whole SoC as a PC substitute idea. At least for the Windows world, the popular software is only offered as precompiled binaries for x86 based platforms. It may be a while before there's a critical mass of ARM based offerings to attract serious commercial attention. Windows 8 may change this but I think it's still too early to tell.

      I think upgradability is possibly not the main advantage of desktops though it's certainly a key factor for many people. I'd argue that a sizeable number of PC's, if not the majority, will never get an upgrade that requires opening the case (so, I'm excluding new peripherals). That's why there's a market for things like onboard (ie. on the mobo) audio, NIC, and others including sometimes GPU.

    10. Re:All on one chip by Anonymous Coward · · Score: 1

      The cost and schedule of the chip grow disproportionately quickly as the size of the die is increased. It would be something like making a large monolithic luxurious pre-fab, it would probably not match everyone's tastes and would be ridiculously expensive to ship, defeating the gains from building the whole thing beforehand in a factory.

    11. Re:All on one chip by Anonymous Coward · · Score: 0

      What you are talking about is called system on a chip. There are several ARM variants out there that do this. Such as the one used in the beagleboard and many droid phones. Dont think they have the memory quite in there yet. As the embedded memory on the chip is fairly small at this point. Probably 2-3 more process generations and you will see the chips you are talking about. Pretty much everything else is though.

      The limiting factor with x86 is the instruction set. The decoder takes a up a good amount of room. Then the secondary risc cpu and cache takes up the rest.

      http://en.wikipedia.org/wiki/OMAP
      http://en.wikipedia.org/wiki/Qualcomm_Snapdragon (this one has a phone modem built in but that is its target audience)
      http://en.wikipedia.org/wiki/System_on_a_chip

    12. Re:All on one chip by tibit · · Score: 1

      Intel instruction set architecture requires a lot of hardware to execute efficiently. That's the price we all pay for using an instruction set that is 3 decades behind the hardware it runs on.

      --
      A successful API design takes a mixture of software design and pedagogy.
    13. Re:All on one chip by epine · · Score: 1

      Intel instruction set architecture requires a lot of hardware to execute efficiently. That's the price we all pay for using an instruction set that is 3 decades behind the hardware it runs on.

      You'd think three decades would be enough to dispel an inaccurate meme. The grotesque legacy instructions are mostly handled by microcode. Have you looked at how much die area that occupies relative to the rest of the CPU? The 286 was capable of executing the majority of the crappiest legacy instructions. That had 134,000 total. A quad core i7 has 731,000,000 transistors.

      I could go down the list. The big problem with the Intel instruction set is that far too many transistors are active on every clock cycle, not that there is a lot of wasted die area.

    14. Re:All on one chip by petermgreen · · Score: 1

      You have one in your phone.

      They may call it a "system on chip" but while there is greater integration than with PC processors many components such as power management, main memory ethernet PHY (and sometimes MAC too), cellphone and/or wifi, serial level shift and so-on are nearly always on seperate chips.

      The problem is different chips need different process compromises, DRAM needs capacitors with low leakage, processors need fast switching and lots of interconnect. Ethernet needs relatively high currents to drive a couple of volts into a low impedance transmission line. cellphone and wifi need usable analog behaviour at GHz frequencies and so-on.

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
    15. Re:All on one chip by tibit · · Score: 2

      It's not about how much die area the microcode takes, it's about how much die area everything else needed to run this microcode efficiently is taking! Properly designed opcodes would obviate trace generator, branch predictor, register reallocator, parts of northbridge, etc. Now that takes a lot of space. In case of x86 ISA it's not about legacy opcodes really, it's about all the missing opcodes (and registers) that a well performing architecture should have. That's what I mean by it being 30 years behind. Even fairly aged DSPs like 2100 architecture from Analog Devices have dedicated address generators and loop generators that are fully controlled by the code -- this alone means you don't need prediction logic, and this could be exploited by the northridge (were it in a general purpose CPU) to optimally schedule DRAM cycles.

      --
      A successful API design takes a mixture of software design and pedagogy.
    16. Re:All on one chip by Anonymous Coward · · Score: 0

      Welcome to the modern cellphone

    17. Re:All on one chip by DarkOx · · Score: 1

      Because the manufacturing process is not perfect. The "bigger" the chip the grater the chance something will be wrong with it. For most applications if there were a defect in any part you'd have to toss that entire thing. That is allot of very expensive Si wafer to throw away. That and it would make bench tests very complicated as you might not have access direct to the subsystems.

      What you are talking about is called a "system on chip" or SoC, and Intel actually is building one, for tablets and smart phones. There are lots of ARM SoCs from a variety of vendors.

      Don't expect to see "high performance" stuff in SoC configurations. The yields on those parts are typically even lower so the entire thing would be just to cost prohibitive.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    18. Re:All on one chip by petermgreen · · Score: 1

      Well, Rasberry Pi [raspberrypi.org] could be described as a proof of concept for the whole SoC as a PC substitute idea.

      The pi is a mobile device "SOC" on a small board, fine for low demand tasks but don't expect it to compare to a modern laptop or desktop on either processing grunt or (perhaps more importantly) ram. If you are prepared to pay more you can get better SOCs with more ram on similar sized boards and you can get them right now from well known suppliers like mouser (unlike the pi which should hopefully be coming out in early feburary if there are no further delays). Examples include the pandaboard and the imx53 quickstart.

      Hardware wise the only thing that makes the pi notable is the pricetag.

      I think upgradability is possibly not the main advantage of desktops though

      Desktops have a number of advantages over laptops

      1: better bang per buck
      2: higher specs available
      3: better egonomics (unless
      4: upgradable.
      5: easier to secure than laptops

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
    19. Re:All on one chip by Anonymous Coward · · Score: 0

      That's call SoAC - Systems on a Chip. Europe is doing a lot of research in that area - that's why they have all these mobile phone companies like Nokia, Siemens etc...

    20. Re:All on one chip by Anonymous Coward · · Score: 0

      It's not about how much die area the microcode takes, it's about how much die area everything else needed to run this microcode efficiently is taking! Properly designed opcodes would obviate trace generator,

      Only the Pentium IV has one. It wasn't a consequence of it being x86, it was a consequence of one of several interesting if weird ideas Intel tried out in P4. Not all of them worked out so well.

      branch predictor,

      what

      You cannot redesign opcodes to make the branch predictor go away. You can throw away the branch predictor and get sucky performance no matter what your opcode design might be, though.

      register reallocator,

      Imagine me repeating what I just said about branch prediction! You can't get good performance out of an out-of-order CPU core if every write-after-read hazard forces the writing instruction to stall until the reading instruction is done.

      parts of northbridge, etc.

      The northbridge has no idea what instruction set the CPU might happen to be running. It's just glue logic which accepts memory read/write commands from master devices (CPUs and bus master peripherals), and executes them.

      Now that takes a lot of space. In case of x86 ISA it's not about legacy opcodes really, it's about all the missing opcodes (and registers) that a well performing architecture should have.

      Can you name a single opcode you think x86 should have, and doesn't?

      That's what I mean by it being 30 years behind. Even fairly aged DSPs like 2100 architecture from Analog Devices have dedicated address generators and loop generators that are fully controlled by the code -- this alone means you don't need prediction logic, and this could be exploited by the northridge (were it in a general purpose CPU) to optimally schedule DRAM cycles.

      Translation: "I twiddle DSP code for a living, never touch anything else, and have no clue whatsoever about how and why fast general purpose CPUs must differ from DSP architectures."

      Here's clue #1 for you: Those dedicated address and loop features are awesome for number crunching on nice linear arrays of data, or arrays with stride. They do absolutely nothing for the morass of convoluted pointer-chasing, branch-for-reasons-other-than-looping GUI code which general purpose CPUs are expected to run fast.

      Clue #2: A branch predictor good enough to perform 99% as well as a prefab loop-through-array feature costs approximately no hardware resources. Seriously, that level of prediction isn't rocket science: support only one branch (the last one encountered), and one bit of history for that branch (was it taken or not taken last time through?). Loop through 2000 data elements? Okay, you mispredict zero or one branches on the first pass through, and one on the last. The other 1998 out of 2000 times, you're golden, a 99.9% prediction rate.

      Clue #3: I had to know, so I had a brief look at some googled reference documentation for that DSP. I laughed quite a bit. Oh my sir, you are so clueless. It uses dedicated input and output registers for the ALU!!! That's probably how it can even have programmable address/loop generation features in the first place. You have no idea how inappropriate this design would be for a general purpose instruction set. It would mean, just for example, that you're locked into providing exactly as many ALUs as the original instruction set was designed to handle. It also means that for ANY pattern other than stepping through nice, clean arrays, you lose. Badly. Because your instruction set is only designed to do math on nice, clean arrays. It's super efficient at that, horrible at everything else.

      Which leads us to clue #4: General purpose ISAs designed long after that Analog Devices DSP architecture had none of those features you're claiming to be necessary for any modern ISA. There is a reason for that.

  13. eDRAM by Anonymous Coward · · Score: 0

    How does this compare with embedded dram, which caused a lot of hype few years ago?

  14. Support Chip? by TaoPhoenix · · Score: 1

    I'm no techie, but I'm just wondering if this isn't more of a support chip that works the other way, if it's like a "smart cache" where the main CPU can offload something memory intensive and repetitive to keep it out of the way of the fancy thread calculations.

    --
    My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
  15. synthesis by lkcl · · Score: 4, Informative

    there's a problem with doing designs like this. the tooling for CPUs is very very specific: 28nm, 32nm, 45nm - all those companies that do the simulations where they charge something like $USD 250,000 per week to license their tools like mentor do - have written the tools SPECIFICALLY for those geometries.

    if you wander randomly outside of those geometries you are either on your own or you are into some unbelievably-high development costs.

    why is this relevant?

    it's because the DRAM manufacturers do *not* stick to the well-known geometries: they vary the geometry in order to get the absolute best performance because the cell layout is absolutely identical for DRAM ICs. and, because those cells _are_ identical, the verification process is much simpler than is required for a complex CPU.

    in other words, this company is trying to mix-and-match two wildly different approaches. in other words, what he's doing is either incredibly expensive or is sub-optimal. which begs the question: what's it _for_?

    1. Re:synthesis by Anonymous Coward · · Score: 0

      What task can you do that requires nothing more than a few specific operations, but ones that need to be repeated so fast because there are a tremendous number of operations to do?

      Breaking encryption.

    2. Re:synthesis by Anonymous Coward · · Score: 0

      in other words, this company is trying to mix-and-match two wildly different approaches. in other words, what he's doing is either incredibly expensive or is sub-optimal. which begs the question: what's it _for_?

      Because your processor doesn't need to be "optimal" if you're running memory-bandwidth-intensive operations and your choice of how to build it allows you to have a 4kbit wide per-core memory interface. This device basically has about 16x the memory bandwidth of anything else out there, which means for the right application it runs 16 times faster.

      (captcha: "defended". heh.)

    3. Re:synthesis by K.+S.+Kyosuke · · Score: 1

      there's a problem with doing designs like this. the tooling for CPUs is very very specific: 28nm, 32nm, 45nm - all those companies that do the simulations where they charge something like $USD 250,000 per week to license their tools like mentor do - have written the tools SPECIFICALLY for those geometries.

      Or you can do it the way Chuck Moore does and write your own OKAD, simpler, faster and better. :)

      --
      Ezekiel 23:20
    4. Re:synthesis by GrumpySteen · · Score: 1

      which begs the question: what's it _for_?

        "begging the question" doesn't mean what you think it means.

      Aside from that, the device is a building block for massively parallel computers with extremely high memory bandwidth for the processors. The tasks it would be used for are the same tasks that other massively parallel supercomputers are used for today; simulating complex systems, graphics rendering, etc.

    5. Re:synthesis by Anonymous Coward · · Score: 0

      See this. I became little suspicious after reading about the dependency between the value of the invention and the exclusivity of the license or sale.

  16. Don't count this out yet by fyngyrz · · Score: 5, Interesting

    Useless? My key question would be does it have decent speed integer multiply and perhaps even divide instructions. A whole heck of a lot can be achieved if you have, say, the basic instruction set of a 6809, but fast and wide (and it didn't even have a divide... so we built multiply-by-reciprocal macros to substitute, that works too.)

    I know everyone's used to having FP right at hand, but I'm telling you, fast integer code and table tricks can cover a lot more bases than one might initially think. A lot of my high performance stuff -- which is primarily image processing and software defined radio -- is currently limited considerably more by how fast I can move data in and out of main memory than it is by actually needing FP operations. On a dual 4-core machine, I can saturate the memory bus without half trying with code that would otherwise be considerably more efficient, if it could actually get to the memory when it needs to.

    Another thing... when you're coding with C, for instance, the various FP ops can just as easily be buried in a library, then who cares why or how they get done anyway, as long as they are? With lots-o-RAM, you can write whatever you need to and it'd be the same code you'd write for another platform. Just mostly faster, because for many things, FP just isn't required, or critical. Fixed point isn't very bard to build either and can cover a wide range of needs (and then there's BCD code... better than FP for accounting, for instance.)

    Signed, old assembly language programmer guy who actually admits he likes asm...

    --
    I've fallen off your lawn, and I can't get up.
    1. Re:Don't count this out yet by ByOhTek · · Score: 3, Insightful

      pfft. floating point sucks anyway.

      typedef struct FRACTION_STRUCT
      { //numerator/denominator * 10^exponent
          int numerator;
          unsigned int denominator;
          int exponent;
      } Fraction;

      --
      Self proclaimed typo king, and inventor of the bear destroying coffee table (patent not pending).
    2. Re:Don't count this out yet by buglista · · Score: 2

      Exactly. ARM2 didn't have FP, people still wrote some extremely good stuff for it. (You can always approximate it by pretending that the last 10 bits are 2^-1, 2^-2, ... 2^10 and multiplying out - in fact shifting - when you need the answer. I've written graphics demos like that.)

    3. Re:Don't count this out yet by Anonymous Coward · · Score: 0

      If floating point sucks, why did you implement it in your fraction struct? ;)

    4. Re:Don't count this out yet by smi.james.th · · Score: 2

      You're assuming a rational number there.

      Wait. Hang on. Forget that I pointed that out... :P

      --
      One thing I know, and that is that I am ignorant...
    5. Re:Don't count this out yet by walshy007 · · Score: 4, Interesting

      Exactly. ARM2 didn't have FP, people still wrote some extremely good stuff for it.

      Nintendo DS doesn't have an fpu on either cpu.

    6. Re:Don't count this out yet by ByOhTek · · Score: 1

      Floating point is based on an integer base (rather than a fractional one).

      Or maybe, I should say, floating point, as I have seen it implemented up to this point.

      --
      Self proclaimed typo king, and inventor of the bear destroying coffee table (patent not pending).
    7. Re:Don't count this out yet by ByOhTek · · Score: 2

      it still manages a few cases classic floating point would miss. However, if you give me an infinitely wide register, I would happily work on something that would handle irrational numbers :-)

      --
      Self proclaimed typo king, and inventor of the bear destroying coffee table (patent not pending).
    8. Re:Don't count this out yet by tibit · · Score: 4, Interesting

      Agreed. I'm working on a digital oscilloscope display system and that thing might be very useful in this application -- where you need lots of bandwith, but also plenty of storage. Say, zooming, filtering, scaling of one second long acquisition done at 2Gs/s, using a 12 bit digitizer. You tweak the knobs, it updates, all in real time. In the worst case, you need about 120 Gbytes/s memory bandwidth to make it real time on a 30FPS display. And that's assuming the filter coefficients don't take up any bandwidth, because if they do you've just upped the bandwidth to terabytes/s.

      --
      A successful API design takes a mixture of software design and pedagogy.
    9. Re:Don't count this out yet by TeknoHog · · Score: 4, Funny

      Signed, old assembly language programmer guy

      I see what you did there.

      --
      Escher was the first MC and Giger invented the HR department.
    10. Re:Don't count this out yet by poetmatt · · Score: 1

      I agree with you, this is substantial. The ability to have no megabytes of "cache" but instead gigabytes, depending on how it's used, could be very very substantial.

      I could see this going many ways, including basically the equivalent of having a "Socket type" for the ram - drop in and upgrade as necessary. Even with the significant latency differences versus various levels of on-die cache this can be very significant.

    11. Re:Don't count this out yet by fyngyrz · · Score: 5, Funny

      That's a bit shifty, don't you think? I don't mean to negate your point, but too, it's beyond my power to complement you -- I'm somewhat over a barrel. Perhaps if you add one to your argument, we'd have something else. Logically speaking. HCF.

      --
      I've fallen off your lawn, and I can't get up.
    12. Re:Don't count this out yet by lkcl · · Score: 1, Interesting

      the DEC alpha didn't have a floating-point unit: it had primitives that could be used to emulate floating-point almost as quickly as having a dedicated FPU. i spoke to a friend several years back who know about these things: he said that a 1s complement add goes a long way towards speeding up floating-point using integer operations. i have _no_ idea what he meant but it was kinda interesting to hear from someone who'd really thought about this stuff.

    13. Re:Don't count this out yet by Anonymous Coward · · Score: 0

      Well, some might say that floating point is defined by having a "floating point" :) You have a floating point numerator, implied by the exponent field.

    14. Re:Don't count this out yet by tomhath · · Score: 1
      Would this thing be able to run a JVM? With 128 cores on a 16GB chip it might actually make Eclipse fairly responsive.

      I could see it as part of a database machine too. Somewhat limited by the amount of memory available on each chip, but an architecture similar to a Netezza appliance where the processing is pushed out closer to the data might be interesting.

    15. Re:Don't count this out yet by Anonymous Coward · · Score: 0

      "didn't even have a divide... so we built multiply-by-reciprocal macros to substitute, that works too."

      How can you do a reciprocal without a divide? Unless you had predetermined "divisors"?

    16. Re:Don't count this out yet by Anonymous Coward · · Score: 0

      That's a bit shifty, don't you think? I don't mean to negate your point, but too, it's beyond my power to complement you -- I'm somewhat over a barrel. Perhaps if you add one to your argument, we'd have something else. Logically speaking. HCF.

      My brain just went like your userid.

    17. Re:Don't count this out yet by TarMil · · Score: 1

      Or maybe, I should say, floating point, as I have seen it implemented up to this point.

      If you define numbers differently, you can't call them "floating" anymore. Integer base + exponent is the very definition of "floating point".

    18. Re:Don't count this out yet by wik · · Score: 2

      You must be thinking of some other processor. The first released Alpha silicon, Alpha 21064, had a pipelined FPU for adds/subtracts/multiplies and a non-pipelined floating-point divide unit.

      --
      / \
      \ / ASCII ribbon campaign for peace
      x
      / \
    19. Re:Don't count this out yet by gbjbaanb · · Score: 1

      for most applications you don't need a FPU, or floating point numbers at all.

      After all, you just redefine where the decimal point is and you have perfect-accuracy floats, its how you create decimal type for currency arithmetic, no reason why you can't use it for 3d graphics too.

      On a dual 4-core machine, I can saturate the memory bus without half trying with code

      you code in .NET too huh? :)

    20. Re:Don't count this out yet by inhuman_4 · · Score: 1

      Quite right. I do research on AI for embedded systems, specifically Integer Neural Networks (is it a shameless plug if I don't profit from it? creative commons book chapter.). By cutting out FP you can make all those low cost (sub $1) microcontrollers pretty powerful. Neural Networks cut out a lot of the processing by just making good guesses, then cutting out FP makes an implementation very light on resources.

    21. Re:Don't count this out yet by Anonymous Coward · · Score: 0

      hurrrr durrrr floating point sucks watch me invent a lame 96-bit floating point type

    22. Re:Don't count this out yet by TerranFury · · Score: 1

      Since one way to define a real number is as an (equivalence class of) sequences of rationals, I suppose any object that (1) stores a rational number, and (2) can advance to a "next" rational number, could be called a real number. (That's so long as the sequences it generates are Cauchy.)

      I guess I'm saying you just need to make all your evaluation really, really lazy, and you can work with arbitrary precision. :-)

    23. Re:Don't count this out yet by raynet · · Score: 2

      I raise my carry to you, sir.

      --
      - Raynet --> .
    24. Re:Don't count this out yet by Kagato · · Score: 1

      This could be interesting datastorage. Recent articles about the NoSQL implementations at FB and other new media companies have indicated that 80+% of the data is actually stored in memory (with a write behind to disk). This coule end up being a specialized server product.

    25. Re:Don't count this out yet by hairyfeet · · Score: 1

      Uhhh...why give up anything? Who says we can't have this AND an ARM or x86/64 CPU/APU? Hell we already have specialized chips like DSPs for sound, maybe having "Smart RAM" that can say do its own error correction or compression or encryption wouldn't give us a nice speed boost without making the power requirements jump?

      --
      ACs don't waste your time replying, your posts are never seen by me.
    26. Re:Don't count this out yet by Teancum · · Score: 1

      How can you do a reciprocal without a divide?

      ROM look-up tables can do wonders if you aren't begging for high accuracy (or even if you are) and are screaming fast (one extra fetch cycle). Trig functions are often done that way in FPUs as well, so it isn't without precedent. The hard part is simply silicon real estate being traded for circuit complexity.

      To explain, a ROM table simply maps one to one for every value (or high order bits if you are looking for less accuracy but smaller ROM real estate) that you want to change over to another value. ROM is also something that can be built quite efficiently, so it isn't all that big of a deal either. With a single extra clock cycle as the "penalty" for culling out the dividing circuity, it is a good way to significantly simplify the CPU design and still give you most of the performance that you need.

    27. Re:Don't count this out yet by Sun · · Score: 1

      Shouldn't dirty talk be done in private?

      Shachar

    28. Re:Don't count this out yet by rubycodez · · Score: 1

      no, that base has to have base digits of one half raised to increasing integer powers. it is NOT an integer

    29. Re:Don't count this out yet by Anonymous Coward · · Score: 0

      FP is always rational. For numerical work, it's meaningless since the rationals are dense on R.

    30. Re:Don't count this out yet by fyngyrz · · Score: 1

      .NET? No, I generally code in straight C, and stay away from system libraries and OS dependencies and more specialized languages and GPL as much I can. Generally I write under OS X, but am almost always no more than a recompile away from working on some other platform. Consequently I have a very large library of my own code I can call upon, and rarely suffer from Other People's Bugs. My bugs, of course, are another matter -- but at least I can generally fix them.

      My memory bandwidth problems typically arise from doing things to data elements that (a) can be spread across multiple cores and (b) don't take many CPU cycles per data element and (c) where the data elements come from, and must be restored to, data element arrays that are far too large for cache to do them any significant good.

      --
      I've fallen off your lawn, and I can't get up.
    31. Re:Don't count this out yet by fyngyrz · · Score: 1

      Well, because no one's offering that yet? :)

      --
      I've fallen off your lawn, and I can't get up.
    32. Re:Don't count this out yet by Shifty0x88 · · Score: 1

      that's because a ds wouldn't benefit from FP math, it is just displaying on an integer-based display (LCD), so the FP math would just lead to rounding errors and graphics not matching up.

      Graphics can be implemented as FP math, but really if I am going to be drawing at point ( 5, 25) why should I need the resolution of point ( 5.25678, 25.9635)????

      In my opinion graphics like that should always be integer based math, as it would be faster and as you pointed out @walshy007, you wouldn't need an FPU

    33. Re:Don't count this out yet by walshy007 · · Score: 1

      3d tends to require fractions. While the nintendo ds does not have an fpu, the gpu in it uses fixed point arithmetic (in 1.4.11 format I think? it has been a while) for the projection/model view matrices.

      floating point is far better for these uses, not having an fpu imposes serious limitations and work arounds on 3d work

      If you honestly don't think 3d is better served by an fpu, I'd heavily encourage you to try writing a software renderer that does not use them (all pc graphics cards these days are essentially a bunch of _hundreds_ of floating point units)

    34. Re:Don't count this out yet by ChrisMaple · · Score: 1

      Actually, "floating point" is an example of a definition that has changed over the years. Originally, floating point was a form of binary-coded-decimal that in each nibble could hold 0-9, minus, or a decimal point. To represent varying degrees of small numbers, the decimal point showed up at different places in the representation, like 123.45 or 1.2345, thus it "floated". What we now call "floating point" is more properly referred to as "scientific notation".

      --
      Contribute to civilization: ari.aynrand.org/donate
    35. Re:Don't count this out yet by Anonymous Coward · · Score: 0

      If you could quit playing with your bit for a few minutes you might see the light

    36. Re:Don't count this out yet by Anonymous Coward · · Score: 0

      Now multiply two of those in one clock cycle.

    37. Re:Don't count this out yet by gbjbaanb · · Score: 1

      oh sigh.

      I was taking the piss out of how most .net code I've seen uses RAM like it was neverending, mostly due to people creating objects all over the place and not even trying to code in an efficient manner. (it is designed for 'programmer productivity' so I guess that's by design)

    38. Re:Don't count this out yet by Shifty0x88 · · Score: 0

      I wasn't saying that 3D is better served by an FPU, but we are taking about the DS so it's all just "fake" 3D and when you are dealing with an LCD panel, I cannot light half a pixel differently from the other half, so what is the point is having the greater precision if you just throw it all out when you actually display the new frame??

      I was just saying it is a waste of CPU cycles to do all the fancy math and just throw it out at the end anyways. Why not just stick with integer-based math?

      Plus I was limiting my discussion on the DS, NOT desktop computers in general, so graphics cards are out

    39. Re:Don't count this out yet by walshy007 · · Score: 1

      I cannot light half a pixel differently from the other half, so what is the point is having the greater precision if you just throw it all out when you actually display the new frame??

      There is a lot of math between "draw z/y/v possition with this projection matrix" etc to "we have our 2d plane to project on to the screen". THAT is the math that is better suited by an fpu, which is the same on a DS or a desktop pc doing graphics.

      Try doing some 3d programming.. on anything, and you will see what I mean.

    40. Re:Don't count this out yet by sjames · · Score: 1

      Floats and doubles can only approximate an irrational value, so I don't see where they win over the fraction struct. The only precise way to deal with irrationals is symbolic manipulation. With any luck, they'll cancel by the end.

    41. Re:Don't count this out yet by smi.james.th · · Score: 1

      Was trying to be funny...

      Woosh...

      --
      One thing I know, and that is that I am ignorant...
    42. Re:Don't count this out yet by White+Flame · · Score: 1

      It has a hardware 16*16=32bit integer multiply, two separate instructions to easily support 32-bit multiplication, given that the register sizes are 32-bits. It doesn't look like there are any additional cycles for multiplication vs any other operation, just that you have to run 2 instructions to perform it in 32-bit. No division instruction.

      The addressing modes are kind of funky, and there's a single accumulator register that all math seems to pipe through. So this is a very non-CISC architecture in terms of requiring multiple instructions to shuffle data around between registers/stack plus performing ALU ops. Though I suspect compilers could get pretty good at keeping the code tight.

    43. Re:Don't count this out yet by White+Flame · · Score: 1

      That's not a truncated floating point number, it's a rational number.

    44. Re:Don't count this out yet by White+Flame · · Score: 1

      What does LCD have to do with anything? You're still dealing with discrete pixels on a CRT or what have you.

      And having fractional resolution of pixel points helps a ton for doing subpixel scanline rendering, ie smooth motion when drawn segment endpoints are moving less than a pixel.

    45. Re:Don't count this out yet by TerranFury · · Score: 1

      FP is always rational.

      The floating point numbers (say, IEEE whatever-whatever) are a subset of the rationals, sure.

      For numerical work, it's meaningless since the rationals are dense on R.

      Sure. I think my point wasn't as immediately practical as that. It was more philosophical. Some revered mathematicians was quoted as saying something to the effect of "progress in mathematics happens by calling things the same that are different." E.g., saying that a real number IS a Cauchy sequence of rationals. In the same way, you can say that some C++ object that stores a rational (and some state), and can advance to a next rational IS the sequence it generates, and IS a real number -- (if it is Cauchy, which, if the object is a black box, you cannot tell from the outside.) However, I'd neglected something obvious:

      That same code is represented on the computer by a finite-length sequence of symbols from a finite alphabet, so it can also be interpreted as being a natural number. Since we have a bijection from the naturals to the sequence-generating programs, and we know that the set of reals is larger than the set of naturals, the map from sequence-generating programs to the reals must not be onto.

      So I guess there are real numbers that it is impossible to write programs to converge to (even with arbitrary-precision rationals). Huh.

      (All this must be extremely introductory computational theory... It's got to be...)

    46. Re:Don't count this out yet by tolkienfan · · Score: 1

      They can only approximate some rationals too. So a fraction class can be useful for exact calculation.

    47. Re:Don't count this out yet by tolkienfan · · Score: 1

      You can do reciprocals to any desired accuracy with a Taylor expansion.

    48. Re:Don't count this out yet by sjames · · Score: 1

      Absolutely agreed. I would like to see programming in general move to more abstract numerical operations and reserve IEEE floats and fixed width ints for the most performance critical operations. I would especially like to see calculators move away from decimal only representation. It's subtly distorting everything we do that even touches on math. It was understandable back in the days when a calculator cost over $100 and just barely had enough computational power to get the job done. In these days where a decent low spec CPU is tiny, dirt cheap, and requires no heat sink, there's no reason we can't do it.

      A while back, just to see, I wrote an RSA function in python taking advantage of the unlimited precision of longs. The code was dead simple and MUCH more understandable than any C implementation I have seen. While it won't break any speed records, it was more than fast enough for any human time scale operation (most uses of RSA, that is) when run on a modern PC. Yes, underneath that simple and readable python code was a lot of hairy implementation in C, but it's a very well exercised hairy implementation.

      Even in the case of iterative HPC simulation and modeling code where hundreds of modern CPUs can still take a month to crunch a simulation, I wonder if a slower validation implementation would still be a win so we can be sure that the fast (and often loose) production code is actually producing valid results.

  17. Way to miss the point by Anonymous Coward · · Score: 0

    Er, from that link:

    the term SoC is typically used with more powerful processors, ..., which need external memory chips (flash, RAM) to be useful

    And it also explains why the different processes used to make memory and CPU means these are usually separated.

  18. Yo dawg! by jimmydigital · · Score: 1, Funny

    I heard you like to reduce maps, so I put a CPU in your RAM so you can hash while you map.

    --
    Every normal man must be tempted, at times, to spit on his hands, hoist the black flag, and begin slitting throats. -HLM
  19. Humm, IBM did it first. by JimCanuck · · Score: 2, Interesting

    IBM sells CPU's that have DRAM onboard for quite a while, IBM developed it, patented it, and sells it as "eDRAM" aka "embeddedDRAM".

    I guess IBM's POWER7 processor family powering such things like, Sony's PlayStation 2, Sony's PlayStation Portable, Nintendo's GameCube, Nintendo's Wii, and Microsoft's Xbox 360. All have eDRAM.

    Maybe news articles should be checked to see if they are really news or not before posting?

    1. Re:Humm, IBM did it first. by vlm · · Score: 1

      In the field, a microcontroller is just a processor with onboard usually static memory, so this thing is pretty close to a microcontroller.
      Anyone know of any other microcontroller type chipsets that use dynamic ram?

      --
      "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
    2. Re:Humm, IBM did it first. by Anonymous Coward · · Score: 0

      No, none of these game consoles have eDRAM in it. It is only used on Power7 and Z Series machines made in 45nm SOI technology (I know because I work at IBM Hopewell Junction where these were initially manufactured).

    3. Re:Humm, IBM did it first. by Anonymous Coward · · Score: 0

      No, none of these game consoles have eDRAM in it. It is only used on Power7 and Z Series machines made in 45nm SOI technology (I know because I work at IBM Hopewell Junction where these were initially manufactured).

      I think the other guy overstated the number of consoles which have eDRAM, but the Xbox360 definitely does have it:

      http://msdn.microsoft.com/en-us/library/bb464139.aspx

    4. Re:Humm, IBM did it first. by JimCanuck · · Score: 1

      Your right, my mistake, however IBM did put in DRAM onto the Xbox as AC says.

  20. Caches by unixisc · · Score: 2

    Normally, in any CPU, you have 1, 2 or even 3 levels of cache - level one being the fastest accessed from the CPU, and higher numbers involving more latency. The whole idea being that data that is frequently accessed should be either within the CPU's register files, or within the level 1 cache. Failing that, the level 2 cache, failing that, level 3 cache or main memory. So for this CPU, the DRAM can be considered an L4 cache?

    Incidentally, is it an SoC? Does all the support circuitry - to the South Bridge, PCIx, USB, 802.11 and other peripheral interfaces - get included here? And can someone attach a few extra GB externally to give what's effectively an L5 cache?

    I can't say I like this approach - I'd prefer it if the CPU and interface logic was on 1 chip, and the memory on another.

    1. Re:Caches by cbhacking · · Score: 2

      Umm... no. You've apparently completely failed to notice the part where this CPU *has* no cache, at least certainly no L2 or L3. Instead, it talks directly to main memory (which it's embedded in, at least in a portion of, and has extremely fast access to). More accurately, any given gigabit (128MB of RAM) is the cache for one of these CPUs.

      I don't know how quickly they can communicate across the DIMM (each 2GB has 16 CPUs, so some intercommunication is critical) - maybe that's more akin to traditional memory access speed - but it's still a ludicrous amount of "cache" and eliminating the multiple levels of caching greatly simplifies the memory controller logic.

      That said, I wonder how useful a CPU core with so few transistors (and apparently a low clock speed) will be. It's certainly not going to have all the peripheral interfaces you mention - not even close.

      --
      There's no place I could be, since I've found Serenity...
    2. Re:Caches by White+Flame · · Score: 1

      The CPU does explicitly have cache: It has a 512-byte cache line for each of its 3 pointer registers, and one for its instruction cache. It can apparently read or write this width in a single cycle to the DRAMs, because of the 4096-bit bus (barring contention or who knows what else).

  21. cray3/super scalar system by Rachael · · Score: 2

    cray where heading that way also in the 90ish with their sss system, they where just adding many 2048 cpus per block.

    http://en.wikipedia.org/wiki/Cray-3/SSS
    http://www.thefreelibrary.com/CRAY+COMPUTER+CORP.+COMPLETES+INITIAL+DEMONSTRATION+OF+THE+CRAY-3...-a016628331

  22. cpu and memory already atomic by CBravo · · Score: 1

    You model separate cpu and memory as two processors: one with only a litte memory and a lot of processing power, the second with a lot of memory and no processing power (theoretically speaking).

    --
    nosig today
    1. Re:cpu and memory already atomic by Anonymous Coward · · Score: 0

      A processor with no processing power? I always thought processing power was the defining characteristic of a processor.

  23. Reinvention from 1984 by AlecC · · Score: 2

    Looks like they have reinvented the inmos Transputer, from about 1984. http://en.wikipedia.org/wiki/Transputer . They alwaysintended to take that multicore, but never got that far. But it looks remarkably similar in intention.

    --
    Consciousness is an illusion caused by an excess of self consciousness.
    1. Re:Reinvention from 1984 by tibit · · Score: 1

      The ideas from inmos are alive and well at XMOS. I use their two core chip and I'm fairly happy -- it's plenty fast for what I use it for (industrial data collection). If only they documented the darn thing better.

      --
      A successful API design takes a mixture of software design and pedagogy.
  24. IBM already has CPU and eDRAM in their chips by Anonymous Coward · · Score: 0

    IBM already offers embedded DRAM option to go with logic to enable high density cache in microprocessors. Power7 already uses this feature. How is this new ? You can use their foundry service to use the technology.

    http://www-03.ibm.com/press/us/en/pressrelease/32970.wss

  25. How curious by vikingpower · · Score: 1

    "Venray" is a boring little town in the Netherlands. Both it and neighbouring Venlo are known for the tough crime scene. Link ?

    --
    Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
  26. Hmm. by theswimmingbird · · Score: 1

    I used to be a CPU like you until I took some DRAM to the knee.

  27. More than just embedded DRAM by sulimma · · Score: 1

    This is not just about putting DRAM and a CPU on the same chip while keeping the architecture of both unchanged.
    This is about how computer architecture is effected by the possibility of implementing both on the same chip.

    Dave Patterson noted in the nineties that the number of DRAM chips per computer went down with time. He predicted that DRAM
    will become large enough soon that at least the memory for a single process will fit into one chip soon. At that point it is unecessary
    slow and power consuming to move the data to the CPU and back for every computation (or alternativly spend 90% of the CPU chip
    area for cache to reduce the number of transports)

    When you do put CPU and DRAM on the same chip the cost functions change and different architectures become optimal.
    Patterson noted that when you have a CPU and DRAM on the same chip the relative architectural cost functions will be similar to the
    technologies of the 70ies, just a few orders of magnitude smaller. Therefore he revisited architectures of that time and suggested to
    put a vector computer on a DRAM chip called the IRAM.
    http://www.cs.berkeley.edu/~pattrsn/talks/iram.html

    Vector computers do not benefit much from cache. Latency is not a big issue for vector computers but they really benefit from bandwidth.
    On chip you can connect the DRAM to the CPU with 2048 bits bus width or more. (And the latency would be much smaller than the latency
    of a CPU going through a big cache hierachy and an external bus to the RAM)

    If more memory is needed than fits on one chip he suggested to minimize data transports between chips. Instead the register state of the
    process would be migrated to the DRAM where the desired data resides.

  28. Maybe we're thinking about this ALL wrong... by PortHaven · · Score: 1

    Let's say, instead of looking at it as a substitute for a main processors. We look at a much more distributed system.

    8GB of RAM with (what I'll call) an Inline RAM processor.

    So it doesn't do a lot of FP. That's fine, most portables and handhelds already have a GPU. GPUs love FP. Then let's add (if necessary) a simply CPU that essentially controls drive & I/O access.

    Now, I'm not saying this will replace current processors or platforms. But there might be uses. Heck, I don't know. But what if this type of RAM replaced the GPU memory on a video card. Allowing the memory itself to do some post-processing. Could we improve aliasing & refine video output even further.

    I don't know. But I think it's wrong to knock a technology that's in it's infancy. There were enough naysayers to the automobile. They all turned out to be wrong. Granted, the steam powered vehicle didn't make it through history.

    But let's at least give it a chance to be planted and see what sprouts.

  29. GreenArrays does something similar by Anonymous Coward · · Score: 0

    http://www.greenarraychips.com/

    Their GA144 chip has 144 complete computers on one die, and the power requirements are extremely low. The F18A CPU's are completely asynchronous, requiring no clock. While their target market is mostly embedded systems, there's no reason why they couldn't be used elsewhere.

    1. Re:GreenArrays does something similar by tibit · · Score: 1

      It seems to be fairly useless for things other than very limited specialized tasks. 128 words of memory per core? What the heck? Even with 512 words per core on Parallax Propeller, people are fighting to get things going without having to access the shared and slow hub memory. Fitting everything in 512 words (2 kbytes) is hard, 128 words makes it more than 4 times harder. Their architecture basically implements a very simple dialect of FORTH in hardware. It's very hard to design systems of any considerable complexity using such simple CPUs without proper support from software development tools. It seems to me that this is the eventual undoing of all small-but-potentially-fast architectures: the development tools suck and do not help you in partitioning, simulating and validating your problem. XMOS seems to be going the right way about it -- they at least provide decent (high bandwidth) realtime debugging and they have a static timing verifier that's a must in real-time applications those chips are good for. I'd love to use F18 if only the damn thing had modern tools to support development on it, not something that feels like a trip back to the 70s. So far, it seems like a decent, perhaps more flexible, low-power replacement for small programmable logic, but it requires a lot of effort to get anything done on it. Their hardware, albeit fairly pretty, is useless without decent tools. Their disdain for common terms and introduction of their own terminology in their literature doesn't help. If they don't get their act together with a proper development environment, they will sink, they must understand that.

      --
      A successful API design takes a mixture of software design and pedagogy.
  30. Pipelines by Alioth · · Score: 1

    Of course you can get a pipeline in a CPU with ~22000 transistors, the original ARM had IIRC about ~28000 transistors, and has a pipeline. I'm guessing that this chip isn't x86. The x86 is far less economical with transistors, just the part that works out how long the next instruction is for x86 is larger than an entire ARM core. With simple fixed length instructions, and with a simple ALU you can get a chip that'll have pretty decent instruction throughput.

    I somehow doubt this chip is designed to take over from x86, in reality it's likely targeted at special purposes where being on the same die as the DRAM is important.

  31. Old idea, but better than you expect by Theovon · · Score: 3, Informative

    My research area is computer architecture.

    This idea of moving compute into the RAM has been around a long time. Papers have proposed everything from adding simple ALUs to the DRAMs to fully functional microprocessors. Most assume that these are "accelerators" for common vector operations and such, while the heavy lifting is done by beefier cores, but the idea if doing all the compute embedded in a DRAM has been proposed and evaluated before.

    One thing we've learned in the past few decades is that modern processors are limited by memory latency and bandwidth. A Sun engineer (talking about Rock) pointed out that a modern out-of-order processor performs a race between last-level cache misses. When you have to go out to DRAM, the CPU instruction window fills up with as much dependent work as possible, before it completely stalls because everything is dependent on that one miss. When that data finally arrives, the CPU blasts through that work really fact, and then soon stalls out again on another miss. OOO processors resolve this (somewhat) by the instruction window, while Rock solved it by speculative execution. One of the reasons for Sandy Bridge's excellent performance is the very large instruction window that can absorb more of the LLC miss stall time.

    And so, although these processors have other advantages, OOO processors dedicate a huge amount of logic just to dealing with the cache miss latency. If there were no such latency, then they could get the same performance with a hell of a lot less hardware. Although I haven't seen the figures, my suspicion is that for general computation, TOMI will blow the doors off of whatever else we've got in both performance AND energy efficiency. Only when you have a specialized compute kernel whose working data fits in the cache can you comparatively benefit from something like Sandy Bridge. (I realize that's an overly strong statement, because lots of general purpose workloads have good locality, but nevertheless main memory is a major bottleneck for most workloads.)

  32. Hmmmm, I may have been looking for this... by meburke · · Score: 1

    Just as I was thinking that this might be the start of a good FORTH machine, I find out that Fish used to work with Chuck Moore. What a coinkydink.

    --
    "The mind works quicker than you think!"
  33. +FPGA FTW by Doc+Ruby · · Score: 2

    left a few things out--like an FPU, branch prediction, pipelining, or any form of speculative execution [...] it's an ultra-specialized, ultra-lightweight core that trades 25 years of flexibility and performance for scads of memory bandwidth.

    Embed a fat FPGA in this chip well-interconnected to DRAM and CPU, and you get all those things. You might even replace the current chip's buses with FPGA for both data distribution and inline logic. Or make a discrete (but well-interconnected) onchip FPGA able to power down when not in use, and keep the low power consumption except when it's necessary. Turn on the FPGA for speed, or when the FPGA logic is so efficient that it's lower power than doing it in the CPU.

    For somewhat lower power consumption, and better performance in many tasks, but less flexibility, embed a DSP in the chip instead of the FPGA.

    Or both: DSP as ALU, FPGA as CLU (and flexible ALU, and beyond), on the chip with a simple processor to run the OS and main app threads. Bringing all the ports and buses to RAM all on the chip makes it all wicked fast. De/selecting these modules for power on demand (or in thread init) saves energy.

    --

    --
    make install -not war

    1. Re:+FPGA FTW by Anonymous Coward · · Score: 1

      Embedding an FPGA has already been done, an example would be the Xilinx Zynq.

      This is an dual Cortex A9 + FPGA: http://www.xilinx.com/products/silicon-devices/epp/zynq-7000/index.htm

    2. Re:+FPGA FTW by Doc+Ruby · · Score: 2

      Yes, I'm very excited about the Zynq, but it doesn't embed the RAM onchip, which is what's interesting about this new processor. The dual Cortex A9 is far more complex than the CPU on this new chip, in part to speed execution that's slowed by offchip RAM latency.

      But indeed an FPGA and the AMBA bus into a simple chip with onboard RAM is interesting. Though for Xilinx's market for FPGA apps they'd be better with much more than 2GB onchip RAM; more like 128GB or a TB. And onboard optical Gbps ethernet, as long as I'm wishing for Xilinx.

      --

      --
      make install -not war

  34. Some sort of coprocessor by xluap · · Score: 1

    128 processors on a dimm............... Not every program is suitable for parellel execution.
    http://en.wikipedia.org/wiki/Amdahl_law

    So this might only be useful for task that can be parallelized. Then it will be a parallel coprocessor.

    This might be exactly suitable to speed up things like the integer fractal program fractint, but what else can benefit?

  35. Old idea, but better than you expect by Anonymous Coward · · Score: 0

    F21 Microprocessor, 500 MIPs in 1997, ( ref: http://www.ultratechnology.com/f21.html ) --> .2 sq mm asynchronous microcomputer core, 60,000 Mips, ( ref: http://www.forthfreak.net/misc/25x.html ) -->
    variable length instruction word symmetric multi processing of multiple parallel processors ( VLIW SMP MPP )

  36. Same old story... by Baldrson · · Score: 2
    Ever since Illiac IV its been the same story:

    Can't really solve the mutex problem so pretend it doesn't exist and screw the programmers by pretending to solve the main memory latency problem with CPU-local memory.

    The "innovation" over Illiac IV is to call it "multicore".

    PS: There is a solution but since I can't afford the patent fees, its not going anywhere.

  37. no pipeline? by mattack2 · · Score: 1

    That said, when your CPU has fewer transistors than an architecture that debuted in 1986, there is a good chance that you left a few things out--like an FPU, branch prediction, pipelining

    The 6502 had a pipeline of sorts 10 years before that.

  38. Doing division with multiplication by fyngyrz · · Score: 1

    How can you do a reciprocal without a divide? Unless you had predetermined "divisors"?

    Ok, two approaches. First, if you know the divisor, but not the dividend, then in your assembly code you can write, conceptually:

    dividend * (1.0 / divisor)

    Since the right side of that contains only known values, there's only a multiply to be done at run time; the assembler can prepare the rest at assembly time.

    Second, if you don't know the divisor or the dividend, you can prepare a table, conceptually like this:


    1/1

    1/2

    1/3

    1/4
    ....

    1/N

    Then, when the time comes for division, you do it like this:

    x = y * table[z]

    It's a hair slower than just multiplication, because it includes a table lookup.

    There are some technical things involved in this, like accounting for width of the various inputs and where the binary point lands within your results, but these are really implementation details and don't change the overall idea.

    I should also point out that using tables, you can pre-compute results for many arbitrary inputs, so that execution time is essentially the table lookup, sometimes with an interpolation stage. You can do sines, cosines, logs, etc. this way. If memory is cheap and fast, then the need for an FPU may simply go away. Simple example: If you have a 16 bit float, then a 64k entry table can contain the sine value for every possible floating point input. Zero compute time at run time. You want fast, there it is.

    But as it turns out, a lot of times, the need for floating point isn't what we think it is. Let's say you want to rotate an image, so you scan through it pixel by pixel. Naively, rotation is:

    newx = (x * sin(theta)) - (y * cos(theta));

    newy = (x * cos(theta)) + (y * sin(theta));

    So, you look at that and you think, wow, two FP sines and two FP cosines and four FP multiplies and an FP add and an FP subtract PER PIXEL!

    But, work it out, and this is what you want to actually execute per pixel:


    newx = xxs - yyc;
    newy = xxc + yys;

    xxs += s;

    xxc += c;

    That's three FP adds and one FP subtract per pixel. All of the sine, cosine and multiply stuff is gone. There are also two more FP adds per line. It's hella fast, and 100% accurate.

    How one gets from the naive approach to the fast one, I leave as an exercise for the reader. Unless someone is really curious -- in which case, I'll blog it and point you to it, it's a little esoteric, even for this place.

    --
    I've fallen off your lawn, and I can't get up.
    1. Re:Doing division with multiplication by Anonymous Coward · · Score: 0

      I should also point out that using tables, you can pre-compute results for many arbitrary inputs, so that execution time is essentially the table lookup, sometimes with an interpolation stage. You can do sines, cosines, logs, etc. this way. If memory is cheap and fast, then the need for an FPU may simply go away.

      The problem is that "cheap and fast" does not accurately describe memory today. Both terms are valid in some places, but not universally. "Cheap" applies to one part of the memory hierarchy (DRAM), "fast" applies to a different part (caches).

      Lookup tables were really great back in the era when DRAM was both cheap and fast, but that's just not true now. Which is why software lookup tables have been relegated to a niche performance optimization technique -- there aren't too many situations left where they enhance performance. Instead, they often reduce performance because caches necessarily have limited space (else they wouldn't be fast), and lookup tables tend to chew up lots of cache (especially if you need much precision).

      Fast memory is so important that today, it's often better to recompute seldom used temporary values from source operands every time the temp is used, rather than compute once, store to memory. ALU ops are cheap, cache space isn't.

    2. Re:Doing division with multiplication by fyngyrz · · Score: 1

      Dude, whoosh. Look at what the story is about here -- a CPU with lots of very fast ram -- and no cache.

      --
      I've fallen off your lawn, and I can't get up.
    3. Re:Doing division with multiplication by Anonymous Coward · · Score: 0

      Dude, whoosh. Look at what the story is about here -- a CPU with lots of very fast ram -- and no cache.

      Dude, whoosh right back. RTFA and look at the block diagram of the CPU. It has a cache, but you assumed it didn't, probably because you skimmed the summary and caught the bit about "not surrounding the processor with L2 and L3", but didn't notice the implication that it has L1. Which it does. But not very much of it, because you don't get a hell of a lot of cache in a 22K transistor CPU.

      The reason why it does have cache is that 'very fast DRAM' is an incomplete description. The integration permits a very wide bus between the DRAM and the processor, so it's got tons of bandwidth. Unfortunately, high bandwidth does not imply low latency. Some components of DRAM latency which would be present in a more conventional design are gone, but the long pole -- the time it takes to actually read or write DRAM bit cells -- is still there.

      To give you an idea, I just looked at a datasheet for a 2Gb DDR3 SDRAM IC from a major mfr., and the tRCD timing parameter even for fast speed grades is 13ns. tRCD is the time you're required to wait after sending an ACTIVATE command to open a row; after the wait, you can begin to actually read or write inside that row. Even at 500 MHz that's over 6 cycles, and tRCD is only part of the sequence needed to perform a DRAM memory access. Guess what any ultra simple core can't handle without performance going in the toilet? More than a cycle or two of memory latency.

      So, it has cache, but because it's so simple, it can't have very much of it, and therefore what it has is very precious indeed. There might be some circumstances where table lookup would be a valid technique to use with this CPU, but it would require a lot of care.

    4. Re:Doing division with multiplication by fyngyrz · · Score: 1

      Ok, fair enough. It's still cache-based. I did read it wrong. Too bad. I hope someday we get ram up to CPU speed. Lots of things in the way, at least as long as we use electrons for signaling. Oh well, was nice to think about, even if wrong. :)

      --
      I've fallen off your lawn, and I can't get up.
  39. Combined CPU and DRAM by rhfish · · Score: 2

    Wow, we're on Slashdot......almost like being On The Cover of the Rolling Stone.

    Answers to various questions and comments:
    - We support the Linux toolchain; compilers, debuggers, etc., fortunate to have some of the original gcc team. Ported pieces of various kernels to TOMI Aurora to make certain we had not left anything out and to test the memory manager. Aurora was for use in a tablet type device.
    - TOMI Borealis was optimized for Big Data and unstructured data apps like MapReduce that choke at the Memory Wall. Linux could probably be ported without too much difficulty. Most massively parallel installations will use something really light weight instead.
    - Potential users said give them more integer cores instead of adding FPU. We gladly cede the FP world to Itanium.
    - For raw FP horsepower within a reasonable power budget, its tough to beat Nvidia's GPU approach. That is probably why 3 of the top 10 supercomputers are GPU accelerated. http://www.top500.org/ GPU-type architectures will likely be the future of scientific computing. Venray is focused on Memory Wall limited areas such as Big Data.
    - From the computer architecture perspective, the distinction between Big Data and Small Data is whether the datasets will primarily fit within the onboard caches. Video compression, graphics acceleration, encryption, and much of LINPAC (http://en.wikipedia.org/wiki/LINPACK) would be classed as Small Data since most of the computing can be done without leaving the caches (high locality). Legacy architectures choke on Big Data since the datasets overflow the caches and there is much much less data reuse.
    - MapReduce is important because it is currently the most visible Big Data application thanks to Google. http://research.google.com/archive/mapreduce.html
    - Venray believes Big Data applications are the future of computing. So does McKinsey Consulting. http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation We leave it to others to accelerate MS Office and Call of Duty.
    - The future of Big Data appears to be RAM resident, not disk, not even flash. (See Fred Ho's work at IBM.) https://www.ibm.com/developerworks/mydeveloperworks/blogs/fredho66/?lang=en_us
    - re: Mitsubish 3DRAM and other similar ventures, iRAM, Exacute, Gilgamesh, etc....they embedded DRAM into logic. Contrast with TOMI that embeds CPU cores into DRAMs.....our benefits are performance and particularly cost: http://www.edn.com/photo/294/294788-microprocessor_vs_memory_transistors_graph.jpg
    - We chose a modified RISC architecture rather than a special purpose one such as Gilgamesh in order to make programming simpler with well understood Linux tools such as gcc. Submit your gcc C, C++, or Fortran to http://www.venraytechnology.com./ Statistics are returned in standard dGen format.
    - TSV (through silicon vias) and HMC (hybrid memory cube) are valid attempts to push back the Memory Wall. Discussed in Part 1 for EDN. http://www.edn.com/article/520059-The_future_of_computers_Part_1_Multicore_and_the_Memory_Wall.php Decision may be determined by cost.
    - Would love to dispense with caches because they add transistors. 4K data and 4K instruction caches sped us up about 10x. Unlike legacy architectures, TOMI cache lines load in a single DRAM cycle.
    - Yes love Raspberry Pi. http://www.raspberrypi.org/
    - Quad-

    1. Re:Combined CPU and DRAM by Anonymous Coward · · Score: 0

      Isn't 1G/8 = 128M (not 128k per core)?

      And is this 1Gbit or 1GByte total DRAM per?

    2. Re:Combined CPU and DRAM by rhfish · · Score: 1

      1Gbit see line 4 http://www.dramexchange.com/
      128M....thx :-)

  40. All i got to say is... by Skal+Tura · · Score: 1

    WOW! Smply awesome, amazing! :D

    Sure it's specialized, but it's very neat innovation.
    Now if they can scale it upto say 200k transistors and implement best optimizations, or make a "fore" chip managing all the different cores which puts some of the optimizations there.

    Floating point ops can be done in a GPU at immense speeds in any case ;)

  41. Hmmm. by randyleepublic · · Score: 1

    Sounds to me like exactly what is needed as a building block with which to build a self-aware AI.

    --
    Social Credit would solve everything...