Startup Combines CPU and DRAM
MojoKid writes "CPU design firm Venray Technology announced a new product design this week that it claims can deliver enormous performance benefits by combining CPU and DRAM on to a single piece of silicon. Venray's TOMI (Thread Optimized Multiprocessor) attempts to redefine the problem by building a very different type of microprocessor. The TOMI Borealis is built using the same transistor structures as conventional DRAM; the chip trades clock speed and performance for ultra-low low leakage. Its design is, by necessity, extremely simple. Not counting the cache, TOMI is a 22,000 transistor design. Instead of surrounding a CPU core with L2 and L3 cache, Venray inserted a CPU core directly into a DRAM design. A TOMI Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of 16 ICs per 2GB DIMM. This works out to a total of 128 processor cores per DIMM. That said, when your CPU has fewer transistors than an architecture that debuted in 1986, there is a good chance that you left a few things out--like an FPU, branch prediction, pipelining, or any form of speculative execution. Venray may have created a chip with power consumption an order of magnitude lower than anything ARM builds and more memory bandwidth than Intel's highest-end Xeons, but it's an ultra-specialized, ultra-lightweight core that trades 25 years of flexibility and performance for scads of memory bandwidth."
does it run GNU/Linux?
I'd love to see a beowulf cluster of these things...
Oh, wait..
Does it have to be a either-or suggestion?
I could see this being useful as an accelerator - in the same way that GPUs can accellerate vector operations. E.g. memory that can calculate a hash table index by itself. Stuffed in as a component of a larger system it could be a really clever breakthrough for incremental performance improvements.
#!/bin/csh cat $0
And how do I add more RAM to my system?
> "that trades 25 years of flexibility and performance for scads of memory bandwidth"
Right... because memory bandwidth isn't one of the greatest bottlenecks in current designs...
So you could implement some simple map reduce operations and run them directly in RAM?
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
This isn't new. The MIT Terasys platform did the same in 1995, and many have since. Nobody has yet come up with a viable programming model for such processors.
I'm expecting AMD's Fusion platform to move in the same direction (interleaved memory and shader banks), and they already have a usable MIMD model (basically OpenCL).
Really, this was inevitable, and this first implementation is just a first step. Future versions will undoubtedly include more functionality.
Current processors are ridiculously complicated. If you can knock out the entire cache with all of its logic, give the processor direct access to memory, and stick to a RISC design, you can get a very nice processor in under a million transistors.
Enjoy life! This is not a dress rehearsal.
Speaking of unconventional design, why don't we see hexagonal or triangular CPU-designs? All I have seen are the Manhattan-like designs. Are these really the best? Embedding the CPU inside a hexagonal/triangular DRAM design should be possible too. What would be the trade-offs?
They're innovating?
T-minus two days until they've been hit with 13 different patent lawsuits by companies that don't even produce anything similar.
Sorry about your luck!
Memory bottlenecks might be an issue but cache generally solves a lot of them. Binning just about every advance in processor design since the Z80 simply to speed up memory access is farcical. I'm afraid this is going to sink without trace since if you need low power you can just use ARM anyway which incidentally will have a shed load more performance.
I'm just wondering and maybe it exists already, but why not make everything on one chip? The CPU, memory, GPU, etc? Most people don't mess with the insides of their computer, and I'm guessing that it will speed up the computer as a whole. You won't even need to make it high-performance. Just do a I3 core with the associated chipset (or equivalent), maybe 4GB of RAM, some connectivity (USB 2, DVI, SATA, Wi-Fi and 1000Base-T) and you have it all. The power savings should be huge as everything internally should be low voltage. The die will be huge but we are heading that way anyway.
Am I talking bollocks?
How does this compare with embedded dram, which caused a lot of hype few years ago?
I'm no techie, but I'm just wondering if this isn't more of a support chip that works the other way, if it's like a "smart cache" where the main CPU can offload something memory intensive and repetitive to keep it out of the way of the fancy thread calculations.
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
there's a problem with doing designs like this. the tooling for CPUs is very very specific: 28nm, 32nm, 45nm - all those companies that do the simulations where they charge something like $USD 250,000 per week to license their tools like mentor do - have written the tools SPECIFICALLY for those geometries.
if you wander randomly outside of those geometries you are either on your own or you are into some unbelievably-high development costs.
why is this relevant?
it's because the DRAM manufacturers do *not* stick to the well-known geometries: they vary the geometry in order to get the absolute best performance because the cell layout is absolutely identical for DRAM ICs. and, because those cells _are_ identical, the verification process is much simpler than is required for a complex CPU.
in other words, this company is trying to mix-and-match two wildly different approaches. in other words, what he's doing is either incredibly expensive or is sub-optimal. which begs the question: what's it _for_?
Useless? My key question would be does it have decent speed integer multiply and perhaps even divide instructions. A whole heck of a lot can be achieved if you have, say, the basic instruction set of a 6809, but fast and wide (and it didn't even have a divide... so we built multiply-by-reciprocal macros to substitute, that works too.)
I know everyone's used to having FP right at hand, but I'm telling you, fast integer code and table tricks can cover a lot more bases than one might initially think. A lot of my high performance stuff -- which is primarily image processing and software defined radio -- is currently limited considerably more by how fast I can move data in and out of main memory than it is by actually needing FP operations. On a dual 4-core machine, I can saturate the memory bus without half trying with code that would otherwise be considerably more efficient, if it could actually get to the memory when it needs to.
Another thing... when you're coding with C, for instance, the various FP ops can just as easily be buried in a library, then who cares why or how they get done anyway, as long as they are? With lots-o-RAM, you can write whatever you need to and it'd be the same code you'd write for another platform. Just mostly faster, because for many things, FP just isn't required, or critical. Fixed point isn't very bard to build either and can cover a wide range of needs (and then there's BCD code... better than FP for accounting, for instance.)
Signed, old assembly language programmer guy who actually admits he likes asm...
I've fallen off your lawn, and I can't get up.
Er, from that link:
the term SoC is typically used with more powerful processors, ..., which need external memory chips (flash, RAM) to be useful
And it also explains why the different processes used to make memory and CPU means these are usually separated.
I heard you like to reduce maps, so I put a CPU in your RAM so you can hash while you map.
Every normal man must be tempted, at times, to spit on his hands, hoist the black flag, and begin slitting throats. -HLM
IBM sells CPU's that have DRAM onboard for quite a while, IBM developed it, patented it, and sells it as "eDRAM" aka "embeddedDRAM".
I guess IBM's POWER7 processor family powering such things like, Sony's PlayStation 2, Sony's PlayStation Portable, Nintendo's GameCube, Nintendo's Wii, and Microsoft's Xbox 360. All have eDRAM.
Maybe news articles should be checked to see if they are really news or not before posting?
Normally, in any CPU, you have 1, 2 or even 3 levels of cache - level one being the fastest accessed from the CPU, and higher numbers involving more latency. The whole idea being that data that is frequently accessed should be either within the CPU's register files, or within the level 1 cache. Failing that, the level 2 cache, failing that, level 3 cache or main memory. So for this CPU, the DRAM can be considered an L4 cache?
Incidentally, is it an SoC? Does all the support circuitry - to the South Bridge, PCIx, USB, 802.11 and other peripheral interfaces - get included here? And can someone attach a few extra GB externally to give what's effectively an L5 cache?
I can't say I like this approach - I'd prefer it if the CPU and interface logic was on 1 chip, and the memory on another.
cray where heading that way also in the 90ish with their sss system, they where just adding many 2048 cpus per block.
http://en.wikipedia.org/wiki/Cray-3/SSS
http://www.thefreelibrary.com/CRAY+COMPUTER+CORP.+COMPLETES+INITIAL+DEMONSTRATION+OF+THE+CRAY-3...-a016628331
You model separate cpu and memory as two processors: one with only a litte memory and a lot of processing power, the second with a lot of memory and no processing power (theoretically speaking).
nosig today
Looks like they have reinvented the inmos Transputer, from about 1984. http://en.wikipedia.org/wiki/Transputer . They alwaysintended to take that multicore, but never got that far. But it looks remarkably similar in intention.
Consciousness is an illusion caused by an excess of self consciousness.
IBM already offers embedded DRAM option to go with logic to enable high density cache in microprocessors. Power7 already uses this feature. How is this new ? You can use their foundry service to use the technology.
http://www-03.ibm.com/press/us/en/pressrelease/32970.wss
"Venray" is a boring little town in the Netherlands. Both it and neighbouring Venlo are known for the tough crime scene. Link ?
Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
I used to be a CPU like you until I took some DRAM to the knee.
This is not just about putting DRAM and a CPU on the same chip while keeping the architecture of both unchanged.
This is about how computer architecture is effected by the possibility of implementing both on the same chip.
Dave Patterson noted in the nineties that the number of DRAM chips per computer went down with time. He predicted that DRAM
will become large enough soon that at least the memory for a single process will fit into one chip soon. At that point it is unecessary
slow and power consuming to move the data to the CPU and back for every computation (or alternativly spend 90% of the CPU chip
area for cache to reduce the number of transports)
When you do put CPU and DRAM on the same chip the cost functions change and different architectures become optimal.
Patterson noted that when you have a CPU and DRAM on the same chip the relative architectural cost functions will be similar to the
technologies of the 70ies, just a few orders of magnitude smaller. Therefore he revisited architectures of that time and suggested to
put a vector computer on a DRAM chip called the IRAM.
http://www.cs.berkeley.edu/~pattrsn/talks/iram.html
Vector computers do not benefit much from cache. Latency is not a big issue for vector computers but they really benefit from bandwidth.
On chip you can connect the DRAM to the CPU with 2048 bits bus width or more. (And the latency would be much smaller than the latency
of a CPU going through a big cache hierachy and an external bus to the RAM)
If more memory is needed than fits on one chip he suggested to minimize data transports between chips. Instead the register state of the
process would be migrated to the DRAM where the desired data resides.
Let's say, instead of looking at it as a substitute for a main processors. We look at a much more distributed system.
8GB of RAM with (what I'll call) an Inline RAM processor.
So it doesn't do a lot of FP. That's fine, most portables and handhelds already have a GPU. GPUs love FP. Then let's add (if necessary) a simply CPU that essentially controls drive & I/O access.
Now, I'm not saying this will replace current processors or platforms. But there might be uses. Heck, I don't know. But what if this type of RAM replaced the GPU memory on a video card. Allowing the memory itself to do some post-processing. Could we improve aliasing & refine video output even further.
I don't know. But I think it's wrong to knock a technology that's in it's infancy. There were enough naysayers to the automobile. They all turned out to be wrong. Granted, the steam powered vehicle didn't make it through history.
But let's at least give it a chance to be planted and see what sprouts.
http://www.greenarraychips.com/
Their GA144 chip has 144 complete computers on one die, and the power requirements are extremely low. The F18A CPU's are completely asynchronous, requiring no clock. While their target market is mostly embedded systems, there's no reason why they couldn't be used elsewhere.
Of course you can get a pipeline in a CPU with ~22000 transistors, the original ARM had IIRC about ~28000 transistors, and has a pipeline. I'm guessing that this chip isn't x86. The x86 is far less economical with transistors, just the part that works out how long the next instruction is for x86 is larger than an entire ARM core. With simple fixed length instructions, and with a simple ALU you can get a chip that'll have pretty decent instruction throughput.
I somehow doubt this chip is designed to take over from x86, in reality it's likely targeted at special purposes where being on the same die as the DRAM is important.
Oolite: Elite-like game. For Mac, Linux and Windows
My research area is computer architecture.
This idea of moving compute into the RAM has been around a long time. Papers have proposed everything from adding simple ALUs to the DRAMs to fully functional microprocessors. Most assume that these are "accelerators" for common vector operations and such, while the heavy lifting is done by beefier cores, but the idea if doing all the compute embedded in a DRAM has been proposed and evaluated before.
One thing we've learned in the past few decades is that modern processors are limited by memory latency and bandwidth. A Sun engineer (talking about Rock) pointed out that a modern out-of-order processor performs a race between last-level cache misses. When you have to go out to DRAM, the CPU instruction window fills up with as much dependent work as possible, before it completely stalls because everything is dependent on that one miss. When that data finally arrives, the CPU blasts through that work really fact, and then soon stalls out again on another miss. OOO processors resolve this (somewhat) by the instruction window, while Rock solved it by speculative execution. One of the reasons for Sandy Bridge's excellent performance is the very large instruction window that can absorb more of the LLC miss stall time.
And so, although these processors have other advantages, OOO processors dedicate a huge amount of logic just to dealing with the cache miss latency. If there were no such latency, then they could get the same performance with a hell of a lot less hardware. Although I haven't seen the figures, my suspicion is that for general computation, TOMI will blow the doors off of whatever else we've got in both performance AND energy efficiency. Only when you have a specialized compute kernel whose working data fits in the cache can you comparatively benefit from something like Sandy Bridge. (I realize that's an overly strong statement, because lots of general purpose workloads have good locality, but nevertheless main memory is a major bottleneck for most workloads.)
Just as I was thinking that this might be the start of a good FORTH machine, I find out that Fish used to work with Chuck Moore. What a coinkydink.
"The mind works quicker than you think!"
Embed a fat FPGA in this chip well-interconnected to DRAM and CPU, and you get all those things. You might even replace the current chip's buses with FPGA for both data distribution and inline logic. Or make a discrete (but well-interconnected) onchip FPGA able to power down when not in use, and keep the low power consumption except when it's necessary. Turn on the FPGA for speed, or when the FPGA logic is so efficient that it's lower power than doing it in the CPU.
For somewhat lower power consumption, and better performance in many tasks, but less flexibility, embed a DSP in the chip instead of the FPGA.
Or both: DSP as ALU, FPGA as CLU (and flexible ALU, and beyond), on the chip with a simple processor to run the OS and main app threads. Bringing all the ports and buses to RAM all on the chip makes it all wicked fast. De/selecting these modules for power on demand (or in thread init) saves energy.
--
make install -not war
128 processors on a dimm............... Not every program is suitable for parellel execution.
http://en.wikipedia.org/wiki/Amdahl_law
So this might only be useful for task that can be parallelized. Then it will be a parallel coprocessor.
This might be exactly suitable to speed up things like the integer fractal program fractint, but what else can benefit?
F21 Microprocessor, 500 MIPs in 1997, ( ref: http://www.ultratechnology.com/f21.html ) --> .2 sq mm asynchronous microcomputer core, 60,000 Mips, ( ref: http://www.forthfreak.net/misc/25x.html ) -->
variable length instruction word symmetric multi processing of multiple parallel processors ( VLIW SMP MPP )
Can't really solve the mutex problem so pretend it doesn't exist and screw the programmers by pretending to solve the main memory latency problem with CPU-local memory.
The "innovation" over Illiac IV is to call it "multicore".
PS: There is a solution but since I can't afford the patent fees, its not going anywhere.
Seastead this.
The 6502 had a pipeline of sorts 10 years before that.
Ok, two approaches. First, if you know the divisor, but not the dividend, then in your assembly code you can write, conceptually:
dividend * (1.0 / divisor)
Since the right side of that contains only known values, there's only a multiply to be done at run time; the assembler can prepare the rest at assembly time.
Second, if you don't know the divisor or the dividend, you can prepare a table, conceptually like this:
1/1
1/2
1/3
1/4
....
1/N
Then, when the time comes for division, you do it like this:
x = y * table[z]
It's a hair slower than just multiplication, because it includes a table lookup.
There are some technical things involved in this, like accounting for width of the various inputs and where the binary point lands within your results, but these are really implementation details and don't change the overall idea.
I should also point out that using tables, you can pre-compute results for many arbitrary inputs, so that execution time is essentially the table lookup, sometimes with an interpolation stage. You can do sines, cosines, logs, etc. this way. If memory is cheap and fast, then the need for an FPU may simply go away. Simple example: If you have a 16 bit float, then a 64k entry table can contain the sine value for every possible floating point input. Zero compute time at run time. You want fast, there it is.
But as it turns out, a lot of times, the need for floating point isn't what we think it is. Let's say you want to rotate an image, so you scan through it pixel by pixel. Naively, rotation is:
newx = (x * sin(theta)) - (y * cos(theta));
newy = (x * cos(theta)) + (y * sin(theta));
So, you look at that and you think, wow, two FP sines and two FP cosines and four FP multiplies and an FP add and an FP subtract PER PIXEL!
But, work it out, and this is what you want to actually execute per pixel:
newx = xxs - yyc;
newy = xxc + yys;
xxs += s;
xxc += c;
That's three FP adds and one FP subtract per pixel. All of the sine, cosine and multiply stuff is gone. There are also two more FP adds per line. It's hella fast, and 100% accurate.
How one gets from the naive approach to the fast one, I leave as an exercise for the reader. Unless someone is really curious -- in which case, I'll blog it and point you to it, it's a little esoteric, even for this place.
I've fallen off your lawn, and I can't get up.
Wow, we're on Slashdot......almost like being On The Cover of the Rolling Stone.
Answers to various questions and comments:
- We support the Linux toolchain; compilers, debuggers, etc., fortunate to have some of the original gcc team. Ported pieces of various kernels to TOMI Aurora to make certain we had not left anything out and to test the memory manager. Aurora was for use in a tablet type device.
- TOMI Borealis was optimized for Big Data and unstructured data apps like MapReduce that choke at the Memory Wall. Linux could probably be ported without too much difficulty. Most massively parallel installations will use something really light weight instead.
- Potential users said give them more integer cores instead of adding FPU. We gladly cede the FP world to Itanium.
- For raw FP horsepower within a reasonable power budget, its tough to beat Nvidia's GPU approach. That is probably why 3 of the top 10 supercomputers are GPU accelerated. http://www.top500.org/ GPU-type architectures will likely be the future of scientific computing. Venray is focused on Memory Wall limited areas such as Big Data.
- From the computer architecture perspective, the distinction between Big Data and Small Data is whether the datasets will primarily fit within the onboard caches. Video compression, graphics acceleration, encryption, and much of LINPAC (http://en.wikipedia.org/wiki/LINPACK) would be classed as Small Data since most of the computing can be done without leaving the caches (high locality). Legacy architectures choke on Big Data since the datasets overflow the caches and there is much much less data reuse.
- MapReduce is important because it is currently the most visible Big Data application thanks to Google. http://research.google.com/archive/mapreduce.html
- Venray believes Big Data applications are the future of computing. So does McKinsey Consulting. http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation We leave it to others to accelerate MS Office and Call of Duty.
- The future of Big Data appears to be RAM resident, not disk, not even flash. (See Fred Ho's work at IBM.) https://www.ibm.com/developerworks/mydeveloperworks/blogs/fredho66/?lang=en_us
- re: Mitsubish 3DRAM and other similar ventures, iRAM, Exacute, Gilgamesh, etc....they embedded DRAM into logic. Contrast with TOMI that embeds CPU cores into DRAMs.....our benefits are performance and particularly cost: http://www.edn.com/photo/294/294788-microprocessor_vs_memory_transistors_graph.jpg
- We chose a modified RISC architecture rather than a special purpose one such as Gilgamesh in order to make programming simpler with well understood Linux tools such as gcc. Submit your gcc C, C++, or Fortran to http://www.venraytechnology.com./ Statistics are returned in standard dGen format.
- TSV (through silicon vias) and HMC (hybrid memory cube) are valid attempts to push back the Memory Wall. Discussed in Part 1 for EDN. http://www.edn.com/article/520059-The_future_of_computers_Part_1_Multicore_and_the_Memory_Wall.php Decision may be determined by cost.
- Would love to dispense with caches because they add transistors. 4K data and 4K instruction caches sped us up about 10x. Unlike legacy architectures, TOMI cache lines load in a single DRAM cycle.
- Yes love Raspberry Pi. http://www.raspberrypi.org/
- Quad-
WOW! Smply awesome, amazing! :D
Sure it's specialized, but it's very neat innovation.
Now if they can scale it upto say 200k transistors and implement best optimizations, or make a "fore" chip managing all the different cores which puts some of the optimizations there.
Floating point ops can be done in a GPU at immense speeds in any case ;)
Pulsed Media Seedboxes
Sounds to me like exactly what is needed as a building block with which to build a self-aware AI.
Social Credit would solve everything...