Flex Logix Says It's Solved Deep Learning's DRAM Problem (ieee.org)
An anonymous reader quotes a report from IEEE Spectrum: Deep learning has a DRAM problem. Systems designed to do difficult things in real time, such as telling a cat from a kid in a car's backup camera video stream, are continuously shuttling the data that makes up the neural network's guts from memory to the processor. The problem, according to startup Flex Logix, isn't a lack of storage for that data; it's a lack of bandwidth between the processor and memory. Some systems need four or even eight DRAM chips to sling the 100s of gigabits to the processor, which adds a lot of space and consumes considerable power. Flex Logix says that the interconnect technology and tile-based architecture it developed for reconfigurable chips will lead to AI systems that need the bandwidth of only a single DRAM chip and consume one-tenth the power.
Mountain View-based Flex Logix had started to commercialize a new architecture for embedded field programmable gate arrays (eFPGAs). But after some exploration, one of the founders, Cheng C. Wang, realized the technology could speed neural networks. A neural network is made up of connections and "weights" that denote how strong those connections are. A good AI chip needs two things, explains the other founder Geoff Tate. One is a lot of circuits that do the critical "inferencing" computation, called multiply and accumulate. "But what's even harder is that you have to be very good at bringing in all these weights, so that the multipliers always have the data they need in order to do the math that's required. [Wang] realized that the technology that we have in the interconnect of our FPGA, he could adapt to make an architecture that was extremely good at loading weights rapidly and efficiently, giving high performance and low power."
Mountain View-based Flex Logix had started to commercialize a new architecture for embedded field programmable gate arrays (eFPGAs). But after some exploration, one of the founders, Cheng C. Wang, realized the technology could speed neural networks. A neural network is made up of connections and "weights" that denote how strong those connections are. A good AI chip needs two things, explains the other founder Geoff Tate. One is a lot of circuits that do the critical "inferencing" computation, called multiply and accumulate. "But what's even harder is that you have to be very good at bringing in all these weights, so that the multipliers always have the data they need in order to do the math that's required. [Wang] realized that the technology that we have in the interconnect of our FPGA, he could adapt to make an architecture that was extremely good at loading weights rapidly and efficiently, giving high performance and low power."
designed to do difficult things in real time, such as telling a cat from a kid in a car's backup camera video stream
So now I'm really curious which one they think it's OK to back over.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
It appears they need to constantly change the data going through the matrix computations. That requires significant memory bandwidth, especially if the data set is too large to fit in cache. On top of the bandwidth there is latency too. It's pointless doing a mac in one cycle in your 5GHz (0.2ns cycle) processor if it takes 40 cycles to address a new column in your DRAM (first word on DDR4-4800 is 8ns).
The latency hasn't got any faster since DDR2-1066 CL4, which is 7.5ns. It gets much worse when you need to address a new row instead of just changing the column.
Indeed.
Considering they use the phrase ... critical "inferencing" computation, called multiply and accumulate ... instead of the terms fused MAC or FMAC I'm wondering why are they just using nVidia's Tensor Cores ?
The issue is that data movement is now the bottleneck, not the actual math itself.
The architecture they propose allows high bandwidth reads from DRAM over the data set using an FPGA tile to do the flexible data routing while being tied to the pins of a single DRAM chip rather than traditional CPU read/write centralized busses that generally have higher latencies and limited bandwidth.
It's essentially a better memory controller architecture that emphasizes embarrassingly parallel data access that needs both high bandwidth and low latency but little in the way of random access.
CPU's traditionally optimize for latency but not bandwidth. GPU's optimize for bandwidth but not latency.
CPU's traditionally optimize for latency but not bandwidth.
In my experience, memory latency hasn't essentially improved since the days of SDRAM. As each DDR generation doubled the throughput, it also doubled the latency as measured in clock cycles (e.g. CL2.5 to CL5 to CL10) meaning the actual latency stayed the same.
The CPUs themselves have gotten longer pipelines and wider SIMD units to improve throughput at the expense of latency (in clock cycles). While the clock speeds have also increased from the days of SDRAM, the design as a whole doesn't exactly seem to optimize for latency.
Escher was the first MC and Giger invented the HR department.
That I can buy right now? If not, then they haven't "solved the problem", but, "think they have a possible way of solving the problem that is yet to be proven to work".
These do not do the inferior "deep" learning. They do proper learning where the neural network is designed for the task. Of course they perform better.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
They are in the business of blinding you with bullshit.
Its perfectly true there is a problem of bandwidth between memory and the CPU, and that the current popular solution is to have a nice wide path between the memory chips and the cache. One transfer accounts for a whole bunch of bytes, and DRAM usually does four or more data transfers for each address transfer.
This is based on optimising for what today's hardware does well.
As for "more memory chips means more power consumption" - power consumption depends on the number of cells being wiggled, and the frequency you wiggle them. Not the number of packages you wiggle. Quiescent dissipation of a ram chip is insignificant. And its DRAM - so when you address some of it, you refresh a whole bunch more - which you would have to do anyway. OTOH if you design the system so accesses are highly localised, you get row-hammer problems (or complete meltdown). Spreading dissipation over multiple chips by interleaving them keeps the heat dissipation problems down.
DO NOT let these guys design your hardware.
Sent from my ASR33 using ASCII