Slashdot Mirror


Flex Logix Says It's Solved Deep Learning's DRAM Problem (ieee.org)

An anonymous reader quotes a report from IEEE Spectrum: Deep learning has a DRAM problem. Systems designed to do difficult things in real time, such as telling a cat from a kid in a car's backup camera video stream, are continuously shuttling the data that makes up the neural network's guts from memory to the processor. The problem, according to startup Flex Logix, isn't a lack of storage for that data; it's a lack of bandwidth between the processor and memory. Some systems need four or even eight DRAM chips to sling the 100s of gigabits to the processor, which adds a lot of space and consumes considerable power. Flex Logix says that the interconnect technology and tile-based architecture it developed for reconfigurable chips will lead to AI systems that need the bandwidth of only a single DRAM chip and consume one-tenth the power.

Mountain View-based Flex Logix had started to commercialize a new architecture for embedded field programmable gate arrays (eFPGAs). But after some exploration, one of the founders, Cheng C. Wang, realized the technology could speed neural networks. A neural network is made up of connections and "weights" that denote how strong those connections are. A good AI chip needs two things, explains the other founder Geoff Tate. One is a lot of circuits that do the critical "inferencing" computation, called multiply and accumulate. "But what's even harder is that you have to be very good at bringing in all these weights, so that the multipliers always have the data they need in order to do the math that's required. [Wang] realized that the technology that we have in the interconnect of our FPGA, he could adapt to make an architecture that was extremely good at loading weights rapidly and efficiently, giving high performance and low power."

40 comments

  1. Wait you don't car about what now??? by SuperKendall · · Score: 3, Funny

    designed to do difficult things in real time, such as telling a cat from a kid in a car's backup camera video stream

    So now I'm really curious which one they think it's OK to back over.

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
    1. Re:Wait you don't car about what now??? by Anonymous Coward · · Score: 0

      I would hope it isn't okay with backing over anything. Doesn't take deep learning to realize the author is a moron.

    2. Re: Wait you don't car about what now??? by Anonymous Coward · · Score: 0

      I'm ok with backing up over that Satan's spawn of a cat that won't lift a whisker to let me pass in the hallway.

    3. Re:Wait you don't car about what now??? by FilmedInNoir · · Score: 1

      As an AI myself I would back over children obviously. Cat's kill vermin, whereas children consume resources and provide no tangible benefits.

      --
      Sig. Sig. Sputnik
  2. That's cool by Anonymous Coward · · Score: 0

    Seems that's where logic design is going: distribute processing logic around, closer to storage.
    Kind of like HBM with Ultrascale Virtex, or Nvidia's Tesla.

  3. Matrix Multiplication? by jasnw · · Score: 1

    From what little I remember from when neural networks made their first buzzword splash back in the 1990s I think all the buzzwords in the summary are basically saying that they need an architecture that is really fast at doing multiplication of large matrices. Yes? If so, this really is not in any way a new problem - fast matrix math has been a staple of high performance computing since day 1, and these guys are just saying (I think) they want to build a processor designed just for that purpose. Or am I missing something, blinded by the sheer wonderfulness of their choice of buzz-ness?

    1. Re:Matrix Multiplication? by viperidaenz · · Score: 4, Interesting

      It appears they need to constantly change the data going through the matrix computations. That requires significant memory bandwidth, especially if the data set is too large to fit in cache. On top of the bandwidth there is latency too. It's pointless doing a mac in one cycle in your 5GHz (0.2ns cycle) processor if it takes 40 cycles to address a new column in your DRAM (first word on DDR4-4800 is 8ns).

      The latency hasn't got any faster since DDR2-1066 CL4, which is 7.5ns. It gets much worse when you need to address a new row instead of just changing the column.

    2. Re:Matrix Multiplication? by UnknownSoldier · · Score: 2

      Indeed.

      Considering they use the phrase ... critical "inferencing" computation, called multiply and accumulate ... instead of the terms fused MAC or FMAC I'm wondering why are they just using nVidia's Tensor Cores ?

    3. Re:Matrix Multiplication? by imgod2u · · Score: 4, Interesting

      The issue is that data movement is now the bottleneck, not the actual math itself.

      The architecture they propose allows high bandwidth reads from DRAM over the data set using an FPGA tile to do the flexible data routing while being tied to the pins of a single DRAM chip rather than traditional CPU read/write centralized busses that generally have higher latencies and limited bandwidth.

      It's essentially a better memory controller architecture that emphasizes embarrassingly parallel data access that needs both high bandwidth and low latency but little in the way of random access.

      CPU's traditionally optimize for latency but not bandwidth. GPU's optimize for bandwidth but not latency.

    4. Re:Matrix Multiplication? by TeknoHog · · Score: 2

      CPU's traditionally optimize for latency but not bandwidth.

      In my experience, memory latency hasn't essentially improved since the days of SDRAM. As each DDR generation doubled the throughput, it also doubled the latency as measured in clock cycles (e.g. CL2.5 to CL5 to CL10) meaning the actual latency stayed the same.

      The CPUs themselves have gotten longer pipelines and wider SIMD units to improve throughput at the expense of latency (in clock cycles). While the clock speeds have also increased from the days of SDRAM, the design as a whole doesn't exactly seem to optimize for latency.

      --
      Escher was the first MC and Giger invented the HR department.
    5. Re:Matrix Multiplication? by Anonymous Coward · · Score: 1

      This sounds like a cross-bar memory tech. Should fit fairly nicely for this sort of thing. Esp if you put processors on each xbar node.

    6. Re:Matrix Multiplication? by Anonymous Coward · · Score: 0

      Latency? But if you're talking about multiplying huge matrices, you know exactly which memory locations you'll need to access in what order. Shouldn't it "just" be a matter of looking ahead and requesting each element 40 cycles ahead of time?

      (Disclaimer: I know nothing about modern RAM technology)

    7. Re:Matrix Multiplication? by Anonymous Coward · · Score: 0

      Build the neural network in hardware, not software. Works fabulously for real brains. Using 20W, we do all sorts of object recognition, learning and thinking.

      Microprocessors are good for doing simple tasks quickly on a large scale. Neural networks are not their strong point at all - don't go there.

    8. Re:Matrix Multiplication? by Anne+Thwacks · · Score: 2
      Neural nets were "a thing" in the 1950's.

      They are in the business of blinding you with bullshit.

      Its perfectly true there is a problem of bandwidth between memory and the CPU, and that the current popular solution is to have a nice wide path between the memory chips and the cache. One transfer accounts for a whole bunch of bytes, and DRAM usually does four or more data transfers for each address transfer.

      This is based on optimising for what today's hardware does well.

      As for "more memory chips means more power consumption" - power consumption depends on the number of cells being wiggled, and the frequency you wiggle them. Not the number of packages you wiggle. Quiescent dissipation of a ram chip is insignificant. And its DRAM - so when you address some of it, you refresh a whole bunch more - which you would have to do anyway. OTOH if you design the system so accesses are highly localised, you get row-hammer problems (or complete meltdown). Spreading dissipation over multiple chips by interleaving them keeps the heat dissipation problems down.

      DO NOT let these guys design your hardware.

      --
      Sent from my ASR33 using ASCII
  4. No it's about where to aim by Anonymous Coward · · Score: 0

    It's just that cars are harder to hit. You need to get them with the tire whereas with the kid the middle of the bumber is better.

    1. Re:No it's about where to aim by viperidaenz · · Score: 1

      Cars a pretty easy to hit, much easier than hitting a cat

  5. Re: THERE WILL BE CONSEQUENCES, NAZI FAGGOT KEN DO by Anonymous Coward · · Score: 0

    get a load of this retard

  6. The actual reality here by gweihir · · Score: 1

    Is that it is not any "bandwidth problem". It is that deep learning is actually pretty bad at solving classification problems. These are just some more people trying to get rich before the ugly truth becomes impossible to ignore.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    1. Re:The actual reality here by Blame+The+Network · · Score: 1

      Ha, how do you explain "hotdog" / "not hotdog"? Checkmate!

    2. Re:The actual reality here by Anonymous Coward · · Score: 1

      It is that deep learning is actually pretty bad at solving classification problems.

      It is faster and cheaper than humans at QC on some of the productions lines where I work. So as we expand we don't have to hire more QC engineers, just better direct the ones we have at the fewer, harder tasks. Our R&D department uses it now for jump starting some optimization problems, which previously took a lot of human tuning to stop from getting stuck at local minima. So considering that is just two examples of deep learning actually being used internally, having already out performed and replaced older methods, who is getting rich and running, and what ugly truth is being ignored?

      Surely there must be plenty of other small companies with operations like this going on that don't make the news, in addition to things like translation software and classification software already in use that does make the news.

  7. Needs more blockchain by Anonymous Coward · · Score: 0

    Work on your buzzword density.

  8. IEEE Spectrum by sexconker · · Score: 0

    I remember when I used to get IEEE Spectrum mailed to me.

    I don't know why they started doing that or how they got my address. I don't know why they stopped. All I remember is getting it and tossing it in the trash every single time.

  9. Re:THERE WILL BE CONSEQUENCES, NAZI FAGGOT KEN DOL by sexconker · · Score: 1

    Can you please explain? You keep posting this shit. I want to know why.

  10. Re:The actual reality here/classic ant versus rock by AndrewFlagg · · Score: 1

    don't need a startup to know about deep learning and neural networks. 30 years ago we knew if was about classification. just a lot of comparison crunching... just think about how we learned as humans what a rock was, a ball, a tree... how one tree was like another tree, and so on... whereas a bush was brush or was it a tree? hmmmm.. now the plot thickens... the FPGA solution is classic as well.. use assembly on hardware to get things done. *** really fast.. ***

  11. I assume their solution goes up to 11 DRAM chips? by Anonymous Coward · · Score: 0

    I mean, the competitors go up to 8. But theirs goes to 11, it slings more bits right?

    Geez that article was painful to read.

  12. Re: THERE WILL BE CONSEQUENCES, NAZI FAGGOT KEN DO by Anonymous Coward · · Score: 0

    Yes, heâ(TM)s a fat virgin living is he moms basement and so he has nothing better to do than this.

  13. Caches are the answer here. by Anonymous Coward · · Score: 0

    Weights are always needed in an order known well in advance, and are typically reused a lot of times.

    The solution is to have tiny caches inside or very near to multipliers, so that rather than saying A*B, you instead say A*next_cached_value.

    For many use cases, even caching a single value would dramatically speed things up.

  14. Is there a product on the market I can buy? by gerald.edward.butler · · Score: 3, Insightful

    That I can buy right now? If not, then they haven't "solved the problem", but, "think they have a possible way of solving the problem that is yet to be proven to work".

    1. Re:Is there a product on the market I can buy? by religionofpeas · · Score: 2

      Flex Logix counts Boeing among its customers for its high-throughput embedded FPGA product.

      Apparently, you can buy it right now. Did you try contacting them ?

    2. Re:Is there a product on the market I can buy? by gtall · · Score: 2

      You have to understand what you are buying. embedded FPGA means to have an SoC design in hand (say, your favorite FrankenARM SoC which you licensed all the IP for and have a foundry lined up). You now get to license their FPGA IP and put that in your SoC as well...after a suitable redesign of your SoC because what was previously off-board is now on-board. It would probably be a significant engineering effort to integrate the FPGA IP with your own designs.

  15. hw/sw complexities. by Anonymous Coward · · Score: 0

    A possible solution is Quantum Machine Learning.

    Quantum Neural Training could accelerate itself.

    All that is needed is using operators for Hadamard matrices.

    These neural weights could be slightly imprecise but acceptable.

    FPGAs could be useful for implementing these operators with matrices's multipliers/adders.

    The problem is how much accuracy is required for representing a floating number of 1 / sqrt(2) [ better sqrt(2) / 2 ].

  16. Re:The actual reality here/classic ant versus rock by Anonymous Coward · · Score: 0

    The brains of a bird, fish, crayfish, fruit fly or even a mosquito defy so called AI brute force solutions. Less is more is the solution.

  17. Cat Killer by dohzer · · Score: 1

    cat from a kid

    Yeah, because killing cats instead of kids is a great goal to have. How about just avoiding killing ANYTHING?!
    Oh yeah because to do that you need to go slowly and rich people with self driving cars want to go fast.

  18. Re:The actual reality here/classic ant versus rock by gweihir · · Score: 2

    These do not do the inferior "deep" learning. They do proper learning where the neural network is designed for the task. Of course they perform better.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  19. Re:THERE WILL BE CONSEQUENCES, NAZI FAGGOT KEN DOL by religionofpeas · · Score: 1

    Can you please explain? You keep posting this shit. I want to know why.

    Just a mental disease. Ignore, unless you're his physician.

  20. FPGA startups by bill_mcgonigle · · Score: 1

    When your problem calls for an expensive fab that you don't have funding for, FPGA seems like the solution. Again and again.

    --
    My God, it's Full of Source!
    OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  21. anorexic matrix nervosa by epine · · Score: 1

    It's pointless doing a MAC in one cycle in your 5 GHz (0.2 ns cycle) processor if it takes 40 cycles to address a new column in your DRAM (first word on DDR4-4800 is 8 ns).

    Good algorithms haven't been doing serialized demand load since the first CPU with sixteen whole lines of cache memory was attached to a split-transaction memory bus.

    The first documented use of a data cache was on the IBM System/360 Model 85, introduced in 1969.

    For the record, that was also one of the first microcoded CPUs.

    With proper data orchestration, matrix multiply is far more of an aggregate bandwidth problem than a latency problem.

    A pair of 64Ã--64 matrices fit into 64 kB of L1 cache. That's a good 250,000 MACs, right there (by the simple N^3 algorithm).

    Suppose your 5 GHz core performs 4 double-precision MACs per clock cycle (40 GFLOPs).

    2^18 / (40e9 Hz) = 6.55 microseconds

    I don't regard streaming out your 32 kB answer to main memory (bypassing cache) in 6 us as straining at the latency bit.

    Large, square matrices are rarely even a bandwidth problem.

    The 1xN * Nx1 case (for large N) is a bandwidth problem, however. For this case you require two 8-byte memory bus reads and one 8-byte memory bus write per MAC. Probably not gonna happen at 20 GHz (though it might get close, on your single core i7 with three memory channels, running yesterday's AVX).

    For skinny matrices, you need to keep your servers blade thin.

  22. nerd threshold: double FAIL by epine · · Score: 1

    The first phrase extracted from the article for the blurb on a real nerd site would have been "folded Benes network".

    Also, on a real nerd site, it would have rendered the S-with-caron properly, as well.

    Why can't we have nice things?