19-Year-Old's Supercomputer Chip Startup Gets DARPA Contract, Funding
An anonymous reader writes: 19-year-old Thomas Sohmers, who launched his own supercomputer chip startup back in March, has won a DARPA contract and funding for his company. Rex Computing, is currently finishing up the architecture of its final verified RTL, which is expected to be completed by the end of the year. The new Neo chips will be sampled next year, before moving into full production in mid-2017.The Platform reports: "In addition to the young company’s first round of financing, Rex Computing has also secured close to $100,000 in DARPA funds. The full description can be found midway down this DARPA document under 'Programming New Computers,' and has, according to Sohmers, been instrumental as they start down the verification and early tape out process for the Neo chips. The funding is designed to target the automatic scratch pad memory tools, which, according to Sohmers is the 'difficult part and where this approach might succeed where others have failed is the static compilation analysis technology at runtime.'"
mean it.
Not sure whats more impressive, the fact that a 19 year old is able to get DARPA funding or the fact that a 19 year old (and his team presumably) is about to go into mass production with a fairly fancy looking custom microprocessor on a 28nm fab process.
That doesn't go very far in the microprocessor world. I worked for Cisco back in the early 00's and even back then tape out costs were approaching $1M for a 5 layer mask, today with sub-wavelength masks and chips using 12+ layers it must be tremendously expensive to spin a chip.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
We actually have very good reasons to say why this is a very different kind of VLIW, and have found the reason why other VLIW chips have had such static scheduling issues. Hope we can convince you and everyone else soon enough.
Uhm, it ranges, but I'd say I can get a snickers bar for around a buck in most vending machines. And there are also plenty of people smarter than me, even in this very small niche that I am in.
When I was 19, my main achievement was building a bong out of a milk jug.
You are welcome on my lawn.
The biggest thing is what we have tried to emphasize, which is the fact that we have an entirely different memory system that does away with the hardware managed cache hierarchy. The rest of the really interesting stuff we have not publicly disclosed (yet), but I can tell you that it is very different from both Kalray and Tilera.
An ENTIRE sleepless night? Wow. Sounds TOUGH. —said no MIT grad ever.
Please explain to me simply how you get 10x in compute efficiency over GPUs--these chips are already fairly optimal at general purpose flops per watt because they run at low voltage and fill up the die with arithmetic.
GPUs have excellent memory bandwidth to their video RAM (GDDR*), they have poor IO latency & bandwidth (PCIe limited) which is the main reason they don't scale well.
We've heard the VLIW "we just need better compilers" line several times before.
Thus far this sounds like a truly excellent high school science fair project, or a slightly above average college engineering project. It is miles away from passing an industrial smell test.
Cue this old joke...
...
- How many hardware engineers does it take to change a light bulb?
- None, we'll fix it in software.
Doing stuff in software to make hardware easier has been tried before (and before this kid was born, perhaps why he thinks this is new). It failed. Transputer, i960, i432, Itanium, MTA, Cell, a slew of others I don't remember...
As for the grid, nice, but not exactly new. Tilera, Adapteva, KalRay,
Prefetching in the general case is non-computable, but a lot of accesses are predictable. If the stack is in the scratchpad, then you're really only looking at heap accesses and globals for prefetching. Globals are easy to statically hint and heap variables are accessed by pointers that are reachable. It's fairly easy for each function that you might call to emit a prefetch version that doesn't do any calculation and just loads the data, then insert a call to that earlier. You don't have to get it right all of the time, you just have to get it right often enough that it's a benefit.
For prefetching vs eviction, it's a question of window size. Even with no prefetching, most programs exhibit a lot of locality of reference and so caches work pretty well without prefetching - it doesn't matter that you take a miss on the first access, because you hit on the next few dozen (and in a multithreaded chip, you just let another thread run while you wait), but if you're evicting data too early then it's a problem. A combination of LRU / LFU works well, though all of the good algorithms in this space are patented. Although issuing prefetch hints is fairly easy, the reason that most compilers don't is that there's a good chance of accidentally pushing something else out of the cache. That said, if they're targeting HPC workloads, then just running them in a trace and then using that for hinting would probably be enough for a lot of things.
I heard a nice anecdote from some friends at Apple a while ago. They found that one of their core frameworks was getting a significant slowdown on their newer chip. The eventual cause was quite surprising. In the old version, they had a branch being mispredicted, and a load speculatively executed. The correct branch target was identified quite early, so they only had a few cancelled instructions in the pipeline. About a hundred cycles later, they hit the same instruction and this time ran it correctly. With the new CPU, the initial branch was correctly predicted. This time, when they hit the load for real, it hadn't been speculatively executed and so they had to wait for a cache miss.
Also, if you're trying to create a parallel system with manual caches... good luck. Cache coherency is a pain to get right, but it's then fundamental to most modern parallel software. Implementing the shootdowns in software is going to give you a programming model that's horrible.
And finally there's the problem that doing it in software makes it serial. The main reason that we use hardware page-table walkers in modern CPUs is not that they're much better than a software TLB fill, it's that it's much easier to make them run completely asynchronously with the main pipeline. The same applies to caches.
I am TheRaven on Soylent News
One of the things that doesn't seem to be getting through in most of the media articles is how our memory system is actually set up. I'll try to describe it briefly here, starting from the single core.
:)
At a single core, we have a 128KB multibanked scratchpad memory, which you can think of as just like an L1 cache but smaller and lower latency. We have one cycle latency for a load/store from your registers to or from the scratchpad, and the same latency from the scratchpad to/from the cores router. That scratchpad is physically addressed, and does not have a bunch of extra (and in our opinion, wasted) logic to handle address translations, which just take up a lot of area and power (especially once you multiply it over hundreds of cores and large SRAMs. Most people think the TLB logic is a fixed size for any size SRAM, but it is not, and it gets significantly worse if you add coherency). Remember, even if you have a L1 cache (Typically 16 to 32KB, tops) hit on an Intel chip, it still takes 4 whole cycles.
Once we get to having a 16x16 grid (256 cores) as part of our Network on Chip, we have a total of 32MBs of on chip 1 cycle latency scratchpad. How we have arranged that is as a global flat address space, with all of the addresses being physically mapped. What I mean by this is that Core 0's scratchpad is the first 128K of the address space, and the address space continues on seamlessly to core 1, core 2, and all the way to core 255. If the address requested by a core is not in its own scratchpad's range, it goes to the router and hops on the NoC until it gets there... with a one cycle latency per hop. We have 32GB/s in each cardinal direction per router, giving a total on chip bandwidth of 8TB/s. Since it is all statically routed (which is a *very* important part of our entire design, which I am not revealing the full implications of just yet), we have guaranteed 1 cycle per hop latency between each router on the NoC. So even if you are going from one corner to another (core 0 to core 255) it is still a max latency of 32 cycles... still less than the latency to the L3 cache on an Intel chip.
This gets to the chip to chip interconnect, which we have not been very public about, but I can say it is VERY high bandwidth (48GB/s in each direction, on all four sides of the chip, so an aggregate bandwidth of 384GB/s... compare that to 16GB/s of PCIe or even NVIDIA's 2018/2019 80GB/s plans with NVLINK). There are a lot of very cool things in that design, but I can't go into them publicly quite yet. We sacrifice distance and interoperability to get those numbers, but we think it is a worthy tradeoff for insane speed and efficiency. The other interesting thing that we are looking at (and haven't fully explored the full tradeoffs) is being able to extend of flat address space across multiple chips in a larger grid.
To wrap up, most of the problems you mentioned here and in other comments are not totally valid, as we are not trying to replicate the inefficient protocols implemented super inefficiently in hardware today. We want to eventually be able to provide the same user experience and convenience that hardware caching provides, but keeping it abstracted away from the user. Hopefully you can understand I can't go into full details of this, and you have every reason to be skeptical, but that does not mean we are not going to try to do it anyways.
Also, cool Apple story. Thanks
Happy to answer any other questions
1. We have already run through synthesis of a version of our core (and rough version of our chip)... There's a lot of work to be done, especially as we are in the last steps of locking down the RTL, but we are not worried about timing... we are being very conservative.
2. Already have standard cells and memory compilers. We are not amateurs.
3. We actually have solid state physics and fabrication experience, and understand the physical constraints of wire and gate delays, leakage, etc. All of those played a very large part in our architectural design, specifically so we don't have a timing and closure being a huge clusterfuck.