19-Year-Old's Supercomputer Chip Startup Gets DARPA Contract, Funding
An anonymous reader writes: 19-year-old Thomas Sohmers, who launched his own supercomputer chip startup back in March, has won a DARPA contract and funding for his company. Rex Computing, is currently finishing up the architecture of its final verified RTL, which is expected to be completed by the end of the year. The new Neo chips will be sampled next year, before moving into full production in mid-2017.The Platform reports: "In addition to the young company’s first round of financing, Rex Computing has also secured close to $100,000 in DARPA funds. The full description can be found midway down this DARPA document under 'Programming New Computers,' and has, according to Sohmers, been instrumental as they start down the verification and early tape out process for the Neo chips. The funding is designed to target the automatic scratch pad memory tools, which, according to Sohmers is the 'difficult part and where this approach might succeed where others have failed is the static compilation analysis technology at runtime.'"
mean it.
Not sure whats more impressive, the fact that a 19 year old is able to get DARPA funding or the fact that a 19 year old (and his team presumably) is about to go into mass production with a fairly fancy looking custom microprocessor on a 28nm fab process.
I'd say that's a fair sign of success. Despite the sense of jealousy, nobody can think of anything bad to say
That doesn't go very far in the microprocessor world. I worked for Cisco back in the early 00's and even back then tape out costs were approaching $1M for a 5 layer mask, today with sub-wavelength masks and chips using 12+ layers it must be tremendously expensive to spin a chip.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
We actually have very good reasons to say why this is a very different kind of VLIW, and have found the reason why other VLIW chips have had such static scheduling issues. Hope we can convince you and everyone else soon enough.
The final project of this VLSI elective course I took required each team to build three logical modules that would work together. I was responsible for the control and integration portion bringing together all the logical modules. I spent an entire sleepless night sorting out the issues. Our team was the only one that had a functioning chip (simulated) in the end. The lecturer wasn't surprised - most chips of any reasonable complexity require A LOT of painstaking (e.g. efficient routing, interference) work to get them working - often requiring certain modules to be pulled apart (or redesigned) so they integrate better with others.
Uhm, it ranges, but I'd say I can get a snickers bar for around a buck in most vending machines. And there are also plenty of people smarter than me, even in this very small niche that I am in.
When I was 19, my main achievement was building a bong out of a milk jug.
You are welcome on my lawn.
From what I've been able to read it doesn't look that different from other projects like Tilera or Kalray MPPA.
The biggest thing is what we have tried to emphasize, which is the fact that we have an entirely different memory system that does away with the hardware managed cache hierarchy. The rest of the really interesting stuff we have not publicly disclosed (yet), but I can tell you that it is very different from both Kalray and Tilera.
Parallella...
Please explain to me simply how you get 10x in compute efficiency over GPUs--these chips are already fairly optimal at general purpose flops per watt because they run at low voltage and fill up the die with arithmetic.
GPUs have excellent memory bandwidth to their video RAM (GDDR*), they have poor IO latency & bandwidth (PCIe limited) which is the main reason they don't scale well.
We've heard the VLIW "we just need better compilers" line several times before.
Thus far this sounds like a truly excellent high school science fair project, or a slightly above average college engineering project. It is miles away from passing an industrial smell test.
"Virtual Memory translation and paging are two of the worst decisions in computing history"
"Introduction of hardware managed caching is what I consider 'The beginning of the end'"
---
These comments belie a fairly child-like understanding of computer architecture.
He's young, and he displays much more talent than people twice his age. What's your problem anyway?
I'm a minority race. Save your vitriol for white people.
The biggest thing is what we have tried to emphasize, which is the fact that we have an entirely different memory system that does away with the hardware managed cache hierarchy. The rest of the really interesting stuff we have not publicly disclosed (yet), but I can tell you that it is very different from both Kalray and Tilera.
You might have answered this already but I'm not very good at reading walls-o-text, so apologies if this is a repeat: The hardware managed cache design for chips is popular for a reason - it provides a speed boost. If you remove this what kind of process do you propose to replace it with? (Unless you have a design that makes a hardware managed cache redundant. What do you do then? Have software manage the cache?)
I'm a minority race. Save your vitriol for white people.
"Why is there yogurt in this hat?" "I can explain that. It used to be milk, and well, time makes fools of us all."
I truly hope this approach pans out and advances chip design, but if it doesn't, it will be another publicly available learning tool for the next small team to learn from. It's easy to say that it won't work and that it is going down the same path as previous attempts, but thet might have something that does work and is worth a lot of money. If you don't like it, don't invest. If you think it has potential then pony up your own $100k and see where this goes. Either way a group of really smart people get to do some really cool shit, and as long as they don't get burned out or jaded by the online community, they all will be able to either continue on a successful project or regroup and tackle a new one. The whole world needs as many intelligent, ambitious, dreamers as possible...no matter what their inferred promiscuity / penis size is.
I, for one, assume any 19 year old willing to risk $1.25 mil can probably also pull a sizeable dong out of his pants during a funding presentation if needed.
Godspeed You! Black Emperor.
-jeff-
Talent is not the same thing as experience. Being able to do something does not mean it is a good idea to do it. So many people have tried this approach to improving efficiency (MIT RAW, Stanford Imagine, Stanford Smart Memories) and have run into such serious problems (compilers, libraries, eco system, system-level support) that unless he has solutions for those problems, starting it again is not a smart idea.
As somebody in the VLSI field, I am happy that somebody broke out of the monopoly/duopoly of the established players. WE are moving towards "single/double" vendor for everything from mobiles to laptop processors to desktop processors. Having little choice also harms progress.
The other thing which excites me is that you are going towards a completely new architecture. This is what innovation is about!
Hopefully, your success will inspire others also to take the plunge.
My Aurora : http://www.youtube.com/watch?v=o91ZsGwJYyg
FB : https://www.facebook.com/TanveersPhotography
Cue this old joke...
...
- How many hardware engineers does it take to change a light bulb?
- None, we'll fix it in software.
Doing stuff in software to make hardware easier has been tried before (and before this kid was born, perhaps why he thinks this is new). It failed. Transputer, i960, i432, Itanium, MTA, Cell, a slew of others I don't remember...
As for the grid, nice, but not exactly new. Tilera, Adapteva, KalRay,
Talent is not the same thing as experience.
I'm in agreement - experience counts for a lot when doing something new.
Being able to do something does not mean it is a good idea to do it.
I'm in agreement with this as well.
So many people have tried this approach to improving efficiency (MIT RAW, Stanford Imagine, Stanford Smart Memories) and have run into such serious problems (compilers, libraries, eco system, system-level support) that unless he has solutions for those problems, starting it again is not a smart idea.
It is highly unlikely that this will go anywhere (so, yeah - agreement again)... BUT... he is displaying a great deal of talent for his age. The lessons he learns from this failure[1] will be more valuable than the lessons learned in succeeding at a less difficult task.
As I understand it, he proposes removing the hardware cache and instead using the compiler to prefetch values from memory. He says the hardware cache logic gates add 40% overhead to every memory fetch. Whether he can actually produce a compiler than will insert the necessary memory fetch instructions at compile time in an efficient manner remains to be seen, but it is still a worthwhile endeavour for a 19 year old.
[1] Worst case scenario. He might succeed after all.
I'm a minority race. Save your vitriol for white people.
You may be underestimating the level of effort to which some people will go to get at the performance. Right now they're running nVidia cards, for pete's sake. Show a way and they will come. After all, that "full ecosystem of tools and vendors" can be ultimately achieved with an open development model.
Ezekiel 23:20
"Virtual Memory translation and paging are two of the worst decisions in computing history"
In the old days and even with current CPUs, one CPU can run multiple processes. But if CPUs were small enough and cheap enough, one program would run on multiple CPUs. Why would you need memory protection (virtual memory translation) if only a small portion of one program is running on one CPU? Answer: you don't.
So TL;DR, he could be right, but only for systems with huge number of weak/limited CPUs.
"Virtual Memory translation and paging are two of the worst decisions in computing history"
He's not completely wrong there. Paging is nice for operating systems isolating processes and for enabling swapping, but it's horrible to implement in hardware and it's not very useful for userland software. Conflating translation with protection means that the OS has to be on the fast path for any userland changes and means that the protection granule and translation granule have to be the same size. The TLB needs to be an associative structure that can return results in a single cycle, which makes it hard to scale. Larger pages help (though then you make the protection granule even larger), but the amount of physical memory that the TLB can cover has dropped with each successive generation since paging was first introduced into microprocessors.
"Introduction of hardware managed caching is what I consider 'The beginning of the end'"
I don't completely agree with this, but given the amount of effort that people writing high-performance code (and compilers) have to spend understanding the hardware caching policy and working around it, I'm not completely convinced that it's a win in the HPC arena - you end up spending almost as much time fighting the cache as you would working with a hardware scratchpad. I'm still a fan of single-level stores as a programmer abstraction though.
I am TheRaven on Soylent News
Whether he can actually produce a compiler than will insert the necessary memory fetch instructions at compile time in an efficient manner remains to be seen
That's not the hard bit of the problem. Compiler-aided prefetching is fairly well understood. The problem is the eviction. Having a good policy for when data won't be referenced in the future is hard. A simple round-robin policy on cache lines works okay, but part of the reason that modern caches are complex is that they try to have more clever eviction strategies. Even then, most of the die usage by caches is the SRAM cells - the controller logic is tiny in comparison.
I am TheRaven on Soylent News
Whether he can actually produce a compiler than will insert the necessary memory fetch instructions at compile time in an efficient manner remains to be seen
That's not the hard bit of the problem. Compiler-aided prefetching is fairly well understood.
I honestly thought that was the difficult part; it's halting-problem hard, if I understand correctly. If you cannot predict whether a program will ever reach the end-state, then you cannot predict if it will ever reach *any* particular state. To know whether to prefetch something requires you to have knowledge about the program's future state.
To my knowledge prediction of program state only works if your predicting a *very* short time in the future (say, no more than a hundred instructions). If you're limited to that then the best you can do is branch prediction or similar (only a few hundred instructions?). This is why the cache helps - if you use something then the probability is high you will use it again soon. Compilers can then take limited advantage of this by locality of variables/instructions.
The problem is the eviction. Having a good policy for when data won't be referenced in the future is hard.
It's the negation of the problem of deciding what *will* be needed in some future state. This makes it equally hard (halting-problem hard) to deciding what to prefetch. For both problems it appears to me that computer science has already settled on "no solution" as the answer to the question "can we predict the programs future state?". NP-hard is NP-hard, no matter how much engineering talent is thrown at it; it remains mathematically impossible. Hence, I figure that what this kid has got is some great new mitigation scheme for program state prediction. That, or maybe he skipped the automata theory classes (I see that a lot with engineers-turned-programmers).
(I think - feel free to correct my understanding).
I'm a minority race. Save your vitriol for white people.
His comments indicate vision. Decades ago it was necessary to have caching and virtual memory, but with modern chip design he sees that it's no longer needed; instead of trying to fix yesterday's problem with yesterdays solution let's move on to solving the problem as if there was never a need for caching and virtual memory in the first place.
Prefetching in the general case is non-computable, but a lot of accesses are predictable. If the stack is in the scratchpad, then you're really only looking at heap accesses and globals for prefetching. Globals are easy to statically hint and heap variables are accessed by pointers that are reachable. It's fairly easy for each function that you might call to emit a prefetch version that doesn't do any calculation and just loads the data, then insert a call to that earlier. You don't have to get it right all of the time, you just have to get it right often enough that it's a benefit.
For prefetching vs eviction, it's a question of window size. Even with no prefetching, most programs exhibit a lot of locality of reference and so caches work pretty well without prefetching - it doesn't matter that you take a miss on the first access, because you hit on the next few dozen (and in a multithreaded chip, you just let another thread run while you wait), but if you're evicting data too early then it's a problem. A combination of LRU / LFU works well, though all of the good algorithms in this space are patented. Although issuing prefetch hints is fairly easy, the reason that most compilers don't is that there's a good chance of accidentally pushing something else out of the cache. That said, if they're targeting HPC workloads, then just running them in a trace and then using that for hinting would probably be enough for a lot of things.
I heard a nice anecdote from some friends at Apple a while ago. They found that one of their core frameworks was getting a significant slowdown on their newer chip. The eventual cause was quite surprising. In the old version, they had a branch being mispredicted, and a load speculatively executed. The correct branch target was identified quite early, so they only had a few cancelled instructions in the pipeline. About a hundred cycles later, they hit the same instruction and this time ran it correctly. With the new CPU, the initial branch was correctly predicted. This time, when they hit the load for real, it hadn't been speculatively executed and so they had to wait for a cache miss.
Also, if you're trying to create a parallel system with manual caches... good luck. Cache coherency is a pain to get right, but it's then fundamental to most modern parallel software. Implementing the shootdowns in software is going to give you a programming model that's horrible.
And finally there's the problem that doing it in software makes it serial. The main reason that we use hardware page-table walkers in modern CPUs is not that they're much better than a software TLB fill, it's that it's much easier to make them run completely asynchronously with the main pipeline. The same applies to caches.
I am TheRaven on Soylent News
One of the things that doesn't seem to be getting through in most of the media articles is how our memory system is actually set up. I'll try to describe it briefly here, starting from the single core.
:)
At a single core, we have a 128KB multibanked scratchpad memory, which you can think of as just like an L1 cache but smaller and lower latency. We have one cycle latency for a load/store from your registers to or from the scratchpad, and the same latency from the scratchpad to/from the cores router. That scratchpad is physically addressed, and does not have a bunch of extra (and in our opinion, wasted) logic to handle address translations, which just take up a lot of area and power (especially once you multiply it over hundreds of cores and large SRAMs. Most people think the TLB logic is a fixed size for any size SRAM, but it is not, and it gets significantly worse if you add coherency). Remember, even if you have a L1 cache (Typically 16 to 32KB, tops) hit on an Intel chip, it still takes 4 whole cycles.
Once we get to having a 16x16 grid (256 cores) as part of our Network on Chip, we have a total of 32MBs of on chip 1 cycle latency scratchpad. How we have arranged that is as a global flat address space, with all of the addresses being physically mapped. What I mean by this is that Core 0's scratchpad is the first 128K of the address space, and the address space continues on seamlessly to core 1, core 2, and all the way to core 255. If the address requested by a core is not in its own scratchpad's range, it goes to the router and hops on the NoC until it gets there... with a one cycle latency per hop. We have 32GB/s in each cardinal direction per router, giving a total on chip bandwidth of 8TB/s. Since it is all statically routed (which is a *very* important part of our entire design, which I am not revealing the full implications of just yet), we have guaranteed 1 cycle per hop latency between each router on the NoC. So even if you are going from one corner to another (core 0 to core 255) it is still a max latency of 32 cycles... still less than the latency to the L3 cache on an Intel chip.
This gets to the chip to chip interconnect, which we have not been very public about, but I can say it is VERY high bandwidth (48GB/s in each direction, on all four sides of the chip, so an aggregate bandwidth of 384GB/s... compare that to 16GB/s of PCIe or even NVIDIA's 2018/2019 80GB/s plans with NVLINK). There are a lot of very cool things in that design, but I can't go into them publicly quite yet. We sacrifice distance and interoperability to get those numbers, but we think it is a worthy tradeoff for insane speed and efficiency. The other interesting thing that we are looking at (and haven't fully explored the full tradeoffs) is being able to extend of flat address space across multiple chips in a larger grid.
To wrap up, most of the problems you mentioned here and in other comments are not totally valid, as we are not trying to replicate the inefficient protocols implemented super inefficiently in hardware today. We want to eventually be able to provide the same user experience and convenience that hardware caching provides, but keeping it abstracted away from the user. Hopefully you can understand I can't go into full details of this, and you have every reason to be skeptical, but that does not mean we are not going to try to do it anyways.
Also, cool Apple story. Thanks
Happy to answer any other questions
Most 19 year olds' idea of achievement is not puking up on the front doorstep after a particularly brutal night out boozing. For all you doubters: can we see how this chip performs in the wild before making judgement, please? To Thomas: will the chip ever see a retail shelf in say a personal supercomputer like the NVidia Tesla?
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
1. We have already run through synthesis of a version of our core (and rough version of our chip)... There's a lot of work to be done, especially as we are in the last steps of locking down the RTL, but we are not worried about timing... we are being very conservative.
2. Already have standard cells and memory compilers. We are not amateurs.
3. We actually have solid state physics and fabrication experience, and understand the physical constraints of wire and gate delays, leakage, etc. All of those played a very large part in our architectural design, specifically so we don't have a timing and closure being a huge clusterfuck.
Take a look at my comment here: http://news.slashdot.org/comme...
The primary benefit of caches for HPC applications is *bandwidth filtering*. You can have much higher bandwidth to your cache (TB/s, pretty easily) than you can ever get to off-chip--and it is substantially lower power. It requires blocking your application to have a working set that fits in cache.
He's pulling out quotes from Cray (I used to work there) about how caches just get in the way--and they did, 30 years ago when there were very few HPC applications whose working set could fit in cache. It's a very different world nowadays.
Sometimes skipping college doesn't make you a genius, sometimes it just means you are doomed to repeat 50 years worth of mistakes in a well developed field.
If you never reinvent the wheel, you'll never invent the tire. I say we let him down the rabbit hole and see if he comes back with anything new.
At a single core, we have a 128KB multibanked scratchpad memory, which you can think of as just like an L1 cache but smaller and lower latency. We have one cycle latency for a load/store from your registers to or from the scratchpad
Note that a single-cycle latency for L1 is not that uncommon in in-order pipelines - the Cortex A7, for example, has single-cycle access to L1.
That scratchpad is physically addressed, and does not have a bunch of extra (and in our opinion, wasted) logic to handle address translations,
The usual trick for this is to arrange your cache lines such that your L1 is virtually indexed and physically tagged, which means that you only need the TLB lookup (which can come from a micro-TLB) on the response. If you look at the cache design on the Cortex A72, it does a few more tricks that let you get roughly the same power as a direct-mapped L1 (which has very similar power to a scratchpad) from an associative L1.
If the address requested by a core is not in its own scratchpad's range, it goes to the router and hops on the NoC until it gets there... with a one cycle latency per hop
To get that latency, it sounds like you're using the NoC topology that some MIT folks presented at ISCA last year. I seem to remember that it was pretty easy to come up with cases that would overload their network (propagating wavefronts of messages) and end up breaking the latency guarantees. It also sounds like you're requiring physical layout awareness from your jobs, bringing NUMA scheduling problems from the OS (where they're hard) into the compiler (where they're harder).
Building a compiler for this sounds like a fun set of research problems (if you're looking for consultants, my rates are very reasonable! Though I have a different research architecture that presents interesting compiler problems to occupy most of my time).
Oh, one more quick question: Have you looked at Loki? The lowRISC project is likely to include an implementation of those ideas and it sounds as if they have a lot in common with your design (though also a number of significant differences).
I am TheRaven on Soylent News
Proof-of-concept doesn't have to be on the latest technology, which is undeniably expensive. Do a shared-wafer (https://www.mosis.com/) on some near-obsolete technology, and when the bugs are worked out it's time for scaling.
Contribute to civilization: ari.aynrand.org/donate
Darned overloaded abbreviations. RTL has priority, means Resistor-Transistor Logic.
Contribute to civilization: ari.aynrand.org/donate
Congratulations for tricking someone into giving you money. Good luck with your impending disaster.