Slashdot Mirror


19-Year-Old's Supercomputer Chip Startup Gets DARPA Contract, Funding

An anonymous reader writes: 19-year-old Thomas Sohmers, who launched his own supercomputer chip startup back in March, has won a DARPA contract and funding for his company. Rex Computing, is currently finishing up the architecture of its final verified RTL, which is expected to be completed by the end of the year. The new Neo chips will be sampled next year, before moving into full production in mid-2017.The Platform reports: "In addition to the young company’s first round of financing, Rex Computing has also secured close to $100,000 in DARPA funds. The full description can be found midway down this DARPA document under 'Programming New Computers,' and has, according to Sohmers, been instrumental as they start down the verification and early tape out process for the Neo chips. The funding is designed to target the automatic scratch pad memory tools, which, according to Sohmers is the 'difficult part and where this approach might succeed where others have failed is the static compilation analysis technology at runtime.'"

30 of 150 comments (clear)

  1. good for him by turkeydance · · Score: 4, Insightful

    mean it.

  2. Not sure whats more impressive... by jonwil · · Score: 4, Insightful

    Not sure whats more impressive, the fact that a 19 year old is able to get DARPA funding or the fact that a 19 year old (and his team presumably) is about to go into mass production with a fairly fancy looking custom microprocessor on a 28nm fab process.

    1. Re:Not sure whats more impressive... by alvinrod · · Score: 5, Informative

      I was a little curious about that as well and one of the linked articles from TFA says that this kid was at MIT at 13. I'll go ahead guess that he's really into and good at microprocessor design. The article I've linked also talks about some of the design decisions for the chip he's making, on which I'd be interested in hearing from someone with a background in the field.

    2. Re:Not sure whats more impressive... by trsohmers · · Score: 5, Informative

      This is the founder of the startup in the article. We have actually just raised $1.25 in venture funding, which is mentioned in the article. Thanks, and I hope we will be bringing more news soon.

    3. Re:Not sure whats more impressive... by avgjoe62 · · Score: 4, Funny

      $1.25? Contact me and I'll double, hell why not triple your funding! :)

      --

      How come Slashdot never gets Slashdotted?

    4. Re:Not sure whats more impressive... by trsohmers · · Score: 5, Informative

      I'm hugely biased as I am the founder of the referenced startup, but I figured I would point out a few key things: 1. When it comes to FLOPs per watt, we are actually aiming at a 10 to 25x increase over existing systems... The best GPUs (before you account for the power usage of the CPU required to operate it) get almost 6 double precision GFLOPs per watt, while our chip is aiming for 64. 2. When it comes to being better than a GPU for applications, you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU) after you run out of data in the relatively small memory on the GPU. Due to this, GPUs are really only good at level 3 BLAS applications (matrix-matrix based... basically things you would imagine GPUs were designed for, which are related to image/video processing). It just so happened that when the GPGPU craze started ~6/7 years ago, they had enough of an advantage over CPUs that they made sense for some other applications, but in actuality, GPUs do so much worse on level 1 and level 2 BLAS apps compared to the latest CPUs that GPUs are really starting to lose their advantage (and I think will be dying out when it comes to anything other than what they were originally designed for plus some limited heavy matrix workloads... but then again, I'm biased). 3.Programming is the biggest difficulty, and will make or break our company and processor. The DARPA grant is specifically for continued research and work on our development tools, which are intended to automate the unique features of our memory system. We have some papers in the works and will be talking pubicly about our *very* cool software in the next couple of months. 4. Your mention of the Mill and running existing code well, I had a pretty good laugh. Let me preface this by saying that I find stack machines academically interesting and are fun to think about, and I don't discredit the Mill team entirely, and think it is good thing they exist. With that being said they have had barely functioning compilers for years (which they refuse to release pubicly), and stack machines are notorious for having HORRIBLE support for languages like C. The fact that Out Of The Box Computing (the creators of the Mill) have been around for over 10 years and have given nothing but talks with powerpoints (though they clearly are very intelligent and have an interesting architectures) says a lot about their future viability. I hate to be a downer like that, especially since I have found Ivan's talks interesting and that he is a nice and down to earth guy, but I highly doubt they will never have a chip come out. I'll restate my obvious biases for the previous statement. Feel free to ask any other questions.

    5. Re:Not sure whats more impressive... by metlin · · Score: 2

      He's a Thiel Fellow, and clearly, that model is working for kids like him who are super gifted for whom the current college education model would be absurd.

      This 17-Year-Old Dropped Out Of High School For Peter Thiel And Built A Game-Changing New Kind Of Computer

      Pretty awesome, if you ask me!

    6. Re:Not sure whats more impressive... by PopeRatzo · · Score: 2, Funny

      I'm hugely biased as I am the founder of the referenced startup, but I figured I would point out a few key things:

      Thomas, you are awesome.

      Enjoy your success. I see from your bio that in your "free time" you like to play guitar. I hope you've bought yourself a good one (or six).

      --
      You are welcome on my lawn.
    7. Re:Not sure whats more impressive... by Anonymous Coward · · Score: 3, Interesting

      Thanks for the response!

      I should have noticed your numbers for for double precision flops, so my numbers were way off. Thanks for the correction. I bet you are IEEE compliment too (Darn GPUs...).

      Your design is intended specifically for parallel work loads with localized or clustered data access, correct? (I realize this is includes most supercomputer work jobs) It sounds like similar constraints you have with GPUs, but if met properly, the performance should be much better/more efficient and more scale-able. And you expect your compilers to be able to meet these needs and statically schedule all the memory movement which is where you get massive gains. Is that a reasonable assessment?

      Your designs don't have anything to offer for old straight line single threaded programs, correct? It will also not work well if you can't schedule the DMA actions well enough: pointer heavy random access code wouldn't run faster on your system than a gpu, but it won't run fast anywhere. Is that about right?

      I'm looking forward to your papers on the compiler side it sounds very interesting: If you get something working in that area, it could be a big deal to the super computer guys (that's not me though).

      Personally I'm mostly interested in single threaded throughput, process isolation, and security, which is why the mill interests me a lot. As for their stuff taking a long time: your rate of progress and schedule is just amazing, its not that others are slow...

    8. Re:Not sure whats more impressive... by LetterRip · · Score: 2

      The important question is - how does it perform for the Cycles (Blenders render engine) benchmarks :)

      http://blenderartists.org/foru...

      https://www.blender.org/downlo...

    9. Re:Not sure whats more impressive... by captnjohnny1618 · · Score: 4, Interesting

      I'm burning some mod points to post this under my username, but it's totally worth it. THIS is the kind of article that should be on Slashdot!

      Can you elaborate on the programming structure/API you guys are envisioning for this? (it's cool if you can't, I'd understand :-D). Also, what particular types of problems are you guys targeting your chips to solve or to what areas do you envision your chips being especially well suited? Also, who do you think has done the best nitty-gritty write up about the project so far? I'd love to hear what you think is the best technical description publicly available. Can't wait to learn more as the project grows.

      Although I'm not a programmer or CS person by training, I do GPGPU programming (although not BLAS-based stuff) almost exclusively for my research and enjoy it because once you understand the differences between the GPU and CPU it just become a question of how to best parallelize your algorithm. It'd be AMAZING to see the memory bandwidth and power usage specs you guys are working towards under a similar programming structure we currently see with something like CUDA or OpenCL. Any plans for something like that or am I betraying my hobbyist computing status?

      Finally, if you ever need any applications testing, specifically in the medical imaging field, feel free get in touch. ;-)

    10. Re:Not sure whats more impressive... by godrik · · Score: 3, Interesting

      I like the idea of "reinventing the computer for performance". Trying to get rid of overhead caused by virtual memory has attracted quite a bit of attention recently, so the idea is definitly sound.
      A few questions:
      -Is there any more details I can read on anywhere? I could not really see any details passed the "slightly technical PR" on http://www.rexcomputing.com/in...
      -Do you plan on plan on presenting your work at SuperComputing?
      -You mention BLAS3 kernels, so I assume you mean dense BLAS3 kernels. In what I see, people are no longer really interested in dense linear algebra. Most of the applications I see nowadays are sparse. Can your architecture deal with that?
      -The chip and architecture seem to essentially be based on a 2D mesh network, can it be extended to more dimensions? I was under the impression that it would cause high latency in physical simulation, because you can not easily project a 3D space in a 2D space without introducing large distance discrepancies. (Which is why BG/Q use 5D torus network.)
      Keep us apraised!
      Cheers

    11. Re:Not sure whats more impressive... by trsohmers · · Score: 4, Informative

      This is a bit old and has some inaccuracies, so I hesitate to share it, but since you can find it if you dig deep enough... here it is: http://rexcomputing.com/REX_OC...
      Couple quick things: Our instruction encoding is a bit different than what it has on the slide, we've brought it down to 128 bit VLIW (32 bits per functional unit operation), and there are some pipeline particulars we are not talking about publicly yet. We have also moved all of our compiler and toolchain development to be based on LLVM (and thus the really dense slides in there talking about GCC are mostly irrelevant).
      As mentioned in the presentation, we have some ideas of expanding the 2D mesh on the chip, including having it become a 2D torus... our chip-to-chip interconnect allows a lot more interesting geometries, and are working on one with a university research lab that features a special 50-node configuration with max 2 hops between nodes. Our 96GB/s chip-to-chip bandwidth per side is also a big thing differing us from other chips (with the big sacrifice being the very short distance we need to have between chips and having a lot of constraints in packaging and the motherboard). We'll have more news on this in the future.
      When it comes to sparse and dense computations, we are mostly focusing on the dense ones to start (FFT's are beautiful on our architecture), but we are capable of doing well with sparse workloads, and while those developments are in the pipeline, it will take a lot more compiler development effort.
      We actually had a booth in the emerging technologies exhibition at Supercomputing Conference 2014, and hope to have a presence again this year

    12. Re:Not sure whats more impressive... by trsohmers · · Score: 5, Funny

      While this is obvious troll bait, I can't resist the opportunity to just say that yes, I have kissed multiple girls.

    13. Re:Not sure whats more impressive... by trsohmers · · Score: 5, Informative

      1.We are IEEE compliant, but I'm not a fan of it TBH, as it has a ridiculous number of flaws... Check out Unum and the new book "The End Of Error" by John Gustafson (and also search Gustafson's Law, the counterargument to the more famous Amdahl's law), which goes over all of them and proposes a superior floating point format in *every* measure.
      2.First thing we get around primarily by having ridiculous bandwidth (288 to 384GB/s aggregate chip-to-chip bandwidth)... we'll have more info out on that in the coming months. When it comes to memory movement, that's the big difficulty and what a big portion of our DARPA work is focused on, but a number of unique features of our network on chip (statically routed, non blocking, single cycle latency between routers, etc) help a lot with allowing the compiler to *know* that it can push things around in given time, and having to put a minimal number of NOPs. There is a lot of work, and it will not be perfect with our first iteration, but the initial customers we are working with do not require perfect auto-optimization to begin with.
      3. If you think of it as each core as being a quad issue in order RISC core (think on the performance order of a ARM Cortex A9 or potentially A15, but using a lot less power and being 64 bit), you can have one fully independent and isolated application on each core. That's one of the very nice things about a true MIMD/MPMD architecture. So we do fantastic with things that parallelize well, but you can also use our cores to run a lot of independent programs decently well.

    14. Re:Not sure whats more impressive... by trsohmers · · Score: 5, Interesting

      1. My personal favorite programming models for our sort of architecture would be PGAS/SPMD style, with the latter being the basis for OpenMP. PGAS gives a lot more power in describing and efficiently having shared memory in an application with multiple memory regions. Since every one of our cores have 128KB of our scratchpad memory, and all of those memories are part of a global flat address space, every core can access any other cores memory as if it is part of one giant continuous memory region. That does cause some issues with memory protection, but that is a sacrifice you make for this sort of efficiency and power (but we have some plans on how to address that with software... more news on that will be in the future). The other nice programming model we see is the Actor model... so think Erlang, but potentially also some CSP like stuff with Go in the future (And yes, I do realize they are competing models).
      If you want to get the latest info as it comes out, sign up for our mailing list on our website!

    15. Re:Not sure whats more impressive... by __rze__ · · Score: 3, Interesting
      Hi Thomas,

      I found this extremely intriguing, as I am currently writing up my dissertation on high-GFLOPS/W 3-D layered reconfigurable architectures. I am also of the opinion that memory handling is the key, as it is the only way to resolve the von Neumann bottle-neck problem. Many processing elements with no means to feed them are useless. In my design I am using reconfigurability and flexibility to gain energy efficiency (my architectural range allows 111GFLOPs/W in some configurations).

      I am also concentrating on dense linalg kernels, as they are a perfect challenge in variable computation:data ratio, varied and complex memory access patterns and regularity.

      In my approach, I am of the opinion that forcing an application mapping to a given architecture via a compiler is inefficient. Instead, I am exploiting architectural flexibility gained from coarse-grained reconfigurable structures to adapt the architecture to an optimal ASAP/ALAP scheduling, thus constructing the perfect architecture to match an optimal mapping. Basically, keeping all processing elements busy all the time is the goal, leading to huge energy gains.

      The way this is done is a bit weird, as my architecture has a function set as opposed to an instruction set, which is custom-definable and run-time reconfigurable to suit an application. The construction of the function set is done by composing elementary hardware functions based on meaning, a concept close to functional programming concepts from John Backus. Programming is meaning-based, efficiently constructing required functions and bringing them out to assembly.

      Several kernels have been done this way, and programming stays easy via this functional reconfiguration (so far longest being TRSM with 112 assembly lines). Reached 21-25GFLOPs/W on 65nm tech pre-layout for 10 BLAS1-3 kernels)

      I am now finishing up a 3D VIA-last physical layout in 40nm tech which already doubled my energy efficiency. (Why 3D? That's another story -- I think that division of computation, memory access and communication(intra-kernel data movement, sharing, broadcasting) needs custom hardware structures optimized for these tasks, which can be parallelized. Which is then native for 3D silicon -- each class on its own die). I will be reading your papers ASAP to see how you deal with the von Neumann bottle-neck :)

      Cheers, Zoltan

    16. Re:Not sure whats more impressive... by K.+S.+Kyosuke · · Score: 4, Funny

      In a SIMD or a MIMD fashion?

      --
      Ezekiel 23:20
    17. Re:Not sure whats more impressive... by TheRaven64 · · Score: 2

      When it comes to being better than a GPU for applications, you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU)

      That's a somewhat odd claim. One of the reasons that computations on GPUs are fast is that they have high memory bandwidth. Being hampered by using the same DRAM as the CPU is one of the reasons that integrated GPUs perform worse. If you're writing GPU code that's doing anything other than initial setup over PCIe, then you're doing it badly wrong.

      That said, GPU memory controllers tend to be highly specialised. The nVidia ones have around 20 different streaming modes for different access patterns (I think the new version has a programmable prefetcher - Intel is also adding one), but if your memory access patterns are data dependent then GPUs can suck.

      after you run out of data in the relatively small memory on the GPU

      Not really. If you're doing big workloads on a GPU, your overflow isn't main memory over PCIe, it's the next GPU along over a much faster interconnect. And even with PCIe, most of the latency comes from the protocol and not the physical interconnect - you can get a lot more speed out of the PCIe hardware if you don't need all of the features of the PCIe bus.

      The DARPA grant is specifically for continued research and work on our development tools, which are intended to automate the unique features of our memory system. We have some papers in the works and will be talking pubicly about our *very* cool software in the next couple of months.

      Where have you sent them? I'll keep an eye out.

      Your mention of the Mill and running existing code well, I had a pretty good laugh

      You certainly wouldn't be alone there.

      stack machines are notorious for having HORRIBLE support for languages like C

      That's not really true (not sure what the relevance to The Mill is though - it's not a stack machine). Algol support for stack machines became pretty good (C wasn't really popular until stack machines had largely died out, but the back end of a C compiler is not that different from the back end of an Algol compiler). The reason that stack machines died is that it's basically impossible for the hardware to extract ILP from a stack ISA. That's less of an issue if your throughput comes from thread-level parallelism. There are some experimental architectures floating around that get very good i-cache usage and solid performance from a stack-based ISA and a massive number of hardware threads.

      --
      I am TheRaven on Soylent News
  3. Only $100k? by afidel · · Score: 4, Informative

    That doesn't go very far in the microprocessor world. I worked for Cisco back in the early 00's and even back then tape out costs were approaching $1M for a 5 layer mask, today with sub-wavelength masks and chips using 12+ layers it must be tremendously expensive to spin a chip.

    --
    There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  4. Re:By Neruos by trsohmers · · Score: 4, Interesting

    We actually have very good reasons to say why this is a very different kind of VLIW, and have found the reason why other VLIW chips have had such static scheduling issues. Hope we can convince you and everyone else soon enough.

  5. Re:Half an hour, two comments by trsohmers · · Score: 5, Informative

    Uhm, it ranges, but I'd say I can get a snickers bar for around a buck in most vending machines. And there are also plenty of people smarter than me, even in this very small niche that I am in.

  6. 19 by PopeRatzo · · Score: 4, Funny

    When I was 19, my main achievement was building a bong out of a milk jug.

    --
    You are welcome on my lawn.
  7. Re: By Neruos by trsohmers · · Score: 5, Informative

    The biggest thing is what we have tried to emphasize, which is the fact that we have an entirely different memory system that does away with the hardware managed cache hierarchy. The rest of the really interesting stuff we have not publicly disclosed (yet), but I can tell you that it is very different from both Kalray and Tilera.

  8. Re:VLSI is hard by Anonymous Coward · · Score: 2, Funny

    An ENTIRE sleepless night? Wow. Sounds TOUGH. —said no MIT grad ever.

  9. I'm a pro in the field. This doesn't scan. by Brannon · · Score: 4, Interesting

    Please explain to me simply how you get 10x in compute efficiency over GPUs--these chips are already fairly optimal at general purpose flops per watt because they run at low voltage and fill up the die with arithmetic.

    GPUs have excellent memory bandwidth to their video RAM (GDDR*), they have poor IO latency & bandwidth (PCIe limited) which is the main reason they don't scale well.

    We've heard the VLIW "we just need better compilers" line several times before.

    Thus far this sounds like a truly excellent high school science fair project, or a slightly above average college engineering project. It is miles away from passing an industrial smell test.

  10. (old fart)been tried before(/old fart) by Melkhior · · Score: 4, Insightful

    Cue this old joke...
    - How many hardware engineers does it take to change a light bulb?
    - None, we'll fix it in software.

    Doing stuff in software to make hardware easier has been tried before (and before this kid was born, perhaps why he thinks this is new). It failed. Transputer, i960, i432, Itanium, MTA, Cell, a slew of others I don't remember...

    As for the grid, nice, but not exactly new. Tilera, Adapteva, KalRay, ...

  11. Re:The 19 year old is a lunatic by TheRaven64 · · Score: 2

    Prefetching in the general case is non-computable, but a lot of accesses are predictable. If the stack is in the scratchpad, then you're really only looking at heap accesses and globals for prefetching. Globals are easy to statically hint and heap variables are accessed by pointers that are reachable. It's fairly easy for each function that you might call to emit a prefetch version that doesn't do any calculation and just loads the data, then insert a call to that earlier. You don't have to get it right all of the time, you just have to get it right often enough that it's a benefit.

    For prefetching vs eviction, it's a question of window size. Even with no prefetching, most programs exhibit a lot of locality of reference and so caches work pretty well without prefetching - it doesn't matter that you take a miss on the first access, because you hit on the next few dozen (and in a multithreaded chip, you just let another thread run while you wait), but if you're evicting data too early then it's a problem. A combination of LRU / LFU works well, though all of the good algorithms in this space are patented. Although issuing prefetch hints is fairly easy, the reason that most compilers don't is that there's a good chance of accidentally pushing something else out of the cache. That said, if they're targeting HPC workloads, then just running them in a trace and then using that for hinting would probably be enough for a lot of things.

    I heard a nice anecdote from some friends at Apple a while ago. They found that one of their core frameworks was getting a significant slowdown on their newer chip. The eventual cause was quite surprising. In the old version, they had a branch being mispredicted, and a load speculatively executed. The correct branch target was identified quite early, so they only had a few cancelled instructions in the pipeline. About a hundred cycles later, they hit the same instruction and this time ran it correctly. With the new CPU, the initial branch was correctly predicted. This time, when they hit the load for real, it hadn't been speculatively executed and so they had to wait for a cache miss.

    Also, if you're trying to create a parallel system with manual caches... good luck. Cache coherency is a pain to get right, but it's then fundamental to most modern parallel software. Implementing the shootdowns in software is going to give you a programming model that's horrible.

    And finally there's the problem that doing it in software makes it serial. The main reason that we use hardware page-table walkers in modern CPUs is not that they're much better than a software TLB fill, it's that it's much easier to make them run completely asynchronously with the main pipeline. The same applies to caches.

    --
    I am TheRaven on Soylent News
  12. Re:The 19 year old is a lunatic by trsohmers · · Score: 2

    One of the things that doesn't seem to be getting through in most of the media articles is how our memory system is actually set up. I'll try to describe it briefly here, starting from the single core.

    At a single core, we have a 128KB multibanked scratchpad memory, which you can think of as just like an L1 cache but smaller and lower latency. We have one cycle latency for a load/store from your registers to or from the scratchpad, and the same latency from the scratchpad to/from the cores router. That scratchpad is physically addressed, and does not have a bunch of extra (and in our opinion, wasted) logic to handle address translations, which just take up a lot of area and power (especially once you multiply it over hundreds of cores and large SRAMs. Most people think the TLB logic is a fixed size for any size SRAM, but it is not, and it gets significantly worse if you add coherency). Remember, even if you have a L1 cache (Typically 16 to 32KB, tops) hit on an Intel chip, it still takes 4 whole cycles.

    Once we get to having a 16x16 grid (256 cores) as part of our Network on Chip, we have a total of 32MBs of on chip 1 cycle latency scratchpad. How we have arranged that is as a global flat address space, with all of the addresses being physically mapped. What I mean by this is that Core 0's scratchpad is the first 128K of the address space, and the address space continues on seamlessly to core 1, core 2, and all the way to core 255. If the address requested by a core is not in its own scratchpad's range, it goes to the router and hops on the NoC until it gets there... with a one cycle latency per hop. We have 32GB/s in each cardinal direction per router, giving a total on chip bandwidth of 8TB/s. Since it is all statically routed (which is a *very* important part of our entire design, which I am not revealing the full implications of just yet), we have guaranteed 1 cycle per hop latency between each router on the NoC. So even if you are going from one corner to another (core 0 to core 255) it is still a max latency of 32 cycles... still less than the latency to the L3 cache on an Intel chip.

    This gets to the chip to chip interconnect, which we have not been very public about, but I can say it is VERY high bandwidth (48GB/s in each direction, on all four sides of the chip, so an aggregate bandwidth of 384GB/s... compare that to 16GB/s of PCIe or even NVIDIA's 2018/2019 80GB/s plans with NVLINK). There are a lot of very cool things in that design, but I can't go into them publicly quite yet. We sacrifice distance and interoperability to get those numbers, but we think it is a worthy tradeoff for insane speed and efficiency. The other interesting thing that we are looking at (and haven't fully explored the full tradeoffs) is being able to extend of flat address space across multiple chips in a larger grid.

    To wrap up, most of the problems you mentioned here and in other comments are not totally valid, as we are not trying to replicate the inefficient protocols implemented super inefficiently in hardware today. We want to eventually be able to provide the same user experience and convenience that hardware caching provides, but keeping it abstracted away from the user. Hopefully you can understand I can't go into full details of this, and you have every reason to be skeptical, but that does not mean we are not going to try to do it anyways.

    Also, cool Apple story. Thanks :)
    Happy to answer any other questions

  13. Re:But it works in RTL Simulation by trsohmers · · Score: 2

    1. We have already run through synthesis of a version of our core (and rough version of our chip)... There's a lot of work to be done, especially as we are in the last steps of locking down the RTL, but we are not worried about timing... we are being very conservative.

    2. Already have standard cells and memory compilers. We are not amateurs.

    3. We actually have solid state physics and fabrication experience, and understand the physical constraints of wire and gate delays, leakage, etc. All of those played a very large part in our architectural design, specifically so we don't have a timing and closure being a huge clusterfuck.