19-Year-Old's Supercomputer Chip Startup Gets DARPA Contract, Funding

← Back to Stories (view on slashdot.org)

19-Year-Old's Supercomputer Chip Startup Gets DARPA Contract, Funding

Posted by samzenpus on Wednesday July 22, 2015 @12:17PM from the show-me-the-money dept.

An anonymous reader writes: 19-year-old Thomas Sohmers, who launched his own supercomputer chip startup back in March, has won a DARPA contract and funding for his company. Rex Computing, is currently finishing up the architecture of its final verified RTL, which is expected to be completed by the end of the year. The new Neo chips will be sampled next year, before moving into full production in mid-2017.The Platform reports: "In addition to the young company’s first round of financing, Rex Computing has also secured close to $100,000 in DARPA funds. The full description can be found midway down this DARPA document under 'Programming New Computers,' and has, according to Sohmers, been instrumental as they start down the verification and early tape out process for the Neo chips. The funding is designed to target the automatic scratch pad memory tools, which, according to Sohmers is the 'difficult part and where this approach might succeed where others have failed is the static compilation analysis technology at runtime.'"

89 of 150 comments (clear)

Min score:

Reason:

Sort:

good for him by turkeydance · 2015-07-22 12:24 · Score: 4, Insightful

mean it.
Not sure whats more impressive... by jonwil · 2015-07-22 12:33 · Score: 4, Insightful

Not sure whats more impressive, the fact that a 19 year old is able to get DARPA funding or the fact that a 19 year old (and his team presumably) is about to go into mass production with a fairly fancy looking custom microprocessor on a 28nm fab process.
1. Re:Not sure whats more impressive... by alvinrod · 2015-07-22 12:55 · Score: 5, Informative
  
  I was a little curious about that as well and one of the linked articles from TFA says that this kid was at MIT at 13. I'll go ahead guess that he's really into and good at microprocessor design. The article I've linked also talks about some of the design decisions for the chip he's making, on which I'd be interested in hearing from someone with a background in the field.
2. Re:Not sure whats more impressive... by Dutch+Gun · 2015-07-22 13:51 · Score: 1
  
  Both are pretty damn impressive, although for chip manufacturing, $100,000 isn't exactly a lot of money. Consider what a single engineer earns per year. Apparently, they have six people, and hiring a seventh. I guess that means they must have some other funding coming in from somewhere, as he talked about how "supercomputing" is a poisoned word among Silicon Valley VC firms.
  I hope to hear interesting things about this young man and his company in the future.
  
  --
  Irony: Agile development has too much intertia to be abandoned now.
3. Re:Not sure whats more impressive... by Anonymous Coward · 2015-07-22 13:54 · Score: 1
  
  I don't have a background in microprocessor design: I've only designed a very simple one as an assignment, but I've been following the industry pretty closely.
  From what I can tell, his design looks like it might be flops per watt comparable to GPUs, but with different memory abstractions that result in similar limitations. I suspect that if you write custom code for it, and have the right kind of problem, it will do significantly better than available options, but in the general case and/or non specialized code it won't do anything much better than a GPU, but it it may be competitive.
  I don't see the design as revolutionary: basically everyone wants to make a grid of DSPs because thats the most efficient thing you can do and we all know it. Also everyone wants to have core local memory with explicit DMA between them because we know thats the most efficient, but it sucks to use. Look how much people enjoyed writing for the PS3... (It had an IBM cell processor with explicit DMA). Its a idealized mix of the ideas from DSPs, GPUs and might feel a bit like a monster IBM Cell processor. The real question is what its like to code for.
  If you want an interesting general purpose processor to look into (something that will run existing code well) I recommend the mill processor. The videos on that site provide hours of interesting insights into CPU design.
4. Re:Not sure whats more impressive... by trsohmers · 2015-07-22 14:08 · Score: 5, Informative
  
  This is the founder of the startup in the article. We have actually just raised $1.25 in venture funding, which is mentioned in the article. Thanks, and I hope we will be bringing more news soon.
5. Re:Not sure whats more impressive... by avgjoe62 · 2015-07-22 14:12 · Score: 4, Funny
  
  $1.25? Contact me and I'll double, hell why not triple your funding! :)
  
  --
  How come Slashdot never gets Slashdotted?
6. Re:Not sure whats more impressive... by trsohmers · 2015-07-22 14:25 · Score: 5, Informative
  
  I'm hugely biased as I am the founder of the referenced startup, but I figured I would point out a few key things: 1. When it comes to FLOPs per watt, we are actually aiming at a 10 to 25x increase over existing systems... The best GPUs (before you account for the power usage of the CPU required to operate it) get almost 6 double precision GFLOPs per watt, while our chip is aiming for 64. 2. When it comes to being better than a GPU for applications, you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU) after you run out of data in the relatively small memory on the GPU. Due to this, GPUs are really only good at level 3 BLAS applications (matrix-matrix based... basically things you would imagine GPUs were designed for, which are related to image/video processing). It just so happened that when the GPGPU craze started ~6/7 years ago, they had enough of an advantage over CPUs that they made sense for some other applications, but in actuality, GPUs do so much worse on level 1 and level 2 BLAS apps compared to the latest CPUs that GPUs are really starting to lose their advantage (and I think will be dying out when it comes to anything other than what they were originally designed for plus some limited heavy matrix workloads... but then again, I'm biased). 3.Programming is the biggest difficulty, and will make or break our company and processor. The DARPA grant is specifically for continued research and work on our development tools, which are intended to automate the unique features of our memory system. We have some papers in the works and will be talking pubicly about our *very* cool software in the next couple of months. 4. Your mention of the Mill and running existing code well, I had a pretty good laugh. Let me preface this by saying that I find stack machines academically interesting and are fun to think about, and I don't discredit the Mill team entirely, and think it is good thing they exist. With that being said they have had barely functioning compilers for years (which they refuse to release pubicly), and stack machines are notorious for having HORRIBLE support for languages like C. The fact that Out Of The Box Computing (the creators of the Mill) have been around for over 10 years and have given nothing but talks with powerpoints (though they clearly are very intelligent and have an interesting architectures) says a lot about their future viability. I hate to be a downer like that, especially since I have found Ivan's talks interesting and that he is a nice and down to earth guy, but I highly doubt they will never have a chip come out. I'll restate my obvious biases for the previous statement. Feel free to ask any other questions.
7. Re:Not sure whats more impressive... by metlin · 2015-07-22 14:32 · Score: 2
  
  He's a Thiel Fellow, and clearly, that model is working for kids like him who are super gifted for whom the current college education model would be absurd.
  This 17-Year-Old Dropped Out Of High School For Peter Thiel And Built A Game-Changing New Kind Of Computer
  Pretty awesome, if you ask me!
8. Re:Not sure whats more impressive... by ArcadeMan · 2015-07-22 14:45 · Score: 1
  
  Hell, at that price I'd be able to fund the whole project in Dogecoins!
  
  --
  Get free satoshi (Bitcoin) and Dogecoins
9. Re:Not sure whats more impressive... by PopeRatzo · 2015-07-22 14:48 · Score: 2, Funny
  
  I'm hugely biased as I am the founder of the referenced startup, but I figured I would point out a few key things:
  Thomas, you are awesome.
  Enjoy your success. I see from your bio that in your "free time" you like to play guitar. I hope you've bought yourself a good one (or six).
  
  --
  You are welcome on my lawn.
10. Re:Not sure whats more impressive... by Dutch+Gun · 2015-07-22 15:21 · Score: 1
  
  Ah, I missed it right in the first paragraph. Good work getting the initial funding and your company off the ground. Mine isn't nearly so ambitious, but I can sympathize with the headaches of getting a new business started.
  I'm a software guy, and know only the theoretical basics about the hardware I program for, but the notion of putting more of the complexity into the compiler instead of the chip is interesting. I wonder if this technology requires new approaches to languages and compilers, or whether it can be adapted to existing infrastructure (existing compilers like GCC or LLVM). Hopefully the latter, as it would be a bigger barrier to adoption if a whole new toolchain needed to be adopted along with the chip.
  In either case, it sounds like a hell of a challenge, as (if I understand correctly) you'd presumably need to pre-evaluate logic flow and track how resources are accessed in order to embed the proper data cache hints. However, those sort of access patterns can change depending on the state of program data, even within the same code. Or, I suppose you could "tune" the program for optimal execution through multiple pass evaluation runs, embedding data access hints in a second pass after seeing the results of a "profiling" run.
  If you can't reveal that sort of secret sauce yet, don't worry about it, as it's fun to speculate as well. Thanks for checking in with us. Interesting stuff!
  
  --
  Irony: Agile development has too much intertia to be abandoned now.
11. Re:Not sure whats more impressive... by Anonymous Coward · 2015-07-22 15:26 · Score: 3, Interesting
  
  Thanks for the response!
  I should have noticed your numbers for for double precision flops, so my numbers were way off. Thanks for the correction. I bet you are IEEE compliment too (Darn GPUs...).
  Your design is intended specifically for parallel work loads with localized or clustered data access, correct? (I realize this is includes most supercomputer work jobs) It sounds like similar constraints you have with GPUs, but if met properly, the performance should be much better/more efficient and more scale-able. And you expect your compilers to be able to meet these needs and statically schedule all the memory movement which is where you get massive gains. Is that a reasonable assessment?
  Your designs don't have anything to offer for old straight line single threaded programs, correct? It will also not work well if you can't schedule the DMA actions well enough: pointer heavy random access code wouldn't run faster on your system than a gpu, but it won't run fast anywhere. Is that about right?
  I'm looking forward to your papers on the compiler side it sounds very interesting: If you get something working in that area, it could be a big deal to the super computer guys (that's not me though).
  Personally I'm mostly interested in single threaded throughput, process isolation, and security, which is why the mill interests me a lot. As for their stuff taking a long time: your rate of progress and schedule is just amazing, its not that others are slow...
12. Re:Not sure whats more impressive... by LetterRip · 2015-07-22 15:32 · Score: 2
  
  The important question is - how does it perform for the Cycles (Blenders render engine) benchmarks :)
  http://blenderartists.org/foru...
  https://www.blender.org/downlo...
13. Re:Not sure whats more impressive... by captnjohnny1618 · 2015-07-22 16:00 · Score: 4, Interesting
  
  I'm burning some mod points to post this under my username, but it's totally worth it. THIS is the kind of article that should be on Slashdot!
  
  Can you elaborate on the programming structure/API you guys are envisioning for this? (it's cool if you can't, I'd understand :-D). Also, what particular types of problems are you guys targeting your chips to solve or to what areas do you envision your chips being especially well suited? Also, who do you think has done the best nitty-gritty write up about the project so far? I'd love to hear what you think is the best technical description publicly available. Can't wait to learn more as the project grows.
  
  Although I'm not a programmer or CS person by training, I do GPGPU programming (although not BLAS-based stuff) almost exclusively for my research and enjoy it because once you understand the differences between the GPU and CPU it just become a question of how to best parallelize your algorithm. It'd be AMAZING to see the memory bandwidth and power usage specs you guys are working towards under a similar programming structure we currently see with something like CUDA or OpenCL. Any plans for something like that or am I betraying my hobbyist computing status?
  
  Finally, if you ever need any applications testing, specifically in the medical imaging field, feel free get in touch. ;-)
14. Re:Not sure whats more impressive... by Anonymous Coward · 2015-07-22 16:00 · Score: 1
  
  "We have actually just raised $1.25 in venture funding"
  Looks like you shouldn't use your CPU design to run your finances there, Sparky.
15. Re:Not sure whats more impressive... by godrik · 2015-07-22 16:29 · Score: 3, Interesting
  
  I like the idea of "reinventing the computer for performance". Trying to get rid of overhead caused by virtual memory has attracted quite a bit of attention recently, so the idea is definitly sound.
  A few questions:
  -Is there any more details I can read on anywhere? I could not really see any details passed the "slightly technical PR" on http://www.rexcomputing.com/in...
  -Do you plan on plan on presenting your work at SuperComputing?
  -You mention BLAS3 kernels, so I assume you mean dense BLAS3 kernels. In what I see, people are no longer really interested in dense linear algebra. Most of the applications I see nowadays are sparse. Can your architecture deal with that?
  -The chip and architecture seem to essentially be based on a 2D mesh network, can it be extended to more dimensions? I was under the impression that it would cause high latency in physical simulation, because you can not easily project a 3D space in a 2D space without introducing large distance discrepancies. (Which is why BG/Q use 5D torus network.)
  Keep us apraised!
  Cheers
16. Re:Not sure whats more impressive... by trsohmers · 2015-07-22 16:53 · Score: 4, Informative
  
  This is a bit old and has some inaccuracies, so I hesitate to share it, but since you can find it if you dig deep enough... here it is: http://rexcomputing.com/REX_OC...
  Couple quick things: Our instruction encoding is a bit different than what it has on the slide, we've brought it down to 128 bit VLIW (32 bits per functional unit operation), and there are some pipeline particulars we are not talking about publicly yet. We have also moved all of our compiler and toolchain development to be based on LLVM (and thus the really dense slides in there talking about GCC are mostly irrelevant).
  As mentioned in the presentation, we have some ideas of expanding the 2D mesh on the chip, including having it become a 2D torus... our chip-to-chip interconnect allows a lot more interesting geometries, and are working on one with a university research lab that features a special 50-node configuration with max 2 hops between nodes. Our 96GB/s chip-to-chip bandwidth per side is also a big thing differing us from other chips (with the big sacrifice being the very short distance we need to have between chips and having a lot of constraints in packaging and the motherboard). We'll have more news on this in the future.
  When it comes to sparse and dense computations, we are mostly focusing on the dense ones to start (FFT's are beautiful on our architecture), but we are capable of doing well with sparse workloads, and while those developments are in the pipeline, it will take a lot more compiler development effort.
  We actually had a booth in the emerging technologies exhibition at Supercomputing Conference 2014, and hope to have a presence again this year
17. Re:Not sure whats more impressive... by trsohmers · 2015-07-22 16:55 · Score: 5, Funny
  
  While this is obvious troll bait, I can't resist the opportunity to just say that yes, I have kissed multiple girls.
18. Re:Not sure whats more impressive... by trsohmers · 2015-07-22 17:06 · Score: 5, Informative
  
  1.We are IEEE compliant, but I'm not a fan of it TBH, as it has a ridiculous number of flaws... Check out Unum and the new book "The End Of Error" by John Gustafson (and also search Gustafson's Law, the counterargument to the more famous Amdahl's law), which goes over all of them and proposes a superior floating point format in *every* measure.
  2.First thing we get around primarily by having ridiculous bandwidth (288 to 384GB/s aggregate chip-to-chip bandwidth)... we'll have more info out on that in the coming months. When it comes to memory movement, that's the big difficulty and what a big portion of our DARPA work is focused on, but a number of unique features of our network on chip (statically routed, non blocking, single cycle latency between routers, etc) help a lot with allowing the compiler to *know* that it can push things around in given time, and having to put a minimal number of NOPs. There is a lot of work, and it will not be perfect with our first iteration, but the initial customers we are working with do not require perfect auto-optimization to begin with.
  3. If you think of it as each core as being a quad issue in order RISC core (think on the performance order of a ARM Cortex A9 or potentially A15, but using a lot less power and being 64 bit), you can have one fully independent and isolated application on each core. That's one of the very nice things about a true MIMD/MPMD architecture. So we do fantastic with things that parallelize well, but you can also use our cores to run a lot of independent programs decently well.
19. Re:Not sure whats more impressive... by trsohmers · 2015-07-22 17:12 · Score: 5, Interesting
  
  1. My personal favorite programming models for our sort of architecture would be PGAS/SPMD style, with the latter being the basis for OpenMP. PGAS gives a lot more power in describing and efficiently having shared memory in an application with multiple memory regions. Since every one of our cores have 128KB of our scratchpad memory, and all of those memories are part of a global flat address space, every core can access any other cores memory as if it is part of one giant continuous memory region. That does cause some issues with memory protection, but that is a sacrifice you make for this sort of efficiency and power (but we have some plans on how to address that with software... more news on that will be in the future). The other nice programming model we see is the Actor model... so think Erlang, but potentially also some CSP like stuff with Go in the future (And yes, I do realize they are competing models).
  If you want to get the latest info as it comes out, sign up for our mailing list on our website!
20. Re:Not sure whats more impressive... by goose-incarnated · 2015-07-22 17:52 · Score: 1
  
  Whatever. You ever kiss a girl? You've tossed your youth away to build next year's landfill. Good for you.
  I'd imagine that a 19 year old as impressive as he is pretty much has the sexual choice of anyone (male or female) that he wants.
  
  --
  I'm a minority race. Save your vitriol for white people.
21. Re: Not sure whats more impressive... by robi5 · 2015-07-22 18:50 · Score: 1
  
  Interesting you mentioned CSP. When I read up on your architecture, its close relative, Functional Reactive Programming (well... inspired by FRP...) came to mind. Leads to easy programming and relatively straightforward, direct mapping of the FRP nodes to cores, and event streams to communication among cores. Very good isolation.
22. Re:Not sure whats more impressive... by robi5 · 2015-07-22 19:07 · Score: 1
  
  Maybe HotSpot / V8 type of optimizations would work well, as in running code, the actual patterns emerge. This is a great talk on the cost of virtual memory, the future of JS and more :-) https://www.destroyallsoftware...
23. Re:Not sure whats more impressive... by Arkh89 · 2015-07-22 19:15 · Score: 1
  
  When it comes to being better than a GPU for applications, you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU) after you run out of data in the relatively small memory on the GPU. Due to this, GPUs are really only good at level 3 BLAS applications (matrix-matrix based... basically things you would imagine GPUs were designed for, which are related to image/video processing).
  This is only true if your problem does not fit in the VRAM (which is getting over 10GB nowadays). If it does, you'll be 8x to 12x faster than any brand new CPU for any element-wise operation. Also, it is much more common to find an easy way to cut the problem nicely than not.
  That being said, do you know with how much embedded RAM will you be proposing your architecture (even a rough projection)?
24. Re:Not sure whats more impressive... by __rze__ · 2015-07-22 19:17 · Score: 3, Interesting
  
  Hi Thomas,
  I found this extremely intriguing, as I am currently writing up my dissertation on high-GFLOPS/W 3-D layered reconfigurable architectures. I am also of the opinion that memory handling is the key, as it is the only way to resolve the von Neumann bottle-neck problem. Many processing elements with no means to feed them are useless. In my design I am using reconfigurability and flexibility to gain energy efficiency (my architectural range allows 111GFLOPs/W in some configurations).
  I am also concentrating on dense linalg kernels, as they are a perfect challenge in variable computation:data ratio, varied and complex memory access patterns and regularity.
  In my approach, I am of the opinion that forcing an application mapping to a given architecture via a compiler is inefficient. Instead, I am exploiting architectural flexibility gained from coarse-grained reconfigurable structures to adapt the architecture to an optimal ASAP/ALAP scheduling, thus constructing the perfect architecture to match an optimal mapping. Basically, keeping all processing elements busy all the time is the goal, leading to huge energy gains.
  The way this is done is a bit weird, as my architecture has a function set as opposed to an instruction set, which is custom-definable and run-time reconfigurable to suit an application. The construction of the function set is done by composing elementary hardware functions based on meaning, a concept close to functional programming concepts from John Backus. Programming is meaning-based, efficiently constructing required functions and bringing them out to assembly.
  Several kernels have been done this way, and programming stays easy via this functional reconfiguration (so far longest being TRSM with 112 assembly lines). Reached 21-25GFLOPs/W on 65nm tech pre-layout for 10 BLAS1-3 kernels)
  I am now finishing up a 3D VIA-last physical layout in 40nm tech which already doubled my energy efficiency. (Why 3D? That's another story -- I think that division of computation, memory access and communication(intra-kernel data movement, sharing, broadcasting) needs custom hardware structures optimized for these tasks, which can be parallelized. Which is then native for 3D silicon -- each class on its own die). I will be reading your papers ASAP to see how you deal with the von Neumann bottle-neck :)
  Cheers, Zoltan
25. Re:Not sure whats more impressive... by K.+S.+Kyosuke · 2015-07-22 20:45 · Score: 1
  
  From what I can tell, his design looks like it might be flops per watt comparable to GPUs, but with different memory abstractions that result in similar limitations. I suspect that if you write custom code for it, and have the right kind of problem, it will do significantly better than available options, but in the general case and/or non specialized code it won't do anything much better than a GPU, but it it may be competitive.
  In other words, very much like Chuck Moore's Forth cores. ;-) Which is fine, there's quite a range of applications for hardware like that. Especially in the military.
  
  --
  Ezekiel 23:20
26. Re:Not sure whats more impressive... by K.+S.+Kyosuke · 2015-07-22 20:48 · Score: 1, Funny
  
  and stack machines are notorious for having HORRIBLE support for languages like C
  Which is what makes them so awesome. It's like a door that filters out undesirable drunken retards before they even enter your house.
  
  --
  Ezekiel 23:20
27. Re:Not sure whats more impressive... by K.+S.+Kyosuke · 2015-07-22 20:49 · Score: 4, Funny
  
  In a SIMD or a MIMD fashion?
  
  --
  Ezekiel 23:20
28. Re:Not sure whats more impressive... by K.+S.+Kyosuke · 2015-07-22 20:53 · Score: 1
  
  This man asks the right questions!
  
  --
  Ezekiel 23:20
29. Re:Not sure whats more impressive... by TheRaven64 · 2015-07-22 20:58 · Score: 2
  
  When it comes to being better than a GPU for applications, you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU)
  That's a somewhat odd claim. One of the reasons that computations on GPUs are fast is that they have high memory bandwidth. Being hampered by using the same DRAM as the CPU is one of the reasons that integrated GPUs perform worse. If you're writing GPU code that's doing anything other than initial setup over PCIe, then you're doing it badly wrong.
  That said, GPU memory controllers tend to be highly specialised. The nVidia ones have around 20 different streaming modes for different access patterns (I think the new version has a programmable prefetcher - Intel is also adding one), but if your memory access patterns are data dependent then GPUs can suck.
  
  after you run out of data in the relatively small memory on the GPU
  Not really. If you're doing big workloads on a GPU, your overflow isn't main memory over PCIe, it's the next GPU along over a much faster interconnect. And even with PCIe, most of the latency comes from the protocol and not the physical interconnect - you can get a lot more speed out of the PCIe hardware if you don't need all of the features of the PCIe bus.
  
  The DARPA grant is specifically for continued research and work on our development tools, which are intended to automate the unique features of our memory system. We have some papers in the works and will be talking pubicly about our *very* cool software in the next couple of months.
  
  Where have you sent them? I'll keep an eye out.
  
  Your mention of the Mill and running existing code well, I had a pretty good laugh
  You certainly wouldn't be alone there.
  
  stack machines are notorious for having HORRIBLE support for languages like C
  That's not really true (not sure what the relevance to The Mill is though - it's not a stack machine). Algol support for stack machines became pretty good (C wasn't really popular until stack machines had largely died out, but the back end of a C compiler is not that different from the back end of an Algol compiler). The reason that stack machines died is that it's basically impossible for the hardware to extract ILP from a stack ISA. That's less of an issue if your throughput comes from thread-level parallelism. There are some experimental architectures floating around that get very good i-cache usage and solid performance from a stack-based ISA and a massive number of hardware threads.
  
  --
  I am TheRaven on Soylent News
30. Re:Not sure whats more impressive... by K.+S.+Kyosuke · 2015-07-22 20:59 · Score: 1
  
  In either case, it sounds like a hell of a challenge, as (if I understand correctly) you'd presumably need to pre-evaluate logic flow and track how resources are accessed in order to embed the proper data cache hints. However, those sort of access patterns can change depending on the state of program data, even within the same code. Or, I suppose you could "tune" the program for optimal execution through multiple pass evaluation runs, embedding data access hints in a second pass after seeing the results of a "profiling" run.
  Sounds like LuaJIT on steroids.
  
  --
  Ezekiel 23:20
31. Re:Not sure whats more impressive... by TheRaven64 · 2015-07-22 21:07 · Score: 1
  
  I'm hoping that there's a million missing there. Are you just planning on selling IP cores? When I talked to a former Intel Chief Architect a few years ago (hmm, about 10 years ago now), he was looking at creating a startup and figured that $60m was about the absolute minimum to bring something to market. From talking to colleagues on the lowRISC project and at ARM, $1-2m is just enough to produce a prototype on a modern process, but won't get you close to mass production. Do you plan on raising more money or partnering with someone else for production?
  
  --
  I am TheRaven on Soylent News
32. Re:Not sure whats more impressive... by BitZtream · 2015-07-22 22:33 · Score: 1
  
  you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU)
  Uhm, no. GPUs have massive bandwidth to THEIR memory. You're talking about lower speeds to memory of A DIFFERENT PROCESSOR, so essentially you're trying to compare using the PCI bus as a network and your direct memory access. These are two different things. GPUs can have far more memory than the systems they are attached to and nVidia has certainly used this as a selling point for their GPGPU stuff. If your GPU is using system memory over the PCI bus, you fucked up your hardware purchase. When you think of the PCI bus as a network bus, which is what its used for in the instance you're referring to, then its about as fast as you can find within several orders of magnitude of cost. The problem with GPUs is branching, not memory.
  
  3.Programming is the biggest difficulty, and will make or break our company and processor. The DARPA grant is specifically for continued research and work on our development tools, which are intended to automate the unique features of our memory system. We have some papers in the works and will be talking pubicly about our *very* cool software in the next couple of months.
  
  You've never heard of Itanium, have you? Unless you can change this so that your model fits the existing developer base, you're screwed. Seriously, go read up on Itanium, trust me, you'll realize that your best bet is to figure out how to blow as much of the money as you can on vacations and fun things before it disappears.
  
  and stack machines are notorious for having HORRIBLE support for languages like C.
  So basically everything anyone that matters understands about software dev ... won't work on your machine. Well thats handy, at least you'll have an excuse as to why no one can make useful software for your system.
  
  have been around for over 10 years and have given nothing but talks with powerpoints (though they clearly are very intelligent and have an interesting architectures) says a lot about their future viability. I hate to be a downer like that
  Thats okay, you're just too young and inexperienced to realize you'll be lucky if you last that long.
  
  --
  Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
33. Re:Not sure whats more impressive... by K.+S.+Kyosuke · 2015-07-23 00:15 · Score: 1
  
  That's less of an issue if your throughput comes from thread-level parallelism. There are some experimental architectures floating around that get very good i-cache usage and solid performance from a stack-based ISA and a massive number of hardware threads.
  The other day, I had the mildly insane idea that perhaps our abilities to explore the architectural space are limited by all existing architectures having been painstakingly handcrafted. Thus, if it were possible somehow to parametrically generate an architecture, and then synthesize a code generator backend for a compiler and a suitable hardware implementation, we might be able to discover some hidden gems in the largely unexplored universe of machine architectures. But it sounds like a pipe dream to me...
  
  --
  Ezekiel 23:20
34. Re:Not sure whats more impressive... by trsohmers · 2015-07-23 02:10 · Score: 1
  
  Tensilica (now owned by Cadence) does a shitty version of this already, which generates (restricted in scope) RTL and a compiler backend from a description of what you want. Synopsys also has a smaller version that does that in a larger scope than Tensilica, using some high level synthesis (which I think is basically pseudoscience when it comes to hardware design) along with SystemVerilog stuff. We actually prototyped the hardware generation part of what you are saying, and it works pretty decently without the same limitations imposed by Tensilica, but we have shelved the backend generation part of it to focus on our "real" work.
35. Re:Not sure whats more impressive... by tehcyder · 2015-07-23 03:26 · Score: 1
  
  I'll go ahead guess that he's really into and good at microprocessor design
  What, a 19 year old who's designed his own supercomputer chip and received $100K DARPA funding? You're really going out on a limb there.
  
  --
  To have a right to do a thing is not at all the same as to be right in doing it
36. Re:Not sure whats more impressive... by interval1066 · 2015-07-23 04:26 · Score: 1
  
  Thomas; your web site mentions LLVM as your development language environment. Can I assume C/C++ is the main language? Is your tool chain highly customized? Would seem to be necessary to support a highly parallel machine with a new architecture. Are there any details on your tool chain you might be willing to share, or do I need to apply, get hired, and then find out?
  
  --
  Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
37. Re:Not sure whats more impressive... by trsohmers · 2015-07-23 04:40 · Score: 1
  
  It is a lot easier to talk in person or phone/skype/email over the specifics, but what I can say right now is that you are correct in your base assumption, but I would not say it is highly customized... The base of it borrows the number of LLVM improvements targeted for VLIW systems over the past couple of years (which work even better for us as we are a much more simplified/relaxed VLIW), and extends funcitonality, but most of our custom work is meant to extend beyond the base backend. Technically, if you have a program that compiles with Clang, it will run fine on a single one of our cores... to be able to do the more fancy stuff, you'll need to use our additional tools (which we hope to contribute back to the community in some form in the future). Shoot me an email (thomas at rexcomputing.com), and I'd be happy to answer any questions, even before a real job application.
38. Re:Not sure whats more impressive... by ExekielS · 2015-07-23 05:02 · Score: 1
  
  Since your chips are so small, have you considered moving to BJT TTL circuits to ramp up frequency to the 50-100 GHz range while using lower power to improve your GFLOP/W rating?
  
  --
  ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn
39. Re:Not sure whats more impressive... by KGIII · 2015-07-23 09:16 · Score: 1
  
  You may be too busy but are you accepting private VC funding and, if so, whom would one contact to do so? I might have a soft-spot for MIT alumni.
  
  --
  "So long and thanks for all the fish."
40. Re:Not sure whats more impressive... by samwichse · 2015-07-24 01:49 · Score: 1
  
  This is probably the best reply I've ever read by the subject of an article posted to Slashdot.
Re:Half an hour, two comments by Narcocide · 2015-07-22 13:05 · Score: 1

I'd say that's a fair sign of success. Despite the sense of jealousy, nobody can think of anything bad to say
Only $100k? by afidel · 2015-07-22 13:18 · Score: 4, Informative

That doesn't go very far in the microprocessor world. I worked for Cisco back in the early 00's and even back then tape out costs were approaching $1M for a 5 layer mask, today with sub-wavelength masks and chips using 12+ layers it must be tremendously expensive to spin a chip.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
1. Re:Only $100k? by Anonymous Coward · 2015-07-22 13:32 · Score: 1
  
  If you read the article, it says they raised $1.25 million in addition to the DARPA SBIR, which is just for software, and goes into the costs involved. For getting prototypes, it says they only need a couple hundred thousand dollars. I bet they are going to raise their next round after they have prototypes.
2. Re:Only $100k? by TwentyCharsIsNotEnou · 2015-07-22 18:51 · Score: 1
  
  You can reduce the cost dramatically for a prototype - it's called a "shuttle run", where you share the mask costs with a group of other companies who put their chips on the same wafers. You can't go into mass production with this of course.
  
  Plus, they have a lot more than $100k in total.
3. Re:Only $100k? by TheRaven64 · 2015-07-22 21:31 · Score: 1
  
  A couple of hundred thousand dollars will get you a prototype, not prototypes - experienced chip designers sometimes get something that works first time (and are deservedly incredibly smug about it). More commonly, you go through at least 2-3 iterations.
  
  --
  I am TheRaven on Soylent News
4. Re:Only $100k? by Fnord666 · 2015-07-23 04:41 · Score: 1
  
  That doesn't go very far in the microprocessor world. I worked for Cisco back in the early 00's and even back then tape out costs were approaching $1M for a 5 layer mask, today with sub-wavelength masks and chips using 12+ layers it must be tremendously expensive to spin a chip.
  That's just DARPA's award. He mentioned another $1.2M or so in VC funding in a different comment.
  
  --
  'The tyrant will always find pretext for his tyranny.' - Aesop's Fables
Re:By Neruos by trsohmers · 2015-07-22 14:29 · Score: 4, Interesting

We actually have very good reasons to say why this is a very different kind of VLIW, and have found the reason why other VLIW chips have had such static scheduling issues. Hope we can convince you and everyone else soon enough.
VLSI is hard by hlee · 2015-07-22 14:33 · Score: 1

The final project of this VLSI elective course I took required each team to build three logical modules that would work together. I was responsible for the control and integration portion bringing together all the logical modules. I spent an entire sleepless night sorting out the issues. Our team was the only one that had a functioning chip (simulated) in the end. The lecturer wasn't surprised - most chips of any reasonable complexity require A LOT of painstaking (e.g. efficient routing, interference) work to get them working - often requiring certain modules to be pulled apart (or redesigned) so they integrate better with others.
1. Re:VLSI is hard by Anonymous Coward · 2015-07-22 16:04 · Score: 2, Funny
  
  An ENTIRE sleepless night? Wow. Sounds TOUGH. —said no MIT grad ever.
Re:Half an hour, two comments by trsohmers · 2015-07-22 14:43 · Score: 5, Informative

Uhm, it ranges, but I'd say I can get a snickers bar for around a buck in most vending machines. And there are also plenty of people smarter than me, even in this very small niche that I am in.
19 by PopeRatzo · 2015-07-22 14:51 · Score: 4, Funny

When I was 19, my main achievement was building a bong out of a milk jug.

--
You are welcome on my lawn.
1. Re:19 by Anonymous Coward · 2015-07-22 15:23 · Score: 1
  
  Which is less economically wasteful than flushing $1.25 million down the drain on yet another VLIW chip that is claimed to change the world, etc.
2. Re:19 by LordWabbit2 · 2015-07-22 20:28 · Score: 1
  
  But I bet it was a nice bong, that brought joy to many.
  
  --
  There are three kinds of falsehood: the first is a 'fib,' the second is a downright lie, and the third is statistics.
3. Re:19 by maestroX · 2015-07-22 21:53 · Score: 1
  
  obviously, with mindblowing results
4. Re:19 by PopeRatzo · 2015-07-22 23:11 · Score: 1
  
  But I bet it was a nice bong, that brought joy to many.
  Oh, it was a sweet bong. I got a contract from DARPA to make more, but I encountered some problems in the manufacturing stage because I was too busy watching Cartoon Network.
  
  --
  You are welcome on my lawn.
5. Re:19 by david_thornley · 2015-07-23 06:06 · Score: 1
  
  True. I can get plenty of chips for not much money at the grocery store, and I'd guess they're tastier than the ones trsohmers is working on.
  
  --
  "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
Re: By Neruos by loufoque · 2015-07-22 15:30 · Score: 1

From what I've been able to read it doesn't look that different from other projects like Tilera or Kalray MPPA.
Re: By Neruos by trsohmers · 2015-07-22 15:54 · Score: 5, Informative

The biggest thing is what we have tried to emphasize, which is the fact that we have an entirely different memory system that does away with the hardware managed cache hierarchy. The rest of the really interesting stuff we have not publicly disclosed (yet), but I can tell you that it is very different from both Kalray and Tilera.
Parallella... by Dr+Black+Adder · 2015-07-22 16:20 · Score: 1

Parallella...
I'm a pro in the field. This doesn't scan. by Brannon · 2015-07-22 16:34 · Score: 4, Interesting

Please explain to me simply how you get 10x in compute efficiency over GPUs--these chips are already fairly optimal at general purpose flops per watt because they run at low voltage and fill up the die with arithmetic.
GPUs have excellent memory bandwidth to their video RAM (GDDR*), they have poor IO latency & bandwidth (PCIe limited) which is the main reason they don't scale well.
We've heard the VLIW "we just need better compilers" line several times before.
Thus far this sounds like a truly excellent high school science fair project, or a slightly above average college engineering project. It is miles away from passing an industrial smell test.
1. Re:I'm a pro in the field. This doesn't scan. by Anonymous Coward · 2015-07-22 18:47 · Score: 1
  
  It's not hard to get 10x if you make something that is essentially unprogrammable and incompatible with all existing software. (Think DSPs without the vendor libraries, stream computing, FPGAs, etc.) Since so much of the cost of these systems is in software development incompatibility is pretty much a dead-end for anything that isn't military-specific.
Re:The 19 year old is a lunatic by goose-incarnated · 2015-07-22 17:59 · Score: 1

"Virtual Memory translation and paging are two of the worst decisions in computing history"
"Introduction of hardware managed caching is what I consider 'The beginning of the end'"
---
These comments belie a fairly child-like understanding of computer architecture.
He's young, and he displays much more talent than people twice his age. What's your problem anyway?

--
I'm a minority race. Save your vitriol for white people.
Re: By Neruos by goose-incarnated · 2015-07-22 18:21 · Score: 1

The biggest thing is what we have tried to emphasize, which is the fact that we have an entirely different memory system that does away with the hardware managed cache hierarchy. The rest of the really interesting stuff we have not publicly disclosed (yet), but I can tell you that it is very different from both Kalray and Tilera.
You might have answered this already but I'm not very good at reading walls-o-text, so apologies if this is a repeat: The hardware managed cache design for chips is popular for a reason - it provides a speed boost. If you remove this what kind of process do you propose to replace it with? (Unless you have a design that makes a hardware managed cache redundant. What do you do then? Have software manage the cache?)

--
I'm a minority race. Save your vitriol for white people.
Futurama quote of the day. by Anonymous Coward · 2015-07-22 18:31 · Score: 1

"Why is there yogurt in this hat?" "I can explain that. It used to be milk, and well, time makes fools of us all."
I truly hope this approach pans out and advances chip design, but if it doesn't, it will be another publicly available learning tool for the next small team to learn from. It's easy to say that it won't work and that it is going down the same path as previous attempts, but thet might have something that does work and is worth a lot of money. If you don't like it, don't invest. If you think it has potential then pony up your own $100k and see where this goes. Either way a group of really smart people get to do some really cool shit, and as long as they don't get burned out or jaded by the online community, they all will be able to either continue on a successful project or regroup and tackle a new one. The whole world needs as many intelligent, ambitious, dreamers as possible...no matter what their inferred promiscuity / penis size is.
I, for one, assume any 19 year old willing to risk $1.25 mil can probably also pull a sizeable dong out of his pants during a funding presentation if needed.
Godspeed You! Black Emperor.
-jeff-
Re:The 19 year old is a lunatic by Anonymous Coward · 2015-07-22 18:53 · Score: 1

Talent is not the same thing as experience. Being able to do something does not mean it is a good idea to do it. So many people have tried this approach to improving efficiency (MIT RAW, Stanford Imagine, Stanford Smart Memories) and have run into such serious problems (compilers, libraries, eco system, system-level support) that unless he has solutions for those problems, starting it again is not a smart idea.
All the best by tanveer1979 · 2015-07-22 18:55 · Score: 1

As somebody in the VLSI field, I am happy that somebody broke out of the monopoly/duopoly of the established players. WE are moving towards "single/double" vendor for everything from mobiles to laptop processors to desktop processors. Having little choice also harms progress.
The other thing which excites me is that you are going towards a completely new architecture. This is what innovation is about!
Hopefully, your success will inspire others also to take the plunge.

--
My Aurora : http://www.youtube.com/watch?v=o91ZsGwJYyg
FB : https://www.facebook.com/TanveersPhotography
(old fart)been tried before(/old fart) by Melkhior · 2015-07-22 19:01 · Score: 4, Insightful

Cue this old joke...
- How many hardware engineers does it take to change a light bulb?
- None, we'll fix it in software.

Doing stuff in software to make hardware easier has been tried before (and before this kid was born, perhaps why he thinks this is new). It failed. Transputer, i960, i432, Itanium, MTA, Cell, a slew of others I don't remember...

As for the grid, nice, but not exactly new. Tilera, Adapteva, KalRay, ...
1. Re:(old fart)been tried before(/old fart) by thinkwaitfast · 2015-07-23 04:26 · Score: 1
  
  Yeah, yawn. Just last week I was bored and looked up transputer on ebay You can still buy them I think their programming language was OCCAM, no? I wish I had all the tools available to kids these days, but a 16k computer took two years of savings.
Re:The 19 year old is a lunatic by goose-incarnated · 2015-07-22 20:07 · Score: 1

Talent is not the same thing as experience.
I'm in agreement - experience counts for a lot when doing something new.

Being able to do something does not mean it is a good idea to do it.
I'm in agreement with this as well.

So many people have tried this approach to improving efficiency (MIT RAW, Stanford Imagine, Stanford Smart Memories) and have run into such serious problems (compilers, libraries, eco system, system-level support) that unless he has solutions for those problems, starting it again is not a smart idea.
It is highly unlikely that this will go anywhere (so, yeah - agreement again)... BUT... he is displaying a great deal of talent for his age. The lessons he learns from this failure[1] will be more valuable than the lessons learned in succeeding at a less difficult task.
As I understand it, he proposes removing the hardware cache and instead using the compiler to prefetch values from memory. He says the hardware cache logic gates add 40% overhead to every memory fetch. Whether he can actually produce a compiler than will insert the necessary memory fetch instructions at compile time in an efficient manner remains to be seen, but it is still a worthwhile endeavour for a 19 year old.
[1] Worst case scenario. He might succeed after all.

--
I'm a minority race. Save your vitriol for white people.
Re:The 19 year old is a lunatic by K.+S.+Kyosuke · 2015-07-22 21:07 · Score: 1

You may be underestimating the level of effort to which some people will go to get at the performance. Right now they're running nVidia cards, for pete's sake. Show a way and they will come. After all, that "full ecosystem of tools and vendors" can be ultimately achieved with an open development model.

--
Ezekiel 23:20
Re:The 19 year old is a lunatic by gnupun · 2015-07-22 21:08 · Score: 1

"Virtual Memory translation and paging are two of the worst decisions in computing history"
In the old days and even with current CPUs, one CPU can run multiple processes. But if CPUs were small enough and cheap enough, one program would run on multiple CPUs. Why would you need memory protection (virtual memory translation) if only a small portion of one program is running on one CPU? Answer: you don't.
So TL;DR, he could be right, but only for systems with huge number of weak/limited CPUs.
Re:The 19 year old is a lunatic by TheRaven64 · 2015-07-22 21:28 · Score: 1

"Virtual Memory translation and paging are two of the worst decisions in computing history"
He's not completely wrong there. Paging is nice for operating systems isolating processes and for enabling swapping, but it's horrible to implement in hardware and it's not very useful for userland software. Conflating translation with protection means that the OS has to be on the fast path for any userland changes and means that the protection granule and translation granule have to be the same size. The TLB needs to be an associative structure that can return results in a single cycle, which makes it hard to scale. Larger pages help (though then you make the protection granule even larger), but the amount of physical memory that the TLB can cover has dropped with each successive generation since paging was first introduced into microprocessors.

"Introduction of hardware managed caching is what I consider 'The beginning of the end'"
I don't completely agree with this, but given the amount of effort that people writing high-performance code (and compilers) have to spend understanding the hardware caching policy and working around it, I'm not completely convinced that it's a win in the HPC arena - you end up spending almost as much time fighting the cache as you would working with a hardware scratchpad. I'm still a fan of single-level stores as a programmer abstraction though.

--
I am TheRaven on Soylent News
Re:The 19 year old is a lunatic by TheRaven64 · 2015-07-22 21:30 · Score: 1

Whether he can actually produce a compiler than will insert the necessary memory fetch instructions at compile time in an efficient manner remains to be seen
That's not the hard bit of the problem. Compiler-aided prefetching is fairly well understood. The problem is the eviction. Having a good policy for when data won't be referenced in the future is hard. A simple round-robin policy on cache lines works okay, but part of the reason that modern caches are complex is that they try to have more clever eviction strategies. Even then, most of the die usage by caches is the SRAM cells - the controller logic is tiny in comparison.

--
I am TheRaven on Soylent News
Re:The 19 year old is a lunatic by goose-incarnated · 2015-07-22 22:10 · Score: 1

Whether he can actually produce a compiler than will insert the necessary memory fetch instructions at compile time in an efficient manner remains to be seen
That's not the hard bit of the problem. Compiler-aided prefetching is fairly well understood.
I honestly thought that was the difficult part; it's halting-problem hard, if I understand correctly. If you cannot predict whether a program will ever reach the end-state, then you cannot predict if it will ever reach *any* particular state. To know whether to prefetch something requires you to have knowledge about the program's future state.
To my knowledge prediction of program state only works if your predicting a *very* short time in the future (say, no more than a hundred instructions). If you're limited to that then the best you can do is branch prediction or similar (only a few hundred instructions?). This is why the cache helps - if you use something then the probability is high you will use it again soon. Compilers can then take limited advantage of this by locality of variables/instructions.

The problem is the eviction. Having a good policy for when data won't be referenced in the future is hard.
It's the negation of the problem of deciding what *will* be needed in some future state. This makes it equally hard (halting-problem hard) to deciding what to prefetch. For both problems it appears to me that computer science has already settled on "no solution" as the answer to the question "can we predict the programs future state?". NP-hard is NP-hard, no matter how much engineering talent is thrown at it; it remains mathematically impossible. Hence, I figure that what this kid has got is some great new mitigation scheme for program state prediction. That, or maybe he skipped the automata theory classes (I see that a lot with engineers-turned-programmers).
(I think - feel free to correct my understanding).

--
I'm a minority race. Save your vitriol for white people.
Re:The 19 year old is a lunatic by tomhath · 2015-07-22 23:18 · Score: 1

His comments indicate vision. Decades ago it was necessary to have caching and virtual memory, but with modern chip design he sees that it's no longer needed; instead of trying to fix yesterday's problem with yesterdays solution let's move on to solving the problem as if there was never a need for caching and virtual memory in the first place.
Re:The 19 year old is a lunatic by TheRaven64 · 2015-07-23 00:09 · Score: 2

Prefetching in the general case is non-computable, but a lot of accesses are predictable. If the stack is in the scratchpad, then you're really only looking at heap accesses and globals for prefetching. Globals are easy to statically hint and heap variables are accessed by pointers that are reachable. It's fairly easy for each function that you might call to emit a prefetch version that doesn't do any calculation and just loads the data, then insert a call to that earlier. You don't have to get it right all of the time, you just have to get it right often enough that it's a benefit.
For prefetching vs eviction, it's a question of window size. Even with no prefetching, most programs exhibit a lot of locality of reference and so caches work pretty well without prefetching - it doesn't matter that you take a miss on the first access, because you hit on the next few dozen (and in a multithreaded chip, you just let another thread run while you wait), but if you're evicting data too early then it's a problem. A combination of LRU / LFU works well, though all of the good algorithms in this space are patented. Although issuing prefetch hints is fairly easy, the reason that most compilers don't is that there's a good chance of accidentally pushing something else out of the cache. That said, if they're targeting HPC workloads, then just running them in a trace and then using that for hinting would probably be enough for a lot of things.
I heard a nice anecdote from some friends at Apple a while ago. They found that one of their core frameworks was getting a significant slowdown on their newer chip. The eventual cause was quite surprising. In the old version, they had a branch being mispredicted, and a load speculatively executed. The correct branch target was identified quite early, so they only had a few cancelled instructions in the pipeline. About a hundred cycles later, they hit the same instruction and this time ran it correctly. With the new CPU, the initial branch was correctly predicted. This time, when they hit the load for real, it hadn't been speculatively executed and so they had to wait for a cache miss.
Also, if you're trying to create a parallel system with manual caches... good luck. Cache coherency is a pain to get right, but it's then fundamental to most modern parallel software. Implementing the shootdowns in software is going to give you a programming model that's horrible.
And finally there's the problem that doing it in software makes it serial. The main reason that we use hardware page-table walkers in modern CPUs is not that they're much better than a software TLB fill, it's that it's much easier to make them run completely asynchronously with the main pipeline. The same applies to caches.

--
I am TheRaven on Soylent News
Re:The 19 year old is a lunatic by trsohmers · 2015-07-23 01:21 · Score: 2

One of the things that doesn't seem to be getting through in most of the media articles is how our memory system is actually set up. I'll try to describe it briefly here, starting from the single core.

At a single core, we have a 128KB multibanked scratchpad memory, which you can think of as just like an L1 cache but smaller and lower latency. We have one cycle latency for a load/store from your registers to or from the scratchpad, and the same latency from the scratchpad to/from the cores router. That scratchpad is physically addressed, and does not have a bunch of extra (and in our opinion, wasted) logic to handle address translations, which just take up a lot of area and power (especially once you multiply it over hundreds of cores and large SRAMs. Most people think the TLB logic is a fixed size for any size SRAM, but it is not, and it gets significantly worse if you add coherency). Remember, even if you have a L1 cache (Typically 16 to 32KB, tops) hit on an Intel chip, it still takes 4 whole cycles.

Once we get to having a 16x16 grid (256 cores) as part of our Network on Chip, we have a total of 32MBs of on chip 1 cycle latency scratchpad. How we have arranged that is as a global flat address space, with all of the addresses being physically mapped. What I mean by this is that Core 0's scratchpad is the first 128K of the address space, and the address space continues on seamlessly to core 1, core 2, and all the way to core 255. If the address requested by a core is not in its own scratchpad's range, it goes to the router and hops on the NoC until it gets there... with a one cycle latency per hop. We have 32GB/s in each cardinal direction per router, giving a total on chip bandwidth of 8TB/s. Since it is all statically routed (which is a *very* important part of our entire design, which I am not revealing the full implications of just yet), we have guaranteed 1 cycle per hop latency between each router on the NoC. So even if you are going from one corner to another (core 0 to core 255) it is still a max latency of 32 cycles... still less than the latency to the L3 cache on an Intel chip.

This gets to the chip to chip interconnect, which we have not been very public about, but I can say it is VERY high bandwidth (48GB/s in each direction, on all four sides of the chip, so an aggregate bandwidth of 384GB/s... compare that to 16GB/s of PCIe or even NVIDIA's 2018/2019 80GB/s plans with NVLINK). There are a lot of very cool things in that design, but I can't go into them publicly quite yet. We sacrifice distance and interoperability to get those numbers, but we think it is a worthy tradeoff for insane speed and efficiency. The other interesting thing that we are looking at (and haven't fully explored the full tradeoffs) is being able to extend of flat address space across multiple chips in a larger grid.

To wrap up, most of the problems you mentioned here and in other comments are not totally valid, as we are not trying to replicate the inefficient protocols implemented super inefficiently in hardware today. We want to eventually be able to provide the same user experience and convenience that hardware caching provides, but keeping it abstracted away from the user. Hopefully you can understand I can't go into full details of this, and you have every reason to be skeptical, but that does not mean we are not going to try to do it anyways.

Also, cool Apple story. Thanks :)
Happy to answer any other questions
Colour me impressed by ihtoit · 2015-07-23 01:34 · Score: 1

Most 19 year olds' idea of achievement is not puking up on the front doorstep after a particularly brutal night out boozing. For all you doubters: can we see how this chip performs in the wild before making judgement, please? To Thomas: will the chip ever see a retail shelf in say a personal supercomputer like the NVidia Tesla?

--
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
Re:But it works in RTL Simulation by trsohmers · 2015-07-23 01:55 · Score: 2

1. We have already run through synthesis of a version of our core (and rough version of our chip)... There's a lot of work to be done, especially as we are in the last steps of locking down the RTL, but we are not worried about timing... we are being very conservative.

2. Already have standard cells and memory compilers. We are not amateurs.

3. We actually have solid state physics and fabrication experience, and understand the physical constraints of wire and gate delays, leakage, etc. All of those played a very large part in our architectural design, specifically so we don't have a timing and closure being a huge clusterfuck.
Re: By Neruos by trsohmers · 2015-07-23 02:02 · Score: 1

Take a look at my comment here: http://news.slashdot.org/comme...
His entire premise is wrong. by Brannon · 2015-07-23 02:06 · Score: 1

The primary benefit of caches for HPC applications is *bandwidth filtering*. You can have much higher bandwidth to your cache (TB/s, pretty easily) than you can ever get to off-chip--and it is substantially lower power. It requires blocking your application to have a working set that fits in cache.
He's pulling out quotes from Cray (I used to work there) about how caches just get in the way--and they did, 30 years ago when there were very few HPC applications whose working set could fit in cache. It's a very different world nowadays.
Sometimes skipping college doesn't make you a genius, sometimes it just means you are doomed to repeat 50 years worth of mistakes in a well developed field.
1. Re:His entire premise is wrong. by trsohmers · 2015-07-23 02:17 · Score: 1
  
  Except we do have "caches".... just not hardware managed ones, and so we call them scratchpads. Our "L1" equivalent is 128KB per core (2 to 4x more than Intel), 1 cycle latency (compared to 4 cycle latency for Intel), and lower power (As little as 2/5ths the power usage). We do have that TB/s per second for every core to its local scratchpad, and our total aggregate bandwidth on the network on chip (Cores to cores) is 8TB/s.
  
  I'll throw a Seymour Cray quote here for fun... "You can't fake memory bandwidth that isn't there"... Faking memory bandwidth is the whole point of hardware managed caches. Software managed memory/scratchpads are augmenting memory bandwidth.
Re:The 19 year old is a lunatic by Anonymous Coward · 2015-07-23 04:00 · Score: 1

If you never reinvent the wheel, you'll never invent the tire. I say we let him down the rabbit hole and see if he comes back with anything new.
Re:The 19 year old is a lunatic by TheRaven64 · 2015-07-23 04:20 · Score: 1

At a single core, we have a 128KB multibanked scratchpad memory, which you can think of as just like an L1 cache but smaller and lower latency. We have one cycle latency for a load/store from your registers to or from the scratchpad
Note that a single-cycle latency for L1 is not that uncommon in in-order pipelines - the Cortex A7, for example, has single-cycle access to L1.

That scratchpad is physically addressed, and does not have a bunch of extra (and in our opinion, wasted) logic to handle address translations,
The usual trick for this is to arrange your cache lines such that your L1 is virtually indexed and physically tagged, which means that you only need the TLB lookup (which can come from a micro-TLB) on the response. If you look at the cache design on the Cortex A72, it does a few more tricks that let you get roughly the same power as a direct-mapped L1 (which has very similar power to a scratchpad) from an associative L1.

If the address requested by a core is not in its own scratchpad's range, it goes to the router and hops on the NoC until it gets there... with a one cycle latency per hop
To get that latency, it sounds like you're using the NoC topology that some MIT folks presented at ISCA last year. I seem to remember that it was pretty easy to come up with cases that would overload their network (propagating wavefronts of messages) and end up breaking the latency guarantees. It also sounds like you're requiring physical layout awareness from your jobs, bringing NUMA scheduling problems from the OS (where they're hard) into the compiler (where they're harder).
Building a compiler for this sounds like a fun set of research problems (if you're looking for consultants, my rates are very reasonable! Though I have a different research architecture that presents interesting compiler problems to occupy most of my time).
Oh, one more quick question: Have you looked at Loki? The lowRISC project is likely to include an implementation of those ideas and it sounds as if they have a lot in common with your design (though also a number of significant differences).

--
I am TheRaven on Soylent News
Re:About $40M by ChrisMaple · 2015-07-23 07:29 · Score: 1

Proof-of-concept doesn't have to be on the latest technology, which is undeniably expensive. Do a shared-wafer (https://www.mosis.com/) on some near-obsolete technology, and when the bugs are worked out it's time for scaling.

--
Contribute to civilization: ari.aynrand.org/donate
Old-timer here by ChrisMaple · 2015-07-23 07:42 · Score: 1

Darned overloaded abbreviations. RTL has priority, means Resistor-Transistor Logic.

--
Contribute to civilization: ari.aynrand.org/donate
Pretty much everything you said is gibberish. by Brannon · 2015-07-23 14:33 · Score: 1

Congratulations for tricking someone into giving you money. Good luck with your impending disaster.