19-Year-Old's Supercomputer Chip Startup Gets DARPA Contract, Funding
An anonymous reader writes: 19-year-old Thomas Sohmers, who launched his own supercomputer chip startup back in March, has won a DARPA contract and funding for his company. Rex Computing, is currently finishing up the architecture of its final verified RTL, which is expected to be completed by the end of the year. The new Neo chips will be sampled next year, before moving into full production in mid-2017.The Platform reports: "In addition to the young company’s first round of financing, Rex Computing has also secured close to $100,000 in DARPA funds. The full description can be found midway down this DARPA document under 'Programming New Computers,' and has, according to Sohmers, been instrumental as they start down the verification and early tape out process for the Neo chips. The funding is designed to target the automatic scratch pad memory tools, which, according to Sohmers is the 'difficult part and where this approach might succeed where others have failed is the static compilation analysis technology at runtime.'"
We actually have very good reasons to say why this is a very different kind of VLIW, and have found the reason why other VLIW chips have had such static scheduling issues. Hope we can convince you and everyone else soon enough.
Thanks for the response!
I should have noticed your numbers for for double precision flops, so my numbers were way off. Thanks for the correction. I bet you are IEEE compliment too (Darn GPUs...).
Your design is intended specifically for parallel work loads with localized or clustered data access, correct? (I realize this is includes most supercomputer work jobs) It sounds like similar constraints you have with GPUs, but if met properly, the performance should be much better/more efficient and more scale-able. And you expect your compilers to be able to meet these needs and statically schedule all the memory movement which is where you get massive gains. Is that a reasonable assessment?
Your designs don't have anything to offer for old straight line single threaded programs, correct? It will also not work well if you can't schedule the DMA actions well enough: pointer heavy random access code wouldn't run faster on your system than a gpu, but it won't run fast anywhere. Is that about right?
I'm looking forward to your papers on the compiler side it sounds very interesting: If you get something working in that area, it could be a big deal to the super computer guys (that's not me though).
Personally I'm mostly interested in single threaded throughput, process isolation, and security, which is why the mill interests me a lot. As for their stuff taking a long time: your rate of progress and schedule is just amazing, its not that others are slow...
I'm burning some mod points to post this under my username, but it's totally worth it. THIS is the kind of article that should be on Slashdot!
:-D). Also, what particular types of problems are you guys targeting your chips to solve or to what areas do you envision your chips being especially well suited? Also, who do you think has done the best nitty-gritty write up about the project so far? I'd love to hear what you think is the best technical description publicly available. Can't wait to learn more as the project grows.
;-)
Can you elaborate on the programming structure/API you guys are envisioning for this? (it's cool if you can't, I'd understand
Although I'm not a programmer or CS person by training, I do GPGPU programming (although not BLAS-based stuff) almost exclusively for my research and enjoy it because once you understand the differences between the GPU and CPU it just become a question of how to best parallelize your algorithm. It'd be AMAZING to see the memory bandwidth and power usage specs you guys are working towards under a similar programming structure we currently see with something like CUDA or OpenCL. Any plans for something like that or am I betraying my hobbyist computing status?
Finally, if you ever need any applications testing, specifically in the medical imaging field, feel free get in touch.
I like the idea of "reinventing the computer for performance". Trying to get rid of overhead caused by virtual memory has attracted quite a bit of attention recently, so the idea is definitly sound.
A few questions:
-Is there any more details I can read on anywhere? I could not really see any details passed the "slightly technical PR" on http://www.rexcomputing.com/in...
-Do you plan on plan on presenting your work at SuperComputing?
-You mention BLAS3 kernels, so I assume you mean dense BLAS3 kernels. In what I see, people are no longer really interested in dense linear algebra. Most of the applications I see nowadays are sparse. Can your architecture deal with that?
-The chip and architecture seem to essentially be based on a 2D mesh network, can it be extended to more dimensions? I was under the impression that it would cause high latency in physical simulation, because you can not easily project a 3D space in a 2D space without introducing large distance discrepancies. (Which is why BG/Q use 5D torus network.)
Keep us apraised!
Cheers
Please explain to me simply how you get 10x in compute efficiency over GPUs--these chips are already fairly optimal at general purpose flops per watt because they run at low voltage and fill up the die with arithmetic.
GPUs have excellent memory bandwidth to their video RAM (GDDR*), they have poor IO latency & bandwidth (PCIe limited) which is the main reason they don't scale well.
We've heard the VLIW "we just need better compilers" line several times before.
Thus far this sounds like a truly excellent high school science fair project, or a slightly above average college engineering project. It is miles away from passing an industrial smell test.
1. My personal favorite programming models for our sort of architecture would be PGAS/SPMD style, with the latter being the basis for OpenMP. PGAS gives a lot more power in describing and efficiently having shared memory in an application with multiple memory regions. Since every one of our cores have 128KB of our scratchpad memory, and all of those memories are part of a global flat address space, every core can access any other cores memory as if it is part of one giant continuous memory region. That does cause some issues with memory protection, but that is a sacrifice you make for this sort of efficiency and power (but we have some plans on how to address that with software... more news on that will be in the future). The other nice programming model we see is the Actor model... so think Erlang, but potentially also some CSP like stuff with Go in the future (And yes, I do realize they are competing models).
If you want to get the latest info as it comes out, sign up for our mailing list on our website!
I found this extremely intriguing, as I am currently writing up my dissertation on high-GFLOPS/W 3-D layered reconfigurable architectures. I am also of the opinion that memory handling is the key, as it is the only way to resolve the von Neumann bottle-neck problem. Many processing elements with no means to feed them are useless. In my design I am using reconfigurability and flexibility to gain energy efficiency (my architectural range allows 111GFLOPs/W in some configurations).
I am also concentrating on dense linalg kernels, as they are a perfect challenge in variable computation:data ratio, varied and complex memory access patterns and regularity.
In my approach, I am of the opinion that forcing an application mapping to a given architecture via a compiler is inefficient. Instead, I am exploiting architectural flexibility gained from coarse-grained reconfigurable structures to adapt the architecture to an optimal ASAP/ALAP scheduling, thus constructing the perfect architecture to match an optimal mapping. Basically, keeping all processing elements busy all the time is the goal, leading to huge energy gains.
The way this is done is a bit weird, as my architecture has a function set as opposed to an instruction set, which is custom-definable and run-time reconfigurable to suit an application. The construction of the function set is done by composing elementary hardware functions based on meaning, a concept close to functional programming concepts from John Backus. Programming is meaning-based, efficiently constructing required functions and bringing them out to assembly.
Several kernels have been done this way, and programming stays easy via this functional reconfiguration (so far longest being TRSM with 112 assembly lines). Reached 21-25GFLOPs/W on 65nm tech pre-layout for 10 BLAS1-3 kernels)
I am now finishing up a 3D VIA-last physical layout in 40nm tech which already doubled my energy efficiency. (Why 3D? That's another story -- I think that division of computation, memory access and communication(intra-kernel data movement, sharing, broadcasting) needs custom hardware structures optimized for these tasks, which can be parallelized. Which is then native for 3D silicon -- each class on its own die). I will be reading your papers ASAP to see how you deal with the von Neumann bottle-neck :)
Cheers, Zoltan