Princeton Researchers Announce Open Source 25-Core Processor (pcworld.com)

← Back to Stories (view on slashdot.org)

Princeton Researchers Announce Open Source 25-Core Processor (pcworld.com)

Posted by BeauHD on Thursday August 25, 2016 @10:00AM from the all-strung-out dept.

An anonymous reader writes: Researchers at Princeton announced at Hot Chips this week their 25-core Piton Processor. The processor was designed specifically to increase data center efficiency with novel architecture features enabling over 8,000 of these processors to be connected together to build a system with over 200,000 cores. Fabricated on IBM's 32nm process and with over 460 million transistors, Piton is one of the largest and most complex academic processors every built. The Princeton team has opened their design up and released all of the chip source code, tests, and infrastructure as open source in the OpenPiton project, enabling others to build scalable, manycore processors with potentially thousands of cores.

16 of 114 comments (clear)

Min score:

Reason:

Sort:

Re: How does technology sanctions work with this? by johnsmithperson123 · 2016-08-25 10:11 · Score: 4, Insightful

Relax. In between architectural basis and the relatively low performance, it's insignificant. A few hundred million transistors for a 25 core chip in a day where your stock chip is multibillion in terms of transistor count.
Lots of cores doesn't mean shit by BitZtream · 2016-08-25 10:15 · Score: 2, Insightful

I've been hearing about massive number of cores for years ... the problem however is they are great for demonstrating that you can put a bunch of 'cores' on a chip ... not that they are actually useful for anything.
Connecting 8k of these things together? You've just proven you actually don't understand how the real world does things.
If you have 8 million cores that can add 20 super floating point numbers a second ... thats WORTHLESS because I need to do things other than add two numbers.
If you have 8k cores that can be interconnected ... that must be one awesome bus if those interconnects are useful because the congestion on that bus is going to be insane, oh ... you've got a solution to that problem? funny how that solution kills the theoretical performance
Sorry, but I've heard this stuff so many times over the years that I just get annoyed when some professor tells us about this super awesome CPU he has that is utterly fucking worthless outside of theoretical land.
And by the way, 25 cores is on the tiny side for these silly academic projects.
Blah blah blah I made this awesome processor but it only works for one tiny problem domain that can't even be used for that problem domain because of the constraints on it that allow you to make so many cores.
Not once has one of these things actually been useful in the real world, and I know thats not the point of research but the only reason you list something about so many cores is pure clickbait. No one with a clue believes you've built something useful when you make such ridiculous statements.
No, I didn't read the article. I don't have to. These papers are only about getting grant money by making ridiculous statements, not about producing anything useful and 9 times out of 8, its done using methods that the real world (read people who actually get shit done) has already deemed don't actually work outside of academia and theory.
Yes, I'm bitter. I hate useless people wasting money that could be spent doing real things, not reiterating something intel and amd knew in the 80s.

--
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
1. Re:Lots of cores doesn't mean shit by rubycodez · 2016-08-25 12:55 · Score: 3, Interesting
  
  real computers solving real problems with large core counts exist, and they have non-bus architectures by the way.
  So according to you the cpus in the Sunway TaihuLight supercomputer with 256 cores per cpu don't really do anything?
  I think you don't have a background in the field to be making such pronouncements, you're spewing out of your ass
2. Re: Lots of cores doesn't mean shit by D.McG. · 2016-08-25 14:35 · Score: 3
  
  Nvidia has a wonderful 3840 core processor with a wonderful scheduler and interconnect. Two can be bridged for 7680 cores. Hmmm... Your argument of 8000 cores being a pipe dream is complete rubbish.
massive parallel processing=limited applications by wierd_w · 2016-08-25 10:16 · Score: 2

while being able to leverage that many compute units all a once is quite impressive, most tasks are still serial by nature. computers are not clairvoyant, so cannor know in advance what a branched logic chain will tell them to do for any arbitrary path depth, nor can they perform a computation on data that doesnt exist yet.
thhe benefits of more cores are from parallel execution, not from doing tasks faster. as such, most software is not going to benefit from having access to 8000 more threads.
For those that didn't read TFA, esp in regards to by Anonymous Coward · 2016-08-25 10:36 · Score: 5, Informative

the type of cores:
Some of OpenPiton® features are listed below:
OpenSPARC T1 Cores (SPARC V9)
Written in Verilog HDL
Scalable up to 1/2 Billion Cores
Large Test Suite (>8000 tests)
Single Tile FPGA (Xilinx ML605) Prototype
The bit that may put some people off:
This work was partially supported by the NSF under Grants No. CCF-1217553, CCF-1453112, and CCF-1438980, AFOSR under Grant No. FA9550-14-1-0148, and DARPA under Grants No. N66001-14-1-4040 and HR0011-13-2-0005. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.
So interesting and possibly FGPA synthesizable test processor it may be. Trustworthy computer core it may *NOT* be. (You would have to compare it to the original T1 cores, and have had those independently audited to ensure no nefarious timing attacks, etc were in place.)
Now, having said that, if this interconnect is even a fraction as good as they claim, it could make for an AWESOME libre SPARC implementation competitive with Intel/AMD for non-Wintel computing uses. Bonus for someone taping out an AM3+ socket chip (or AM4 if all the signed firmware is SoC-side and not motherboard/southbridge side.) that can be initialized on a commercially available board with standard expansion hardware. AM3/3+ would offer both IGP and discrete graphics options if a chip could be spun out by middle of 2017, and if AMD was convinced to continue manufacturing their AM3 chipset lines we could have 'libreboot/os' systems for everything except certain hardwares initialization blobs. IOMMUv1 support on the 9x0(!960) chipsets could handle most of the untrustworthy hardware in a sandbox as well, although you would lose out on HSA/XeonPhi support due to the lack of 64 bit BARs and memory ranges.
Re: massive parallel processing=limited applicatio by BarbaraHudson · 2016-08-25 10:37 · Score: 2, Informative

Instead of branch prediction picking the most often used branch, and stalling when they get it wrong, just take all possible branches and toss out the ones that turned out to be wrong.

--
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
Re:Richard Stallman by Hylandr · 2016-08-25 11:00 · Score: 2

*I* Just shit my pants...

--
~ People that think they are better than anyone else for any reason are the cause of all the strife in the world.
Re:massive parallel processing=limited application by goose-incarnated · 2016-08-25 11:33 · Score: 4, Interesting

With a multiuser, multitasking OS you can have 25 different unrelated processes running on something with 25 cores. Or you could have 25 threads in a dataflow arrangement where each is a consumer of what the last just produced. Or you could go over the members of an array or matrix 25 members at a time with the same transformation. Some things are serial, but there are plenty of ways more cores can actually be used.
Nope. You'll generally hit the wall with around 16-20 cores using shared memory. You need distinct processors with dedicated memory to make multi-processing scale beyond 20 or so processors. Those huge servers with 32-cores apiece have their point of dminishing returns/processor after around 20 cores.
First, the reason you aren't going to be doing multithreading/shared-memory on any known computer architectures, read this.
Secondly, let's say you aren't multithreading so you don't run into the problems in the link I posted above. Let's assume you run 25 separate tasks. You still run into the same problem, but at a lower level. The shared-memory is the throttle, because the memory only has a single bus. So you have 1000 cores. Each time an instruction has to be fetched[1] for one of those processors it needs exclusive access to those address lines that go to the memory. The odds of a core getting access to memory is roughly 1/n (n=number of cores/processors).
On a 8-core machine, a processor will be placed into a wait queue roughly 7 out of 8 times that it needs access. Further, The expected length of time in the queue is (1-(1/8)). This is of course, for an 8-core system. Adding more cores results in the waiting time increasing asymptotically towards infinity.
So, no. More cores sharing the same memory is not the answer. More cores with private memory is the answer but we don't have any operating system that can actually take advantage of that.
A project that I am eyeing for next year is putting together a system that can effectively spread out the operating system over multiple physical memorys. While I do not think that this is feasible, it's actually not a bad hobby to tinker with :-)
[1] Even though they'd be fetched in blocks, they still need to be fetched; a single incorrect speculative path will invalidate the entire cache.

--
I'm a minority race. Save your vitriol for white people.
Re:massive parallel processing=limited application by Fwipp · 2016-08-25 11:54 · Score: 3, Insightful

Chances are you're not content to watch video in 240p anymore.
Hot Chips Conference by Areyoukiddingme · 2016-08-25 11:56 · Score: 2

Perhaps more interesting is the semi-detailed presentation about AMD's Zen. Other people have already pointed out that a paltry few hundred million transistors doesn't get you very far. What are the billions of transistors used for? The Zen presentation is quite informative. Loads of cache is a fair chunk of it. Überfancy predictive logic is another big chunk of it. The rest is absorbed by 4 completely parallel ALUs, two parallel AGUs, and a completely independent floating point section with two MUL and two ADD logics. And after all that, what you get is parity with Intel's Broadwell. Barely.
So for perspective, that took a decade of hard labor by quite well paid engineers, and there was no low-hanging fruit in the form of the register-starved x86 architecture for AMD to pluck this time. The difference between half a billion and two billion transistors is very very substantial.
Re:Richard Stallman by Anonymous Coward · 2016-08-25 12:32 · Score: 2, Funny

I just shit Richard Stallman's pants! (Maybe he shoulda used a password?)
Re:Richard Stallman by fyngyrz · 2016-08-25 13:02 · Score: 4, Funny

No, it's ok. You have to shit *and* piss his pants. It's two-factor authorization.

--
I've fallen off your lawn, and I can't get up.
Re:massive parallel processing=limited application by wierd_w · 2016-08-25 13:07 · Score: 2

It's an interesting idea, and one I have given a little thought to. ( it would enable a very fault tolerant computer architecture) however, unless you implement highly redundant interconnects/busses, you still have the N-devices fighting for a shared resource problem.
If you make the assertion that all nodes have a private direct connection with all other nodes, and thus eliminate the bottleneck that way, you now have to gracefully decide how to handle a downed private link.
I suppose a hybrid might work. Fully dedicated links, and one shared bus. When dedicated link fails, communicate over the shared bus.
Scaling such a design would become prohibitively costly though. A 200 node design would have orders of magnitude more dedicated links.
The idea I had for playing with this idea, was to use some cheap wired home routers, set up private vlans on the 5 or so ethernet ports each has, then put private patch cables on each port, then put all the Wan ports on a dumb hub.
The local copies of Linux on each system can handle management of local device resources, and a daemon running on each node then handles listening/responding on each interface.
Just what such a thing would be good at doing escapes me though. To be really useful, you would need some way to have nodes specialize, then cooperate, without a central authority.
That way, should we decide to use this network to process live video, one node decodes the input stream, then dispatches portions of the decoded stream to peer devices, who then take the decoded stream and do whatever processing is requested, before sending the processed streams to yet another peer device which assembles the processed stream, then shuttles that to the endpoint node, which reencodes the stream and writes it to the output device. (Or some similarly cellular process)
I suppose this is kinda similar to how a neural colum works, where locally interconnected nets are restricted in the number of true local peers they have, and then communicate collectively to other neUral columns by dedicated interconnects. (Video input source in the above, could be from a camera, but it could also be from another network's output stream.)
The major logical tasks are:
Role selection in the assigned task for each local node.
How to issue instructions to the mesh nodes in a decentralized manner
Depending on how far you wanted to extrapolate this, each mesh node could be treated as a logical unit, where each logical node then is part of another, higher level node of similar topology: each mesh has a direct connection to each other mesh inside its higher order node, and one communal link all nodes can talk on inside that node.
Eg, if I make 5, 5node networks made out of such routers, I need 7 ports on each router. 5 for direct local traffic. 1 for local shared connect, 1 for direct connect to another 5node group. Clever use of subnetting and routing on the shared net would enable there to be a dumb gateway device to allow the shared higher link to function. Each 5node network is connected to every other 5-node network in the scaled up version.
Decisions on how to process incoming data might be tied to which interface received it, or any number of other methods.
Spying on the system state of the whole system should be possible through the shared link infrastructure, though ideally any node you interact with the system with should be a proper peer in it, and nit something sitting on the shared net only.
The drawback of such a design will be signal propogation latency, and keepin all the subnodes, at all levels, synchronized. The human brain uses a support network of astrocytes and glial cells to guide dedicated link physical routing, and to tune propogation delay between neural columns through selective mylienation of trunk bundles.
You could probably fake it with introduced waitstates.
At some point though, the behavior of the whole will revolve around the basic logic baked inside each physical compute unit. Ideall
Re:massive parallel processing=limited application by godrik · 2016-08-25 13:33 · Score: 3

well, nothing will ever break amdahl's law. But that is rarely the issue. The parallelism is many scientific problem is pretty vast. We run lots of simulations on 100K and more cores. Often the interconnect is the issue, and not the sequential part.
There is a real problem today in build a exaflops machine, one of the biggest problem is managing communications because they are very power consuming. If that architecture can scale meaningful codes at 100K, it is interesting.
Re:massive parallel processing=limited application by godrik · 2016-08-25 13:39 · Score: 3, Insightful

That is not really true. Most workloads can be executed in parallel. Pretty much all the field of scientific computing (would that be physics, chemistry, or biology) are typically quite parallel. If you are looking at database and data analytics, they are very parallel as well, if you are building topic models of the web, or trying to find correlation in twitter post, these things are highly parallel.
Even on your machine, you are certainly using a fair amount of parallel computing, most likely video decompression is done in parallel (or it should be). It is the old argument that by decreasing frequency you can increase core count in the same power envelop while increasing performance.
For sure, some applications are not sequential. Most likely, they are not the one we really care about. Otherwise, hire me, and I'll write them in parallel :)