Princeton Researchers Announce Open Source 25-Core Processor (pcworld.com)

Richard Stallman by Anonymous Coward · 2016-08-25 10:03 · Score: 1

just shit his pants!

Re:Richard Stallman by Hylandr · 2016-08-25 11:00 · Score: 2

*I* Just shit my pants...

--
~ People that think they are better than anyone else for any reason are the cause of all the strife in the world.
Re:Richard Stallman by Anonymous Coward · 2016-08-25 12:32 · Score: 2, Funny

I just shit Richard Stallman's pants! (Maybe he shoulda used a password?)
Re:Richard Stallman by fyngyrz · 2016-08-25 13:02 · Score: 4, Funny

No, it's ok. You have to shit *and* piss his pants. It's two-factor authorization.

--
I've fallen off your lawn, and I can't get up.
Re:Richard Stallman by FatdogHaiku · 2016-08-25 14:17 · Score: 1

I just shit Richard Stallman's pants! (Maybe he shoulda used a password?)
He did, but who's not going to try "zipper"?

--
You have the right to remain sentient. If you give up the right to remain sentient, you will be elected to public office
Re:Richard Stallman by Anonymous Coward · 2016-08-25 16:22 · Score: 0

Bet you a shilling you wouldn't be able to tell the difference between shitted and unshitted Stallman pants using smell alone.
Re:Richard Stallman by Anonymous Coward · 2016-08-25 21:31 · Score: 0

Bet you a shilling you wouldn't be able to tell the difference between shitted and unshitted Stallman pants using smell alone.
You are shilling.

/. Editor Easiest job ever by Anonymous Coward · 2016-08-25 10:04 · Score: 0

" Piton is one of the largest and most complex academic processors every built"

Re:/. Editor Easiest job ever by fyngyrz · 2016-08-25 13:03 · Score: 1

You put an extra space in your quote. Editing fail.

--
I've fallen off your lawn, and I can't get up.

How does technology sanctions work with this? by Anonymous Coward · 2016-08-25 10:08 · Score: 1

Does it mean any sanctioned country can order their own processors from a generic manufacturer?

Re: How does technology sanctions work with this? by johnsmithperson123 · 2016-08-25 10:11 · Score: 4, Insightful

Relax. In between architectural basis and the relatively low performance, it's insignificant. A few hundred million transistors for a 25 core chip in a day where your stock chip is multibillion in terms of transistor count.
Re:How does technology sanctions work with this? by AHuxley · 2016-08-25 19:58 · Score: 1

Any hardware they buy in to think about expanding their ability to build a super computer comes with free NSA and GCHQ hardware added during shipping. e.g.
people may recall DEITYBOUNCE, IRONCHEF, MONTANA, BULLDOZER, KONGUR, NIGHTSTAND.
So it then becomes a race to buy in safe top end consumer kit and fill a hall or older super computers without attracting the FBI, CIA, MI6 while exporting.
Nothing allowed to be floating in the educational or consumer realm will really help.

--
Domestic spying is now "Benign Information Gathering"
Re:How does technology sanctions work with this? by Anonymous Coward · 2016-08-25 21:34 · Score: 0

Any hardware they buy in to think about expanding their ability to build a super computer comes with free NSA and GCHQ hardware added during shipping. e.g.
people may recall DEITYBOUNCE, IRONCHEF, MONTANA, BULLDOZER, KONGUR, NIGHTSTAND.
So it then becomes a race to buy in safe top end consumer kit and fill a hall or older super computers without attracting the FBI, CIA, MI6 while exporting.
Nothing allowed to be floating in the educational or consumer realm will really help.
Except none of those malware packages whose names you drop like you know what you're talking about, will actually run on this processor.
tl;dr? Even your troll relatives want you to fuck off.
Re:How does technology sanctions work with this? by gtall · 2016-08-25 22:17 · Score: 1

You do know that everyone in the U.S. and Britain has a government "minder" assigned to them to watch their every move, yes?
Re:How does technology sanctions work with this? by K.+S.+Kyosuke · 2016-08-25 23:47 · Score: 1

What are you talking about? What "sanctioned countries"? Unless you're talking about US manufacturers, of course, however, the prominent one, meaning Intel, is not going to act as a fab for other random people anyway, so that's a moot point.

--
Ezekiel 23:20
Re:How does technology sanctions work with this? by Black.Shuck · 2016-08-26 01:22 · Score: 1

You do know that everyone in the U.S. and Britain has a government "minder" assigned to them to watch their every move, yes?
Yes, but they can't be trusted. That's why we have minder minders who keep an eye on the minders.
But who keeps an eye on them you ask? Fool! It's minders all the way down!
Re:How does technology sanctions work with this? by unixisc · 2016-08-26 01:22 · Score: 1

Uh, that's precisely what they do. They have their Custom Foundry, where one can have any agreement w/ them. And if it's a non-US company that needs them manufactured outside the US, they can have Intel make their parts in Israel
Re:How does technology sanctions work with this? by K.+S.+Kyosuke · 2016-08-26 04:39 · Score: 1

I said RANDOM people. Intel may choose to agree to manufacture something for you, but if it smells of competing with Intel, well, tough luck. And manufacturing CPUs replacing Intel models is definitely is going to smell to Intel like competing with Intel.

--
Ezekiel 23:20
Re:How does technology sanctions work with this? by unixisc · 2016-08-26 05:24 · Score: 1

No, if they stand to get money from me by making something for me, they'll do it. They may refuse if I'm violating anyone's patents, or if I wanted to make an x64 after getting an AMD license, they may balk, but otherwise, they'd be just fine. The custom foundry business is an independent business unit.

Nice by Anonymous Coward · 2016-08-25 10:12 · Score: 0

Wish more universities would sponsor projects like this. Congratulations Princeton!

Lots of cores doesn't mean shit by BitZtream · 2016-08-25 10:15 · Score: 2, Insightful

I've been hearing about massive number of cores for years ... the problem however is they are great for demonstrating that you can put a bunch of 'cores' on a chip ... not that they are actually useful for anything.

Connecting 8k of these things together? You've just proven you actually don't understand how the real world does things.

If you have 8 million cores that can add 20 super floating point numbers a second ... thats WORTHLESS because I need to do things other than add two numbers.

If you have 8k cores that can be interconnected ... that must be one awesome bus if those interconnects are useful because the congestion on that bus is going to be insane, oh ... you've got a solution to that problem? funny how that solution kills the theoretical performance

Sorry, but I've heard this stuff so many times over the years that I just get annoyed when some professor tells us about this super awesome CPU he has that is utterly fucking worthless outside of theoretical land.

And by the way, 25 cores is on the tiny side for these silly academic projects.

Blah blah blah I made this awesome processor but it only works for one tiny problem domain that can't even be used for that problem domain because of the constraints on it that allow you to make so many cores.

Not once has one of these things actually been useful in the real world, and I know thats not the point of research but the only reason you list something about so many cores is pure clickbait. No one with a clue believes you've built something useful when you make such ridiculous statements.

No, I didn't read the article. I don't have to. These papers are only about getting grant money by making ridiculous statements, not about producing anything useful and 9 times out of 8, its done using methods that the real world (read people who actually get shit done) has already deemed don't actually work outside of academia and theory.

Yes, I'm bitter. I hate useless people wasting money that could be spent doing real things, not reiterating something intel and amd knew in the 80s.

--
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager

Re:Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-25 11:57 · Score: 0

So, can I see your parallel processor implementation. You know, that little FPGA you have been working on for a few years now? having a little trouble with the clock frequency, maybe dial it down another few Mhz? You know, maybe actually make a board, and put a FPGA on it?
Where is your silicon? Can I see it?
Maybe you can show me that beowulf clusters of Z80s you built?
Re:Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-25 12:40 · Score: 0

Not once has one of these things actually been useful in the real world,
Intel would like to disagree.
Re:Lots of cores doesn't mean shit by rubycodez · 2016-08-25 12:55 · Score: 3, Interesting

real computers solving real problems with large core counts exist, and they have non-bus architectures by the way.
So according to you the cpus in the Sunway TaihuLight supercomputer with 256 cores per cpu don't really do anything?
I think you don't have a background in the field to be making such pronouncements, you're spewing out of your ass
Re: Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-25 13:24 · Score: 0

So you never complain about your food in a restaurant that was cooked by a professional cook. K.
Re: Lots of cores doesn't mean shit by D.McG. · 2016-08-25 14:35 · Score: 3

Nvidia has a wonderful 3840 core processor with a wonderful scheduler and interconnect. Two can be bridged for 7680 cores. Hmmm... Your argument of 8000 cores being a pipe dream is complete rubbish.
Re:Lots of cores doesn't mean shit by AHuxley · 2016-08-25 20:04 · Score: 1

Wrap it up in great marketing and sell it back to a government for crypto? No need to buy in a big brand super computer, lots of our small CPU's can do that and be expanded on as needed. At a per core price thats some nice pay back to a contractor.

--
Domestic spying is now "Benign Information Gathering"
Re:Lots of cores doesn't mean shit by terjeber · 2016-08-25 20:54 · Score: 1

funny how that solution kills the theoretical performance
Yeah, what are those dumbasses at NVidia thinking about?

Blah blah blah I made this awesome processor but it only works for one tiny problem domain
Yeah. These things don't work at all. Much like the brains of ignorant idiots posting this kind of drivel in /.
Re:Lots of cores doesn't mean shit by K.+S.+Kyosuke · 2016-08-25 23:52 · Score: 1

There's absolutely nothing wrong with designing chips with a larger number of smaller cores, especially if it removes a lot of the core-overengineering pressure present in the legacy x86 chip market and improves power efficiency for smaller applications, which are going to be more numerous in the future. The ability to customize ISAs for specific applications such as modern mobile robotics would also significantly improve the odds when competing against using generic large CPUs.

--
Ezekiel 23:20
Re:Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-26 01:11 · Score: 0

... I need to do things other than add two numbers
Hate to shatter any illusions, but that's essentially all processors ever do.
All your computational problems can be reduced to adding together two numbers at unfathomable speed, so progress in that domain consists of doing exactly the same thing but even faster, or in conjunction with other things doing it too. Anything "real" that can have its variables reduced to sequences of two numbers will benefit by this, which happens to be a lot of things.
But the well of Good Will in the world is not drained by efforts in this direction. If you want to pump clean water for thirsty children in Africa, you can be on a plane tomorrow.
Re: Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-26 02:21 · Score: 0

Nvidia has a wonderful 3840 core processor with a wonderful scheduler and interconnect. Two can be bridged for 7680 cores.
Hmmm... Your argument of 8000 cores being a pipe dream is complete rubbish.
We won't be impressed until they release something with at least 9001 cores.
Re:Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-26 03:16 · Score: 0

You don't have to make brilliant hardware in order to call someone else out for making shit hardware.
Shit, I've never so much as touched a steamroller, but I can tell when a construction crew has done a total shit job on a section of road.
Re: Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-26 03:21 · Score: 0

(A) Niche use case
(B) Compute doesn't run over SLI. You schedule work on each GPU independently.
(C) Those cores are far less capable than x86 or even ARM cores, as is the scheduler
(D) It's not a real interconnect. Nowhere near as capable, not like the glue between SMP cores in CPUs at all. It would have to be significantly redesigned to function at that level
Re: Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-26 05:12 · Score: 0

I realize you can add nines faster than you can understand them.
Re: Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-26 11:29 · Score: 0

To that interconnect thing:
http://www.nvidia.com/object/nvlink.html
https://blogs.nvidia.com/blog/2014/11/14/what-is-nvlink/
Re:Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-27 08:11 · Score: 0

Maybe you should take a time out from being bitter and read the paper.
I've linked it below so you don't get even more grouchy having to look for it.
They've taken the OpenSparc T1 core as their base and built off of it, modifying the L1 cache and given each core an FPU.
I'm well out of my depth here but it doesn't look like this is some pointless research-only project just to get grant money. They've released a full test suite, an Xilink FPGA-ready implementation and its looks like there's also specs for a complete chipset.
https://parallel.princeton.edu...
Re: Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-27 14:46 · Score: 0

Yes - but there I'm PAYING for the meal
Re:Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-27 23:01 · Score: 0

Which means you went and LOOKED at what they'd done and didn't assume they'd done a terrible job.
This putz has openly stated that he will NOT even look at their paper.
And how does he know it's, as you say, "shit hardware"? The basic core is a proven design and the modifications *seem* sound.
Until it's been tested, no one can say for sure. Look, Linux just turned 25 and you could have said the same about it for more than a few of those years - and many people did. Yet it now runs on anything from watches to supercomputers.
Re:Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-29 13:17 · Score: 0

It's bitztream, the autism-hating Slashdot troll!
Re:Lots of cores doesn't mean shit by Anonymous Coward · 2016-08-29 13:20 · Score: 0

It's bitztream, the moronic autism-hating Slashdot troll!

massive parallel processing=limited applications by wierd_w · 2016-08-25 10:16 · Score: 2

while being able to leverage that many compute units all a once is quite impressive, most tasks are still serial by nature. computers are not clairvoyant, so cannor know in advance what a branched logic chain will tell them to do for any arbitrary path depth, nor can they perform a computation on data that doesnt exist yet.

thhe benefits of more cores are from parallel execution, not from doing tasks faster. as such, most software is not going to benefit from having access to 8000 more threads.

Re:massive parallel processing=limited application by mr_mischief · 2016-08-25 10:23 · Score: 1

With a multiuser, multitasking OS you can have 25 different unrelated processes running on something with 25 cores. Or you could have 25 threads in a dataflow arrangement where each is a consumer of what the last just produced. Or you could go over the members of an array or matrix 25 members at a time with the same transformation. Some things are serial, but there are plenty of ways more cores can actually be used.

Re:massive parallel processing=limited application by knightghost · 2016-08-25 10:26 · Score: 1

Unless the computer is figuring out every possible combination 1, 2, or more steps ahead. That how computers beat chess and could really improve in predictive modeling. Depth/Lookahead depends on how fast the logic flow branches. To think about it more, it'd be a great factor in human predictive analysis, from driving to combat.

Re:32nm is shit by Anonymous Coward · 2016-08-25 10:31 · Score: 0

lelelelelelelelelelelelel....GO TEAM GREEN!!!!!!!!11111111oneoneoneoneone

For those that didn't read TFA, esp in regards to by Anonymous Coward · 2016-08-25 10:36 · Score: 5, Informative

the type of cores:
Some of OpenPiton® features are listed below:

OpenSPARC T1 Cores (SPARC V9)
Written in Verilog HDL
Scalable up to 1/2 Billion Cores
Large Test Suite (>8000 tests)
Single Tile FPGA (Xilinx ML605) Prototype

The bit that may put some people off:
This work was partially supported by the NSF under Grants No. CCF-1217553, CCF-1453112, and CCF-1438980, AFOSR under Grant No. FA9550-14-1-0148, and DARPA under Grants No. N66001-14-1-4040 and HR0011-13-2-0005. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.

So interesting and possibly FGPA synthesizable test processor it may be. Trustworthy computer core it may *NOT* be. (You would have to compare it to the original T1 cores, and have had those independently audited to ensure no nefarious timing attacks, etc were in place.)

Now, having said that, if this interconnect is even a fraction as good as they claim, it could make for an AWESOME libre SPARC implementation competitive with Intel/AMD for non-Wintel computing uses. Bonus for someone taping out an AM3+ socket chip (or AM4 if all the signed firmware is SoC-side and not motherboard/southbridge side.) that can be initialized on a commercially available board with standard expansion hardware. AM3/3+ would offer both IGP and discrete graphics options if a chip could be spun out by middle of 2017, and if AMD was convinced to continue manufacturing their AM3 chipset lines we could have 'libreboot/os' systems for everything except certain hardwares initialization blobs. IOMMUv1 support on the 9x0(!960) chipsets could handle most of the untrustworthy hardware in a sandbox as well, although you would lose out on HSA/XeonPhi support due to the lack of 64 bit BARs and memory ranges.

Re: massive parallel processing=limited applicatio by BarbaraHudson · 2016-08-25 10:37 · Score: 2, Informative

Instead of branch prediction picking the most often used branch, and stalling when they get it wrong, just take all possible branches and toss out the ones that turned out to be wrong.

--
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.

Re:massive parallel processing=limited application by inode_buddha · 2016-08-25 10:43 · Score: 1

Or you can fake it with really good scheduling and context switching. Why is it that I was able to simultaneously watch a realplayer video in Netscape 4.x while editing a doc in OpenOffice with NO lag or skips or jitters, on a 200mHz box with 1 gig of ram in 1998, but I can't do that now with 2.6 gHz and 4 gigs? This is starting to really bug me, like where the FUCV is all that horsepower going???
NOTE all of this is on Linux, various flavors. My Time-Warner cable is easily able to saturate the box during off-peak hours.

--
C|N>K

Re:massive parallel processing=limited application by Dadoo · 2016-08-25 10:46 · Score: 1

With a multiuser, multitasking OS you can have 25 different unrelated processes running on something with 25 cores.

In practice, most jobs running on a computer have some relation to each other, and the more jobs you have - and this CPU clearly expects to be able to run a lot of jobs - the more likely that will be. (Where I work, we actually have an application that gets slower when you add more cores.) Like most CPUs with high core counts, this one looks like it'll be great at compute-intensive tasks, but as soon as you try to do I/O, it'll slow to a crawl. Given the number of terabytes people are trying to process these days, I'm thinking this CPU's applications are limited

--
Sit, Ubuntu, sit. Good dog.

Re:For those that didn't read TFA, esp in regards by Anonymous Coward · 2016-08-25 11:29 · Score: 1

Now that was a good /. post, reminiscent of yesteryear.

Get off my lawn.

Re:massive parallel processing=limited application by goose-incarnated · 2016-08-25 11:33 · Score: 4, Interesting

With a multiuser, multitasking OS you can have 25 different unrelated processes running on something with 25 cores. Or you could have 25 threads in a dataflow arrangement where each is a consumer of what the last just produced. Or you could go over the members of an array or matrix 25 members at a time with the same transformation. Some things are serial, but there are plenty of ways more cores can actually be used.

Nope. You'll generally hit the wall with around 16-20 cores using shared memory. You need distinct processors with dedicated memory to make multi-processing scale beyond 20 or so processors. Those huge servers with 32-cores apiece have their point of dminishing returns/processor after around 20 cores.

First, the reason you aren't going to be doing multithreading/shared-memory on any known computer architectures, read this.

Secondly, let's say you aren't multithreading so you don't run into the problems in the link I posted above. Let's assume you run 25 separate tasks. You still run into the same problem, but at a lower level. The shared-memory is the throttle, because the memory only has a single bus. So you have 1000 cores. Each time an instruction has to be fetched[1] for one of those processors it needs exclusive access to those address lines that go to the memory. The odds of a core getting access to memory is roughly 1/n (n=number of cores/processors).

On a 8-core machine, a processor will be placed into a wait queue roughly 7 out of 8 times that it needs access. Further, The expected length of time in the queue is (1-(1/8)). This is of course, for an 8-core system. Adding more cores results in the waiting time increasing asymptotically towards infinity.

So, no. More cores sharing the same memory is not the answer. More cores with private memory is the answer but we don't have any operating system that can actually take advantage of that.

A project that I am eyeing for next year is putting together a system that can effectively spread out the operating system over multiple physical memorys. While I do not think that this is feasible, it's actually not a bad hobby to tinker with :-)

[1] Even though they'd be fetched in blocks, they still need to be fetched; a single incorrect speculative path will invalidate the entire cache.

--
I'm a minority race. Save your vitriol for white people.

Re:massive parallel processing=limited application by Fwipp · 2016-08-25 11:54 · Score: 3, Insightful

Chances are you're not content to watch video in 240p anymore.

Re:32nm is shit by Anonymous Coward · 2016-08-25 11:55 · Score: 0

So, can I see your 16nm Fab?

Hot Chips Conference by Areyoukiddingme · 2016-08-25 11:56 · Score: 2

Perhaps more interesting is the semi-detailed presentation about AMD's Zen. Other people have already pointed out that a paltry few hundred million transistors doesn't get you very far. What are the billions of transistors used for? The Zen presentation is quite informative. Loads of cache is a fair chunk of it. Überfancy predictive logic is another big chunk of it. The rest is absorbed by 4 completely parallel ALUs, two parallel AGUs, and a completely independent floating point section with two MUL and two ADD logics. And after all that, what you get is parity with Intel's Broadwell. Barely.

So for perspective, that took a decade of hard labor by quite well paid engineers, and there was no low-hanging fruit in the form of the register-starved x86 architecture for AMD to pluck this time. The difference between half a billion and two billion transistors is very very substantial.

Re:Hot Chips Conference by Megol · 2016-08-25 13:36 · Score: 1

Perhaps more interesting is the semi-detailed presentation about AMD's Zen. Other people have already pointed out that a paltry few hundred million transistors doesn't get you very far. What are the billions of transistors used for? The Zen presentation is quite informative. Loads of cache is a fair chunk of it. Überfancy predictive logic is another big chunk of it. The rest is absorbed by 4 completely parallel ALUs, two parallel AGUs, and a completely independent floating point section with two MUL and two ADD logics. And after all that, what you get is parity with Intel's Broadwell. Barely.
Intel Broadwell E. There's a big difference. And barely being in parity with one of the best performing processors in the world (classified by Intel as an "enthusiast" processor) is a good thing.

So for perspective, that took a decade of hard labor by quite well paid engineers, and there was no low-hanging fruit in the form of the register-starved x86 architecture for AMD to pluck this time. The difference between half a billion and two billion transistors is very very substantial.
Yes it is a factor of 4. Given that Zen is/is to be mass produced in a small process the price/chip are probably skewed strongly towards AMD. Performance is likely to be better for Zen for real world code (read: not embarrassingly parallel).

Re:massive parallel processing=limited application by Anonymous Coward · 2016-08-25 12:04 · Score: 0

But the whole point of parallel execution is to reduce serial cycle times which means it is going to be faster overall. The clock speed race was ended years ago, we are multi-threading everything now, and understanding chips like these will help in our understanding of future computing.

Re:massive parallel processing=limited application by Anonymous Coward · 2016-08-25 12:11 · Score: 0

Predictive branching on 32 cores is useless. On 200,000 cores, it is trivial. Don't be so willfully dull.

Re:For those that didn't read TFA, esp in regards by wbr1 · 2016-08-25 12:16 · Score: 1

So the NSF is not to be trusted? Are they the sister of the NSA?

Tinfoil is apt in many circumstances, but geez keep it where it belongs.

--
Silence is a state of mime.

Re: massive parallel processing=limited applicatio by Anonymous Coward · 2016-08-25 12:16 · Score: 0

SLOWER with more cores? Sounds like your application was written by a bunch of fools. Worst case is that those cores should go unused.

Re:massive parallel processing=limited application by ClickOnThis · 2016-08-25 12:16 · Score: 1

On a 8-core machine, a processor will be placed into a wait queue roughly 7 out of 8 times that it needs access. Further, The expected length of time in the queue is (1-(1/8)). This is of course, for an 8-core system. Adding more cores results in the waiting time increasing asymptotically towards infinity.

Sorry, that doesn't sound right. The expected length of time in the queue should be on the order of nt, where n is the number of cores and t is the average time required to process a memory-request. (A better formula would use the average length of the queue instead of n but to first order it still would be roughly linear with n.) So, the time required would increase linearly with the number of cores.

--
If it weren't for deadlines, nothing would be late.

Re: massive parallel processing=limited applicatio by Anonymous Coward · 2016-08-25 12:22 · Score: 0

There is a class of problems that are, amusingly enough, called "embarrassingly parallel." Those tasks are perfect for this kind of thing. Think of things like ray-tracing or key-cutting (hello NSA!) where there are no dependencies between operations. Split those amongst 8k CPUs and you can get some serious speed out of a embarrassingly large number of cores.

https://en.m.wikipedia.org/wiki/Embarrassingly_parallel

Re: massive parallel processing=limited applicatio by Megol · 2016-08-25 12:56 · Score: 1

Just as there is the super-linear scaling effect (very uncommon) there's also the effect that a multiprocessor system, even if not loaded, can produce worse performance than the ideal _even_ when written competently. Systems have to use different algorithms for different kinds of system to get good scaling, if the system isn't loaded with the workload the system is designed for those algorithms are unlikely to be optimal and performance then scales worse than expected. That includes hardware algorithms like data/instruction prefetch, cache coherency, DRAM management etc.

But you talk like someone that have no freaking clue of real world systems. Do you think synchronization and coherence (hardware _and_ software for distributed systems) are cheap operations? Nope. When the overheads of added cores exceed the extra performance they can provide the system will run slower with more cores, it is that simple. Most interesting synchronization algorithms scale as O(n^2) or O(n*log n), do the math.

Re:For those that didn't read TFA, esp in regards by 110010001000 · 2016-08-25 13:01 · Score: 1

I think he meant the AFOSR and DARPA involvement.

Re:For those that didn't read TFA, esp in regards by rthille · 2016-08-25 13:02 · Score: 1

I imagine the poster was referring to DARPA more than the NSF, but I imagine that any association with the US Govt. could engender distrust in such matters these (post Snowden) days.

--
Awesome furniture, accessories and cabinetry in Santa Rosa, CA: http://humanity-home.com/

Re: massive parallel processing=limited applicatio by Megol · 2016-08-25 13:04 · Score: 1

That's called eager execution (well, it have many names but that's the most common).

In general it is a dumb idea. It requires more instruction throughput, more execution units, larger caches etc. for a small gain which in the real world is probably negative. Doing more things means more switching, switching means more power consumption (and the added resources will add to the leakage current too -> more power) and this means lower effective clock frequency.

Even the limited form of eager execution where only branches that are considered unpredictable have both their paths executed there isn't much gain and the overheads most likely not worth it.

Re:massive parallel processing=limited application by wierd_w · 2016-08-25 13:07 · Score: 2

It's an interesting idea, and one I have given a little thought to. ( it would enable a very fault tolerant computer architecture) however, unless you implement highly redundant interconnects/busses, you still have the N-devices fighting for a shared resource problem.

If you make the assertion that all nodes have a private direct connection with all other nodes, and thus eliminate the bottleneck that way, you now have to gracefully decide how to handle a downed private link.

I suppose a hybrid might work. Fully dedicated links, and one shared bus. When dedicated link fails, communicate over the shared bus.

Scaling such a design would become prohibitively costly though. A 200 node design would have orders of magnitude more dedicated links.

The idea I had for playing with this idea, was to use some cheap wired home routers, set up private vlans on the 5 or so ethernet ports each has, then put private patch cables on each port, then put all the Wan ports on a dumb hub.

The local copies of Linux on each system can handle management of local device resources, and a daemon running on each node then handles listening/responding on each interface.

Just what such a thing would be good at doing escapes me though. To be really useful, you would need some way to have nodes specialize, then cooperate, without a central authority.

That way, should we decide to use this network to process live video, one node decodes the input stream, then dispatches portions of the decoded stream to peer devices, who then take the decoded stream and do whatever processing is requested, before sending the processed streams to yet another peer device which assembles the processed stream, then shuttles that to the endpoint node, which reencodes the stream and writes it to the output device. (Or some similarly cellular process)

I suppose this is kinda similar to how a neural colum works, where locally interconnected nets are restricted in the number of true local peers they have, and then communicate collectively to other neUral columns by dedicated interconnects. (Video input source in the above, could be from a camera, but it could also be from another network's output stream.)

The major logical tasks are:
Role selection in the assigned task for each local node.
How to issue instructions to the mesh nodes in a decentralized manner

Depending on how far you wanted to extrapolate this, each mesh node could be treated as a logical unit, where each logical node then is part of another, higher level node of similar topology: each mesh has a direct connection to each other mesh inside its higher order node, and one communal link all nodes can talk on inside that node.

Eg, if I make 5, 5node networks made out of such routers, I need 7 ports on each router. 5 for direct local traffic. 1 for local shared connect, 1 for direct connect to another 5node group. Clever use of subnetting and routing on the shared net would enable there to be a dumb gateway device to allow the shared higher link to function. Each 5node network is connected to every other 5-node network in the scaled up version.

Decisions on how to process incoming data might be tied to which interface received it, or any number of other methods.

Spying on the system state of the whole system should be possible through the shared link infrastructure, though ideally any node you interact with the system with should be a proper peer in it, and nit something sitting on the shared net only.

The drawback of such a design will be signal propogation latency, and keepin all the subnodes, at all levels, synchronized. The human brain uses a support network of astrocytes and glial cells to guide dedicated link physical routing, and to tune propogation delay between neural columns through selective mylienation of trunk bundles.

You could probably fake it with introduced waitstates.

At some point though, the behavior of the whole will revolve around the basic logic baked inside each physical compute unit. Ideall

Re: massive parallel processing=limited applicatio by Anonymous Coward · 2016-08-25 13:07 · Score: 1

I asked her to open her legs because I found them embarrassingly parallel. She told me I didn't have the appropriate permissions. So I just crashed there. Later, after I was re-booted, I left.

Re:massive parallel processing=limited application by Megol · 2016-08-25 13:11 · Score: 1

That isn't possible. First the number of possibilities explode fast and second we are already at a power wall. Modern processors already do speculative computation however only in cases where it is likely the result is correct and needed. Just adding speculative execution will make the computer slower partially due to extra data movements (caches etc.) and partially because it will consume more power on a chip already difficult to cool.

Branch predictors are doing most of the work already and doing it well.

Re:massive parallel processing=limited application by fyngyrz · 2016-08-25 13:13 · Score: 1

Also, there is caching, and also, some loads are heavy on longish FPU operations.

So... it doesn't quite work out that way. Also, multicore designs can have separate memory.

One example of multicore design that's both interesting and functional are the various vector processor graphics cores. Lots of em in there; and they get to do a lot of useful work you couldn't really do any other way with similar clock speeds and process tech.

--
I've fallen off your lawn, and I can't get up.

Re:massive parallel processing=limited application by Megol · 2016-08-25 13:14 · Score: 1

Parallel execution doesn't reduce serial sections - look at Amdahl's law.
Having large parallel systems is a well researched area and I can't see anything this system will do to help improve the state of art. Well the state of art of building processors with huge amounts of processors perhaps.

Re:massive parallel processing=limited application by Anonymous Coward · 2016-08-25 13:15 · Score: 0

Or less. Remember, he said 1998.

Unsafe by Anonymous Coward · 2016-08-25 13:19 · Score: 0

With that complexity, how can you ever be sure it doesn't contain backdoors. You cannot trust DARPA.

Re: massive parallel processing=limited applicatio by Chaset · 2016-08-25 13:20 · Score: 1

That sounds like we're starting to re-invent the Itanium.

--
-- "This world is a comedy to those who think, a tragedy to those who feel."

Re: massive parallel processing=limited applicatio by Anonymous Coward · 2016-08-25 13:27 · Score: 0

In the multi user, multi tasking scenario, I guarantee those processes will never all be CPU bound unless one asshole is trying to make a point.

They'll be stalling for cache misses most of the time they even spend on the CPUs and then waiting on IO.

Re:massive parallel processing=limited application by godrik · 2016-08-25 13:33 · Score: 3

well, nothing will ever break amdahl's law. But that is rarely the issue. The parallelism is many scientific problem is pretty vast. We run lots of simulations on 100K and more cores. Often the interconnect is the issue, and not the sequential part.

There is a real problem today in build a exaflops machine, one of the biggest problem is managing communications because they are very power consuming. If that architecture can scale meaningful codes at 100K, it is interesting.

Re:massive parallel processing=limited application by godrik · 2016-08-25 13:39 · Score: 3, Insightful

That is not really true. Most workloads can be executed in parallel. Pretty much all the field of scientific computing (would that be physics, chemistry, or biology) are typically quite parallel. If you are looking at database and data analytics, they are very parallel as well, if you are building topic models of the web, or trying to find correlation in twitter post, these things are highly parallel.

Even on your machine, you are certainly using a fair amount of parallel computing, most likely video decompression is done in parallel (or it should be). It is the old argument that by decreasing frequency you can increase core count in the same power envelop while increasing performance.

For sure, some applications are not sequential. Most likely, they are not the one we really care about. Otherwise, hire me, and I'll write them in parallel :)

Good news, bad news by AJWM · 2016-08-25 14:01 · Score: 1

The good news is that this thing uses an existing processor core, OpenSPARC T1 (SPARC V9), so there's plenty of software around for it. (Yes, it runs -- or I imagine it will soon -- Linux.)

The bad news is that this thing uses an existing processor core, instead of a more secure architecture (say, something segment based with tag bits, like the B6700 among others) which would render it much more resistant (dare I say immune?) to things like buffer overflows and such.

--
-- Alastair

Re:Good news, bad news by serviscope_minor · 2016-08-25 20:02 · Score: 1

The bad news is that this thing uses an existing processor core, instead of a more secure architecture (say, something segment based with tag bits, like the B6700 among others) which would render it much more resistant (dare I say immune?) to things like buffer overflows and such.
Sounds like it's intended for HPC though, so security's not much of an issue.

--
SJW n. One who posts facts.
Re:Good news, bad news by unixisc · 2016-08-26 01:39 · Score: 1

Plenty of software, yes - given that SunOS and Solaris were overwhelmingly the most popular UNIXes of the day. Linux? RedHat dropped support for it some versions ago, and I'm not sure whether Debian still does or not. I know that all 3 BSDs - OpenBSD, NetBSD and FreeBSD - do.

Great...where can you buy a working system by Anonymous Coward · 2016-08-25 14:16 · Score: 0

If there is not working systems - how does that help anyone

Re:Great...where can you buy a working system by Anonymous Coward · 2016-08-25 21:41 · Score: 0

If there is not working systems - how does that help me watch porn
TFTFY

Re:massive parallel processing=limited application by Anonymous Coward · 2016-08-25 15:21 · Score: 0

OpenOffice didn't exist in 1998.

good boy by Anonymous Coward · 2016-08-25 16:06 · Score: 0

Now you get tenure

Re:good boy by unixisc · 2016-08-26 01:40 · Score: 1

Just make the research into achieving 100% utilization of every CPU in the above jig, and you have enough to keep yourself busy for the rest of your guaranteed lifetime work

Re:massive parallel processing=limited application by goose-incarnated · 2016-08-25 17:17 · Score: 1

On a 8-core machine, a processor will be placed into a wait queue roughly 7 out of 8 times that it needs access. Further, The expected length of time in the queue is (1-(1/8)). This is of course, for an 8-core system. Adding more cores results in the waiting time increasing asymptotically towards infinity.

Sorry, that doesn't sound right. The expected length of time in the queue should be on the order of nt, where n is the number of cores and t is the average time required to process a memory-request. (A better formula would use the average length of the queue instead of n but to first order it still would be roughly linear with n.) So, the time required would increase linearly with the number of cores.

You're right, I worded it incorrectly (it's late, and I've been working 80hrs/week for the last year due to a private project. Forgive me). What I meant to say was "The expected delay when accessing memory is (1-(1/n))", but even that is off by an entire exponent.

The expected delay is (probability of queueing) X ( probable length queue). The probability of queuing is (1-(1/n)):

With 2 processors, you have a 1/2 chance of getting exclusive access, (1-(1/2)) of queuing.

With 3 processors, you have a 1/3 chance of getting exclusive access, (1-(1/3)) of queuing.

With 4 processors, you have a 1/4 chance of getting exclusive access, (1-(1/4)) of queuing.

With n processors, you have a 1/n chance of getting exclusive access, (1-(1/n)) of queuing.

The probable length of the queue is linearly proportional to n, so the expected delay is (1-(1/n) * n). In terms of performance this is O(n^2) - IOW it's piss-poor performance.

Or maybe I'm still doing the numbers wrong - feel free to derive a better statistic for predicting time-in-queue when processors are all using a single address bus. This is the one I got, and some trivial simulation does actually fit this profile.

--
I'm a minority race. Save your vitriol for white people.

Re:For those that didn't read TFA, esp in regards by cpm99352 · 2016-08-25 17:51 · Score: 1

For those wondering why the distrust, here is a good article describing why the US govt is not to be trusted.

Re:massive parallel processing=limited application by Anonymous Coward · 2016-08-25 18:53 · Score: 0

Nope. You'll generally hit the wall with around 16-20 cores using shared memory. You need distinct processors with dedicated memory to make multi-processing scale beyond 20 or so processors. Those huge servers with 32-cores apiece have their point of dminishing returns/processor after around 20 cores.

Doesn't NUMA handle this at least if your usage pattern allows the data to mostly stick near the associated processor? I suppose if you have that many cores in one processor it may be an issue...

Re:For those that didn't read TFA, esp in regards by serviscope_minor · 2016-08-25 19:58 · Score: 1

They're not the government, they're funded by government grants. As someone who has been funded by government grants, I can assure you it is completely different.

--
SJW n. One who posts facts.

Re:massive parallel processing=limited application by Anonymous Coward · 2016-08-25 21:36 · Score: 0

OpenOffice didn't exist in 1998.

Neither did the troll/charity shill you're responding to.

Re:For those that didn't read TFA, esp in regards by Anonymous Coward · 2016-08-25 21:38 · Score: 0

Sure it is, buddy.

Sure it.

Re:massive parallel processing=limited application by Anonymous Coward · 2016-08-25 21:44 · Score: 0

Your GPU would disagree.

The real question is: are there instances where a SPARC instruction set would be a better fit than the simple GPU concept of 'cores'.

Re:massive parallel processing=limited application by Anonymous Coward · 2016-08-25 22:42 · Score: 0

That is why you design multithreaded programs to avoid bottlenecks. A fair number of applications scale just fine to many hundreds of CPUs.

But yes, if you just aimlessly hack a single-threaded design to use random synchronization primitives, you will lose. And you will deserve to lose.

https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

Re:massive parallel processing=limited application by K.+S.+Kyosuke · 2016-08-25 23:57 · Score: 1

So, no. More cores sharing the same memory is not the answer. More cores with private memory is the answer but we don't have any operating system that can actually take advantage of that. A project that I am eyeing for next year is putting together a system that can effectively spread out the operating system over multiple physical memorys. While I do not think that this is feasible, it's actually not a bad hobby to tinker with :-)

I thought Plan 9 was actually doing this?

--
Ezekiel 23:20

Re:massive parallel processing=limited application by goose-incarnated · 2016-08-26 01:12 · Score: 1

That is why you design multithreaded programs to avoid bottlenecks. A fair number of applications scale just fine to many hundreds of CPUs.

But yes, if you just aimlessly hack a single-threaded design to use random synchronization primitives, you will lose.

Did you even read anything I wrote? Please tell me - exactly what multi-thread design do you know off that can solve the contention for the 48-53 pieces of copper that goes to the RAM? There's only one memory bus, and every instruction that must be executed travels on it before execution.

--
I'm a minority race. Save your vitriol for white people.

Re:32nm is shit by unixisc · 2016-08-26 01:19 · Score: 1

Actually, why are they going w/ IBM, who I thought had exited the semiconductor market? Why not go w/ the best of them - Intel? Speaking of which, I wonder - which CPU's ISP do they use?

Re:massive parallel processing=limited application by K.+S.+Kyosuke · 2016-08-26 01:22 · Score: 1

I would first start by making the main memory bus as wide as the L3/L4 cache line, and make *that* one as large as possible. But ultimately, all sharing works only to some extent, so thinking distributed wins big. There's still coherence in program executions. The mere ability to access a unit of data in a shared address space that is currently far away shouldn't mean that you'd be doing this at a high rate.

--
Ezekiel 23:20

Re: massive parallel processing=limited applicatio by K.+S.+Kyosuke · 2016-08-26 01:24 · Score: 1

Speculative execution? That's already happening, isn't it?

--
Ezekiel 23:20

Re: massive parallel processing=limited applicatio by unixisc · 2016-08-26 01:34 · Score: 1

That's what a number of RISC processors used to do - execute both branches of an instruction, and flushing out the one that turned out wrong.

Re: massive parallel processing=limited applicatio by unixisc · 2016-08-26 01:35 · Score: 1

Uh, no, in the original Itanium, it was the compiler that was supposed to do the branch prediction

Re:For those that didn't read TFA, esp in regards by unixisc · 2016-08-26 01:36 · Score: 1

Ok, so it's OpenSPARC based? That's cool. So what do they have running on it - Linux, BSD or Solaris?

Re:massive parallel processing=limited application by Anonymous Coward · 2016-08-26 01:52 · Score: 0

It was called StarOffice back in those days.

SPARC? Really? by emil · 2016-08-26 03:32 · Score: 1

Branch delay slots? Register windows? This is one of the first RISC architectures, and it has warts. Fujitsu just abandoned it.

Re:massive parallel processing=limited application by Anonymous Coward · 2016-08-26 04:00 · Score: 0

A project that I am eyeing for next year is putting together a system that can effectively spread out the operating system over multiple physical memorys. While I do not think that this is feasible, it's actually not a bad hobby to tinker with :-)

SGI did that since the late 90s, calling it ccNUMA (cache coherent non-uniform memory access). NUMA is common on the many-core CoreI and AMD systems, where each groups of cores have "their own" memory bus.

Lots of cores FPGA style by Anonymous Coward · 2016-08-26 04:32 · Score: 0

Suppose you treat these identical processor cores like LUTs in an FPGA and perhaps group clusters of these cores together in hardware into some sort of logical device an OS can utilize. Each device has programmable interconnect between the cores - that's going to be a lot of switching fabric. Part of the software compilation process or even software installation process includes an analysis and synthesis stage to create the best interconnect for the application. The OS runs the software by programming the interconnect then loading the object code. This would all help with the versatility issue but may still be too expensive.

This is an interesting problem domain that's going to get more and more attention since it's easier to lay down more cores than to make those cores run at higher clock frequencies.

Re:massive parallel processing=limited application by epine · 2016-08-26 04:44 · Score: 1

On a 8-core machine, a processor will be placed into a wait queue roughly 7 out of 8 times that it needs access.

You just snuck into your analysis the assumption that every core is memory saturated, and I don't think that all the memory path aggregates in many designs until the L3 cache (usually L1 not shared, L2 perhaps shared a bit). The real bottleneck ends up being the cache coherency protocol, and cache snoop events, handled on a separate bus, which might even have concurrent request channels.

I think in Intel's Xeon E5 line-up there are single-ring and bridged double-ring SKUs for forwarding dirty cache lines from one cache to another (and perhaps all memory requests). This resource can also drown for many workloads.

In many systems, you have all these cores running tasks which are fairly well isolated (not much cache conflict), except they all want to be able to allocate as much memory as they need from a giant memory space (e.g. a TB of DRAM) so they fundamentally have to fall through to a shared memory allocation framework.

You can learn a lot about the challenges involved by following the winding path of something like jemalloc as increasing concurrency levels expose yet another degeneracy.

The real problem with this field is that there isn't a single, simple story like the one you tried to tell. There are usually dozens of ways to skin the cat, each with completely different scaling stories, with different sets of engineers who are good as tweaking or debugging those stories.

At this point, what you have is a fragile coordination problem between your solution space, your architecture, and the engineers you employ, forcing ambitious ventures to crack out the golden recipe: pour in seven cement mixers full of head hunters, one 55-gallon oil drum of exclamation marks, a metric butter tonne of job perks, and agitate appropriately.

Re:massive parallel processing=limited application by Anonymous Coward · 2016-08-26 13:20 · Score: 0

24min anime fit in 40MiB of space back then.

Re: massive parallel processing=limited applicatio by Bengie · 2016-08-26 13:37 · Score: 1

I've managed super-linear a few times with multi-threading. Required good use of cache. If you can get the threads to be pseudo-synchronized without having to use any actually synchronization, what the first thread reads from memory, the other threads can benefit from. This case only applies to cores that share the same cache. The "super-linear" part no longer applied adding more sockets/CPUs, and adding more cores had diminishing returns, approaching a fixed percentage increase in performance over a single thread.

Then I tell people I code in C# and they don't understand how someone who writes in a high level language know how to think so low level. Lets just say I'm that go-to guy when you can't empirically find why your code is slow. Many hard performance issues cannot be measured because measuring can change the outcome. At that point you need a good mental model of how CPUs, cache, memory, OSes, threads schedulers, io schedulers, harddrives, SSDs, and networks interact to produce strange slowness when no one part is the bottleneck. Almost always an issue of latency vs throughput and different parts of the system with different throughput or latency characteristics.

Re:massive parallel processing=limited application by Bengie · 2016-08-26 13:56 · Score: 1

Also, multicore designs can have separate memory.

NUMA comes to mind but it has complexity issues added to the OS and application. Accessing another CPU's memory is expensive, so the OS needs to try to keep the threads close to the data. The applications need to try to do cross socket communication by using a system API, assuming it exists, to find out which socket the thread is on and trying best to limit to current socket threads. Cross socket communication is probably best done via passing messages instead of reference because copying and storing the data locally, even if duplicated, will be faster than constantly accessing the other socket's memory.

Then you have the issue of load balancing memory allocations. May not always be an issue but it can become an issue if you consume all of one socket's memory. There are other issues like one socket may have direct access to the NIC while the other socket has direct access to the harddrives. Topology is important.

As soon as you step out of a cache-coherent system, then you run into even more fun problems. Stale data is a huge issue. You need to make sure you're always getting the current data and not some local copy. At the same time, without cache-coherency, cross core communication is very high latency. Most x86 CPUs can remain cache-coherent into single cycle latencies. While copying the data may not be any faster, you know if the data changed very very quickly. If the data is read a lot and rarely changed, then you have some nice guarantees about how quickly you know if the data changed and only incur the access cost when that event happens. Without coherency , you are now forced to check out to memory every time, incurring high latency costs every access.

With multi-core systems, cache-coherency has an N^2 problem. I'm sure someone will come up with an idea of "channels" to facilitate low latency inter-core communication while allowing normal memory access to be separate. Possibly even islands of cache-coherency in an ocean of cores. Each island can be a small group. Some of the many-core designs where they have 80+ cores have heavy locality issues. Adjacent cores are fast to access and far aware cores are very expensive. Pretty much think of each core only able to talk to adjacent cores, and requests to far away cores need to go many "hops". Even worse is cores physically nearer the memory controller have faster access to the memory. All memory requests have to go through these cores. Lots of fun issues that requires custom OS designs.

Re:massive parallel processing=limited application by Bengie · 2016-08-26 14:05 · Score: 1

That's why the many core server CPUs have massive L3 caches and quad channel memory. 24 core x86 CPU with around 60MiB of L3 cache? Why not? More memory channels allow more concurrency of access. Intel NICs support written packets directly to L3 cache as to skip memory writes. Large on NIC buffers to make better use of DMA collecting and reduce memory operations, transferring in larger chunks to make use of that high bandwidth memory.

In case it's not clear, I'm not trying to say your point isn't valid, just saying your point explains a lot of current features in high end components.

Re: massive parallel processing=limited applicatio by Anonymous Coward · 2016-08-26 18:42 · Score: 0

Are most tasks serial? Brains have a really low clock rate, and a ludicrous number of parrallel execution units. Humans definitely win the "variety of tasks" award over computers, though.

Multicore is fine. The software just sucks still.

Re:For those that didn't read TFA, esp in regards by RockDoctor · 2016-08-27 13:29 · Score: 1

I imagine that any association with the US Govt. could engender distrust in such matters these (post Snowden) days.

It might engender quantitatively more distrust than in pre-Snowden days, but probably not deeper distrust. The US government has been deeply distrusted by the rest of the world for a long time.

--
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"

Slashdot Mirror

Princeton Researchers Announce Open Source 25-Core Processor (pcworld.com)

114 comments