How We'll Program 1000 Cores - and Get Linus Ranting, Again

← Back to Stories (view on slashdot.org)

How We'll Program 1000 Cores - and Get Linus Ranting, Again

Posted by samzenpus on Thursday January 1, 2015 @07:00PM from the getting-a-good-start dept.

vikingpower writes For developers, 2015 got kick-started mentally by a Linus Torvald rant about parallel computing being a bunch of crock. Although Linus' rants are deservedly famous for the political incorrectness and (often) for their insight, it may be that Linus has overlooked Gustafson's Law. Back in 2012, the High Scalability blog already ran a post pointing towards new ways to think about parallel computing, especially the ideas of David Ungar, who thinks in the direction of lock-less computing of intermediary, possibly faulty results that are updated often. At the end of this year, we may be thinking differently about parallel server-side computing than we do today.

13 of 449 comments (clear)

Min score:

Reason:

Sort:

Mutex lock by Anonymous Coward · 2015-01-01 19:05 · Score: 5, Funny

All other ended up in a mutex lock situaton so I had chance to do the first post
1. Re:Mutex lock by NoNonAlphaCharsHere · 2015-01-01 19:19 · Score: 4, Funny
  
  Thanks a lot asshole, a lot of were busy-waiting while you were typing.
2. Re:Mutex lock by NoNonAlphaCharsHere · 2015-01-01 19:32 · Score: 5, Funny
  
  I think I a word.
  
  A lot of US were busy-waiting.
3. Re:Mutex lock by TheRaven64 · 2015-01-01 21:16 · Score: 5, Funny
  
  That's what happens when you try to write without a lock.
  
  --
  I am TheRaven on Soylent News
Pullin' a Gates? by Tablizer · 2015-01-01 19:11 · Score: 4, Interesting

"4 cores should be enough for any workstation"
Perhaps it's an over-simplification, but if it turns out wrong, people will be quoting that for many decades like they do Gates' memory quote.

--
Table-ized A.I.
1. Re:Pullin' a Gates? by bruce_the_loon · 2015-01-01 19:42 · Score: 4, Interesting
  
  If you went and read Linus' rant, then you'll find you are actually reinforcing his argument. He says that except for a handful of edge use-cases, there will be no demand for massively parallel in end user usage and that we shouldn't waste time that could be better spent optimizing the low-core processes.
  The CAD, video and HTPC use-cases are already solved by the GPU architecture and don't need to be re-solved by inefficient CPU algorithms.
  Your Linux workstation would be a good example, but is a very low user count requirement and can be done at the compiler level and not the core OS level anyway.
  Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway. No need for a 64 core CPU for that one.
  Redesigning what we're already doing successfully with a low number of controller/data shifting CPU cores managing a large bank of dedicated rendering/physics GPU cores and task-specific ASICs for things like 10GB networking and 6GB IO interfaces is pretty pointless, which is what Linus is talking about, not that we only need 4 cores and nothing else.
  
  --
  Trying to become famous by taking photos. Visit my homepage please.
2. Re:Pullin' a Gates? by Urkki · 2015-01-01 20:29 · Score: 5, Insightful
  
  Why not? Currently Firefox has problems rendering (loading) two pages simultaneously, although it should be able to handle tens, using several cores.
  Same with Evince (which is crap anyway), it cannot do anything in parallel, should be able to use tens of cores.
  Javascript? Although the language is the worst I have seen since APL, a smart compiler could at least in some cases parallelize it (maybe with speculative execution or like).
  And so on.
  It will turn out to be as wrong as "640k".
  Javascript is generally used in event driven manner, so it will perform quite well on a single core. Firefox having trouble loading multiple pages simultaneously should still be IO-bound, not CPU-bound, and if the engine has trouble, then it's an SW architecture problem where more cores will not really help.
  Point of Linus was, taking a 6 core CPU, and replacing 2 cores with more cache and more transistors per core should make almost anything on Desktop run faster.
Bad summary, shocking by Urkki · 2015-01-01 20:06 · Score: 5, Interesting

Linus doesn't so much say that parallelism is useless, he's saying that more cache and bigger, more efficient cores is much better. Therefore, increased number of cores at the cost of single core efficiency is just stupid for general purpose computing. Better just stick more cache to the die, instead of adding a core. Or that is how I read what he says.
I'd say, number of cores should scale with IO bandwidth. You need enough cores to make parallel compilation be CPU bound. Is 4 cores enough for that? Well, I don't know, but if the cores are efficient (highly parallel out-of-order execution) and have large caches, I'd wager IO lags far behind today. Is IO catching up? When will it catch up, if it is? No idea. Maybe someone here does?
Torvalds is half right by popo · 2015-01-01 20:25 · Score: 5, Insightful

The problem is that Linus is discussing two different things at once and so it sounds like he's making a more inflammatory point than he is.
The issue is not whether parallelism is uniformly better for all tasks. The question is, is parallelism better for some tasks. And as Torvalds points out, those tasks do exist (Graphics being an obvious one).
The nature of the workload required for most workstations is non-uniform processing of large quantities of discreet, irregular tasks. For this, parallelism (as Torvald's correctly notes) is likely not the most efficient approach. To pretend that in some magical future, our processing needs can be homogenized into tasks for which parallel computing is superior is to make a faith-based prediction on how our use of computers will evolve. I would say that the evidence is quite the opposite: That tasks will become more discrete and unique.
Some fields though: finance, science, statistics, weather, medicine, etc. are rife with computing tasks which ARE well suited to parallel computing. But how much of those tasks happens on workstations. Not much, most likely. So Linus' point is valid.
But I have to take issue of Linus tone in which he downplays "graphics" as being a rather unimportant subset of computing tasks. It's not "graphics". It's "GRAPHICS". That's not a small outlier of a task. Wait until we're all wearing ninth generation Oculus headsets... the trajectory of parallel processing requirements for graphics is already becoming clear -- and it's stratospheric. The issue is this: Our desktop processing requirements are actually slowing and as Linus points out, are probably ill-suited for increased parallelism. But our graphics requirements may be nearly infinite.
Unlike other fields of computing, we know where graphics is going 20 years from now: It's going to the "holodeck".
Keep working on parallel computing guys. Yes, we need it.

--
------ The best brain training is now totally free : )
1. Re:Torvalds is half right by Anonymous Coward · 2015-01-02 00:44 · Score: 4, Informative
  
  AMD have a line of CPUs very much like this, the A Series. It has several conventional multi-purpose x86-64 cores for general-purpose use and a Graphics Processing Unit built-in for those embarrassingly-parallel floating-point operations. Best of all, they're very cheap and perform very well.
2. Re: Torvalds is half right by Half-pint+HAL · 2015-01-02 01:56 · Score: 4, Informative
  
  Verification is the process of checking that software works correctly. The more complex the system, the more complex the process of verification. Rather unfair of the GP to throw that in as a single word after you explicitly said that you're not a computer scientist.
  
  --
  Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
Re:i'm so tired of political correctness by Attila+Dimedici · 2015-01-01 23:00 · Score: 4, Insightful

No, "political correctness" is a thing. It is where someone gets in trouble for using the word "niggardly" because it sounds like another word.

--
The truth is that all men having power ought to be mistrusted. James Madison
Lots of moving parts by m.dillon · 2015-01-02 07:05 · Score: 4, Informative

There are lots of moving parts here. Just adding cores doesn't work unless you can balance it out with sufficient cache and main memory bandwidth to go along with the cores. Otherwise the cores just aren't useful for anything but the simplest of algorithms.
The second big problem is locking. Locks which worked just fine under high concurrent loads on single-socket systems will fail completely on multi-socket systems just from the cache coherency bus bandwidth the collisions cause. For example, on an 8-thread (4 core) single-chip Intel chip having all 8 threads contending on a single spin lock does not add a whole lot of overhead to the serialization mechanic. A 10ns code sequence might serialize to 20ns. But try to do the same thing on a 48-core opteron system and suddenly serialization becomes 1000x less efficient. A 10ns code sequence can serialize to 10us or worse. That is how bad it can get.
Even shared locks using simple increment/decrement atomic ops can implode on a system with a lot of cores. Exclusive locks? Forget it.
The only real solution is to redesign algorithms, particularly the handling of shared resources in the kernel, to avoid lock contention as much as possible (even entirely). Which is what we did with our networking stack on DragonFly and numerous other software caches.
Some things we just can't segregate, such as the name cache. Shared locks only modestly improve performance but it's still a whole lot better than what you get with an exclusive lock.
The namecache is important because for something like a bulk build where we have 48 cores all running gcc at the same time winds up sharing an enormous number of resources. Not just the shell invocations (where the VM pages are shared massively and there are 300 /bin/sh processes running or sitting due to all the Makefile recursion), but also the namecache positive AND negative hits due to the #include path searches.
Other things, particularly with shared resources, can be solved by making the indexing structures per-cpu but all pointing to the same shared data resource. In DragonFly doing that for seemingly simple things like an interface's assigned IP/MASKs can improve performance by leaps and bounds. For route tables and ARP tables, going per-cpu is almost mandatory if one wants to be able to handle millions of packets per second.
Even something like the fork/exec/exit path requires an almost lockless implementation to perform well on concurrent execs (e.g. such as /bin/sh in a large parallel make). Before I rewrote those algorithms our 48-core opteron was limited to around 6000 execs per second. After rewriting it's more like 40,000+ execs per second.
So when one starts working with a lot of cores for general purpose computing, pretty much the ENTIRE operating system core has to be reworked verses what worked well with only 12 cores will fall on its face with more.
-Matt