Slashdot Mirror


How We'll Program 1000 Cores - and Get Linus Ranting, Again

vikingpower writes For developers, 2015 got kick-started mentally by a Linus Torvald rant about parallel computing being a bunch of crock. Although Linus' rants are deservedly famous for the political incorrectness and (often) for their insight, it may be that Linus has overlooked Gustafson's Law. Back in 2012, the High Scalability blog already ran a post pointing towards new ways to think about parallel computing, especially the ideas of David Ungar, who thinks in the direction of lock-less computing of intermediary, possibly faulty results that are updated often. At the end of this year, we may be thinking differently about parallel server-side computing than we do today.

306 of 449 comments (clear)

  1. Mutex lock by Anonymous Coward · · Score: 5, Funny

    All other ended up in a mutex lock situaton so I had chance to do the first post

    1. Re:Mutex lock by NoNonAlphaCharsHere · · Score: 4, Funny

      Thanks a lot asshole, a lot of were busy-waiting while you were typing.

    2. Re:Mutex lock by NoNonAlphaCharsHere · · Score: 5, Funny

      I think I a word.

      A lot of US were busy-waiting.

    3. Re:Mutex lock by Z00L00K · · Score: 2

      In any case - a multi-core machine can also handle multiple different tasks simultaneously, it's not always necessary to break down a single task into sub problems.

      The future for computing will be to have a system that can adapt and avoid single resource contention as much as possible.

      --
      If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    4. Re:Mutex lock by TheRaven64 · · Score: 5, Funny

      That's what happens when you try to write without a lock.

      --
      I am TheRaven on Soylent News
    5. Re:Mutex lock by Anonymous Coward · · Score: 1

      The future for computing will be to have a system that can adapt and avoid single resource contention as much as possible.

      I think this is basically the point LT was making. The core is already dozens of times faster than memory and thousands of times faster than storage, so adding more cores does not really address resource contention. Make more and better caches; make more and better I/O. Having 1000 cores all waiting for 3,000 microseconds while the hard drive rotates to the other side of the platter does not improve performance over 4 cores waiting.

    6. Re:Mutex lock by buckfeta2014 · · Score: 1

      I use SSD, you insensitive clod!

      --
      Buck Feta. You know what to do.
    7. Re:Mutex lock by drinkypoo · · Score: 2

      The core is already dozens of times faster than memory and thousands of times faster than storage

      When you add more cores, you also can add more memory bandwidth, if you couple them closely to memory controllers. This is how multiprocessor PCs work today. Hell, even some processors with more cores in them have more memory buses, it's not just adding chips that gives you more bandwidth.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    8. Re:Mutex lock by Anonymous Coward · · Score: 1

      The future for computing will be to have a system that can adapt and avoid single resource contention as much as possible.

      The future? That's the whole point of out-of-order execution. Execution is governed by availability of input data. OoO microprocessors have been around for 25 years now.

    9. Re:Mutex lock by arth1 · · Score: 2

      I use SSD, you insensitive clod!

      Then after all blocks on the drive has been written to, you wait for a second while the drive moves data away and clears a sector so there's space to write to.
      SSDs have far better average write speeds, but far worse worst case write speeds. Using them for anything timing critical without a battery backed up controller is asking for trouble.

      "Use TRIM", I hear from the peanut gallery. Except that there are no RAID controllers (or software RAIDs) that actually support TRIM in practice. Nor does TRIM work for partitions where there is no file system support. Like raw database partitions or swap. Yep, put a single swap partition on the drive, and you will still be subject to the drive not knowing what blocks are free, and can't write to them unless asked to overwrite them.

      For guaranteed rate I/O, spinning platter drives and pure battery-backed RAM disks is still the way to go. A RAID of short-stroked HDs have a worst case performance far better than modern SSDs, despite the average being much slower. For a desktop user, an occasional or rare "hickup" of a second might not be noticeable or even a concern if it is, so SSDs are fine, and even great. For real-time data processing, it can very well be a big concern.

    10. Re:Mutex lock by Bengie · · Score: 1

      That's why SSDs keep a reserve of about 10%-30% of the logical storage as pre-TRIM'd. Some of the newer SSDs even reserve another bunch of the drive as a scratch pad for writes.

      Looking at this benchmark, it seems reads are more likely to have a random long access time than writes. http://techreport.com/review/2...

    11. Re:Mutex lock by azav · · Score: 1

      Do you also own cat?

      You use *an* SSD.

      --
      - Zav - Imagine a Beowulf cluster of insensitive clods...
    12. Re:Mutex lock by Jeremi · · Score: 1

      You use *an* SSD.

      He uses an solid state disks?

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    13. Re:Mutex lock by Jane+Q.+Public · · Score: 1

      Yes, in fact I've been looking at a lot of the newer SSDs coming out that are reporting sustained random read rates slower than sustained random writes (4k blocks).

    14. Re:Mutex lock by Marillion · · Score: 1

      Will no one think of the dying Dining Philosophers?

      --
      This is a boring sig
  2. Pullin' a Gates? by Tablizer · · Score: 4, Interesting

    "4 cores should be enough for any workstation"

    Perhaps it's an over-simplification, but if it turns out wrong, people will be quoting that for many decades like they do Gates' memory quote.

    1. Re:Pullin' a Gates? by cb88 · · Score: 2

      It already is wrong...

      Linux Workstation: 16cores = way faster builds than 4 cores.
      CAD workstation: I imagine alot of geometry processing is parallelized... the less waiting the better (either format conversion or generating demo videos etc.. eat up alot of CPU)
      Video workstation: Thats just a blatantly obvious use for multiple cores...
      Linux HTPC: I wanna transcode stuff fast... more cores
      Linux Gaming: These days using at least 4 cores is getting more common...

      Things that I often seen that are *broken* for instance 200Mb work documents that hang the entire system when you scroll (yes windows thats bad). Linux isn't much better though disk IO starvation is a long time pet peeve there... 4 cores is the wrong place to draw the line currently maybe 6-8 cores + improved disk IO would be a realistic ideal these days.

      Granted alot of programs will *ought* to run just fine on my Sparcstation LX @ 50Mhz and 128Mb ram... but that isn't the future unless we have a nuclear apocalypse. Also, there is a good chance that alot of my cores will sit ide even so power management is better than it used to be and more cores can improve latency because now I have more available CPU time even though the individual cores are probably slower. Overall thats is a good tradeoff.

    2. Re:Pullin' a Gates? by bruce_the_loon · · Score: 4, Interesting

      If you went and read Linus' rant, then you'll find you are actually reinforcing his argument. He says that except for a handful of edge use-cases, there will be no demand for massively parallel in end user usage and that we shouldn't waste time that could be better spent optimizing the low-core processes.

      The CAD, video and HTPC use-cases are already solved by the GPU architecture and don't need to be re-solved by inefficient CPU algorithms.

      Your Linux workstation would be a good example, but is a very low user count requirement and can be done at the compiler level and not the core OS level anyway.

      Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway. No need for a 64 core CPU for that one.

      Redesigning what we're already doing successfully with a low number of controller/data shifting CPU cores managing a large bank of dedicated rendering/physics GPU cores and task-specific ASICs for things like 10GB networking and 6GB IO interfaces is pretty pointless, which is what Linus is talking about, not that we only need 4 cores and nothing else.

      --
      Trying to become famous by taking photos. Visit my homepage please.
    3. Re:Pullin' a Gates? by jhol13 · · Score: 3, Insightful

      Why not? Currently Firefox has problems rendering (loading) two pages simultaneously, although it should be able to handle tens, using several cores.
      Same with Evince (which is crap anyway), it cannot do anything in parallel, should be able to use tens of cores.
      Javascript? Although the language is the worst I have seen since APL, a smart compiler could at least in some cases parallelize it (maybe with speculative execution or like).
      And so on.

      It will turn out to be as wrong as "640k".

    4. Re:Pullin' a Gates? by Tablizer · · Score: 1

      The Gates quote is ambiguous. One can read it different ways.

    5. Re:Pullin' a Gates? by ls671 · · Score: 2

      hmmm... Linus sounds right to me too. He specifically said, or almost, that people wanting to load 10 pages in sandboxed firefox process/thread in parallel could find a use for 16 cores ;-)

      --
      Everything I write is lies, read between the lines.
    6. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      no, the quote is plain untrue. It was originally supposedly related to the hybrid 8/16 bit processor where he supposedly said it in reference to that chip (which indeed had he said it he would have been right). But no source has ever been found to confirm this and gates himself has said he has said many things wrong and made many bad predictions but that one credited to him is false.

    7. Re:Pullin' a Gates? by bloodhawk · · Score: 3, Insightful

      Actually the quote is just an internet myth, at least no one has ever found a source for it or anyone that even reports to have heard him say it and gates denies having said it as well.

    8. Re:Pullin' a Gates? by davmoo · · Score: 2

      Except that Bill Gates never actually said the so-called "quote" that is attributed to him.

      --
      I want a new quote. One that won't spill. One that don't cost too much. Or come in a pill.
    9. Re:Pullin' a Gates? by Rei · · Score: 3, Interesting

      Linus's argument basically boils down to, "Parallel algorithms are sorcery, and the only place they matter are places applications that demand performance which are indeed increasingly using parallelism".

      Of course you don't need, say, a 50-threaded version of vi or alsamixer or whatever. But for apps that need performance, increasingly they have to get them from threading. And there's nothing "magical" about parallelism. Perhaps in Linus's dislike for C++ he's missed how trivially easy it's gotten to launch threads in C++11, but it takes less work now than a for-loop, since std::thread is so simple and you can inline the command with a lambda. And you have a nice clean mutex library including scoped mutexes like std::lock_guard so you don't even have to remember to unlock them.

      It's quite true that having multiple cores needing to read to and write from the same chunk of memory isn't a good thing. But I'd bet you that only in under 5% or so of high performance apps is that the *only* level you can thread at. Because if you have say five nested levels of looping, 4 of them can be memory constrained, but so long as least just one can be threaded without heavy reads/writes on shared cache, you can thread to your heart's content with minimal adverse impact. And "heavy" is the key word. So long as you're not doing essentially *constant* heavy reads/writes on shared cache, the overhead cost is minimal.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    10. Re:Pullin' a Gates? by Urkki · · Score: 5, Insightful

      Why not? Currently Firefox has problems rendering (loading) two pages simultaneously, although it should be able to handle tens, using several cores.
      Same with Evince (which is crap anyway), it cannot do anything in parallel, should be able to use tens of cores.
      Javascript? Although the language is the worst I have seen since APL, a smart compiler could at least in some cases parallelize it (maybe with speculative execution or like).
      And so on.

      It will turn out to be as wrong as "640k".

      Javascript is generally used in event driven manner, so it will perform quite well on a single core. Firefox having trouble loading multiple pages simultaneously should still be IO-bound, not CPU-bound, and if the engine has trouble, then it's an SW architecture problem where more cores will not really help.

      Point of Linus was, taking a 6 core CPU, and replacing 2 cores with more cache and more transistors per core should make almost anything on Desktop run faster.

    11. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      Firefox has problems rendering (loading) two pages simultaneously, although it should be able to handle tens, using several cores.

      What you describe is basic concurrency (doing two mostly independent tasks), massively parallel as some people believe would magically use all your cores to 100% to load one or more pages in an instant. Most work does not scale that way, just like nine women cant make a child in a month.

    12. Re:Pullin' a Gates? by Urkki · · Score: 2

      It already is wrong...

      Linux Workstation: 16cores = way faster builds than 4 cores.

      Did the 4 core CPU have 1/4th of the transistor count of the 16 core CPU? Then I'd expect it to be much slower of course. Point of Linus was, a 4 core CPU with same transistor count (used for more cache, better out-of-order execution logic, more virtual registers, and so on), as 16 core CPU will be faster on almost every task. So cores beyond 4 (the number Linus threw as the ballpark count) make sense only, if you really can not spend any more transistors in making those 4 cores faster, but still have die space to spare.

    13. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      Why not? Currently Firefox has problems rendering (loading) two pages simultaneously, although it should be able to handle tens, using several cores.
      Same with Evince (which is crap anyway), it cannot do anything in parallel, should be able to use tens of cores.

      You still don't argue against Linus here. The kind of parallel he is talking about is the kind that uses hundreds, not tens of cores and for a longer duration.
      The problem you are talking about only comes from Firefox bloat and bad design.

    14. Re:Pullin' a Gates? by ls671 · · Score: 1

      Nope, I only say that because I already thought the same way before I was aware of his view. It happens all the time.

      --
      Everything I write is lies, read between the lines.
    15. Re:Pullin' a Gates? by im_thatoneguy · · Score: 3, Interesting

      It is a niche which will need specific algorithms tuned for the hardware (GPU or other) the pipeline must be kept busy to observe a performance gain. It doesn't scale to general purpose computing.

      I feel like this is moving the goal posts. "You will never do massively parallel computing on a CPU because if it's massively parallel it's a GPU not a CPU."

      Linus is 100% wrong. What's the "general purpose" computing that we all want? The NCC-1701D's main computer from star trek. If I say "Cortana/Siri/Google Now please rough me out a flyer for our yardsale on Saturday." you're going to be looking at massively parallel task for the neural networks to not only interpret the voice but then make sense of the words and finally produce a printable flyer suitable for hanging. Programming is still a really fancy version of "IF A THEN B". "for X in GROUP do Z". "X = Y". Yeah, if your application is incredibly serial then a serial processor is all that you'll need. When computing advances to the next phase of neural networks, AI and directed (not instructed) computing then it'll need to be more like our brain: massively parallel.

      Now there are two obnoxious tautological arguments against this:
      A) "That's not a "CPU" that's like a NeuroProcessorUnit, an NPU if you will"
      B) "Yes we'll need a giant mainframe, but it'll be a server in the cloud!"

      A is moving the goal posts. Just because the processor isn't an ARM or x86 instruction compatible chip doesn't mean it's not worthy of the label CPU. As mentioned above you can't say that there'll never be a CPU with massive parallelism because as soon as it has massive parallelism it's by definition no longer a CPU. B is just saying that nobody will have a need for computers because we'll have a giant mainframe. Which might be true but you just need a basic DSP not even a CPU if it's just a pure thin client transmitting a video, audio and input stream to the cloud for processing. In which case all of the CPUs in existence... need to be massively parallel AI processors.

    16. Re:Pullin' a Gates? by gnupun · · Score: 1

      If you went and read Linus' rant, then you'll find you are actually reinforcing his argument. He says that except for a handful of edge use-cases, there will be no demand for massively parallel in end user usage and that we shouldn't waste time that could be better spent optimizing the low-core processes.

      So, if someone wants to optimize a critical app-specific operation "foo()" in their app and make it to go 4 times faster using 4 cores, they are crazy?
      Your argument implies that other than these so-called "edge cases" there is no need to improve performance of any other type of code.

    17. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      We are talking about massively-parallel computing and Linus describes it right. It is a niche...

      ...for now, yet it only takes a single killer app to change that. Ask Google if end-users may benefit from massively-parallel computing. It doesn't reach workstations because it's in best interest for them to keep them server side, but the same techniques for natural language processing and deep learning would greatly benefit for massive parallelism, and having them in a machine you own would make it much more private and tailored to you.

      End-user development is a field where many tasks would benefit from this kind of parallelism, and it's the great unknown for many people. Tools for handling unstructured and semi-structured content (like easy web scrapping and multiple editing) need to perform a deep analysis about a page's content to infer what the user is trying to do and make their life easier.

      That most developers haven't heard of such tools and don't know how to build them doesn't mean that they're not useful. A copy/paste clipboard tool that worked like Lapis could be used both by developers (I've heard a less powerful "column edit" mode is popular in the Sublime Text editor) and non-developers (batch renaming of files and album tracks is a common task performed by end-uses, and they edit names one by one because no one understands the existing pattern-based batch renaming tools).

    18. Re:Pullin' a Gates? by itzly · · Score: 1

      A neural net is a very specific, massively parallel, purpose, not general purpose.

    19. Re:Pullin' a Gates? by itzly · · Score: 1

      No, outside of the edge cases, using 4 smaller cores instead of a single big one will not make foo() go faster.

    20. Re:Pullin' a Gates? by TheRaven64 · · Score: 2

      If you look at a typical web page, you have a load of images, a few iframes with ads, scripts (possibly with with multiple web workers). Each one of those really wants to be a separate security domain. You don't want a vulnerability in libpng (something that has happened many times before) to be able to do anything other than break the single image that it's decoding. This kind of fine-grained security is a lot easier if you have the ability to have a load of cheap threads.

      --
      I am TheRaven on Soylent News
    21. Re:Pullin' a Gates? by itzly · · Score: 1

      You may need a bunch of cheap threads, but that doesn't mean they'll run faster on separate cores. Unless you have really fast I/O (most people don't) a single core should handle it just fine.

    22. Re:Pullin' a Gates? by TheRaven64 · · Score: 1

      They're likely to be bursty and when you get the data you want to run most of them in parallel. Add to that, current constraints on CPU design (Dennard Scaling no longer working) mean that adding a load of cores that spend most of their time sleeping is actually quite an easy thing to do.

      --
      I am TheRaven on Soylent News
    23. Re:Pullin' a Gates? by SuricouRaven · · Score: 1

      But by the time you've finished reading the first paragraph of the first page, the other nine are loaded even if you can't parallise.

    24. Re:Pullin' a Gates? by AK+Marc · · Score: 1

      Chrome, Opera, and others load each tab/page in a separate process. Why isn't everyone doing that? Quad core, 2 threads per core lets me run 100+ pages, and any one or two of them freezing up won't cause a problem.

      It doesn't have to be truly parallel, just separate. There's a difference.

    25. Re:Pullin' a Gates? by SuricouRaven · · Score: 3, Informative

      If massive-neural nets do reach common use (Which isn't that likely, they are somewhat overhyped) then I'd expect to see specific accelerators designed to run them. Probably something like FPGAs: Software writes the net, hardware executes it. A general-purpose processor (Probably x64 or ARM) does the coordinating, but augmented by specialised or semi-specialised hardware for certain tasks. Very much as we have today with hardware acceleration of 3D graphics or video decoding.

      You can see the trend already. 3D acceleration was introduced for graphics, but then repurposed for other things, and followed up with revised graphics architectures designed for non-graphics applications. They are still useless for general-purpose computing, their architecture too limited, but used in conjunction with a general processor they can greatly outperform the processor alone on things like image processing, cryptographic tasks, physics simulation and such. It's now quite common to see even consumer applications, with games using physics simulation to provide much more detailed rigid-body simulation than was previously possible - ie, more bits of shrapnel and chunks of corpse bouncing around when you lob that grenade.

      As for neural nets, you probably won't see much need to simulate huge ones. Small ones work surprisingly well, and their applications are really quite limited - they aren't some magic AI bullet that turns into a functional mind if you make them big enough. They excel at classification tasks, so they ar very handy in OCR, handwriting recognition, speech recognition and such. Google made one that can recognise cats, and if you can recognise cats then you can recognise other things, so straight away I'm seeing applications in web filter software.

    26. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1, Informative

      find a file called "bill-gates-1989.mp3". Its a bit over 80MB and about an hour and a half long. Somewhere in there is where he makes the memory statement while talking about how *he* designed the memory layout of the IBM PC.

    27. Re:Pullin' a Gates? by gnupun · · Score: 1

      What if the cores don't become much smaller while cores are added to your PC? Your general desktop/workstation can have up to 16 cores each of which are more powerful than the previous generation core. Should we still do single-threaded programming for any time-critical foo() and run roughly 10 times slower?

      There's plenty of code than would benefit from the speedup of multi-core programming, not just some niche code.

    28. Re:Pullin' a Gates? by itzly · · Score: 1

      A 3+ GHz single core CPU is easily capable of decoding images that come in at full speed over a typical internet connection. You may be able to use multiple cores, but it's going to make the overall page loading any quicker than using a single core.

    29. Re:Pullin' a Gates? by itzly · · Score: 1

      Obviously, adding more big cores is better, but that's not what Linus was talking about.

    30. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1
      http://en.wikiquote.org/wiki/B...

      I have to say that in 1981, making those decisions, I felt like I was providing enough freedom for 10 years. That is, a move from 64 K to 640 K felt like something that would last a great deal of time. Well, it didn't - it took about only 6 years before people started to see that as a real problem.
      1989 speech on the history of the microcomputer industry.

      Said in 1989 about a 1981 decision. While it's not "640kB should be enough for anyone", verbatim, it's where that quote seems to come from.

    31. Re:Pullin' a Gates? by bloodhawk · · Score: 1

      exactly, he never made the statement he is quoted as saying. There is a massive difference between what he is quoted as saying and what is said in that presentation. He also discusses in other interviews how he wanted the limit to be higher but was restricted by the chip architecture but thought it would be good enough for the lifetime of the architecture, he was actually pretty close to being right.

    32. Re:Pullin' a Gates? by itzly · · Score: 1

      For a fair comparison, you shouldn't compare a single core with 16 cores, each the size of the single core. Instead, you should keep the number of transistors fixed, and decide whether you want to divide them into 4 big cores, or 16 smaller ones (with smaller caches). And since we're talking about general purpose PCs, you should consider a typical mix of user applications.

    33. Re:Pullin' a Gates? by bloodhawk · · Score: 1

      Microsoft had supposedly had a very large influence over which chip went into the IBM PC. supposedly they are the reason IBM went with a 16bit chip instead of an 8 Bit one as they talked IBM into changing and they were also considering a 32 bit chip from motorola.

    34. Re:Pullin' a Gates? by TheRaven64 · · Score: 2

      First, that's with a single thread and a single security context. If each one is an isolated sandbox it's not the case (trust me on this: it's my research area and we've done a lot of benchmarking). Second, even if it were true, it would be a lot less power efficient. If you can parallelise your workload, then two 1.5GHz cores will use less power than one 3GHz one. Four 750MHz cores will use less still.

      Until a few years ago, most computers had a single core, so there wasn't much point trying to exploit parallelism and the fastest way of implementing many problems was to serialise them. That's no longer an automatic win.

      --
      I am TheRaven on Soylent News
    35. Re:Pullin' a Gates? by jhol13 · · Score: 1

      Several processes in multicore is parallel.

      And it needs to be parallel. For example my current desktop is A8-5500, got it for ~$100. Four times more single thread performance - how much would that cost?

    36. Re:Pullin' a Gates? by visualight · · Score: 1

      Instead of paraphrasing why not just quote him directly? It's not a long article and no one will think 'strawman'.

      "Big caches are efficient. Parallel stupid small cores without caches are horrible unless you have a very specific load that is hugely regular (ie graphics)." ...
      "the crazies talking about scaling to hundreds of cores are just that - crazy."

      In that context, he's right. If you're doing hundreds of dumb cores you should be using gpu already.

      --
      Samsung took back my unlocked bootloader because Google wants me to rent movies. They're both evil.
    37. Re:Pullin' a Gates? by HuguesT · · Score: 2

      Thanks, interesting document, found here. The audio is really bad at the beginning and fluctuates throughout the talk. The interesting bit that you refer to is at 21 minutes from the start.

      I'm trying to type in what he said directly from the audio:

      The 16-bit design gave us a megabyte of memory. The 8086 has a 20-bit address. It is really a segmented 16-bit data path with segment registers that are really indexes. It is a 1-MB address space. And in this original design I took the upper 384K and tied it to a certain amount to provide for memory video, the ROM and I/O. And that left 640K for general purpose memory. And that leads to today's situation where people talk about the 640K barrier. The limit to how much memory you can put to these machines. I have to say that in 1981 while making those decisions I felt like I was providing enough freedom for 10 years. That is, a move from 64K to 640K felt like something that would last a great deal of time. Well, it didn't. It took only 6 years before people started to see that as a real problem.

      Fortunately, there is a reasonable solution. Intel has moved forward with its chips families, the 286 chip introduced in 1984 moves us to a 24-bit address space (mumbles about segmented indirection, being not that good). That is sort of an intermediate milestone. in 1986 we moved up to the 386 where we get a full 32-bit offset to these segments that have been designed in this architecture. So what we have is a machine that can address 4GB of RAM. And I have to say with all honesty, I believe that it will take us more than 10 years to use up that address space.

      So he never makes that exact quote, however one can understand why people picked it up. Essentially, BG thought in 1981 640K would be enough for everybody for a long while. Note that he was reasonably prudent regarding using up the 32-bit address space (that ship has sailed now).

      Later, regarding memory, he says that computers should have about 1MB of RAM per MIPS. Specifically, he goes on to saying machines with 30-60MB of RAM should be desirable soon (in 1989).

      In this talk he talks about many things, most are pretty insightful in fact: OS design, multitasking, parallelization, multi-processor designs, dynamic linking, object-oriented design. Funnily he talks at length about OS2 in a very positive way. This was before Windows 3 of course. He compares OS2 and Unix, saying that OS2 will take over the desktop and Unix the servers, and all other OSes will die out. He talks about the FSF, saying its task of creating a free Unix-like OS is doomed.

      Some interesting comments on that talk here.

    38. Re: Pullin' a Gates? by Anonymous Coward · · Score: 2, Insightful

      There has been a push back against integrating ANNs into mobile platforms. I think low power real time classification is simply missing an application in the mass market that can't be solved by off loading to a server. We simply assume that we are continuously connected to a sufficiently large data pipe and the problem goes away. Whether the hardware changes on the server side or not is a question of power savings, but I doubt we will see gains in performance over software implemented on server farms.

      That said if we put our future caps on, is there a point when the amount of data our electronics gather for processing that pushing into the cloud is cost and time prohibitive? If wearable electronics becomes a pervasive technology, we may need some on board continuously learning classifier cores to locally fuse sensor data rather than sending raw data Into the cloud. This is where we could see truly assistive computing without the creepier general intelligence hassabis and crew are working on at deep mind.

      Imagine you have a conversation with your wife and she says the kids need to be picked up at 4 on Tuesday. If my phone put a reminder on my calendar for me based on my continuous audio stream, the mental offload would be huge as I could seamlessly continue with my day without managing my calendar, but I don't want to continuously stream my audio to Google nor do they want to continuously process the sound of me typing and sipping coffee... That's what we have the NSA for.

    39. Re:Pullin' a Gates? by itzly · · Score: 1

      trust me on this

      Of course, if you add enough cruft, you can slow anything down to a crawl.

      Second, even if it were true, it would be a lot less power efficient. If you can parallelise your workload, then two 1.5GHz cores will use less power than one 3GHz one

      You need a number of bit flips to solve a problem. Energy is related to the number of bits flipped. If you use twice the bits at half the speed, the energy requirements will be the same. By splitting the workload over multiple cores you have more overhead, so more energy is required.

    40. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      It's not a straw man at all - I know how to read. He literally says "magical parallel algorithms" are needed to make use of hundreds of cores. And he says "The only place where parallelism matters is in graphics or on the server side, where we already largely have it", which is nothing more than pointing out that apps that need high performance (aka, graphics and servers) *do* in fact use parallelism.

      Exactly. Because of this...

      What you claim is "Linus's argument" isn't even brought up until the last paragraph. And the only reason that the CPU would "still going be a few core and not many core" is if programmers don't threading their cpu-intensive apps sufficiently. Which, one should note, are mainly graphics and server stuff, the things that Linus notes *are* being threaded.

      Which is precisely the point. Programmers don't thread their cpu-intesive apps sufficiently because (1) a lot of tasks don't parallelize well and (2) programmers have consistently proven to suck at coming up with ways to adequately parallelize programs. Ergo, the stuff that people know will improve is heavily threaded. And the rest remains using one or two cores, max. Honestly, this is the same crap that came up with the Itanium and someone else pointed out the whole "magical compilers" comment because the whole notion that you'll get it fixed on that end is just as absurd.

      Simply put, sure, on paper 50 cores look nice. But actually keeping them all busy for most computers is just near impossible. Again, you have to point at specific points where it's useful and you end up having a specialty CPU like a GPU for just that task and for which (1) there are known heat issues and (2) most the time it's hardly used. Golly, just what Linus was saying.

      PS - You see, it's not that parallel algorithms is sorcery. It's that advocates for massively parallel general processing units seem to believe in magical parallel algorithms that work generally and show non-negligible improvement that negate all the lower clock-rate/processing that's done on each of these micro-cores to deal with power usage/heat issues. And those algorithms just don't exist and apparently there aren't enough genius programmers to use them even when they do exist (unless it's bundled in a library and mostly hidden and ends up 99% of the time under-utilized). See the difference between what does exist and is known to work and pretending that one can extrapolate it to known bad use cases?

    41. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      Linus is 100% wrong.

      That single leading statement made your entire post worthless. You know that Linus is not 100% wrong so you knowingly lied. That was stupid of you.

    42. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      Also it's a false dichotomy - Linux can support traditional architectures & massively parallel ones at the same time without making the traditional performance worse.
      Amateur packet radio stuff is supported without fighting about it being an edge case.

    43. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      Linus is WRONG on this. He may be right for today's tasks, but like the (purported) Gates quote, it will not hold for tomorrow's tasks. It is easy to extrapolate today's usage into tomorrow and turn out being wrong. It is more difficult to try to predict future uses.

      As machines are asked to do more, for example, speech, vision, AI-ish type tasks (e.g. Google's self-driving car), massively parallelization will become critical. There are likely many other uses that will come up that wouldn't come to mind immediately. Perhaps the computer will track your eye movements (more than some do now) to try to anticipate what you are going to do and pre-calculate/prepare something for you. Or your tablet pays attention to its surroundings in more than a very superficial way so that it can be context aware. There are many "little" improvements that may provide large benefits when combined, but that need continuous input, kind of like our hearing and eyesight.

      Think about 40 years ago what the use cases were in 1975. The PC wasn't even a mainstream use and barely available.
      Think about 30 years ago what they were in 1985. Networks were used in colleges, but few thought about the internet being pervasive in the mainstream - certainly in CS departments they did, but primarily for research usage.
      Think about 1995, the Web was still being dismissed as a fad - Krugman said it would have as much impact as the fax machine even a few years later.
      Think about 2005, the cell phone as a miniature computer was considered, but just a miniaturized version of a PC, not as something with a touch interface etc.

      Not to knock Linus, but it is hard to predict what is coming, I don't know, just that it probably will be something that doesn't just involve a higher res screen, and faster CPU, but something that additional processing power will allow that no one considers important now, but will end up being a game-changer when you can have 10 cores each doing 10 different things and still have 900 cores to spare.

      And IAACSWAAD (I am a computer scientist with an advanced degree).
      (Sorry for any typos - continuous spell check is an example that wouldn't have been considered 30 years ago on a phone)

    44. Re:Pullin' a Gates? by drinkypoo · · Score: 1, Flamebait

      So he never makes that exact quote, however one can understand why people picked it up. Essentially, BG thought in 1981 640K would be enough for everybody for a long while.

      "Bill Gates, CEO of Microsoft Corp. a fiercely competitive company(...)" - Microsoft Encarta, 1996
      "Bill Gates, CEO of Microsoft is a contributor to several charitable causes, including...(...)" Microsoft Encarta 2000

      In some other discussion about whether BG ever said 640k should be enough for anyone, evidence was presented including an eye- (and ear-)witness account of when he did say it. But hey, let's not lose any sleep over whether Bill Gates is a liar, because we know he is. The DoJ had him over a barrel. Then the Gates Foundation was created to promote the goals of Big Pharma and Strong IP law, if you look at where they spend their money and the terms under which these foreign nations get aid from the foundation, it's clear what their actual goals are.

      If you have a hard time believing that Gates ever said 640k should be enough for anyone, keep in mind that 1) he has claimed that he personally created the 640k limit, 2) that he could easily have said it meaning that 640k should have been enough for anyone at that moment, justifying the 640k barrier*, and 3) Bill Gates is a liar, and has been proven such in court.

      * The chip could only address 1MB, video memory had to go somewhere, there had to be a split somewhere, the design may well have been completely justified.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    45. Re:Pullin' a Gates? by DarkOx · · Score: 1

      You are correct, a parallel algorithm is going to be more complex, requiring more total operations. In a world of frictionless pulls and perfectly spherical cattle you are sure to be right.

      We don't live in that world though. In practice higher clock speeds usually require higher voltages for circuits to stabilize. Higher voltage means more current is going to flow, batter will be drained quicker.

      Its likely the case manufacturing and materials constraints are such that we can economically build a 1.5GHz part that uses fewer watt/hours per operation than 3GHz part, if the overhead of parallelism is kept to a minimum its entirely possible two 1.5GHz parts could do the same work as a single 3GHz part in nearly the same amount of time using less power.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    46. Re:Pullin' a Gates? by buckfeta2014 · · Score: 1

      The layout of the flags in the x87 status register is "interesting", indirectly driven by the desire to map to the x86 flags

      an Intel FPU was designed for an Intel CPU/APU? Who would have thunked it.

      --
      Buck Feta. You know what to do.
    47. Re:Pullin' a Gates? by dj245 · · Score: 1

      Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway. No need for a 64 core CPU for that one.

      I think you are being a little shortsighted here. AI for NPC's could be incredible if each NPC had its own core. Real life people analyze every action you take, no matter how small or insignificant. Real life people discard or take notice of these actions, weigh (rank) the important actions, and then combine the most important actions in consideration of what you are thinking or what you might be likely to do. Real people analyze the actions of all the people around them, and take that into consideration when dealing with a person too. In a computer, each AI thread do all these things too, but nowadays we normally use tricks and hacks since computing power is in short supply for AI.

      Doing this well takes a large amount of computing power, and there is no reason it can't be paralleled- real life people act in parallel and aren't all part of the same computing "thread". Simulating that doesn't have to be in the same computing thread either, but nowadays it often is because the vast majority of computers are limited to 2 to 8 cores.

      --
      Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
    48. Re:Pullin' a Gates? by Half-pint+HAL · · Score: 2

      Point of Linus was, taking a 6 core CPU, and replacing 2 cores with more cache and more transistors per core should make almost anything on Desktop run faster.

      There's an element of truth to this, but on the other hand, cache space is already big enough that it hits the law of diminishing returns. Yes, the biggest performance hits in current computing are cache misses. But cache misses are already unexpected events, and cache misses are of biggest concern to the user when there are lots of them at once -- ie when iterating through a large bit of data. Text searches on large documents in a complex format (eg MS Word). Making a global change to a large file. These are the situations where performance matters, and these are the situations where you're going to get cache misses. Torvalds dismisses photo editing as a task for "professional photographers", but our amateur cameras are taking phenomenally detailed pictures, and even making fairly simple edits is a compute-intensive task. He may be right, but he may equally be wrong.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    49. Re:Pullin' a Gates? by DarkOx · · Score: 1

      Okay fine you want to play word games have fun. Linus was obviously speaking in the context of the Von Neumann computers most of us are familiar with.

      I suspect if you asked him, does your quote apply to radically different architectural paradigms, he'd say "no".

       

      Programming is still a really fancy version of "IF A THEN B". "for X in GROUP do Z". "X = Y"

      Yes it is, I don't care what language you are using for all the computing machines in common use at some point a series of fairly limited branch, jump, add, subtract, multiply, and move like instructions have to be generated. This may even hold true for the basic units of computation that participate whatever system is ultimately able to handle very arbitrary requests like "please rough me out a flyer for our yardsale on Saturday."

      I say you are the one moving the goal posts, Linus and *most* of the other people working on parallelism solutions are working/speaking in the context of computers like the ones we know today, you they guy trying to apply what they say to *any* computer. Linus will probably be proved correct there. Past n cores the fundamental architecture in use today will not scale but for niche cases.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    50. Re:Pullin' a Gates? by wisnoskij · · Score: 1

      "SW architecture problem"
      And the software is changing. It used to be that no one needed 64x, as all software was written for 32-bit so getting a 64-bit processor would not make anything faster. And then Crysis came out with a 64-bit edition, and other applications followed suit. The next game that absolutely blows everyone's minds, and stretches what we think is possible, will be released to take advantage of multiple cores (8-16, or possible even more), and slowly applications will follow (and this is not far away). Trust me, right now loads of people are working on this problem, and it is a big problem. We need better compilers at least, possibly brand new languages or ways of using existing ones, but everyone who has bought a computer in the last 5 years has at least 2 cores,

      --
      Troll is not a replacement for I disagree.
    51. Re:Pullin' a Gates? by AchilleTalon · · Score: 1

      A single killer app is a niche. They already exists and they do not change anything more than require specialized hardware to get handled. You don't change a general purpose computer for a specialized one unless you are running only a single killer app and nothing else.

      It seems pretty obvious many people comment here and have no exprience at all of parallel programming and parallel architectures. Also, it would help a bit to read not only the article but the other refered links.

      --
      Achille Talon
      Hop!
    52. Re:Pullin' a Gates? by Half-pint+HAL · · Score: 1

      You need a number of bit flips to solve a problem. Energy is related to the number of bits flipped. If you use twice the bits at half the speed, the energy requirements will be the same. By splitting the workload over multiple cores you have more overhead, so more energy is required.

      Energy is related to the number of bits flipped... true. But there are also other factors in the equation. A more efficient processor uses less energy to flip a bit. Slower processors are generally more efficient.

      Now, the GP poster told us he's a researcher working on parallelism and efficiency. What is your qualification that allows you to dismiss his expertise and assume that he's writing crap software to test out his theories?

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    53. Re:Pullin' a Gates? by Half-pint+HAL · · Score: 1

      Well, embedded images, flash animations, iframes etc can all be handled in parallel. They're certainly threadable.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    54. Re:Pullin' a Gates? by chthon · · Score: 1

      The thing is that I remember reading it in Elektor around 1982 or 1983. I think in the context of an electronics show.

    55. Re:Pullin' a Gates? by AchilleTalon · · Score: 1

      A massively parallel system is not necessarily a GPU. GPU are a class of massively parallel systems, not the only one. For the rest of your post, other commenters reflected my thinking about it.

      --
      Achille Talon
      Hop!
    56. Re:Pullin' a Gates? by meta-monkey · · Score: 1

      One of the reasons we started moving to multiple cores was because of increasing chip size and density and the limitations of propagation delay. This was actually the basis of my master's thesis (but this was ten years ago and now I just do software, no computer architecture). Thing is, in a single clock tick of a 1GHz processor, light can only travel .3m. And electrons moving through a wire are about a third that speed. Plus gate delays, wire capacitance, etc. Point is, it was getting to the point where you couldn't get from one side of the chip to the other in a clock cycle. So it made sense to keep signals local, and only pay the propagation delay penalty when you needed to. So you can't necessarily say 4 big cores are better than 16 small cores. Otherwise, we never would have bothered with 4 cores to begin with. We would have just kept making bigger and bigger single-core processors.

      As for what's "faster" it depends very much on the algorithm. There some algorithms with course grain parallelism, some with fine grain parallelism and some with none at all. What you're running, how it's programmed and what your OS is doing matter a lot.

      --
      We don't have a state-run media we have a media-run state.
    57. Re:Pullin' a Gates? by Anonymous Coward · · Score: 2, Interesting

      Perhaps in Linus's dislike for C++ he's missed how trivially easy it's gotten to launch threads in C++11, but it takes less work now than a for-loop, since std::thread is so simple and you can inline the command with a lambda. And you have a nice clean mutex library including scoped mutexes like std::lock_guard so you don't even have to remember to unlock them.

      He doesn't mention C++ or anything like that. What he is talking about is that since the overhead for task switching is pretty large so in cases where a tradeoff is made between the performance of a single core or adding more cores to a CPU you will typically get more performance gain by having fewer better cores since the task most users do most of the time is of a nature that doesn't lend itself to parallellization. In those cases where it is easily done it is already delegated to dedicated hardware like GPU.
      For your typical for-loop that is so easy to launch threads for the problem is that the overhead for moving the task to another core with another cache is so high that you don't get a performance gain. There are still cases where it makes sense to launch threads but people who does it without thinking because "parallell is better" is the kind of programmers that jumps on every new programming fad.

    58. Re: Pullin' a Gates? by Lije+Baley · · Score: 1

      But I want to read the last paragraph of the last page first!

      Actually though, tab loading seems more dependent on external factors anyway...

      --
      Strange things are afoot at the Circle-K.
    59. Re:Pullin' a Gates? by eth1 · · Score: 1

      Point of Linus was, taking a 6 core CPU, and replacing 2 cores with more cache and more transistors per core should make almost anything on Desktop run faster.

      The real problem is that some desktop tasks really need one thread to run as fast as possible, and others (path finding for 200 drunken Dwarf Fortress denizens, for example) would benefit from having 100 somewhat slower cores. When you buy a desktop CPU, all the cores are the same, and you end up having to compromise between number of cores, single-thread speed, heat, etc.

      Maybe it's time we started designing systems with two separate chips - one dual core chip optimized for running single tasks as fast as possible, and another with 10-50 simpler cores optimized for parallel tasks. I think we're halfway there already, what with GPUs being used that way to some extent, but standardizing it would actually allow non-custom applications to make use of it.

    60. Re:Pullin' a Gates? by eth1 · · Score: 1

      Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway. No need for a 64 core CPU for that one.

      I beg to differ. Games that are trying to run hundreds/thousands of copies of a unit AI or pathfinding (Dwarf Fortress, RTSs, etc.), or are doing tons of physics (KSP, From the Depths, etc.) are what usually end up causing slide shows for me these days, not the graphics. More cores & threads, please. (Yes, I'm aware that a lot of times this due to the games not taking advantage of existing cores)

    61. Re: Pullin' a Gates? by SuricouRaven · · Score: 1

      More likely outcome: The web-dev has a thirty-core monster workstation and produces a page without any thought for performance, because it works quickly for him. Then you try on your portable device, and it takes two minutes to load all the embedded video advertising.

    62. Re:Pullin' a Gates? by marcosdumay · · Score: 1

      Every program has a very specific pourpose, not general porpouse.

      A neural net is still a turing complete computer, by the way.

    63. Re:Pullin' a Gates? by Rei · · Score: 2

      If your loop contains 10 instructions and loops 5 times

      Duh. And that's obviously not what is being discussed here. Step up a level or 20 in the call stack.

      If that "largely" means you need a lock,

      "largely" meaning "does a bunch of stuff on its own and only briefly needs to lock common data structures to update based on the results of what it's been doing". That is by far the most common case in the real world. If you have a texture loading thread for a game it only needs to briefly lock the texture structure when it's gotten its latest texture loaded and processed. If you have a mesh tweaking function for a 3d editor it only needs to lock the list of meshes briefly to swap out its newly tweaked version for the old version. And on and on and on. The most common case doesn't involve locking to wait for calculations to be done from the other side, it just needs to lock briefly to make sure it doesn't read an incomplete state when the results of calculations are being written out.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    64. Re:Pullin' a Gates? by fisted · · Score: 1

      as all software was written for 32-bit so getting a 64-bit processor would not make anything faster.

      Like 64bit software would be somehow faster, rather than potentially slower, on a 64bit CPU, sure.

    65. Re:Pullin' a Gates? by CronoCloud · · Score: 1

      one dual core chip optimized for running single tasks as fast as possible, and another with 10-50 simpler cores optimized for parallel tasks.

      Sounds like "Cell", maybe Sony and IBM had the right idea before it's time.

    66. Re:Pullin' a Gates? by phantomfive · · Score: 1

      To add to your point, it's worth remembering that doubling the processor speed will always give you a better return than doubling the number of processors.

      --
      "First they came for the slanderers and i said nothing."
    67. Re:Pullin' a Gates? by Kjella · · Score: 1

      If you look at a typical web page, you have a load of images, a few iframes with ads, scripts (possibly with with multiple web workers). Each one of those really wants to be a separate security domain. You don't want a vulnerability in libpng (something that has happened many times before) to be able to do anything other than break the single image that it's decoding. This kind of fine-grained security is a lot easier if you have the ability to have a load of cheap threads.

      Per tab security so visiting myonlinebank.com and evilmalwaresite.com at the same time won't be a problem sure, but honestly I don't care if one image can bork just that image or the whole webpage since they from my perspective is equally untrusted. I request a page from slashdot.org and I don't want it to hose my machine. Slashdot embeds an ad image from their advertising network and it's the same. I suppose you could say that the malicious PNG can now social engineer the whole page or use another exploit in the HTML/Javascript engine to gain even more privileges, but that seems highly theoretical. Particularly since those should be in the same sandbox since you can have bad HTML/Javascript too. I can't imagine the overhead of visiting Google's image search and have it spawn hundreds of security contexts, that seems like a total waste since they're all under Google's control.

      --
      Live today, because you never know what tomorrow brings
    68. Re:Pullin' a Gates? by Zeromous · · Score: 1

      Yup, but there was no telling anyone else that.

      Enterprise CELL: Hey I hear these new blades are really fast! Let's throw the kitchen sink at them and prove to everyone they are garbage!

      Never thinking for one minute that their everyday tasks might have performed far better than X86 if they had managed their processing differently or at least attempted to test this difference. Instead it was, heres the worst workload we can think of for any processer, and then tested that on CELL to find, meh, its an underwhelming chip compared to Xeons.

      Well duh! CELL was never about being faster than a Xeon at general computing!

      --
      ---Up Up Down Down Left Right Left Right B A START
    69. Re:Pullin' a Gates? by laird · · Score: 1

      Thinking Machines did this. We had one front-end CPU that ran the sequential process that controlled everything, and thousands of parallel CPUs that did all of the heavy lifting by processing the data in parallel. For large data problems, it worked extremely well. Yes, at any given time some CPUs might not be doing work because they're waiting for other CPUs, but when you're pushing the performance (e.g. processing TB of data, doing PFLOPS) the cost of making a single CPU faster goes up much faster than the performance increase and then becomes impossible, while piling up more CPUs the performance goes up linearly. Of course, some problems don't parallelize in obvious ways, but IMO anything running on large data sets can be parallelized if you look at it right.

      Luckily things like rendering graphics, sorting, searching, running web sites, many crypto problems, simulations, games, image processing, video processing, etc., parallelize really well. Admittedly it takes some cleverness to write a sort algorithm that runs on thousands of CPUs in parallel, but it's valuable to have a constant-time sort (i.e. you can scale hardware linearly with the data size, and sort arbitrary amounts of data in fixed time). The main challenge that parallel computing has, IMO, is that most programmers don't think that way, similar to how most programmers don't think in terms of multi-threading. But that's a matter of education. People used to be terribly confused by event-based programming frameworks, too!

      Once you start thinking in terms of having thousands or millions of (virtual) CPUs, and decomposing problems to run in parallel based on data or actors, pretty much everything becomes highly scalable.

    70. Re:Pullin' a Gates? by laird · · Score: 1

      Faster switching requires more power. Doubling clock speed consumes (roughly) 4x the energy, which is why doing the work in two slower cores is much more power efficient. That's one of the reasons that mobile devices that are power constrained run at slower clock speeds than desktop devices.

    71. Re:Pullin' a Gates? by laird · · Score: 1

      It's not a technical issue, it's a "chicken and egg" market issue. Many desktop applications _would_ run very well on massively parallel hardware, but that's not what people have, so it's not what developers target. And since games are written not to use more CPUs, people don't buy computers with many CPUs. And because MPP hardware is a niche, mainstream developers have no idea how to program for them, much less to think about what problems would run well in parallel.

      From a technical perspective, which I think Linus is trying to argue from, many desktop applications could easily take advantage of massive parallelism. Once you start thinking in terms of data parallelism or agent parallelism, almost all problems decompose in ways that parallelize nicely. For example, there are hundreds of AIs and simulation objects in many games, and each could run on a CPU (or process or thread). Video and image processing are "embarrassingly parallel", and now that people edit video at home, they could happily consume all the CPU you have. Sorting, searching, indexing, scrolling in documents, rendering characters to the screen - all very parallel.

      Luckily the "graphics processors" are breaking out of the "chicken and egg" trap. The better GPUs are now not really "graphics processors", they are fully general MPP CPUs, and many applications are taking advantage of them. Interestingly this architecture is similar (at a high level) to the MPP supercomputers from decades ago. The Thinking Machines' Connection Machine had a fast front-end computer, controlling an array of thousands of tens of thousands of CPUs that did the heavy lifting, and now it's your CPU controlling an array of CPUs in your "GPU". So millions of PCs are MPP, even though their owners probably don't think of them that way. And this is leading to more and more applications taking advantage of MPP!

      So I think that Linux is wrong, in that he's missed that what he's dismissing as GPUs are actually MPP co-processors that are astoundingly powerful and are increasingly being taken advantage of by developers when performance matters.

    72. Re:Pullin' a Gates? by laird · · Score: 1

      In the real world the tradoff is dollars (or power consumption, for mobile devices). So the question is - should you buy a 2x faster CPU for 4x the cost and 4x the power consumption, or should you buy 2 cores for 2x the cost and 2x the power consumption?

      For applications that only run single-threaded, you don't have a choice - you have to buy the fastest CPU you can. But for well-written applications, more cores is a cheaper, more power efficient way to scale performance.

    73. Re:Pullin' a Gates? by laird · · Score: 1

      This is only true if you're unable to use more than one CPU chip in your computer, a hurdle that was overcome 30 years ago. :-) People have been running multiple CPUs to improve performance for a _long_ time.

      The real question is - would you rather have multiple CPUs at the price/performance peak, or one CPU that's a bit faster for a much higher price. Typically getting 2x performance costs 4x or so, making 2 cheap CPUs a much better deal than one really expensive CPU.

    74. Re:Pullin' a Gates? by cb88 · · Score: 1

      I did read it... but he did say 4 cores is enough for most people and I refuted that.

      Even though largely in the context of his rant... he is correct. That single statement is rather horrendous.

      1 core even is "enough" for most tasks... however it doesn't give the best experience no one wants to wait on thier computer more than necessary.

    75. Re:Pullin' a Gates? by cb88 · · Score: 1

      There is pretty good hard data that says.. HTML rendering is embarrassingly parallel... thus Mozilla is working on Servo.

      That is 99% of computer use right there... parallel is here to stay and knowing the web it will get vastly more parallel once a browser engine is out there that can do it.

    76. Re:Pullin' a Gates? by Bengie · · Score: 1

      Larger caches also mean higher latency. If you have a lot of registers, that's great, but if you can't be constantly prefetching data from memory for one reason or another, higher cache latency will make everything slower for a small subset of workloads.

      We're in the transition where throwing more transistors at a single core makes the core slower for most workloads with marginal gains for specific workloads. We need more cores, either of the same type or different types that specialize for certain types of workloads. Maybe we need some cores with large caches and some with smaller caches or some with lots of SIMD or some with few SIMD.

    77. Re:Pullin' a Gates? by Bengie · · Score: 1

      Cutting your frequency in half may reduce your power consumption by 80%. Lots of low frequency cores will kick the crap out of a single high frequency core, efficiency wise.

    78. Re:Pullin' a Gates? by Bengie · · Score: 1

      Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway. No need for a 64 core CPU for that one.

      Spoken truly like someone who does not understand anything about games or parallel processing. We already have games that ran make a 12 core CPU run at 80% load and a quad-GPU about the same. "Pipeline" style multithreading is becoming popular. Each stage of the graphics pipeline is ran where it is most efficient, which may be the CPU or the GPU, and data may bounce between the CPU and GPU several times before completed.

      Piplining plays well with streaming, so one of the first jobs is for the CPU to break up the work to be done into many difference smaller pieces that can be "streamed". "Streaming" is just breaking up a large object into smaller objects. This allows for the GPU and CPU to be kept busy working on difference pieces that are at different points of the pipeline, keeping them busy.

      The natural evolution of this is each stage of the pipeline can be a collection of "processing units", each unit capable of working on an unit of the stream. While the pipeline may prefer to use the GPU for certain types of work or the CPU for other types, if the current machine has an unbalanced mixture of CPU and GPU, the pipeline may augment one with the other. While the CPU may not be as good, if the machine has a lot of CPU cores left unused, might as well use them to speed things up.

      We need a framework that allows code to run where it would be best to run, but also have the ability to overflow to less efficient execution units if they are relatively unused. In other words, we need to treat CPUs and GPUs like a pool of computing resources, and *something* needs to manage these resources to crunch data.

    79. Re:Pullin' a Gates? by TheRaven64 · · Score: 1

      Per tab security so visiting myonlinebank.com and evilmalwaresite.com at the same time won't be a problem sure, but honestly I don't care if one image can bork just that image or the whole webpage since they from my perspective is equally untrusted.

      You don't use webmail then? Or any web pages that have adverts in them?

      --
      I am TheRaven on Soylent News
    80. Re:Pullin' a Gates? by tibit · · Score: 1

      For iteration, as long as you know a bit in advance what you will need, you can certainly issue prefetch requests. They can often remove cache stalls altogether.

      --
      A successful API design takes a mixture of software design and pedagogy.
    81. Re:Pullin' a Gates? by linuxrocks123 · · Score: 1

      It's certainly not the case that "almost all" problems decompose into data parallelism. Likewise, while there are some tasks that GPUs can do very well, there are others where they don't do well at all. What Linus is arguing is that the cases where they don't do well at all dominate. I concur. Programming this stuff typically locks you deeply into a single GPU's architecture, and, oh, you have zero cache, zero pipelining. There are other problems, too. A good overview of the technology as a whole is here: http://cstar.iiit.ac.in/~kkish...

      GPGPU's have had some impressive successes, but CPUs are still more versatile, and, like Linus says, I don't see Intel giving up single core performance so people can program a bunch of tiny little ant-processors that can't communicate with each other in less than 500 cycles.

      --
      vi ~/.emacs # I'm probably going to Hell for this.
    82. Re:Pullin' a Gates? by linuxrocks123 · · Score: 1

      Games can do all sorts of unnecessary and stupid things, including loading 12 CPUs and consuming 500 watts so Duke Nukem's whiskers waft with the prevailing winds in the virtual environment.

      They are a model for no type of problem except themselves.

      --
      vi ~/.emacs # I'm probably going to Hell for this.
    83. Re:Pullin' a Gates? by Bengie · · Score: 1

      In this case it wasn't. There was no obvious waste of resources, the devs had a nice breakdown with a profiler showing the main things being done. No one parts of the system was using vastly more resources than the other, and there was a lot of different parts. Relative to other gaming engines, it was beating them on number of objects rendered by quite a bit.

    84. Re:Pullin' a Gates? by awol · · Score: 1

      More than 20 years ago I had a full and frank exchange with a macweenie friend of mine where I posited that in the vast majority of cases the core "functionality" of the work we were doing was already within the capacity of the processors available at that time and the advances in speed that will come in the future will all be about enhancing the user experience of that core.

      What I meant was that the calculating of the spreadsheet cells or redrawing the document window or .... was already doable by the current processor. It was the handwriting UI, or voice recognition or eye candy (or stuff I couldn't envisage, like parsing my email history to find the right advertisement to display :-) that would consume the CPU advances that were coming. When I say "OK Google what's the weather like today" and my cell phone tells me in a moderately human voice a 2 sentence forecast and displays a detailed weather page for my freakin' suburb. I kinda feel vindicated. When the address I was searching on my desktop is the first entry in the dropdown box on the GPS on my phone when I get in the car later that day. Same. (All points about the invasive nature of that connectivity duly noted).

      The parent poster is absolutely right, this trend is ongoing and the amount of "work" that I can get my compute resources to do via more and more sophisticated interactions is only going to increase and the more encompassing that work becomes the more it can be broken down into smaller discrete and hence parallelizable tasks.

      Having said all that.... my professional expertise is in quite high performance transactional software and Linus statement is absolutely true. I'll take cache size/control over a proliferation of cores any day, given a certain number of cores and within that all the goodness of branch prediction and ooo execution, four sounds about right. So much so that, we find situations where adding cores actually reduces our performance we suspect due to caching issues.

      So in essence there are two trends. Form Linus's perspective he is right, the time spent on parallelism is not worth it. At a more macro level it is. Perhaps that macro level is n application software level rather than a system software level and hence the difference in view point.

      --
      "The first thing to do when you find yourself in a hole is stop digging."
    85. Re:Pullin' a Gates? by DamnOregonian · · Score: 1

      Are you joking?

      I know that there is a wide variance in performance differences between compiled programs on 32-bit and 64-bit architectures, but I do a fair amount of work in assembler, and I assure you there are very large speedups to be had moving over to x86-64. First and foremost, increased register size and double the amount of general purpose registers.

      If you want to go to a higher level, 64-bit pointers also allow for all kinds of very neat OS syscall latency related tricks like mapping stupidly-large files into memory.

      64-bit is the way, my friend. If your software doesn't run faster in long-mode, it's because either you or your compiler just isn't quite with the program yet.

    86. Re:Pullin' a Gates? by DamnOregonian · · Score: 1

      You're so right :(

      I keep a PS3 non-updated just so I can play around with the Cell in linux.
      It certainly is a shitty ass part if you're just trying to write normal software for it- the HT PPC core in it is a total dog.

      However, if one were bored and wanted to whip up a stupidly parallel task, like computing segments of a mandelbrot- then one could zoom down to precision failure in a second.
      Ever generated a 40,000 x 40,000 mandelbrot? Sure it's not quite general-purpose, but my i7 desktop, Q9650 desktop, i3 work computer, i3 laptop, and i7 laptop running parallel generation software struggle to keep up. (Granted- no GPU assist on those)
      I just wish I had come up with more workloads for it before I got bored.

    87. Re:Pullin' a Gates? by DamnOregonian · · Score: 1

      The kind he is talking about is also using those cores to load a single page. He's arguing against parallel computing being the answer for mundane tasks. At the end of the day, improving the instructions/cycle (or cycles/second, but I think we're pretty close to tapped out in that department) performance of cores is more important than increasing the cores.
      People arguing against him largely don't understand that's the argument he's making.
      Adding 12 more cores to your quad core is not going to make the desktop perform better.
      However, a 5% increase in instructions/cycle performance *will*

    88. Re:Pullin' a Gates? by AchilleTalon · · Score: 1

      I wonder if the fine guys marking my post offtopic has read himself the fine article and associated documents. /. is full of surprises, people are even moderating subjects they don't give a fuck to read about.

      --
      Achille Talon
      Hop!
    89. Re:Pullin' a Gates? by im_thatoneguy · · Score: 1

      I say you are the one moving the goal posts, Linus and *most* of the other people working on parallelism solutions are working/speaking in the context of computers like the ones we know today, you they guy trying to apply what they say to *any* computer. Linus will probably be proved correct there. Past n cores the fundamental architecture in use today will not scale but for niche cases.

      Within the context of traditional Van Neumann computers we already today have voice recognition, we already have SLAM 3D positioning, we already have databases like Wolfram Alpha which can give us insights, we already have applications which crunch massive 3D datasets. Some of these run ok on GPGPUs and some need the larger cache sizes of a CPU to run efficiently.

      My point isn't that we need some completely exotic system, my point is that with the very limited amount of applications today for AI-driven solutions there are plenty of applications that can and would use hundreds of cores. Computers were once a "niche" tool for rich people. The internet was once just a niche tool for academics. Only gamers needed a GPU etc etc. All the way back through history when something becomes accessible someone finds an application. Build it and they will come.

    90. Re:Pullin' a Gates? by im_thatoneguy · · Score: 1

      If that whole process takes 3 seconds (which would be amazing) then your computer only performed 1 "operation per second". But computers don't perform "operations" they have to perform millions of sub-actions to accomplish your goal.It would be like saying that "Rendering a game's frame is only a single task so it would be a very serial task without any potential for multithreading." when in reality "rendering a frame" is a massively parallel task of rasterizing millions of triangles (or intersecting rays) and sampling textures, computing lighting values and performing table look ups.

      Take interpreting voice. By applying multiple models simultaneously you can get better results. Seems pretty obvious.
      http://devblogs.nvidia.com/par...

      For the flyer maybe it'll generate 1,000 flyers simultaneously and then compare them to award winning graphic design projects to see which of the 1,000 ideas it had matches historical good ideas.

    91. Re:Pullin' a Gates? by fisted · · Score: 1

      I actually had the caches in mind. One of the reasons why people came up with the x32 idea.

    92. Re:Pullin' a Gates? by im_thatoneguy · · Score: 1

      The point isn't to pick any one approach or technology (say neural nets) the point is that we *already* have an application that comfortably uses more than Linus' mythically adequate 4 cores. A 4 core CPU is fantastic at running a word processor and an email client in the background. But that's not the future of computing. The future of computing is going to be doing the work of the human brain, but better. The human brain is one example of the sort of application we are going to see more of. Improved Microsoft Word is not the future. Improved Chrome is not the future, we see the future in Science Fiction and it's an interface that can communicate with us naturally. Natural human/computer communication means a whole new set of problems, and these are not problems relegated to "niche" marketplaces like research lab super computers. The applications for machine vision are everywhere. The applications for voice recognition are everywhere. The applications for 'common sense' in your interaction are everywhere. These aren't problems that I expect will be solved best with fast linear serial processes. To date all of these classes of problems have been best approached with multi-threaded parallel computing.

      You mention the GPU. It's true the GPU was a custom semi-specialized piece of hardware. In fact the original 3D accelerators weren't even in the display card they were pass-through cards. But you know what else used to be a semi-specialized chip? Math Co-Processors. Even today GPUs are slowly blending back into the CPU. Once something like a math co-processor becomes sufficiently critical to the average user it becomes part of the CPU's die. AMD has already integrated pretty substantial GPUs into their "APUs". By definition SOCs are integrating the GPU. If we do develop a chip that is critical the average user like AI with a magic AI-chip then they'll just integrate it into the CPU.

      It used to be that video playback was a niche market and now just about every CPU, GPU and combination there-of has integrated video decoding into the chip. So what makes you think they won't integrate ai and call it a "CPU"?

    93. Re:Pullin' a Gates? by im_thatoneguy · · Score: 1

      You assume that task-specific tasks are all that people will come up with. If you have to spin a new ASIC every time you want to improve your software we aren't going to innovate. ASICs are specifically for something like 10GB networking which is a defined standard. But most tasks aren't defined standards. Changing specs is the norm not the exception outside of core OS functionality like storage or networking. GPUs couldn't keep up so they moved to a compiled per-pixel shading model so that developers could rapidly iterate and invent new uses. In the process GPUs by necessity became pretty general purpose. But GPUs are still frustratingly limited in their general purpose applications. There is a huge domain of problems that need more than 4 cores but need more memory and larger caches than a GPU offers them. You could legitimately call whatever processor manages to handle them a "CPU" or a "GPU".

    94. Re:Pullin' a Gates? by im_thatoneguy · · Score: 1

      Torvalds dismisses photo editing as a task for "professional photographers", but our amateur cameras are taking phenomenally detailed pictures, and even making fairly simple edits is a compute-intensive task. He may be right, but he may equally be wrong.

      Torvalds is being completely ridiculous here. Avid used to be the domain of professional film editors but iMovie is incredibly popular. We even see cell phones these days sporting 4k cameras. My Lumia has a 41 megapixel sensor! I have a RED camera and it's "only" 18 megapixels. In fact the less professional you are the more processing power you need. Photoshop's paint brush can accomplish wonders in the hands of a professional touch-up artist. But Photoshop's Content-Aware-Fill is processor murder and designed specifically to intelligently replace a professional artist. Take something like 3D rendering. You could have someone hand paint every frame. It would without question require a professional artist. But if you want a pretty picture at the push of a button you want raytracing.

      This is actually something that you see happening today in the high-end VFX market. It used to be that raytracing was too compute intensive for films. But for amateurs and non-artists ironically enough ray tracing was fine. The architect only needed to render 3 frames. Waiting a day was perfectly acceptable there wasn't another 100,000 frames that also needed to get rendered. In film there wasn't time for something like Global Illumination and the shortcuts caused unacceptable flickering. Now the film industry is starting to embrace advanced lighting like GI and they're getting all of the bounces and detail that used to take hundreds of lights to fake automatically. It's making artists more productive but it's coming at the cost of increased compute time. Again a professional lighter can as an artist fake global illumination. An amateur could simply position the sun, turn on GI and wait 18 hours.

      The future will be an Automagical button that not only fixes your photo *cough* instagram *cough* but also performs even more advanced editing like "Remove the gray clouds and put in a photorealistic blue sky. Oh yeah, and also change the lighting of the photo to make it look sunny!" That's going to be far more CPU intensive than any photoshop filter currently in existence and it'll be targeted as much as your average cell phone user as a professional.

    95. Re:Pullin' a Gates? by im_thatoneguy · · Score: 1

      Game developers waste less processor power than just about any other developer I know of short of super-computer developers. When you have 16ms to render a frame and you have to recreate the entire universe in those 16ms you have to be extremely judicious in your use of cycles.

    96. Re:Pullin' a Gates? by ultranova · · Score: 1

      But by the time you've finished reading the first paragraph of the first page, the other nine are loaded even if you can't parallise.

      No, they haven't. If you can't parallise, you can't download and render in background. If you try anyway, you end up blocking the UI randomly. With nine not-really-parallel threads competing for various locks with each other and the user, you set the pages to load and go have coffee.

      Parallelism isn't just good for optimal resource utilization, it's also good for "smooth" user experience. Users might not care if a page loading in the background takes a second rather than two, but they do care about being able to scroll or close the current page while it does.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    97. Re:Pullin' a Gates? by ultranova · · Score: 1

      Maybe it's time we started designing systems with two separate chips - one dual core chip optimized for running single tasks as fast as possible, and another with 10-50 simpler cores optimized for parallel tasks. I think we're halfway there already, what with GPUs being used that way to some extent, but standardizing it would actually allow non-custom applications to make use of it.

      It's standardized - OpenCL is for exactly this - but it's such a pain to program, people usually won't. All of our popular programming languages are designed for sequal execution, and multithreading is just an afterthought. I don't think the problem can be solved through shared, mutable state. Maybe something inspired by physics: every event has its immutable "past light cone" of events who's output it can access, and can't access any data not in this cone?

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    98. Re:Pullin' a Gates? by ultranova · · Score: 1

      A 3+ GHz single core CPU is easily capable of decoding images that come in at full speed over a typical internet connection. You may be able to use multiple cores, but it's going to make the overall page loading any quicker than using a single core.

      If you had 12+ cores, you could keep those images in their compressed form and decompress when they become visible as the user scrolls a page or switches tabs, thus saving a lot of memory. Also, modern webpages tend to be full of "dynamic" content, from animated gifs to ads. Being able to give a separate thread for each of these would do a lot to make the UI more responsive.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    99. Re: Pullin' a Gates? by ultranova · · Score: 1

      I don't want to continuously stream my audio to Google nor do they want to continuously process the sound of me typing and sipping coffee...

      Of course they do. It's behavioral data, which can be used to target advertizing and perhaps even feedback data valuable to manufacturers. For example, how much coffee do you drink per day? How long do you spend with a single mug? Do you brew a little and often, or use a thermos, or do you simply let it sit in the pot? Are other people around - are your coffee breaks spent alone? How does your chair sound like - is it time to get another?

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    100. Re:Pullin' a Gates? by lucien86 · · Score: 1

      Yep 50 tabs open - assign a thread to sub-maintain each tab. I like it... Mind you it opens up the possibility of more viruses than can be squeezed into a small box..

      --
      Below the speed of light Special Relativity is one of the most accurate theories in physics - above the speed of light..
    101. Re:Pullin' a Gates? by lucien86 · · Score: 1

      One of the great advantages of running multiple threads on a single core is that you can get rid of deep pipelining, taking along a lot of old problems and superfluous complexity with it.. Of course the same thing can be extended to multiple cores - or even many multiple cores. The big problem with having many cores on a single die is the large bottleneck that tends to form between the CPU's and the main memory.

      I think the real breakthrough will come with having the main ram on the same die as the CPU cores . . that will be a dream come true and we will see a massive improvement in performance then.. Just adding more chips will increase your system processing power & memory - as many as you want.. Massive parallelism is definitely the future, its just a question of when it will arrive...

      --
      Below the speed of light Special Relativity is one of the most accurate theories in physics - above the speed of light..
    102. Re:Pullin' a Gates? by lucien86 · · Score: 1

      So as long as you can keep guessing at least half the balls in the lottery next week you can keep winning..

      If only it was that easy in Strong AI .. but then I suppose anyone could do it. :(

      --
      Below the speed of light Special Relativity is one of the most accurate theories in physics - above the speed of light..
    103. Re:Pullin' a Gates? by aliquis · · Score: 1

      To begin with the Gates quote never was.

      But regardless, what he seem to suggest isn't that technology won't shrink and that you couldn't make more cores.

      What he seem to suggest is that you're rather not use that to make more cores but rather make even better/bigger cores and caches.

      No?

      And maybe he's right. I mean. The Pentium IV was supposed to be able to go to 10 GHz!
      We had a real mega-hertz race going on then. And then that stopped and we got more cores instead.
      And maybe not people assume more cores is the future because that's what we've seen. But maybe he's right and it's hard to use that efficiently and that's not where we're going at all.

      I don't know :)

    104. Re:Pullin' a Gates? by Tablizer · · Score: 1

      More fuel for the Gates quote debate:

      http://imranontech.com/2007/02...

    105. Re:Pullin' a Gates? by perryizgr8 · · Score: 1

      Adding 12 more cores to your quad core is not going to make the desktop perform better.

      Why, though? Why can't they separate things involved in loading a single page? Like network, static images, css, javascript, WebGL, etc. Each gets their own core. Won't that speed up the loading?

      --
      Wealth is the gift that keeps on giving.
    106. Re:Pullin' a Gates? by perryizgr8 · · Score: 1

      Everyone except Firefox is doing just that. How the mighty have fallen. Firefox was the one that brought down the evil IE.

      --
      Wealth is the gift that keeps on giving.
  3. Linus should try git by MichaelSmith · · Score: 3, Funny

    ...a tool which he may have heard off. It does connectionless, distributed data management, totally without locks.

    1. Re:Linus should try git by phantomfive · · Score: 3, Informative

      In his post, Linus was talking about single, desktop computers, not distributed servers. He specifically said that he could imagine a 1000 core computer might be useful in the server room, but not for a typical user. So if you're going to criticize him, at least criticize what he said.

      Also, git is not totally without locks. Try seeing if you can commit at the same time as someone else. It can't be done, the commits are atomic.

      --
      "First they came for the slanderers and i said nothing."
    2. Re:Linus should try git by MichaelSmith · · Score: 2

      My point is that git knows how to merge. It knows when a merge is required, when it is not, and when it can be done automatically. If you design your data structures properly, the same behaviour can be used in massively parallel systems.

    3. Re:Linus should try git by Half-pint+HAL · · Score: 1

      Git isn't a performance system, though. The timescales it works on are completely different from those of desktop computing.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    4. Re:Linus should try git by phantomfive · · Score: 1

      Git couldn't do any of that without locks. That might sound like a small detail, but it's important......when it comes to multi-processing, the details are important. If you only look at the big picture, you will make poor decisions.

      --
      "First they came for the slanderers and i said nothing."
    5. Re:Linus should try git by MichaelSmith · · Score: 1

      The only locks in git are within single repositories. The locks which control distributed merging are controlled by the hashes which identify change sets. They tell a repo about the origin of the data being merged in. So rather than thinking about a static blob of data which changes sometimes and needs to be preserved while other nodes are working on it, you think of a graph which extends into the future, each node identified by its hash. By working this way it is easier to find places to reintegrate the results of processing which takes place remotely.

  4. Re:GAY NIGGERS can be DEVELOPERS 1000 WHORES! by Tablizer · · Score: 1

    Example of a brain that can't handle parallel thought processing.

  5. Re:Programs people want to use... by Tablizer · · Score: 1

    Grokking and managing parallel programming seems to be the bottleneck. Using mass parallelism can be done, but so far it's been so difficult that it has yet to be worth it for the vast majority of apps (or at least the vast majority of the operations in a given app, for graphics and database calls can sometimes use lots of parallelism).

    It's too early to know if it's just too hard a problem for the human mind in general, or the current generation of programmers is too locked into a way of thinking.

    Regarding the suggestion to follow nature, nature can be unpredictable. Do we want that characteristic in our applications? How do you debug something if you can't faithfully recreate the state? I can see an organic mess being fun for some games, but not for accounting and tracking software.

    We need more pilot projects to experiment with techniques.

  6. Core of the article by phantomfive · · Score: 1

    The article makes the point (which is not correct*) that to have high scalability we need lockless designs, because locking has too much overhead. If you can't imagine trying to get ACID properties in a multithreaded system without locks, well, neither can I. And neither can they: they've decided to give up on reliability. They've decided we need to give up on the idea that the computer always gives the correct answer, and instead gives the correct answer most of the time (correct meaning, of course, doing exactly what the programmer told it to do).

    Here is what the guy says: " The obstacle we shall have to overcome, if we are to successfully program manycore systems, is our cherished assumption that we write programs that always get the exactly right answers." Not only that, we need to give up memory/cache locks within the processor (I don't know a whole lot about those), because when you scale to 1000 processes on a single processor, RAM becomes a bottleneck.

    Now, if he's right, and the only way to get such high performance is by not worrying about whether the computer does what it is told, then he's not going to be able to convince many people.


    *It is not correct in situations where each processor can work on a single chunk for a long time, that is, for problems where resource contention is a small fraction of processor time, like in video encoding. Then the overhead is still small, no matter how many processors you have.

    --
    "First they came for the slanderers and i said nothing."
    1. Re:Core of the article by imgod2u · · Score: 3, Insightful

      The idea isn't that the computer ends up with an incorrect result. The idea is that the computer is designed to be fast at doing things in parallel with the occasional hiccup that will flag an error and re-run in the traditional slow method. How much of a window you can have for "screwing up" will determine how much performance you gain.

      This is essentially the idea behind transactional memory: optimize for the common case where threads that would use a lock don't actually access the same byte (or page, or cacheline) of memory. Elide the lock (pretend it isn't there), have the two threads run in parallel and if they do happen to collide, roll back and re-run in the slow way.

      We see this concept play out in many parts of hardware and software algorithms actually. Hell, TCP/IP is built on having packets freely distribute and possibly collide/drop with the idea that you can resend it. It ends up speeding up the common case: that packets make it to their destination along 1 path.

    2. Re:Core of the article by Rei · · Score: 2

      There are cases where getting exactly the right answer doesn't matter - real-time graphics is a good example. It's amazing the level of error you can have on an object if it's flying quickly past your field of view and lots of things are moving around. In "The Empire Strikes Back" they used a bloody potato and a shoe as asteroids and even Lucas didn't notice.

      That said, it's not the general case in computing that one can tolerate random errors. Nor is the concept of tolerating errors anything new. Programmers have been using for example approximations for square roots for a long, long time to save compute cycles where precision takes a back seat to "just get the shape of the curve roughly right". There's even a number of lower-precision hardware math methods.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    3. Re:Core of the article by Rei · · Score: 1

      I'm wondering about what he is thinking for real-world details. For example, a common use case is one thread does searches through a data structure to find an element (as, say, a pointer or an iterator), but before it can dereference it and try to access the memory, some other thread comes along and removes it from the list and frees it. Then your program tries to dereference a pointer or iterator that's no longer valid and it crashes.

      The problem isn't that it's no longer in the list. Clearly the other thread had a good reason to remove it and if your first thread had happened just a split second later it never would have seen the removed entry. The problem is that your program crashes because it's trying to use a freed memory address.

      What sort of implementation details is he thinking of that prevent this sort of problem? I mean, if he actually has a realistic solution, I'd love to see it, it could make for a brilliant extension of STL containers.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    4. Re:Core of the article by Half-pint+HAL · · Score: 1

      Immutability gets rid of problems of atomicity. Functional Programming is therefore a viable paradigm for parallel computing. Even if Scala is a bad example in terms of implementation, Odersky was right in deciding that scalability worked better with immutability-by-default. Mutability is there if you need it, but it's a conscious choice, which forces you to think carefully about it.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    5. Re:Core of the article by phantomfive · · Score: 1

      Immutability gets rid of problems of atomicity. Functional Programming is therefore a viable paradigm for parallel computing.

      Functional programming makes the conceptual load easier, but it's not a magic bullet. The algorithmic problems are still there, because sometimes you need to update the database.

      --
      "First they came for the slanderers and i said nothing."
    6. Re:Core of the article by phantomfive · · Score: 1

      Yeah, in cases where you don't care about precision, then precision doesn't matter. That's different than not being able to perform a transaction.

      --
      "First they came for the slanderers and i said nothing."
    7. Re:Core of the article by phantomfive · · Score: 1

      Yeah, good luck doing transactions without locks.

      --
      "First they came for the slanderers and i said nothing."
    8. Re:Core of the article by Bengie · · Score: 1

      Immutability increases memory usage, which puts pressure on your allocator, which also needs to be thread safe, and garbage collecting tends to be a "stop world" issue, which means all of your threads stopping. Depending on how you use "immutability", it could be worse than mutability for parallelism. People need to understand how things work in order to understand how they interact. There is no magic bullet, programmers need to start understanding and stop assuming everything is a blackbox that just works as desired.

    9. Re:Core of the article by Half-pint+HAL · · Score: 1

      Functional programming makes the conceptual load easier, but it's not a magic bullet. The algorithmic problems are still there, because sometimes you need to update the database.

      OK, functional programming isn't a good paradigm for database implementation, but DBMSes are typically a prewritten blackbox. Given that most mutability in large systems is best done as a database anyway, and then the DBMS will handle the lion's share of the resource locking anyway.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    10. Re:Core of the article by shutdown+-p+now · · Score: 1

      If you can't imagine trying to get ACID properties in a multithreaded system without locks, well, neither can I.

      Databases have been doing ACID without locks for a while now, with MVCC. As I understand, STM is basically built on the same ideas for running program state.

    11. Re:Core of the article by phantomfive · · Score: 1

      Nah, that still requires locks for writing, it just lets you read while someone else is writing.

      --
      "First they came for the slanderers and i said nothing."
    12. Re:Core of the article by phantomfive · · Score: 1

      Well, I certainly favor functional programming, especially on the server side as business logic......I just don't think its going to give any real algorithmic improvements like these guys are talking about. It makes multithreading easier, and easier to do without bugs, but not more efficient. That is my main point.

      --
      "First they came for the slanderers and i said nothing."
    13. Re:Core of the article by shutdown+-p+now · · Score: 1

      MVCC? It only requires locks when committing the transaction (and reconciling it with other transactions), which is a much shorter duration than writing itself.You can easily have concurrent writes in two ongoing snapshots without any locking whatsoever, so they don't have to wait on each other.

    14. Re:Core of the article by phantomfive · · Score: 1

      MVCC? It only requires locks when committing the transaction (and reconciling it with other transactions), which is a much shorter duration than writing itself.

      True.

      --
      "First they came for the slanderers and i said nothing."
    15. Re:Core of the article by Mr+Z · · Score: 1

      Eventual consistency means that the computer eventually computes the right answer if its quiescent long enough. Intermediate values, though, are an approximation, which is often enough.

      One example that Paul McKinney gives is of a distributed counters built out of per-CPU counters, and CPU-to-CPU events saying how much to update the total by. (Let's assume positive counts only.)

      Each CPU will see update events from other CPUs in different orders, each saying how much to update the count by. All CPUs will eventually see all updates. So, the total seen by any given CPU might differ from the true total in the short run (and may not even be a technically valid total given the original source of events, since events get reordered), but eventually all of the counters will converge on the same total if updates stop pouring in. Also, the totals are still locally monotonic.

      If you required all CPUs to see the same sequence of updates to the count, then you have to take locks and serialize memory accesses, which on a manycore system is an expensive operation that simply doesn't scale well. But, if you relax the constraint to "eventual consistency" and "monotonic updates", then each core can have its local approximation that isn't too far from the real value, knowing that each core is no further from the true value than the backlog of events yet to arrive.

      That's an extremely reasonable model for many types of data.

    16. Re:Core of the article by imgod2u · · Score: 1

      A lot of transactional machines don't have locks. That's not to say they don't have any mutex-like structures altogether but rather, the sequences themselves are treated as locks, thus allowing a finer granularity than normal mutex algorithms.

    17. Re:Core of the article by imgod2u · · Score: 1

      How about graceful seg faults instead of program crashes? Obviously modern architectures don't really support such things but one can imagine a processor that detected bad pointers instead of causing the program to crash. In fact, each program could program or transaction even could program a pre-determined fault handler.

      What'll happen is:

      1. Thread A sets a "start of code snippet" and programs an address that has a fault handler.
      2. Thread B starts its processing as well.
      3. Thread A at some point tries to dereference a pointer at address X.
      4. Thread B races ahead and deletes the pointer at address X.
      5. Normally, in protected memory, the processor would throw a fit as thread A tries to access an illegal memory address.
      6. Instead, the processor jumps to thread A's custom fault handler.
      7. Thread A's fault handler sees "hey, my code snippet tried to access an illegal address and I, the thread, am not guaranteed to be thread safe". It then rolls back all of the work it's done up until the instruction that faulted.
      8. Thread A tries again starting from 1. It could, at some point, decide to not try the thread unsafe method (if it faults too many times) and actually use the old mutex locking method.

      The idea is that the majority of the time, thread A and thread B don't actually conflict. Or thread A wins the race. In those cases, you have a case of parallel computation speedup.

      It's up to the programmer (or compiler, probably a JIT) to recognize when to exploit this by analyzing the algorithm and the likelihood of conflict. A JIT would probably use profiling information it gets in real time.

      Nobody's saying this will replace 100% of all synchronization methods. But we don't need to. To get a speedup, you only need to technically replace 1 use case. But most likely, you can replace a lot (90%) of use cases.

  7. How parallel does a Word Processor need to be? by Nutria · · Score: 3, Interesting

    Or a spreadsheet? (Sure, a small fraction of people will have monster multi-tab sheets, but they're idiots.)
    Email programs?
    Chat?
    Web browsers get a big win from multi-processing, but not parallel algorithms.

    Linus is right: most of what we do has limited need for massive parallelization, and the work that does benefit from parallelization has been parallelized.

    --
    "I don't know, therefore Aliens" Wafflebox1
    1. Re:How parallel does a Word Processor need to be? by phantomfive · · Score: 1

      Emacs can always use more cores [flame suit on]

      --
      "First they came for the slanderers and i said nothing."
    2. Re:How parallel does a Word Processor need to be? by Anonymous Coward · · Score: 1

      Emacs is a bad OS that can only use one core. If you use Erc (an irc client inside of Emacs), you will notice the real pain of Emacs. While Erc tries to reconnect to a server, you can do absolutely nothing, not even changing to another buffer. You just have to sit there waiting for it to either succeed or to time out.

      Emacs is only alive because of how it can handle some code, but as an OS it is terrible broken as it is single threaded.

    3. Re:How parallel does a Word Processor need to be? by phantomfive · · Score: 1

      Emacs is a bad OS that can only use one core. If you use Erc (an irc client inside of Emacs), you will notice the real pain of Emacs. While Erc tries to reconnect to a server, you can do absolutely nothing, not even changing to another buffer

      Isn't there a version of select() inside emacs? In other words, some kind of non-blocking connect?

      --
      "First they came for the slanderers and i said nothing."
    4. Re:How parallel does a Word Processor need to be? by maccodemonkey · · Score: 2

      Or a spreadsheet? (Sure, a small fraction of people will have monster multi-tab sheets, but they're idiots.)
      Email programs?
      Chat?
      Web browsers get a big win from multi-processing, but not parallel algorithms.

      Linus is right: most of what we do has limited need for massive parallelization, and the work that does benefit from parallelization has been parallelized.

      This is kind of silly. Rendering, indexing and searching get pretty easy boosts from parallelization. That applies to all three cases you've listed above. Web browsers especially love tiled parallel rendering (very rarely these days does your web browser output get rendered into one giant buffer), and that can apply to spreadsheets to.

      A better question is how much parallelization we need for the average user. While the software algorithms should nicely scale to any reasonable processor/thread count, on the hardware side you do have to ask how many cores we really need, especially in since a lot of users are happy right now. But targeting these sorts of operations as a single thread is also the entirely wrong approach. It's not power efficient for mobile users, and it drastically limits the gains your code will see on new hardware, while competing source bases pass you up.

    5. Re:How parallel does a Word Processor need to be? by Nutria · · Score: 1

      indexing and searching get pretty easy boosts from parallelization.

      How much indexing and searching does Joe User do? And what percent is already done on a high-core-count server where parallel algorithms have already been implemented in the programs running on that kit?

      Web browsers especially love tiled parallel rendering

      Presuming that just a single tab on a single page is open, how CPU bound are web browsers running on modern 3GHz kit? Or are they really IO (disk and network) bound?

      --
      "I don't know, therefore Aliens" Wafflebox1
    6. Re:How parallel does a Word Processor need to be? by gnasher719 · · Score: 1

      I'll give you an example. I use iBooks to read eBooks. I downloaded two eBooks which are actually each a collection of fifty full-size books. On my MacBook, the one I'm currently reading displays that it's around page 8,000. The total is about ten thousand pages.

      If I change the font size, it recalculates the pages and page breaks for the whole book. One CPU running at 100% for a very long time. For a five hundred page book, no problem. For a ten thousand page book, big problem. I'd love it if the re-pagination process would use all the cores that are available.

    7. Re:How parallel does a Word Processor need to be? by Nutria · · Score: 1

      First thought: Why the hell aren't those two (total of) 10,000 page "books" split into their constituent 50 "actual" books?

      That's the kind of parallelization and work optimization that needs to take place before algorithm changes.

      --
      "I don't know, therefore Aliens" Wafflebox1
    8. Re:How parallel does a Word Processor need to be? by mean+pun · · Score: 1

      That's actually a good example of Linus' point. You can not easily parallelise pagination, because you need to know what fits on one page before you can paginate the next page. Sure, you can do some heuristics (every page contains exactly 1000 words), which is dangerous and at best lowers the quality of the result. You can also try to be clever and for example paginate chapters in parallel and then do the exact numbering afterwards in a separate pass, but before you know it you have turned a fairly simple algorithm into something highly complicated and fragile. And for what? A few corner cases.

    9. Re:How parallel does a Word Processor need to be? by swilver · · Score: 1

      Yes, and it re-rerenders all the pages as bitmaps at 400% zoom, scales them back down to get proper anti-aliased results, then compresses them with JPEG and stores them into main memory... ...or how about just recalculating the page that you need to display?

      Parallel processing is not gonna solve stupidity.

    10. Re:How parallel does a Word Processor need to be? by Rei · · Score: 1

      The key question is, what are as many common example cases one can list (in order of frequency times severity) where users' computers have lagged by a perceptible amount which in any way reduced their user experience, or caused the user to have to forgo features that would otherwise have been desirable? Then you need to look at the cause.

      In the overwhelming majority of cases, you'll find that "more parallelism with more cores" would be a solution. So why not just bloody do it?

      Not everybody suffers performance problems in the same way. But the vast majority of peoples' performance problems can be solved by the same solution.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    11. Re:How parallel does a Word Processor need to be? by jellomizer · · Score: 1

      You are stating that Linux is a Desktop OS?
      That we need Linux for these mundane tasks?

      We need Linux for big calculations. Faster Databases (Parallel sorting of data, parallel searches).
      Faster collection for Decision Support Systems... Err um Business Intelligence Systems... Err um... Big Data... Whatever the buzzard of the week for statistical based calculations.

      Some things just need parallelization but others need parallel processing.

      A good parallel algorithm can bring computation speed down by one order of magnitude. A Sort can happen in O(log(n)) time. A search can happen in O(c) time.

      These improvements happen when you have larger sets of data. Not baby toys like Word Processors, Spread Sheets and Email.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    12. Re:How parallel does a Word Processor need to be? by Half-pint+HAL · · Score: 1

      And of course, every HTML document is a tree, and any tree can be parallelised.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    13. Re:How parallel does a Word Processor need to be? by Half-pint+HAL · · Score: 1
      If there's a hard break in the ebook (ie new chapters start on new pages), you can divide-and-conquer at page breaks and number as x+1, x+2,... x+n and then substitute for x when the previous calculation finishes. The GP's problem isn't that he's waiting for page numbers, but that he's waiting to see his book. It's the reader software that's waiting for the page numbers before it allows him to see the current page. From the user's perspective, having the page visible immediately with "page ?? of ??" at the bottom for the next few minutes is infinitely preferable to waiting a few minutes to get everything at once.

      Of course, this doesn't need parallelisation -- it could just as easily be done in a single core:
      1. Search back from current page to previous hard break.
      2. Repaginate x+n until next hard break found
      3. Render to screen as "page ?? of ??"
      4. Repaginate from beginning of text to start of current segment.
      5. Update screen to show (for example) "page 8123 of ??".
      6. Repaginate remainder of document. 7. Update total page number, eg "page 8123 of 10424".

      That iBooks doesn't do this is a result of design, and mostly based on assumptions of book length. The GP's call for parallelisation is wasted, as if they see such cases as his to be rare enough as to be negligible, they most likely wouldn't bother parallelising the code on a more parallel computer anyway.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    14. Re:How parallel does a Word Processor need to be? by Nutria · · Score: 1

      And of course, every HTML document is a tree, and any tree can be parallelised.

      No one has answered the question about disk and network slowness.

      What's the human-perceived benefit of rewriting Firefox to get a 1/2 second speedup in page rendering when I'm still waiting 3-4 seconds for some ad server to send me the rest of it's crud (ABP needing to be blocked so that videos on ESPN will play)?

      --
      "I don't know, therefore Aliens" Wafflebox1
    15. Re:How parallel does a Word Processor need to be? by laird · · Score: 1

      Pagination can be largely parallelized, because you can do most of the analysis (line layout, font rendering, etc.) in parallel. The only part that's got to be sequential is breaking the lines onto pages. You can then parallelize the rest of the page layout (headers, etc.).

    16. Re:How parallel does a Word Processor need to be? by vidnet · · Score: 1

      How parallel does a Word Processor need to be?

      Don't forget the complementary question, "How fast does each individual core have to be to run a single threaded Word Processor at an acceptable speed?"

      Imagine if instead of 4x 3ghz Xeons you had 4,000x 486s or 4,000,000x 286s.

    17. Re:How parallel does a Word Processor need to be? by Moof123 · · Score: 1

      Moore's Law has always been behind the MS Word curve. no matter how fast the processor, or how many cores, Microsoft has managed to use them up and then some, making word processing for the masses slower, and slower...

    18. Re:How parallel does a Word Processor need to be? by Nutria · · Score: 1

      "How fast does each individual core have to be to run a single threaded Word Processor at an acceptable speed?"

      WordPerfect 6.0 ran great on a 286 w/ 640KB, and WordStar was zippy on a 4MHz Z80 with 64KB (it was the floppy disk IO that hurt performance).

      So... the answer to your question is: not very!

      --
      "I don't know, therefore Aliens" Wafflebox1
    19. Re:How parallel does a Word Processor need to be? by Half-pint+HAL · · Score: 1

      You're quite right -- browsing is I/O bound, and the fact that Firefox refuses to render many pages mid-load is a design decision, and nothing to do with either serial-vs-parallel or concurrency. The place where Firefox still stutters on concurrency is where an object in one page crashes all the pages with objects of the same time. The main selling point of the original Google Chrome was threading Javascript, Flash et al so that no one page would kill the browser.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    20. Re:How parallel does a Word Processor need to be? by Agripa · · Score: 1

      How much indexing and searching does Joe User do? And what percent is already done on a high-core-count server where parallel algorithms have already been implemented in the programs running on that kit?

      Indexing (sorting) and searching in my email client are right up there with games, engineering applications, compression/decompression, and error recovery in using 100% of 1 CPU core while not being I/O limited. Firefox is in that list as well.

      So most of the programs that I use which contribute to a slow experience on the user side do not take advantage of multiple cores. The exceptions are video transcoding and error recovery set generation which can use as many cores as I can provide which is currently up to 4.

  8. Re:i'm so tired of political correctness by jones_supa · · Score: 1

    I think the actual problem is that some people are so worked up about political incorrectness that they take pleasure from it spewing insulting angry messages all day long. Lol, look at my freedomz of speechorz. But a clever guy can say things straight, without being a upsetting dickhead at the same time.

  9. Bad summary, shocking by Urkki · · Score: 5, Interesting

    Linus doesn't so much say that parallelism is useless, he's saying that more cache and bigger, more efficient cores is much better. Therefore, increased number of cores at the cost of single core efficiency is just stupid for general purpose computing. Better just stick more cache to the die, instead of adding a core. Or that is how I read what he says.

    I'd say, number of cores should scale with IO bandwidth. You need enough cores to make parallel compilation be CPU bound. Is 4 cores enough for that? Well, I don't know, but if the cores are efficient (highly parallel out-of-order execution) and have large caches, I'd wager IO lags far behind today. Is IO catching up? When will it catch up, if it is? No idea. Maybe someone here does?

    1. Re:Bad summary, shocking by Z00L00K · · Score: 1

      Some I/O won't catch up that easily, you can't speed up a keyboard much, and even though we have SSDs we have a limit there too.

      But if you break up the I/O as well into sectors so that I/O contention on one area don't impact the I/O on another by using a NUMA architecture for I/O as well as RAM then it's theoretically possible to redistribute some processing.

      It won't be a perfect solution, but it will be less sensitive.

      --
      If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    2. Re:Bad summary, shocking by Half-pint+HAL · · Score: 1

      Cache misses are a problem, but caching is subject to the law of diminishing returns -- it takes an exponential growth in cache size to get a linear reduction in cache misses. Has Torvalds run the numbers and determined where the crossover point of the two lines is? Even then, the best solution depends very heavily on the task in question. General purpose computing has always been a compromise.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
  10. Torvalds is half right by popo · · Score: 5, Insightful

    The problem is that Linus is discussing two different things at once and so it sounds like he's making a more inflammatory point than he is.

    The issue is not whether parallelism is uniformly better for all tasks. The question is, is parallelism better for some tasks. And as Torvalds points out, those tasks do exist (Graphics being an obvious one).

    The nature of the workload required for most workstations is non-uniform processing of large quantities of discreet, irregular tasks. For this, parallelism (as Torvald's correctly notes) is likely not the most efficient approach. To pretend that in some magical future, our processing needs can be homogenized into tasks for which parallel computing is superior is to make a faith-based prediction on how our use of computers will evolve. I would say that the evidence is quite the opposite: That tasks will become more discrete and unique.

    Some fields though: finance, science, statistics, weather, medicine, etc. are rife with computing tasks which ARE well suited to parallel computing. But how much of those tasks happens on workstations. Not much, most likely. So Linus' point is valid.

    But I have to take issue of Linus tone in which he downplays "graphics" as being a rather unimportant subset of computing tasks. It's not "graphics". It's "GRAPHICS". That's not a small outlier of a task. Wait until we're all wearing ninth generation Oculus headsets... the trajectory of parallel processing requirements for graphics is already becoming clear -- and it's stratospheric. The issue is this: Our desktop processing requirements are actually slowing and as Linus points out, are probably ill-suited for increased parallelism. But our graphics requirements may be nearly infinite.

    Unlike other fields of computing, we know where graphics is going 20 years from now: It's going to the "holodeck".

    Keep working on parallel computing guys. Yes, we need it.

     

    --
    ------ The best brain training is now totally free : )
    1. Re:Torvalds is half right by Anonymous Coward · · Score: 1

      The assumption that future workloads are highly parallel isn't faith based. It's simply a matter of recognizing that processing power for parallel workloads is much easier to scale and thus going to be much cheaper than processing power for sequential workloads. If you can get loads more parallel processing power for the same price that you would pay for sequential processing power, many of your processing tasks are going to look parallelizable, because you start looking at them differently.

    2. Re:Torvalds is half right by Oligonicella · · Score: 1

      So you have faith that these processing tasks will actually be 'parallelizable' because you start looking at them differently?

    3. Re:Torvalds is half right by Anonymous Coward · · Score: 4, Informative

      AMD have a line of CPUs very much like this, the A Series. It has several conventional multi-purpose x86-64 cores for general-purpose use and a Graphics Processing Unit built-in for those embarrassingly-parallel floating-point operations. Best of all, they're very cheap and perform very well.

    4. Re:Torvalds is half right by Anonymous Coward · · Score: 2, Insightful

      No, that's not faith. That's an economic argument. I know that many tasks which are considered practically non-parallelizable today can in fact be parallelized. We don't do that today because the additional work doesn't pay off when massive multicore systems are not yet available or not yet capable of running general purpose code. Often it's just a matter of getting the right tools, but sometimes you need to look at problems again and solve them in a different way. With new algorithms and new tools, you will make use of many cores, because if you don't, you will be left in the dust by the people who do.

    5. Re:Torvalds is half right by RabidReindeer · · Score: 1

      There are actually 3 kinds of tasks as far as parallelization goes.

      1. Totally linear tasks. Each step relies on the output of its predecessor. Thus nothing can begin before its time. Obviously, handing this sort of work off to a parallel system is a waste of time.

      2. Simple parallel tasks. This is the case where you can do a lot of trivial operations in parallel. The computer equivalent of a bunch of people using 4-banger calculators. As long as the tasks indivudually take longer to run than they do to schedule and collect, this is an ideal use case for massively parallel array processors.

      3. Complex parallel tasks. This is simply case #2, but armed with HP advanced function calculators. The individual processors would be more than just basic gate arrays and thus able to perform complex math functions in parallel. Not as cheap to scale up, but better than waiting out linear time.

      Of course, this is for the ideal world. Real-life heavy-computing scenarios may have components for 1, 2, or all 3 of the above.

    6. Re:Torvalds is half right by DutchUncle · · Score: 1

      I don't read "unimportant subset"; I read "subset, not general-purpose". By all means, graphics and parallel computer are/will-be important; but look at the history of processor development - the rise of the GPU as a separate device, taking the processing load *off* the CPU. I keep reading posts about using the appropriate language for a task; how much more so, then, is using the appropriate hardware design for a task?

      This is like debating whether your kitchen renovation would be better with an 8 burner stove, or more counter space. A restaurant, cooking separate meals for 4 or 6 people at a time, needs more burners - and usually has more cooks to go with them. At home, with one cook, 4 burners is usually as much as one can handle simultaneously, and the counter space is more useful. Different solutions for different problems.

    7. Re:Torvalds is half right by wisnoskij · · Score: 1

      But is not the real point that we have hit a wall on the hardware front, while while we can easily add parallel processing power, we can not easily add processing power to processors. Which is exactly why programs will increasingly, by necessity, be programmed to take advantage of more and more cores. Yes, individual algorithms/tasks sometimes cannot be broken up even into two parallel tasks, but even single programs are 99% a grouping of many different tasks to begin with. Yes, on the big data crunching scientific research end, this will not help, but for 99.9999999% of the uses for a computer there are already a hundred processing running at all times, and most of these could get broken up many, many, times further

      --
      Troll is not a replacement for I disagree.
    8. Re:Torvalds is half right by Half-pint+HAL · · Score: 1

      The issue is not whether parallelism is uniformly better for all tasks. The question is, is parallelism better for some tasks. And as Torvalds points out, those tasks do exist (Graphics being an obvious one).

      I think, as one of the quoted comments in the article said, that current programming languages have a lot to answer for in this debate. If we look at mathematics, the sum (i.e. sigma) and product operators are inherently parallel. Sadly our FOR and WHILE loops are not, as procedural iteration often relies on side-effects. It's no accident that Scala uses the mathematical basis of the functional programming paradigm as a foundation for massively scalable, parallelisable programming (even though it's a mind-twisting wreck of a language in many ways).

      Every non-trivial program will have to crunch through large iterative processes at some point, and even if these are a small percentage of execution time, the reality is that most interactive systems have a lot of idle time anyway, and the delay for the user is only when the program gets stuck in a long iteration. So it follows that even if parallelism reduces the overall performance, it is of no consequence as the perceived performance by the user is improved.

      But I have to take issue of Linus tone in which he downplays "graphics" as being a rather unimportant subset of computing tasks. It's not "graphics". It's "GRAPHICS". That's not a small outlier of a task.

      His point isn't that graphics is a small thing, it's that the GPU already handles that, and that we therefore don't need much parallelism in the CPU. But when I think about games, I think about AI, and the AI has to operate in 3D space, so the AI obviously benefits from the same parallelism as the graphics. Do we steal GPU cycles to run our AI? No, because the way of the market is typically that games sell first and foremost on their looks. So we need more parallel grunt. Plus, of course, as AI has to handle multiple independent agents, AI is an inherently parallel task (multithreaded in concept, regardless of whether a particular game implements it in threads or not)

      But going back to Scala and FP... A lot of the problems of memory locking are nicely sidestepped if you implement your code in FP: FP guarantees immutability of values: you cannot write to an existing value, so you don't need to have notions of "atomic" operations, and hence no need to lock in most circumstances. Caching becomes less of a pain, as nothing is ever going to change, so your cache value cannot be incorrect.

      Some of the comments in the article refered to theoretical extra bugs due to having to think in parallel, but it simply gives more motivation for a programming paradigm that is less bug-prone, and FP is that paradigm. FP has been rejected by programmers far too long, but the simple mechanism of immutability removes that most bothersome of bugs -- the erroneously altered value that you spend a week tracking down. FP should already be easier to reason about than procedural programming, if we learned to do it properly. FP for parallel isn't all that much different from single-threaded FP, so would actually make basic parallel code no more complicated to learn than algebra.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    9. Re:Torvalds is half right by Half-pint+HAL · · Score: 2

      In essence, it's already done that way. A System-on-a-chip (SoC) typically has a couple of general-purpose cores, along with sound and video processors. In a full-sized PC, the graphics processing is usually taken to another chip -- in fact another circuit board entirely. Because most of the work the graphics processor (=GPU) does is largely independent of the main processor (=CPU) (the CPU pushes in the data, says "do X with it", the GPU then churns away through the data) it doesn't need to be closely linked or share a lot of memory. In fact, it's more efficient for them not to share memory, as then they're not getting in each other's way.

      Expanding that system for more types of semi-general-purpose cores would get rather complicated.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    10. Re: Torvalds is half right by Half-pint+HAL · · Score: 4, Informative

      Verification is the process of checking that software works correctly. The more complex the system, the more complex the process of verification. Rather unfair of the GP to throw that in as a single word after you explicitly said that you're not a computer scientist.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    11. Re:Torvalds is half right by Half-pint+HAL · · Score: 1

      Yes, but it can't really do much non-graphics parallel processing at the same time as rendering a game. As I understand it, a lot of problems with AI in modern AAA titles is down to the fact that they need the parallelism, but in AAA-land, graphics are king, and the AI guys don't get enough cycles to do a decent job.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    12. Re:Torvalds is half right by Euler · · Score: 1

      Even in case #1, there is sometimes things that can be done. For example, speculative execution. If you can boil down to a small number of choices as a result of the first operation, then it may make sense to compute both outcomes. Or there may be some other intermediate value that might be needed in only some outcomes. But this requires application-specific knowledge usually to know exactly what is allowable and what the payoff would be. You wouldn't want to create a situation where executing both cases affects a global resource. So you would need a language expressive enough to hint this information to the compiler.

    13. Re:Torvalds is half right by RabidReindeer · · Score: 1

      Even in case #1, there is sometimes things that can be done. For example, speculative execution. If you can boil down to a small number of choices as a result of the first operation, then it may make sense to compute both outcomes. Or there may be some other intermediate value that might be needed in only some outcomes. But this requires application-specific knowledge usually to know exactly what is allowable and what the payoff would be. You wouldn't want to create a situation where executing both cases affects a global resource. So you would need a language expressive enough to hint this information to the compiler.

      Very good point. A high-level equivalent to the predictive processing at the CPU hardware level!

    14. Re:Torvalds is half right by Lunix+Nutcase · · Score: 2

      Why not design multi-purpose chips that have some cores optimized for some tasks, and other cores optimized for others

      We do have those. Any CPU with an iGPU is such a chip. We've had such CPUs for years and years now. Have you missed out on the last decade of CPU design?

    15. Re:Torvalds is half right by Khyber · · Score: 1

      "The nature of the workload required for most workstations is non-uniform processing of large quantities of discreet, irregular tasks. For this, parallelism (as Torvald's correctly notes) is likely not the most efficient approach."

      Please, he can't even PP his way out of a DX-OGL call/wrap. He's got zero standing ground to talk about paralelism when there are people taking Linux, making it run highly parallel, and it works like a goddamned dream. Being able to do all of those irregular discrete tasks without having to wait for something else to finish first is the goal.

      People have worked on pseudo-parallel code for OoO and what not. minimum 200% increase in performance.

      Meanwhile, Linus still refuses to fix a bug in kernel, which exists all the way back to before kernel version 2.x ever hit the scene, which alows anyone to hardlock the kernel (and could've been mitigated or entirely prevented by having some fucking paralel-capable code.)

      Linus needs to go crawl in his hole and shut up. People more competent than him have taken over his project, and he's just bitching about it in a non-descript way.

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    16. Re:Torvalds is half right by Khyber · · Score: 1

      " Any CPU with an iGPU is such a chip."

      Except the iGPU wasn't put on the same physical package as the CPU for a LONG time, and was only recently happening with AMD and Intel. Used to be the iGPU was on the northbridge.

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    17. Re:Torvalds is half right by Baloroth · · Score: 1

      The nature of the workload required for most workstations is non-uniform processing of large quantities of discreet, irregular tasks. For this, parallelism (as Torvald's correctly notes) is likely not the most efficient approach. To pretend that in some magical future, our processing needs can be homogenized into tasks for which parallel computing is superior is to make a faith-based prediction on how our use of computers will evolve. I would say that the evidence is quite the opposite: That tasks will become more discrete and unique.

      Right, but we want to continue the "Moore's Law" speedup of processing year over year. And that simply can't happen with single core processing: clock speed is already near the physical limit (as in we would need to start violating the speed of light to increase it much further), and manufacturing process size can't continue shrinking indefinitely either, no matter how close we are to the actual physical limits there. So unless we invent entirely new computing systems (e.g. quantum computers), the only speed gains in the future will inevitably be from parallelization, and there are (for many cases) still massive speed gains to be made in that field, simply because the software was never designed for any parallelization at all. Granted, that'll hit a wall where you can't split tasks up anymore as well, but in many cases this process hasn't even started.

      You're quite right about the graphics, though: the long-term future of graphics technology is probably ray-tracing, and that takes absolutely massive amounts of completely parallel CPU power.

      --
      "None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
    18. Re:Torvalds is half right by paulpach · · Score: 1

      But I have to take issue of Linus tone in which he downplays "graphics" as being a rather unimportant subset of computing tasks. It's not "graphics". It's "GRAPHICS". That's not a small outlier of a task. Wait until we're all wearing ninth generation Oculus headsets... the trajectory of parallel processing requirements for graphics is already becoming clear -- and it's stratospheric. The issue is this: Our desktop processing requirements are actually slowing and as Linus points out, are probably ill-suited for increased parallelism. But our graphics requirements may be nearly infinite.

      I agree, he dismisses graphics as something a few people do. WTF? in mobile, all the top grossing apps are games. The number 1 thing games do is graphics. If anything, I would argue that graphics is the single most important type of workload in mobile. Gaming (and therefore graphics) might not be quite as big in desktop, but it is still very far from being the niche he pretends it is.

      That said, I think he is right that the fast single threaded big CPU's are not going anywhere. The trend for mobile and desktop has been to do graphics and general processing in separate hardware (GPU + CPU), I don't see a reason in sight to change that. Even if they were to be combined in a single chip, it would still be different part of the chip doing the tasks.

    19. Re:Torvalds is half right by RyuuzakiTetsuya · · Score: 1

      Torvalds is half wrong too.

      The problem with Torvald's assertion is that while he's probably right(and I read a Steve Jobs kinda ethos when he says that end users are fine with 4 cores; which I like a lot), i think though that there's a pretty practical problem computing is running into.

      We're kind of at the limit with how much work a single core can do. We can't make single cores faster. We can make them cheaper and lower power. Which I think overall is a huge gain for anyone who pays power bills :) But now that we can have a lot of them, we shouldn't be shy about actually using them.

      I think it might be ultimately a fruitless effort, but I think the possible gains make it worth fleshing out.

      I just wonder how much there is to gain by ditching legacy CPU architectures(x86, ARM, etc) and starting from the ground up. Probably not much, but I am hopeful I am wrong.

      --
      Non impediti ratione cogitationus.
    20. Re:Torvalds is half right by lgw · · Score: 1

      How many systems with "totally linear tasks" have only 1 user?

      Smart phones don't need 1000 cores - heck, they only need 2 so you can have a low-power and a high-power core. But servers? 1 core per user seems like a good start - but across how many boxes? The trick is getting that workload to easily scale horizontally, across any number of servers that can only talk to one another slowly, and that's not an easy trick! I expect a lot of work in that area in 2015.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    21. Re:Torvalds is half right by rgbatduke · · Score: 1

      Note well that historically, MOST parallel computers have profited the MOST from parallelizing totally linear tasks. Not the tasks themselves -- embarrassingly parallel tasks, simply running many instances of completely independent code or many instances of code that is extremely coarse grained so that one can run almost all of the task as linear code with only infrequent communications with a "central" controller. Classic examples are plain old multitasking of the operating system with code that doesn't make heavy use of bottlenecked resources (the reason most users see some small benefit from e.g. quad core vs single core processors, as there is enough often enough work being done to keep 3-4 cores busy at least some of the time without much blocking, and this keeps the processor itself from thrashing by providing the illusion of parallelism through multitasking with time slices. It works best if the cores have independent caches and contexts and if there is sufficient task affinity. Also, classic "master-slave" parallel computing, where e.g. a Monte Carlo computation might spawn N slaves, each one with its own random number generator seed, and run N "independent" samplings of some process that are only infrequently aggregated back to the master. Again, the characteristic is lots of nearly independent serial computation with only short, infrequent, non-blocking, non-synchronous communications back to some collection point. Two programs that often were used to demonstrate the awesome advantages of scaling at the limits of Amdahl's law were parallel povray (rendering can be broken up into nearly independent subtasks in master-slave) and a parallel Mandlebrot set generator/displayer (where each point has to be tested independently, so whole subsets of the relevant parts of the complex plane could be distributed to different processors and independently computed, with the master collecting and displaying the results.

      Sadly (well, not really:-) modern processors are so damn fast you can get to the accessible bottom of the Mandlebrot set with almost no perceptible delay from rubber banding even with a single core, so the latter isn't so dramatic, but the point remains -- quite a lot of work that can be done with multiple cores (arguably MOST of the work that can efficiently and easily be done with multiple cores) is trivial parallelism, not parallel programming. Instance 1 is the richest source of advantage for a parallel system, and tasks that will scale out to 1000 cores are almost certainly ONLY going to be trivially/embarrassingly parallel tasks because Amdahl's law and the complexity of unblocking communications between subtasks is a royal bitch at 1000 processors no matter how you architect things. SETI at home, maybe. Solving a system of partial differential equations on a volume with long range interactions not so much.

      The fundamental problem with 2 and 3 is that they have to be hand coded. Really pretty much period. Sure, you can get away with getting some advantage from using e.g. a parallel linear algebra program as a link step in a program that can run on serial resources, but typically the gains you can get are limited and will not scale well, certainly not to anywhere near 1000 cores, even for case 2. To use 1000 cores for a tightly coupled parallel computation where every core talks to every other core per step of the computation -- well, that just isn't going to happen without an incredible (literally) boost in interprocessor communication speed, reduction in communication latency, elimination of resource blocking at both the hardware and kernel level. The problem at some point becomes NP complete (I suspect, of course pending the issue of whether P = NP etc) and simply working out ways for the communications to proceed in a self-avoiding pattern to eliminate collisions or delays due to asynchronicity is itself a "hard problem", forget the problem you're actually trying to solve.

      So I'm largely with Linux on this one. Advantages to parallelism at the OPERATING SYSTEM level

      --
      Even when the experts all agree, they may well be mistaken. --- Bertrand Russell.
    22. Re:Torvalds is half right by Bengie · · Score: 1

      Because most of the work the graphics processor (=GPU) does is largely independent of the main processor (=CPU) (the CPU pushes in the data, says "do X with it", the GPU then churns away through the data) it doesn't need to be closely linked or share a lot of memory.

      There is little GPU+CPU workloads because communications between the GPU and CPU is so slow, not because there is no demand for it. There is a huge class of hybrid workloads that require data ping-ponging back and forth between the GPU and CPU, but no one writes code for that because it's latency sensitive to nanoseconds and GPU to CPU latency is in microseconds. AMD is working on reducing the latency by integrating the GPU and CPU together.

    23. Re:Torvalds is half right by dslbrian · · Score: 1

      The issue is this: Our desktop processing requirements are actually slowing and as Linus points out, are probably ill-suited for increased parallelism.

      Depends on the desktop requirements. I think he is off the mark here. Specifically to quote him from TFA:

      The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless.

      Since he points out servers and graphics as largely solved, I assume he is talking about desktop usage. In this he is assuming a standard usage model for a desktop user, a set of apps - web, devel, coding, games, whatever. I think the view is that of a user who can only focus on a single-task at a time (with perhaps background OS tasks). But this is a myopic view, the rise of virtualization has enabled a convergence of hardware onto a single machine. This is only possible with the rise of multi-core/parallel computing. VMs are a huge benefit, in terms of power/area efficiency and even being able to create and destroy them on a whim.

      On my desktop machine (8 core) I have two VMs running all the time. These machines used to be physical separate machines, consuming power, taking up floor space, making noise, etc. I could not have run this setup on my previous single/dual core machines. However now they are virtual, and my normal desktop usage doesn't even notice them running (even heavy 3D gaming is not lagged by these VMs).

      There are compounding parallelization factors - having the whole setup on encryption means wanting the cores to handle AES in hardware, so as he points out having hordes of parallel weak cores might be pointless for that. However, multiple powerful cores, I can put those to work.

      IMO the advantages are clearly obvious. Sure for a single-task desktop user, you may only want a few cores for background tasks plus the foreground task. But the ability to consolidate lots of hardware into a single box, I want as much of that as I can get. I can easily think of desktop + VM scenarios that can push beyond 4 cores.

    24. Re:Torvalds is half right by skids · · Score: 1

      FP has been rejected by programmers far too long, but the simple mechanism of immutability removes that most bothersome of bugs

      ...and kills you rmemory/cache profile. FP is great for a subset of problems, but should not be held up on a pedestal, just appreciaed as one tool in the box.

      FP should already be easier to reason about than procedural programming

      Considering it makes many everyday things harder to express, the fact that FP lends itself to easy modeling is offloading the mental effort in the wrong place. You're buying academic ease of manipulation at te expense of increasing the drudgery of everyday tasks, which is why FP is favored for research but not generally accepted for application.

    25. Re:Torvalds is half right by skids · · Score: 1

      Also the future typical user will be using more speech recognition, computer vision, and "AI" experts, all of which scale with parallelism.

    26. Re:Torvalds is half right by Half-pint+HAL · · Score: 1

      My personal view is that the problem comes from looking at a program with a single paradigm. I'd love to see FP as a subsystem, where control code is on the outside in procedural style and all the heavy work is buried in strict functions. At the moment, trying to code in that style is overcomplicated, and in the end usually results in coding in a language that doesn't guarantee immutability and you end up having to hunt down phantom mutation bugs, which kind of defeats the purpose of the exercise.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
  11. Re:Sounds like programmers from 40+ years ago by Paradise+Pete · · Score: 1

    "Nobody will ever need more than 2 digits for a year, so the crazies suggesting years be represented by 4 digits are just that - crazy."

    Even the people who knew it would be an issue still used two digits. Resources were extremely constrained. It wasn't worth spending all of that for a problem that would happen decades later. I used to write complete programs that fit in 8K.

  12. Re:Sounds like programmers from 40+ years ago by jandersen · · Score: 1

    Linus sounds like a programmer from 40 years ago

    Not necessarily a bad thing to sound like, IMO; 40 years ago you had to think and actually be insightful about what you were undertaking, because the tools and resources were so limited. And, as somebody else has already mentioned, Linus isn't against graphics and multi-core, he is against the stupid fad that blindly demands more cores at the expense of producing better cores (as well as the idiocy of wrapping everything in a graphical front-end, when that actually ends up getting in the way of doing the job).

    I think what he says makes a lot of sense - when do you actually benefit from having many cores? Only when you have many, independent tasks; there are large classes of tasks that are serial in nature, which would not benefit from having several cores to run on. And most of the independent processes on the average PC are so lightweight that nothing is gained from having several cores compared to multiprocessing on a single core. Unless you are running a proper server in a data centre or performing large computations, you are likely to just waste your money, if you buy into the multi-core fad.

  13. No locks by ShakaUVM · · Score: 2

    Ungar's idea (http://highscalability.com/blog/2012/3/6/ask-for-forgiveness-programming-or-how-well-program-1000-cor.html) is a good one, but it's also not new. My Master's is in CS/high performance computing, and I wrote about it back around the turn of the millenium. It's often much better to have asymptotically or probabilistically correct code rather than perfectly correct code when perfectly correct code requires barriers or other synchronizing mechanisms, which are the bane of all things parallel.

    In a lot of solvers that iterate over a massive array, only small changes are made at one time. So what if you execute out of turn and update your temperature field before a -.001C change comes in from a neighboring node? You're going to be close anyway? The next few iterations will smooth out those errors, and you'll be able to get far more work done in a far more scalable fashion than if you maintain rigor where it is not exactly needed.

    1. Re:No locks by Anonymous Coward · · Score: 1

      The probabilistic synchronisation sounds similar to how we got over the Shannon limit with modems.

      Modems used to compress data before sending it over a line with a bit rate that would be below the Shannon Limit. Modern modems use error correction and send over the Shannon Limit, it is as if the universe has compressed the data more efficiently than a computer could.

    2. Re:No locks by HiThere · · Score: 1

      I prefer message passing through queues, but it clearly depends a lot on what problems you're working with.

      Also, some problems can't be done in parallel, but we won't know how many can until we start trying....and then try for a few decades.

      --

      I think we've pushed this "anyone can grow up to be president" thing too far.
    3. Re:No locks by ShakaUVM · · Score: 1

      >Also, some problems can't be done in parallel, but we won't know how many can until we start trying....and then try for a few decades.

      Right, but there's also a grey area between completely serializable and embarassingly parallel, in which methods like this will allow scaling algorithms up from "a few" computation nodes to "many", with the optimal numbers depending on the specific algorithms.

      The biggest problems are still the same ones that existed when I got my Master's over a decade ago. Language support for parallelism isn't very good (I personally used MPI, which was awkwardly bolted on top of C++), it requires a certain amount of specialized knowledge to write parallel code that doesn't break or deadlock your machine (and writing optimized code is a bit more advanced than that), and library calls aren't all threadsafe. On the plus side, a lot of frameworks and libraries are now multithreaded by default, which nicely isolates the problems of parallel computing away from people who haven't been trained in it, and gives the benefits of parallel computing with only the downside of having to use a framework. =)

    4. Re:No locks by LateArthurDent · · Score: 1

      So what if you execute out of turn and update your temperature field before a -.001C change comes in from a neighboring node? You're going to be close anyway? The next few iterations will smooth out those errors

      Unless you're dealing with a stiff system and that small error just caused you your iterations to start going divergent.

      I mean, not to dismiss the approach, because I agree with you there are certainly lots of situations where it'll be fine. However, it's also one of those things that aren't going to replace current paradigms either. We're not going to go all lock free. We're going to add lock-free programming to the toolset.

  14. Re:Programs people want to use... by Rei · · Score: 3, Insightful

    Indeed. There's tons of CPU-intensive tasks that need to be done in a modern computer game, but they're typically done as:

    while (true)
    {
        do_task_1();
        do_task_2();
        ( ... )
        do_task_N();
    }

    Rather than...

    std::thread([&](){ while (true) do_task_1(); }).detach();
    std::thread([&](){ while (true) do_task_2(); }).detach();
    ( ... )
    std::thread([&](){ while (true) do_task_N(); }).detach();
    }

    ... or similar. Because in C and older versions of C++ launching a thread takes significant typing and ugly code, up to and including - in the case of the same function threaded a variable number of times in a loop with more than a trivial argument - having to have a memory-managed threadsafe container to hold your arguments (and in C you don't have STL containers, you have to do all that work yourself too). It's not the end of the world to have to code threads in C or earlier C++, but it's enough work that programmers usually don't do it any more than they're pretty much forced to. "Okay, my game will literally run at half the speed if I don't thread this function" - fine, they'll thread it. But "this function call eats up 3% of my performance, this one 6%, this one 4%, this one 2,5%, this one 3,5%...."? Usually such functions just get stuck into one big main loop.

    I really hope with how easy it's gotten in C++11 that more people will make better use of threads. In the first example code, not only do you relegate all of your tasks to the same core, thus hitting performance, but if any one task hangs, all of them hang. It's a terrible approach, but it's the most common. The only case where threads aren't good is where you're doing heavy concurrent read/writes to the same cached data, but in real world apps there's almost always a level where you can launch the thread where this isn't the case, if it's even an issue to begin with in your particular application. The presumption that concurrent access to cached memory will usually or always be a problem (which seems to be Linux's presumption) requires that A) your threads not doing the majority of their work on thread-local memory, AND B) that the shared data area being read from / written to concurrently is small enough to be cached, AND C) you can't just migrate your threads up in scope N levels to work around any such issue.
     

    --
    If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
  15. Mmm... Cores... by Greyfox · · Score: 1

    I'll see your cores and raise you your boss strangling all your cores by forcing you to get all the data you were planning to process from NFS shares on 100 megabit LAN connections. Because your developers and IT department, with all the competence of a 14-year-old who just got his hands on a copy of Ruby on Rails, can't figure out how to utilize disk that every fucking machine in the company doesn't have read access to.

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  16. Re: make -j lotsandlots by Z00L00K · · Score: 1

    Only if you have a single I/O device and channel.

    NUMA architectures can also apply to disks and other I/O devices.

    Of course - it comes with a new set of problems, but there's no golden solution.

    --
    If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
  17. 'make -j64 bzImage' by hazeii · · Score: 1

    How does Linux compile his kernel? Certainly I use a parallel make across as many cores as possible (well, up to the point where there's a core for every compilation unit).

    --
    All your ghosts are just false positives.
    1. Re:'make -j64 bzImage' by gweihir · · Score: 1

      Wrong question. C compilation has linear speedup as each file can be compiled without knowing the others. The question is how he links his kernel, and the answer is on a single core as there is no other sane way to do it. Fortunately, this problem is almost linear in the input size (assuming good hash-tables), or we would not be having any software in the size of the kernel.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    2. Re:'make -j64 bzImage' by itzly · · Score: 1

      But would you rather do a parallel make on 4 cores with big caches, or 64 cores, each with 1/16th of the cache ?

    3. Re:'make -j64 bzImage' by itzly · · Score: 1

      C compilation has linear speedup as each file can be compiled without knowing the others

      As long as I/O bandwidth is infinite.

    4. Re:'make -j64 bzImage' by Gaygirlie · · Score: 1

      The compiler doesn't actually do parallel processing when you're compiling the kernel, it does multi-processing and that's the crux here; when you're compiling the kernel each process that spawns works on its own set of files -- multi-processing, that is -- whereas if it was doing parallel-processing they'd be working on the same files simultaneously. They are two very different concepts and you're confusing them.

    5. Re:'make -j64 bzImage' by gweihir · · Score: 1

      Or you do it on separate machines. But yes. Ideally, it has linear speed-up, if I/O is not a bottleneck. In practice, things are not as nice, although with 4...8 cores and an SSD to feed them you do not notice much.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    6. Re:'make -j64 bzImage' by Rei · · Score: 1

      No, as long as I/O bandwidth is not the limiting factor. The sort of thing you're compiling can have radically different CPU vs. I/O requirements. Some simple but verbose C code with little optimization might be almost entirely IO limited while some heavy templated C++ and full optimization might be almost entirely CPU limited.

      The thing is, there's no way to know what is going to cause a particular person to think "I wish my computer was performing faster". It all depends on the individual and what they use. But one can name various cases that might likely cause a certain subset of users headaches, and so those become use cases for improving system performance. One such use case is clearly compilation that's not IO-limited.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    7. Re:'make -j64 bzImage' by laird · · Score: 1

      They're all parallel processing, just on different units of storage. Processing 1,000 files in parallel is parallel. Processing 1 file using 1,000 parallel processes is parallel.

  18. weird by drolli · · Score: 1

    The central claim of Linus seem to be that there are many people out there who claim an efficiency increase by parallelism. While i agree that many people claim (IMHO correctly) a increase in the performance (reduction of execution time) within the constraints given by a specific technology level by doing symmetric multiprocessing, i have not heard many people to claim that efficiency (in terms of power, chip area, component count) is improved by symmetric, general parallelization; and nobody with a good understanding of infromation-related aspects of computation.

    I am now speaking as a physicist, I find it disturbingly easy to show the opposite for many cases in the limit of ideal performing systems (that is, resource per implemented gate operation remaining constant with the number of gate operations).

    Having said that, I speculate that there are reasons to introduce paralellism:

    a) The performance you require can not be achieved without it. An example woulf be an FPU, or even just an 8-bit a full adder. You *can* implement it bit-wise, but you dont like to. The full adder also is an excellent example on how paralellism can increase power consumption (i.e. fast-carry-look-ahead) and resource usage

    b) Your implementation simulates operations in a way in which requires a significant effort for fetching and decoding to simulated function. The extreme case of a extreme RISC processor with one bit operations and 1bit ALU only is more inefficient for many problems than the processors we use. This means that there probably is an ideal "processing power/RAM (cache)" combination, which is a function of your communication cost (i.e. bus drivers) and your algorithm.

    c) From b) we can actually see that it can be extremely resonable to create non-symmetrich mutilprocessing units. For listening to a sensor signal to change, a 8-bit 1MHz Microcontroller with less than 100kGates may be an excellent choice (seen the ti430 line, from example), since it does not insist in keeping an overkill of ALU persistenly on.

    d) Paralell programming is almost never used to increase efficiency (unless you really have a distributed input/output and inherent costs of collecting it), but only for these operations where the efficiency loss due to parallelism is negligible (or zero).

    1. Re:weird by serviscope_minor · · Score: 2

      The central claim of Linus seem to be that there are many people out there who claim an efficiency increase by parallelism.

      They do, and to an extent they are correct.

      On CPUs that have high single thread performance, there is a lot of silicon devoted to that. There's the large, power hungry, expenive out of order unit, with it's large hidden register files and reorder buffers.

      There's the huge expensive multipliers which need to complete in a single cycle at the top clock speed and so on.

      If you dispense with that and replace it all with simple, in order, highly pipelined ALUs, you can fir an awful lot more raw artihmetic performance in a given area of silicon.

      So it is much, much more efficient (at certain workloads). The trouble is getting good use out of a hudge wodge of simple cores. That's what GPUs do: the cores are simple and wide, but the problem of filing them is "solved" by limiting the workload to something very regular. The result is something vastly more efficient than a general purpose CPU... for those workloads.

      The flops/W of a CPU are very much in excess of a CPU. Great, if you can use them.

      Personally, I still want to have time to play with those AMD HSA chips, they put the cores of both types on the same side of the cache and MMU. Much more like a tightly coupled co-processor then.

      --
      SJW n. One who posts facts.
    2. Re:weird by drinkypoo · · Score: 1

      What's so wrong with having our cake and eating it too? Why can't we have future system architectures like:

      In theory, we could have system architectures like that right now with nothing but OS support and a dedicated VRM for each processor socket, which is not exactly abnormal anyway. (Hell, we used to put them on modules, right next to the socket. I thought that was a great place for them, but maybe the connector caused problems. Certainly it kept the heat off the mainboard.) There's nothing in principle that stops you from making a Hypertransport mesh of disparate processors, for example, except the general lack of processors on a Hypertransport bus outside of AMD.

      However, since each manufacturer has a different bus, we can't have nice things.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    3. Re:weird by SecurityGuy · · Score: 1

      I don't think computer efficiency is the goal in many cases, and it often shouldn't be. I don't have a fixed amount of computer time to burn in my life, I have a fixed amount of time. If I can throw more hardware at a problem and make it finish faster, that's usually good. In some cases, it's mandatory. I could run a big weather model on a single CPU desktop with a small number of cores (and a lot of memory). It'd finish eventually, but long after the weather it was supposed to predict has happened.

      You're right in what you're saying. That single desktop would me more efficient in some sense, but it would also not be at all useful for many real world problems.

    4. Re:weird by serviscope_minor · · Score: 1

      What's so wrong with having our cake and eating it too?

      Why can't we have future system architectures like: 1x main core, made as fast as we realistically can 8x secondary cores, each 75% as fast as the main core but using a lot less silicon each 64x teriary cores, each 50% as fast as the main core, but again even simpler in terms of silicon consumption

      Nothing's wrong, except you get less silicon to devote to the mega core/simple cores.

      Nonetheless, that seems like what AMD's HSA is trying to do. There's a few high speed complex cores coupled closely with a lot of very simple, very wide floating point processors. In some sense, vector instructions are a bit like that. You get a lot more raw floating point performance at the penalty of less flexibility. The OoO unit still only has to track one element for 4 operations.

      It ought to be much easier to use HSA than GPGPU due to the nanosecond latency and lack of memory transfer scheduling due to sharing the same cache. They use different instruction sets though.

      But instead of making the programmer have to think about all that, why not have a "std::for_each_threaded"

      Well, if you've been looking for that, I hope you'll like this :)

      OpenMP is a C/C++/FORTRAN extension for exactly this kind of thing. It's a language extension rather than a library. Basically, you do:

      #pragma omp parallel for
      for(size_t i=0 i container.size(); i++)
      { ...
      }

      and it runs different iterations of the for loop in parallel. I think the default is to split it into N chunks (N=threads) for the segments [0,size/N], [size/N+1, 2*size/N], etc. That's simple and low overhead and works well if the loop finishes fast.

      You can certainly specify that it shoves the next iteration into whichever thread is free, which works particularly well if the iterations are slow and rather varied in time.

      I believe you can essentially smoothly go between either of the two by specifying a chunk size, something like:

      #pragma omp parallel for schedule(dynamic, 30)

      It's supported by GCC, ICC and VS. LLVM didn't last time I looked but it does now. Compile and link with -fopenmp on gcc.

      --
      SJW n. One who posts facts.
  19. Shi's Law, Gustafsson's Law, Amdahls Law by amplesand · · Score: 3, Insightful

    Shi's Law

    http://developers.slashdot.org...

    http://spartan.cis.temple.edu/...

    http://slashdot.org/comments.p...

    "Researchers in the parallel processing community have been using Amdahl's Law and Gustafson's Law to obtain estimated speedups as measures of parallel program potential. In 1967, Amdahl's Law was used as an argument against massively parallel processing. Since 1988 Gustafson's Law has been used to justify massively parallel processing (MPP). Interestingly, a careful analysis reveals that these two laws are in fact identical. The well publicized arguments were resulted from misunderstandings of the nature of both laws.

    This paper establishes the mathematical equivalence between Amdahl's Law and Gustafson's Law. We also focus on an often neglected prerequisite to applying the Amdahl's Law: the serial and parallel programs must compute the same total number of steps for the same input. There is a class of commonly used algorithms for which this prerequisite is hard to satisfy. For these algorithms, the law can be abused. A simple rule is provided to identify these algorithms.

    We conclude that the use of the "serial percentage" concept in parallel performance evaluation is misleading. It has caused nearly three decades of confusion in the parallel processing community. This confusion disappears when processing times are used in the formulations. Therefore, we suggest that time-based formulations would be the most appropriate for parallel performance evaluation."



    .

  20. Re:i'm so tired of political correctness by goarilla · · Score: 2

    And some of us just grew up in the sort of nuclear family where offensive expletives are the norm.

  21. Poor slashdot... by Anonymous Coward · · Score: 3, Insightful

    Few are actually people with a real engineering background anymore.

    What Linus means is:
    - Moore's law is ending (go read about mask costs and feature sizes)
    - If you can't geometrically scale transistor counts, you will be transistor count bound (Duh)
    - therefore you have to choose what to use the transistors for
    - anyone with a little experience with how machines actually perform (as one would have to admit Linus does) will know that keeping execution units running is hard.
    - since memory bandwidth has no where near scaled with CPU apatite for instructions and data, cache is already a bottleneck

    Therefore, do instruction and register scheduling well, have the biggest on die cache you can, and enough CPUs to deal with common threaded workflows. And this, in his opinion, is about 4 CPUs in common cases. I think we may find that his opinion is informed by looking at real data of CPU usage on common workloads, seeing as how performance benchmarks might be something he is interested in. In other words, based in some (perhaps adhoc) statistics.

    1. Re:Poor slashdot... by gweihir · · Score: 1

      Good summary, and I completely agree with Linus. The limit may go a bit higher, up to say, 8 cores, but not many more. And there is the little problem that for about 2 decades, chips have been interconnect-limited, which is a far harder limit to solve than the transistor-one, so the problem is actually worse.

      All that wishful thinking going on here is just ignorant of the technological facts. The time where your code could be arbitrary stupid, because CPUs got faster in no time, is over. There may also be other fantasies in there, for example people that do not understand that (true/strong) AI is not a question of cycles, and that have their hopes misplaced in that direction.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    2. Re:Poor slashdot... by laird · · Score: 1

      Having worked on machines with thousands of CPUs, I disagree. The thin that Linus is missing (IMO) is that modern GPUs are no longer "graphics processors" but are actually quite powerful MPP supercomputers, and there are millions of them out there, and applications are increasingly being written to take advantage of them.

      He's right that putting many extremely expensive, power-hungry Intel CPUs in a single box isn't a good tradeoff except in very specific cases. Luckily it's actually quite cheap to add large numbers of cheap, high performance CPUs to a computer, and in fact they're likely already there, so the cost of using them is $0 for hardware, just some developer effort. So the question is simply whether developers should ignore all those CPUs and use only the main CPU, or they should learn how to use the supercomputer sitting on the graphics card.

    3. Re:Poor slashdot... by gweihir · · Score: 1

      What you run in on modern GPUs is tiny programs for problems that have zero interaction between the parts, i.e. can be perfectly partitioned. That is not what Linus is talking about and not a typical workload.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  22. Re:i'm so tired of political correctness by Attila+Dimedici · · Score: 4, Insightful

    No, "political correctness" is a thing. It is where someone gets in trouble for using the word "niggardly" because it sounds like another word.

    --
    The truth is that all men having power ought to be mistrusted. James Madison
  23. From personal experience... by gnasher719 · · Score: 1

    Mostly writing code for MacOS X and iOS. All current devices have two or more cores. Writing multi-threaded code is made rather easy through GCD (Grand Central Dispatch), and anything receiving data from a server _must_ be multithreaded, because you never know how long it takes to get a response. So there is an awful lot of multi-threaded code around.

    But the fact that work is distributed to several cores is just secondary for that kind of work. It is also easy to make most work-intensive code use multiple cores. There are calls like sorting an array or searching for an item with multi-threaded variants. With GCD, you can just say "do this task on a background thread", and if you have five things to do, it uses five threads and up to five cores. It's so easy that people do it a lot without measuring how efficient it is. As long as your software is fast enough, it's fine.

    The typical result is an application that uses multiple cores to some degrees, but may have bottlenecks that require a single core. Now on an iPhone with 2 cores, that's fine. (If 30% of your time needs to run on a single core, but you have only two cores, it doesn't matter). On an iMac with 4 cores, it's quite OK. On a monster MacPro with 24 threads it might be a problem. On a hypothetical machine with 100s of cores it _is_ a problem.

    So your typical MacOS X or iOS app written by reasonably competent people will work fine in the current environment, but would need major changes to take advantage of 100s of cores.

  24. Linus is right by gweihir · · Score: 3, Insightful

    Nothing significant will change this year or in the next 10 years in parallel computing. The subject is very hard, and that may very well be a fundamental limit, not one requiring some kind of special "magic" idea. The other problem is that most programmers have severe trouble handling even classical, fully-locked, code in cases where the way to parallelize is rather clear. These "magic" new ways will turn out just as the hundreds of other "magic" ideas to finally get parallel computing to take off: As duds that either do not work at all, or that almost nobody can write code for.

    Really, stop grasping for straws. There is nothing to be gained in that direction, except for a few special problems where the problem can be partitioned exceptionally well. CPUs have reached a limit in speed, and this is a limit that will be with us for a very long time, and possibly permanently. There is nothing wrong with that, technology has countless other hard limits, some of them centuries old. Life goes on.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    1. Re:Linus is right by SpinyNorman · · Score: 1

      Yeah, parallel computing is mostly hard the way most of us are trying to do it today, but advances will be driven by need, and advised by past failures, not limited by them.

      You also argue against yourself by pointing out that CPU's have hit a speed limit - this is of course precisely why the only way to increase processing power is to use parallelism, and provides added incentive to find ways to make use of parallel hardware easier.

      The way massively parallel hardware will be used in the future should be obvious... we'll have domain specific high level libraries that will encapsulate the complexity, just as we do in any other area (and as we do for massively parallel graphics today). Massive parallelism is mostly about SIMD where the programmer basically wants to provide the data ("D") and high level instructruction ("I") and have a high level library take on the donkey work of implementing it on a given platform.

      Current parallel computing approaches such as OpenCL, OpenMP, CUDA are all just tools to be used by the library writers or those (which will become increasingly few) whose needs are not met by off-the-shelf high level building blocks. No doubt the tools will get better, but for most programmers it makes no difference as they use libraries rather than write them. Compare for example to all the advances in templates and generic programming in C++11 and later... how many C++ programmers are intimately familiar and proficient in these new facilities, and how many actually need to use them as opposed to enjoying the user-friendly facilities of the STL built atop them?!

    2. Re:Linus is right by gweihir · · Score: 1

      You also argue against yourself by pointing out that CPU's have hit a speed limit - this is of course precisely why the only way to increase processing power is to use parallelism, and provides added incentive to find ways to make use of parallel hardware easier.

      No, I don't. It is pretty clear that processing power for single-thread loads or hard to parallelize ones will _not_ increase much more. Get over it. Wishing limits away does not work.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    3. Re:Linus is right by SpinyNorman · · Score: 1

      The need for massive parallelism will come (already has in the lab) from future applications generally in the area of machine learning/intelligence.

      Saying that "single threaded loads" won't benefit from parallelism is a tautology and anyways irrelevant to Linus's claim.

      FWIW I'd challenge you to come up with more than one or two applications that are compute bound and too slow on existing hardware that could NOT be rewritten to take advantage of some degree of parallelism.

  25. Re:Sounds like programmers from 40+ years ago by gweihir · · Score: 1

    Only if you have zero clue about what he is talking about. Note: It is not possible to deduce validity from the way something sounds. That requires actual insight.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  26. Re:i'm so tired of political correctness by jareth-0205 · · Score: 1

    Fuck you. You can't tell me what I can think or say.

    So, what you're saying is... his right to tell you things is trumped by your wish to not hear things? Freedom of speech does not mean what you think it means...

  27. Re: make -j lotsandlots by Shinobi · · Score: 1

    Something I wish I could have in a workstation again is a full-fledged crossbar switch like the Octane and Octane 2 had.

  28. Two points on Linus' post by Qbertino · · Score: 1

    1.) Linus' wording is pretty moderate.
    2.) He's right. Again.

    --
    We suffer more in our imagination than in reality. - Seneca
  29. Re:i'm so tired of political correctness by Oligonicella · · Score: 1

    You're being pedantic. "You can't tell me" doesn't mean a literal 'you have to not talk', it means you cannot force your will on me to make *me* not think or say things. This was pretty much exactly what the poster he was responding to meant by "you should jsut stop". He's got freedom of speech correct.

  30. Re: make -j lotsandlots by Z00L00K · · Score: 1

    And what most usage is on a computer is actually concurrency.

    Massive parallelism is a special case, and even then you suffer from concurrency.

    --
    If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
  31. The single most significant sentence.. by OneSmartFellow · · Score: 1

    ..is this:

    The obstacle we shall have to overcome, if we are to successfully program manycore systems, is our cherished assumption that we write programs that always get the exactly right answers.

    This is an interesting observation. Let's take graphs for example. We rarely need to solve every possible path and find THE shortest one, we usually only need to find one which is shorter than almost all the other ones.

    Do we always care whether every pixel is the best possible color when compressing images ? No, it usually only has to be close enough so that we can't tell the difference.

    These are classic examples of that statement that have already been implemented in both parallel and linear algorithm design. I'd like to see much more research into understanding why some problems don't require an exact answer, and some do. Maybe we need to change the way we think about what a solution is, rather than how to solve.

    1. Re:The single most significant sentence.. by Shados · · Score: 2

      I remember an issue I had a few months ago... we were doing some image processing using HTML canvas element on a web app... Then we wanted a nightly job to use the same code, so we whip out a node.js script. Once it was done, to make sure it worked the same way, we compared the result...

      They were different. Spent 2 days trying to debug it (they were using the same code for the most part, wtf?).

      At the time, I didn't know about http://en.wikipedia.org/wiki/Canvas_fingerprintingcanvas fingerprinting Most of the time, different computers will generate equivalent, but different at the binary level, images from html canvas.

      And there's always the good old floating point operations. ie: 0.2 * 3 = 0.6000000000000001

      So its already everywhere, just not everywhere enough that we've been forced to deal with it (those things are usually just afterthought and end up in bugs). Soon, they won't be.

  32. If you could only.... by asylumx · · Score: 1

    ...imagine a beowulf cluster of these!

  33. Build a PC: More RAM or more CPUs or more I/O? by DutchUncle · · Score: 1

    If you were putting together a PC (any variety, any era), what would you expect to get the most bang for the buck? Obviously get the fastest current hardware, but then: double the CPU? double the RAM? double the comm (which at this point includes SATA controllers)? My experience all the way back to Z80s has normally been more RAM, the extension of which is more cache close to the CPU, which is one of the things Linus says.

    It's hard to parallelize one application, which is why we all point to a handful of well-understood examples in graphics and that's about it. It's more straightforward - and more understandable - to parallelize multiple applications, like a "server" hearkening back to the old mainframe days. For a *general-purpose* computer doing mostly one or two things at a time with background communication and I/O, more RAM/cache == less thrashing == better *all-around* performance without adding complexity.

    1. Re:Build a PC: More RAM or more CPUs or more I/O? by laird · · Score: 1

      It's not "hard to parallelize one application". It's just a matter of learning to think that way. Once you do, nearly all problems parallelize well.

      For example, consider video games. Most of them have hundreds or thousands of AIs and game objects that can run in parallel. Heck, even word processing renders thousands of characters to the screen, which can be done in parallel. Sorting, searching, indexing, all parallelize. Of course, as lot as it's considered "hard" developers won't do it, except in the highest value cases (e.g. video processing, graphics) but that's a matter of tooling. In languages/compilers that are designed for parallelism, it's easy. It's just hard in C++ because as a language it makes parallelism very hard. Compare to FORTRAN 90, or C*.

    2. Re:Build a PC: More RAM or more CPUs or more I/O? by DutchUncle · · Score: 1

      Let's consider one of your examples further: "word processing renders thousands of characters to the screen, which can be done in parallel." The starting position of each character depends on the position and size of the one before it, which in turn depends on the one before *that*, including where the lines break (not to mention ragged-right vs. double-justify). And let's not forget kerning - further interaction between characters. While it would seem, then, that one could treat every individual character as an individual sprite for display calculations, for the practical application of word processing it really makes just as much sense to handle the text file serially - and it's a lot less complex. The way I read Linus' posting, he's arguing that parallelism is overused and overhyped FOR GENERAL USE, and I tend to agree - at the same time that I'm very happy that my 4-core processor seems to overlap all of its network I/O and disk I/O and processing, and I'm very happy with my nice graphics card. From what I've seen over time, the GENERAL-PURPOSE bang for the buck is caching - making more memory faster and closer to the CPU.

  34. Re:i'm so tired of political correctness by drinkypoo · · Score: 1

    And some of us just grew up in the sort of nuclear family where offensive expletives are the norm.

    You mean, low-class? I grew up with that kind of family, but I don't have any illusions about whether obscenity is the crutch of the inarticulate motherfucker.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  35. Limitations by Anonymous Coward · · Score: 1

    Multi-core CPUs are just a side-step because we can't scale single-core CPU performance to the same levels.

    For example, if there was a choice between a single-core CPU that could do 1000 bogomips or a 4-core CPU that could do 4x250 bogomips, I know I'd rather have the single-core chip because for the vast majority of use cases the single-core chip would destroy the quad.

    This is why modern multi-core CPUs have 'turbo' mode - Intel and AMD both realised that single-core performance is still much more important for individual programs so being able to run that code on one core and boost it at the detriment of the other cores gives a significant edge.

    I still remember when multi-core CPUs first came out - They were limited by TDP so cheaper single-core CPUs would almost always beat them in benchmarks because while they were slightly behind in multi-threading performance, they were far superior on single-core performance.

    One thing I am surprised is that no CPU manufacturer has come up with a dynamic pipeline system, where you could run a CPU as e.g. a quad core for normal usage, but when presented with highly predictable streaming data, switch to a P4-style long-pipeline by e.g. feeding one core into another and running the whole thing at a higher clockspeed
    .

    1. Re:Limitations by leuk_he · · Score: 1

      You are not wrong, but the point is that parralel system can scale the number of cpu's ftom 4 to 1000. However the same locking mechanisms used for 4 way parralelism are not useful in 1000 way parralelism. You need different techniques then. The linus rant is pointeda current programming techniquess that scale to 4-16 cores, but start to loose a lot of efficientcy at more cores.

      By the way, some synamic pipeway already exists a long time. Think about hyperthreading. 2 threads share1 core. second thread is optional to keep the cpu busy when one thread could not. Also cache might be local to one or more cores. This is also a way of dynamic pipeline.

  36. Re:Programs people want to use... by 0123456 · · Score: 1

    1. Until recently, most PCs had only a dual core CPU.
    2. You're assuming those tasks can trivially be done in parallel. In reality, most can't. You can't render the graphics until the physics are calculated, for example. Yes, you can be calculating physics for the next frame while you're rendering the current one, but then you have to maintain two copies of all the relevant data (current and new), or use a more complex data format which can support multiple threads updating it at the same time. That's a lot more work than just wrapping a thread around the physics calculations.

  37. Re:Programs people want to use... by AuMatar · · Score: 1

    Because in C and older versions of C++ launching a thread takes significant typing and ugly code,

    Bullshit. It takes 1 function call- because if you had a need to do all that repeatedly, you would write the damn call once, turn it into a function, and let it be done. People didn't do it because the tasks weren't parallelizable- they had massive resource contentions on memory object. Contentions that would be non-trivial to solve, and would cause using threads to be a minimal gain or even a loss in efficiency.

    Libraries like std::thread don't do anything that people weren't already doing- they just prevent people from going out and writing their own implementations. But any problems that would benefit from them were already being solved with roll your own solutions.

    --
    I still have more fans than freaks. WTF is wrong with you people?
  38. SOME THINGS ARE NOT PARALLELIZABLE by Theovon · · Score: 1

    There are many common algorithms at the heart of important workloads that are not parallelizable. Consider sorting and shortest path algorithms that are important for managing data and route finding. The O(n-squared) versions can be parallelized (Bellman-Ford vs. Dijkstra's), but for any useful input size, the n-log-n version will be faster on a single core than the n-squared on a supercomputer (no hyperbole there). Even for workloads that do have a lot of parallelism, the inter-process communication often dominates. Except for benchmarks with no application to reality, there is always SOMETHING that serializes computation. Amdahl's law always bites you in the ass.

    So much for parallel computing.

    If you have many INDEPENDENT tasks, then sure, parallel computing is great. Web servers with many clients, graphics, etc. But that's for servers.

    On end-user systems, the amount of thread-level parallelism is very limited. Unless you're compiling Gentoo, you're going to top out at a handful of cores. This is not limitation of the languages people use. It's a practical limitation of the parallelism inherent (or not) in the workloads people run, and it's a hard mathematical limitation of the optimal algorithms people use for common low-level tasks.

    http://crd-legacy.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf
    http://www.davidhbailey.com/dhbpapers/inv3220-bailey.pdf
    http://www.cs.binghamton.edu/~pmadden/pubs/dispelling-ieeedt-2013.pdf

    There are some people in parallel computing who need to go back to school and learn computational complexity.

  39. Re:Programs people want to use... by jandrese · · Score: 1

    The second approach runs into trouble if your tasks aren't independent. Parallel processing works great until you have to start synchronizing state. If one process stalls and the other processes are dependent on it for some data, then the other processes are going to stall anyway. In the real world, most problems are hard to separate cleanly--data dependencies are very very common. So there is a hidden cost to parallelism--the cost of synchronization between the threads, and the cost grows very fast as you add more threads. This is basically Linus's point: outside of specialized domains it's just not possible to cleanly break up most problems into more than just a handful of threads, so having a 1,000 core beast of a processor doesn't help. You would just have 990+ cores waiting on some other core to finish its job, all of the time. Plus there's the fact that debugging multithreaded programs is inherently more difficult than single threaded ones and that all of this is moot if you are I/O bound anyway.

    --

    I read the internet for the articles.
  40. Re:i'm so tired of political correctness by Half-pint+HAL · · Score: 1

    On the other hand, if most people think your word means something different, it's not worth using that word.

    --
    Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
  41. Re:i'm so tired of political correctness by Noah+Haders · · Score: 1

    think whatever you want, man. but when you feel like being an asshole, own it and say i'm an asshole. don't say I am just saying my mind and you're too politically correct if you take offense.

  42. Re:i'm so tired of political correctness by Noah+Haders · · Score: 1

    that's exactly what's changed. there's a group of people who were used to being on top for no reason of their own doing. others are like "god this guy's an asshole and I'm fed up with it because there's no reason he's on top". so he's not on top any more but he's really butthurt about it, which is what #gamergate is all about. so to all of these assholes, I'm saying wake up because the problem is you, not everybody else in the world.

  43. Let's see how that sounds in 5-10 years time ... by SpinyNorman · · Score: 1

    It sounds rather than Bill Gates' [supposed] "64KB is enough for anyone", but no denying that Linus said this one!

    Saying that graphics is the only client side app that can utilize large scale parallelism is short sighted bunk, and even ignores what is going on today let alone the future. In 20 years time we'll have handheld devices that would look just as much like science fiction, if available today, as today's devices would have looked 20 years ago.

    I have no doubt whatsoever that in the next few decades we'll see human level AI in handheld devices as well as server-based apps, and you better believe that the computing demands (both processing and memory) will be massive. Even today we're starting to see impressive advances in speech and image recognition and the underlying technology is increasingly becoming (massively parallel) connectionist deep learning architectures, not your grandfather's (or Linus's) traditional approaches. Current deep-learning architectures can be optimized to use significantly less resources for recognition-only deployment vs learning, but no doubt we'll see live learning in the future too as AI advances and technology develops.

    Linus's relegation of parallelism to server side is equally if not more shortsighted than his lack of vision of client-side CPU-sucking applications! If you want systems that are always available, responsive and scalable then that calls for distributed (client side) implementation, not server based. Future devices are not only going to be smart but the smarts are going to be local. Bye-bye server based Siri.

  44. Oblig XKCD? by thebes · · Score: 1
  45. Answering Linus' "Where the hell..." question... by tlambert · · Score: 1

    Answering Linus' "Where the hell..." question:

    "Where the hell do you envision that those magical parallel algorithms would be used?"

    When you have millions of robots running around your body, repairing your telomere length and resetting the cells Hayflick limit, and repairing other aging related damage, so you can live another 200+ years of healthy, relatively physiologically young.

    You know, unless you actually *want* to be old and decrepit, and die centuries before you actually have to...

  46. Re:Programs people want to use... by Rei · · Score: 1

    Limited data dependencies are common, it's true, but fundamental lockings between tasks are not that common in the real world. Most real world tasks aren't like matrix multiplication or whatnot. Let's say the task is a video game and your tasks are things like:

    1. Get user input
    2. Translate/rotate moving objects
    3. Backcompute armature positions
    4. Calculate mesh data from armatures
    5. Load/unload new scene data
    6. Load/unload textures
    7. Scale objects by level of detail
    8. Process AI
    9. Play sound effects
    10. Play music
    11. Autosave
    12. Read from the network.
    13. Write to the network
    14. Handle special effect animations
    15. Render

    And on and on and on, your average game has a whole laundry list of these sort of things, and each one is made of many subtasks. Some will be trivial, while others warrant threading even at the subtask level.

    Now, when you look at these, of course they're all obviously interconnected in some ways, you obviously have to use mutexes. But the connections are limited. For example, If you're backcomputing how an armature must be configured, it's obviously going to use the same data structure as the thread that deforms mesh data with armatures. But the only real practical limitation is that the thread that changes armature positions has to lock the one armature it's computing briefly while writing the results of its calculations so that the other thread never reads half-written results - that's it. Likewise, rendering (which has tons and tons of subtasks, and is famously parallel) obviously depends on all sorts of texture and model data from different threads. But again, all it needs is that there not be anything half-written, it doesn't have to wait on any particular result. Objects moving will change their needed level of detail, user actions and collisions may cause sound effects, and on and on, but again, the only requirement is that you not have half-written states.

    This is what the vast majority of CPU-intensive tasks in the real world are like. Yes, you have to use mutexes, and you have to be aware of iterator / pointer invalidation on insert / delete into data structures (where applicable), but apart from those sorts of things, they tend to thread very, very well.

    --
    If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
  47. Re:Programs people want to use... by AuMatar · · Score: 1

    Here's pure C, C++ would lead to slightly neater syntax.

    void do_operation_on_all(my_struct *array, int size, threadfunc func){

      for(int i=0; i<size; i++){
         launch_thread(func, array[i]);
      }
    }

    Where launch thread is a function that calls the correct OS specific function to launch a thread (probably pthread in most cases).

    It would then be called:

    do_operation_on_all(array, size, func);  which is actually even simpler than your solution.

    --
    I still have more fans than freaks. WTF is wrong with you people?
  48. Re:i'm so tired of political correctness by Attila+Dimedici · · Score: 1

    Perhaps not, but should someone get fired because people think a word they used is related to a racial epitaph, when it isn't?

    --
    The truth is that all men having power ought to be mistrusted. James Madison
  49. Re:Linus wrong? Shocking! by Half-pint+HAL · · Score: 2

    Not true, because if the processes are IO bound (and most are), most of the processes will be waiting anyway. But Linus's argument hangs on a more fundamental problem: memory bandwidth. If all the cores are sitting waiting because the data isn't in the cache and the other cores are already trying to use the memory bus, then you'll end up with more unused cycles than if you ran timesliced threads on a single core. The correct answer to this one cannot be made by reasoning and logic from first principles, but only by looking at raw empirical data. I daresay Linus has more of that than most of us here.

    --
    Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
  50. Re:Programs people want to use... by AuMatar · · Score: 1

    ANd when I said C++ would lead to a nicer syntax- I mean C++ 01 without std::thread and autos. Mainly because you could make it a template function instead of special casing for the type of using void pointers.

    --
    I still have more fans than freaks. WTF is wrong with you people?
  51. Re:i'm so tired of political correctness by Half-pint+HAL · · Score: 1

    Perhaps not, but should someone get fired because people think a word they used is related to a racial epitaph, when it isn't?

    No.

    --
    Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
  52. Ripe for Revolution by Roger+W+Moore · · Score: 2

    Nothing significant will change this year or in the next 10 years in parallel computing.

    You might be right but I'm far less certain of it. The problem we have is that further shrinking of silicon makes it easier to add more cores than to make a single core faster so there is a strong push towards parallelism on the hardware side. At the same time the languages we have are not at all designed to cope with parallel programming.

    The result is that we are using our computing resources less and less efficiently. I'm a physicist on an LHC experiment at CERN and we are acutely aware of how inefficient our serial algorithms are at using modern hardware. What we need is a breakthrough in programming languages to be able to parallel program efficiently, just like object oriented programming allowed us to scale up the size of programs. Until this happens I agree than not much will change but if there is some clever CS researcher/student out there with a clever idea for a good parallel programming language the conditions are right for a revolution.

    1. Re:Ripe for Revolution by gweihir · · Score: 1

      If you were right, Transputers would have been the really big thing 25 years ago. They fizzled. Basically all massive parallel things have fizzled, because performance is abysmally bad, often worse than a single large CPU. So have all attempts at programming languages supporting larger parallelism. Linus just sums up the results of about 40 years of research. And most relevant problems cannot be parallelized in a meaningful way anyways, and these are fundamental limits, i.e. no clever idea is possible. Really, there are not going to be any breakthroughs, what we have now is what we will have in 100 years, give or take a small factor.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    2. Re:Ripe for Revolution by Roger+W+Moore · · Score: 1

      ...and right up until the invention of the transistor computers would never be smaller than a large room or a small house. I would not be so sure about there being no clever idea possible unless there is a mathematical proof to support it. Until recently there was no need to go parallel now there is a growing need to be able to program in parallel and necessity is the mother of invention. While parallel does incur an overhead as CPUs become more parallel and less serial this will presumably eventually overcome the cost of the parallel algorithm.

    3. Re:Ripe for Revolution by gweihir · · Score: 1

      There are mathematical proofs for many algorithms that they cannot be efficiently parallelized. There was always a strong effort to get parallel software off the ground, but it failed time and again. There is huge interest from the military and from other communities for simulations, for example. And some things _can_ be parallelized efficiently, like hash-tables (but they are I/O bound, hence Google parallelizes them to different machines in its search engine), while others cannot (like sorting, here parallelization only pays if comparing elements is very expensive).

      This is not a new problem. It has 40 years or so of intense research thrown at it. It even has specialized languages like OCCAM.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  53. Re:Programs people want to use... by bluefoxlucid · · Score: 1

    Most people don't understand lock contention, or lockless code. That's why Dragonfly BSD is ignored, yet is so far ahead: every time someone sees a new problem with parallel computing, with semaphore contention, with threading models, DragonflyBSD is there with a fix from 10 years ago, DragonflyBSD wanted fast semaphores, lockless schedulers, threading models designed to handle running thousands of threads on hundreds of cores, and so on; this was seen, in the early 21st century, as a useless waste of time and a source of complexity; DFBSD is a fork of FreeBSD because the FreeBSD devs wouldn't let the DFBSD guy just do it in FBSD.

    It's one of those things. I expect a long, arduous path to catch up to DragonflyBSD, to Minix, and so on, in the same way that we spent so much time catching up to XFS (ext4 spent years trying to reach feature and performance parity with XFS; it now even has on-the-fly inode allocation as an option). There's always some laughable side project somewhere claiming it will change the world, and there's always a point in the future where everyone else starts imitating that project. Whenever I see something big and long-running like this, I recognize it as some other thing; when people start doing multi-version Linux, I will immediately start talking about NixOS (which I think is implemented like crap, but has the right idea).

  54. Slang? by bjohnso5 · · Score: 1

    "Crock of shit" maybe? "Bunch of crock" doesn't seem like it'd even be a thing.

  55. Slang? by bjohnso5 · · Score: 1

    "Crock of shit" maybe? "Bunch of crock" doesn't seem like it'd even be a thing.

    Replying to myself... apparently it is a thing: http://en.wiktionary.org/wiki/...

  56. Re:Let's see how that sounds in 5-10 years time .. by Junta · · Score: 1

    To be fair, the trend seems to be hitting a ceiling.

    Desktop processors got to quad core and have pretty much sat there. The mobile space has been at quad-core a little less long and there are octo-core implementations moreso than desktop, but it still seems quad core is about where most devices settle. There are more efforts to make GPU style execution cores available for non-graphics use, but in practice a relatively small portion of the market has been able to have meaningful gains exploiting them. As vectorized instructions in cores become more capable, many of those problems actually start coming back to the traditional CPU cores as it works as well as the GPU but with an easier programming model. In short, the marketing results seem to indicate that end user devices might settle around quad core.

    Servers have been going up, with 18 core per socket for 2-socket now available. This shows that the desktop parts have room to grow in that dimension, but it just isn't being bothered with.

    --
    XML is like violence. If it doesn't solve the problem, use more.
  57. Re:Answering Linus' "Where the hell..." question.. by Junta · · Score: 1

    That sounds more like a distributed computing problem rather than applications running on a single 'system'. Even if it were centrally controlled, the computational load being time-shared might mean the best solution is still just a handful of cores. Such nanites would presumably be independent or unused enough that continuous CPU load would likely not even be in the picture. This is very much science fiction, but it still strikes me that the computational load would be negligible compared to the medical/engineering problems overcome. You take 30-40 years to start feeling the effects of aging, so it's not like cells require continual repair to achieve your hypothetical situation, just have to manage to repair everything within 25 years.

    --
    XML is like violence. If it doesn't solve the problem, use more.
  58. driving schools by Duncan+White · · Score: 1

    Http:// www.duncan-white.co.uk

  59. sequential programming mindset (try 64k cores) by the+agent+man · · Score: 1

    I was lucky enough to gather some parallel programming experience on the Connection Machine CM2, a 64k CPU (yes that is 65536 CPUs), 12 dimensional hypercube, a long time ago. The CM2 ultimately failed but we did get many great insights into parallel programming. At the time it was just not feasible for low cost, on your desktop, computing. It is NO problem to keep massive numbers of cores busy doing interesting computing. OK, the 12 dimensions are less clear on how to use them. At any rate, to claim that there is no need for 100 cores or more is really small minded because unlike the time when silly "the world does not need more than 5 computer" kinds of comments were made we already have evidence that there are powerful ways to employ massive parallel computing that can use thousands or even millions of cores.

    Just because we are being caught in a sequential programming mindset does not mean that there is no room for parallel programming. If you are looking at a two dimensional array of data and think of a nested loop you ARE caught in a sequential programming mindset. Additionally, famous people, including Dijkstra, have poopooed some algorithms that are inefficient when execute sequentially to the point where researcher, or programmers, are not even looking any more for good parallel execution. Take bubble sort. Not sure it was Dijkstra but somebody suggested to forbid it. Yes, on a sequential computer bubble sort is indeed inefficient but guess what. If communication does matter and if you are using a massively parallel architecture (i.e., not 4 cores) bubble sort becomes quite efficient because you only need to talk to your data neighbors. Likewise there are AI algorithms that can be shown to be behave really well when conceptualized and executed in parallel. Collaborative Diffusion is an example: http://www.cs.colorado.edu/~ra...

  60. Re:best quote from the article by Anonymous Coward · · Score: 1

    What's wrong with it? It only said you may recognise him - it didn't say that most or many would.

    Shut up, Dave...

  61. Re:GAY NIGGERS can be DEVELOPERS 1000 WHORES! by gatkinso · · Score: 1

    You make me miss Shampoo.

    --
    I am very small, utmostly microscopic.
  62. Re:i'm so tired of political correctness by Noah+Haders · · Score: 2

    +1 this would make the best gravestone ever.

  63. Re:Let's see how that sounds in 5-10 years time .. by SpinyNorman · · Score: 1

    The trouble is that extrapolating the present isn't a great way to predict the future!

    If computers were never required to do anything much different than they do right now then of course the processing/memory requirements won't change either.

    But... of course things are changing, and one change that has been a long time coming but is finally hitting consumer devices are the hard "fuzzy" problems like speech recognition, image/object recognition, natural language processing, artificial intelligence... and the computing needs of these types of application are way different than running traditional software. We may start with accelarators for state-of-the-art offline speech recognition, but in time (a few decades) I expect we'll have pretty sophisticated AI (think smart assistant) functionality widely available that may shake up hardware requirements more significantly.

  64. Lots of moving parts by m.dillon · · Score: 4, Informative

    There are lots of moving parts here. Just adding cores doesn't work unless you can balance it out with sufficient cache and main memory bandwidth to go along with the cores. Otherwise the cores just aren't useful for anything but the simplest of algorithms.

    The second big problem is locking. Locks which worked just fine under high concurrent loads on single-socket systems will fail completely on multi-socket systems just from the cache coherency bus bandwidth the collisions cause. For example, on an 8-thread (4 core) single-chip Intel chip having all 8 threads contending on a single spin lock does not add a whole lot of overhead to the serialization mechanic. A 10ns code sequence might serialize to 20ns. But try to do the same thing on a 48-core opteron system and suddenly serialization becomes 1000x less efficient. A 10ns code sequence can serialize to 10us or worse. That is how bad it can get.

    Even shared locks using simple increment/decrement atomic ops can implode on a system with a lot of cores. Exclusive locks? Forget it.

    The only real solution is to redesign algorithms, particularly the handling of shared resources in the kernel, to avoid lock contention as much as possible (even entirely). Which is what we did with our networking stack on DragonFly and numerous other software caches.

    Some things we just can't segregate, such as the name cache. Shared locks only modestly improve performance but it's still a whole lot better than what you get with an exclusive lock.

    The namecache is important because for something like a bulk build where we have 48 cores all running gcc at the same time winds up sharing an enormous number of resources. Not just the shell invocations (where the VM pages are shared massively and there are 300 /bin/sh processes running or sitting due to all the Makefile recursion), but also the namecache positive AND negative hits due to the #include path searches.

    Other things, particularly with shared resources, can be solved by making the indexing structures per-cpu but all pointing to the same shared data resource. In DragonFly doing that for seemingly simple things like an interface's assigned IP/MASKs can improve performance by leaps and bounds. For route tables and ARP tables, going per-cpu is almost mandatory if one wants to be able to handle millions of packets per second.

    Even something like the fork/exec/exit path requires an almost lockless implementation to perform well on concurrent execs (e.g. such as /bin/sh in a large parallel make). Before I rewrote those algorithms our 48-core opteron was limited to around 6000 execs per second. After rewriting it's more like 40,000+ execs per second.

    So when one starts working with a lot of cores for general purpose computing, pretty much the ENTIRE operating system core has to be reworked verses what worked well with only 12 cores will fall on its face with more.

    -Matt

    1. Re:Lots of moving parts by NovaX · · Score: 1

      Some things we just can't segregate, such as the name cache. Shared locks only modestly improve performance but it's still a whole lot better than what you get with an exclusive lock.

      What is the challenge with the namecache, specifically? If its due to being LRU then there are approaches to mitigate the lock. A buffering approach like this Java cache batch updates to avoid lock contention. Another technique is to take a random sample to be probabilistically LRU, like Redis does.

      --

      "Open Source?" - Press any key to continue
  65. Re:Programs people want to use... by Rei · · Score: 2

    BZZT, fail.

    1) You didn define launch_thread.
    2) my_struct_array was said, and I quote, "a local-context data structure", so congrats, your data is going to go out of scope on you.
    3) The concept of having to write that is absurd because "for (auto&i : container)" is a "do whatever you want, any number of steps, no matching function signature required, inline, on any container whatsoever" built into C++11, *and* it's something that anyone who knows C++11 will know rather being something you brewed yourself.

    Again, to repeat, given your failures on #1 and #2:

    " if you're too lazy to do it here, or change the requirements to present yourself with a simpler problem, then I'm going to take it that you're too lazy to do it in your code, too."

    Hence, I'm going to take it that you're likewise too lazy to actually thread your code. And the fact that your code contains a fundamental oversight resulting in a memory leak which wouldn't have caused a compile error is just icing on the cake.

    --
    If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
  66. Re:Programs people want to use... by Rei · · Score: 1

    Hmm, I was thinking of your launch_thread in terms of passing by reference, but I now imagine you meant copy (would have helped if you had actually, you know, defined the function). But then you're just adding an extra and unnecessary copy.

    Let me help you out. Your function is going to have to keep a global data structure of all of the threads' arguments because they're too big to pass as the pthread's argument. Now, your array isn't going to be fixed-size because you don't know how many instances are going to be called (you could limit it and put a hard cap, but you still have to put checks for that). If it's pure C, then you don't have STL containers, so you have to implement all of your memory management overhead. Regardless, you at the very least have to do an additional copy of your passed my_struct into your global arguments structure (2x), versus the one that std::thread needs. Now, there is a way to work around having to keep a global data structure, but it sucks: it's to have your launch_thread function pass a pointer to the local copy of my_struct and then sit around and wait for the thread to start up, copy off of the pointer, and then zero out your copy to alert launch_thread that it's started and has copied the data structure (of course, this involves yet another copy, plus a ton of reads while sitting around and waiting and wasting time). All of this, of course, is on top of all of the overhead imposed by pthread itself, including defining a function (and not in the same place where the code is being used, which reduces clarity), and roughly three lines for the pthread calls themselves.

    This is all assuming that you implement it pthread-only and not portable. Otherwise, you have to add in #ifdefs and do a whole different approach for whole different platforms.

    Could you do all this? Of course you could. Would you do it? Clearly you didn't, and I know no amount of badgering would have gotten you to do it (I've tried this experiment before, you're not the first). Could you write it once and then reuse it?** Sure you could. Have you? No, of course you haven't, otherwise you would have just pasted it before. Why haven't you written such a thing before? Because it's too much hassle. Which is the very reason threading is underused.

    ** - kind of. You see, it's actually worse than that because unless you make an even more convoluted and unreadable and type-unsafe function, your thread launcher is going to be only set up for launching this particular case. But one can encounter all kinds of threading needs that would require significant changes. But I digress.

    --
    If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
  67. Re:Sounds like programmers from 40+ years ago by gweihir · · Score: 1

    No, it is not. They tried getting parallel programming off the floor 40 years ago and have consistently been failing since then. Linus sums up the results of the last few decades of R&D perfectly.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  68. Linus Lock by fyngyrz · · Score: 2

    The core is already dozens of times faster than memory

    It isn't, though, except for integer operations and tossing things around. Floating point core elements have a ways to go yet to get to single cycle for everything, and so spreading math among cores still saves time. OS folk like Linus may tend to think in terms of byte-to-BusSize manipulation. A lot of us deal with more nuanced data and operations. I *guarantee* you that a multicore processor will chew up properly designed image manipulation tasks a good deal faster than a single core will, and more flexibly (and more system-friendly) than a GPU can too, although slower for ops that fit in the GPU's memory and for which it offers competence. Software defined radio also makes terrific use of multiple cores, for instance here, a 3 GHz system with 8 cores is mostly free to do other stuff, and a system with one core running at the same speed is about 90% utilized, which doesn't leave enough horsepower to do much else. Whereas with the 8-core, I can run the SDR and do whatever the heck I want. Then there's the "what do you mean by 'core'" question. Does the core have an FPU, or is it one of those profoundly crippled integer-only units? Does the core actually share memory (and therefore memory bandwidth) with other cores, or does it have its own pool of RAM? Is eco throttling choking it half to death? And so on.

    Having 1000 cores all waiting for 3,000 microseconds while the hard drive rotates to the other side of the platter

    What is this "hard drive" thing you describe? Doesn't everyone use boards with terabytes of RAM for near-term storage?

    Seriously, though, we all know (well, the ones who have considered it) that's exactly where we're going. SSDs as they stand today are just the tip of the iceberg; you want to know what's coming, instantiate a ram disk on your machine and run some benchies with it. And when we get to real RAM based storage, or anything of similar speed (or perhaps better... memristors?), we won't have wanted CPU development to have been sitting on laurels planted in a garden made of dead-slow storage in the interim.

    Having 1000 cores all waiting for 3,000 microseconds while the hard drive rotates to the other side of the platter does not improve performance over 4 cores waiting.

    True enough, but of course, that's not what happens, so... Effectively -- of course they can and do switch roles when memory is shared -- one is monitoring your ethernet, several are kicking in and out of httpd threads and/or processes, and so on for hundreds of OS tasks, and if you're like me, more than a few users tasks as well. For every task within a process that isn't hidebound by disk (and there are already a lot of them) having an additional available core is a very worthy thing. And when cores are tied up waiting for high level math operations, memory is (more) free relative to the needs of the available cores, and things simply run soother, sooner. There's a lot of handwaving in there because of the complexity of caching and lookahead and so on, but the bottom line is in my 8 core machine, I can do a lot more than in my 2-core machine, both have the same amount of memory and run at the same speed. And I apologize for the mangling of terminology. I think the point remains clear:

    Multiple cores are a great thing.

    --
    I've fallen off your lawn, and I can't get up.
    1. Re:Linus Lock by kesuki · · Score: 1

      http://en.wikipedia.org/wiki/AMD_Radeon_Rx_200_Series#Radeon_R9_280X

      has 2048 'stream' processors and only 3GB of ram -- true it only has 32 of what most people would call compute cores, but it is getting data from thousands of threads processed from its stream processing units. i have one of these devices and it was about 75 times faster at altcoin mining (dogecoins specifically) than the general purpose so called 8 core fx 8150 cpu. even though it's 8 threads the wiki doesn't explain why they can call it an 8 core but i know shortly after i built the rig computer parts changed generations and got slower.

      basically speaking programming for thousands of cores exists today by having simple tasks that break up complex or end user desired tasks and make them simple to run in parallel.

    2. Re:Linus Lock by RoLi · · Score: 1

      True enough, but of course, that's not what happens, so... Effectively -- of course they can and do switch roles when memory is shared -- one is monitoring your ethernet, several are kicking in and out of httpd threads and/or processes, and so on for hundreds of OS tasks, and if you're like me, more than a few users tasks as well. For every task within a process that isn't hidebound by disk (and there are already a lot of them) having an additional available core is a very worthy thing.

      Yeah, that's the theory.

      In real life, my 6-core, 32 GB-RAM box swaps even the tiniest process to disk (which is of course SSD) so that even opening the KDE-menu takes ages after some time.

      I think programmers are just too lazy to really use the hardware (which exists already today). For example the smart thing to do would be to make sure that the user interface is never swapped to disk. That would reduce available RAM only slightly but would dramatically improve performance.

      But of course nobody does it because 1) their mind was closed by academia which preaches inefficient but supposedly programmer-friendly things like OO, scripting, one-size-fits-all frameworks etc. and 2) because everybody is hoping to squash every problem with faster hardware.

      So it won't happen.

      In 20 years, we will run huge machines that will slow down everything by running as much as possible on Python and Javascript because that's what is hip and performance be damned. (Isn't the Windows 8 framework - user interface based on CSS and Javascript already?)
      Performance will probably suffer because instead of having fonts on disk (how 20th-century is that?) our computers will load fonts from Google about 10 times per hour.

    3. Re:Linus Lock by davydagger · · Score: 1
      you have 32 GB of RAM

      swapoff bro, just swapoff, and put a # in /etc/fstab

    4. Re:Linus Lock by fyngyrz · · Score: 1

      I think programmers are just too lazy to really use the hardware

      Not everyone. I write in C -- large applications, too -- and I write as close to the metal as I can get. I don't mind assembler, but the processors in use move under our feet too often: there's just no practical way to keep up without a compiler in between my code and the actual CPU instructions.

      For example the smart thing to do would be to make sure that the user interface is never swapped to disk. That would reduce available RAM only slightly but would dramatically improve performance.

      Agreed, that sounds like it'd be worthy. The problem I would anticipate is that a lot of the OS/UI code may be contained in huge "black boxes" that, if all loaded all the time, would consume much more RAM than we might otherwise think would be needed. OTOH, maybe we should all have 32 GB of RAM like you do. It sure has gotten inexpensive. On the OTHER other hand, if we did, the bloody OS would probably balloon to 32 GB, so... lol

      In real life, my 6-core, 32 GB-RAM box swaps even the tiniest process to disk (which is of course SSD) so that even opening the KDE-menu takes ages after some time.

      Concur with davydagger. You're either doing something really resource-intensive you didn't mention, or your OS is configured wrong.

      If you have 32 GB of ram, unless you're running software that makes demands on that scale, you probably don't need swap at all. I've only got 8 GB of ram and my system does really well unless I actually use it up -- although mine's OS X, so your swap algorithms and so forth are different. Still, I'm almost certain you can set the box up to behave better.

      In the past, I know linux had a really annoying bias for using up all the ram with buffers and cache, and would pig out if you actually tried to use that ram yourself once all the RAM was used that way, despite the supposed ability to throw out the cache and the buffers if the RAM was needed, but I am under the impression that time has passed.

      davydagger offered some specifics there... sounds like the right place to at least start reading some man pages. :)

      Isn't the Windows 8 framework - user interface based on CSS and Javascript already?

      No idea. Microsoft is dead to me. :)

      --
      I've fallen off your lawn, and I can't get up.
  69. Re:Programs people want to use... by Bengie · · Score: 1

    It's too early to know if it's just too hard a problem for the human mind in general

    Most user-space parallel problems aren't hard, it's just programmers who use algorithms and data-structures as black-boxes without understanding their implementation or characteristics, or alternatives, or generally being able to think for themselves. I don't know how many times I've glanced at problems that were throughput sensitive, and I immediately saw large potentials for parallelism, but required designs that would be utterly illogical for a serial design.

    Solving code parallelism problems is nearly identical to making well factored code. You need to break down the problem into its atomic parts, then rearrange those parts. Once you understand all atomic parts of a system and all of the data dependencies, parallelism becomes trivial. The problem is most people don't "understand" the system that they're working on, they just mindlessly throw code at a wall and some sticks. Most parallel code really needs to be designed from the beginning. Designing code? What's what?

  70. Re:i'm so tired of political correctness by Bengie · · Score: 1

    Part of growing up is learning to know when you don't know what you're talking about and Linus is calling you on it. Every time that I've looked into why Linus was "wrong", it was because he was wrong in theory, but correct in practice, because in practice, people are idiots and Linus recognizes this.

    I assume Linus is looking at this from a practical standpoint, that jumping the gun to making massive overhauls of the kernel to optimize for our current limited understanding of concurrent software and hardware interactions for a problem that most programmers are too stupid to even take advantage of, would be a bit premature. We should wait for hardware and software to better stabilize before we get locked(pun) into a concurrency regime for the next few decades.

    We've only just recently gained concurrent support for network and storage IO, and hardware has been changing a lot in the past few years as we keep scaling up SSDs and 40gb+ NICs. We can use work-arounds for the mean time, and once everyone says "yes, this is the best way", we can make large kernel changes.

    Another example, AMD is already working on Mantle. Even if it doesn't fully take off, it's research into a related area, and we'll learn a lot from it. At some point in the future, a Mantle-like system may be incorporated into the Kernel, but lets not turn the kernel into a cesspool of ever changing interfaces while they figure this problem out.

  71. Re:i'm so tired of political correctness by u38cg · · Score: 1

    Yes, specifically it's where there's a bit of a stooshie over something silly like niggardly, everyone finally calms down a bit, and then some asshole decides that the correct thing to do is run around shouting "NIGGER NIGGER NIGGER", because political correctness gone mad.

    --
    [FUCK BETA]
  72. Re:i'm so tired of political correctness by Cederic · · Score: 1

    Bury him next to Dr "Got shot for being a paediatrician".

  73. Re:Programs people want to use... by toby · · Score: 1

    It's ridiculous that not only does the article not mention Erlang or Haskell, but no high modded comment does either.

    Sad. Erlang's been around for more than 25 years with its successful lockless model.

    --
    you had me at #!
  74. Agreed; see also MapReduce and Hadoop; Cliff Nass by Paul+Fernhout · · Score: 1

    http://en.wikipedia.org/wiki/M...
    http://en.wikipedia.org/wiki/A...

    I learned MapReduce for use with CouchDB and it is a powerful technique even when not on parallel hardware -- although a bit of a conceptual shift.

    Here is a group using MapReduce with Hadoop for image processing:
    http://hipi.cs.virginia.edu/
    "HIPI is a library for Hadoop's MapReduce framework that provides an API for performing image processing tasks in a distributed computing environment. "

    Linus wrote: "The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless." But would Linus really think image processing (like for robots or self-driving cars or using Baxter to sort your kid's Legos) is not an important issue? Sounds a bit like "640K is enough memory for anyone". Failure of the imagination is all too common based on unfamiliarity with some problem domain. Although, to be frank, I thought 32K of RAM on a Commodore PET was more than enough memory for anyone, because I could not imagine writing a program that large at the time. :-)

    Also, agent-based simulations or zone-based simulations can often use as much parallel hardware as you can throw at it, even if there may be occasional short synchronization steps. For example you could have a Minecraft-like game with thousands of active entities like wolves, zombies, pigs, and so on -- as well as processes like erosion or plant growth going on in multiple zones simultaneously. Game design could really change with millions of available general purpose cores. My wife and I created an algorithm for growing botanically accurate plants, but current games like Minecraft can't use it to grow each unique plant because it would be too computationally intensive if you had millions of unique plants all growing at the same time.
    https://github.com/pdfernhout/...

    Congrats on your luck/skill in working with Thinking Machines hardware like the CM2. Around 1984, when an psychology undergrad at Princeton interested in AI, I had developed some software called "Mex" for multiple execution where I ran up to 1000 simulated processors on an IBM mainframe under VMUTS. I was using it to help process some data from a robot vision system I had put together (which itself had three 6502 processors). I was really excited about the idea of linking together lots of 6502 processors. I applied for a job then at Thinking Machines but didn't get an offer. A sociology grad student I knew from then (Clifford Nass) got a job offer there (and that is part of why I applied there) but he didn't take the offer, which is kind of ironic. He's brilliant and innovative as his career shows, but not really a programmer or hardware guy, and not all that interested in AI that I knew of:
    http://adlininc.com/uxpioneers...

    I'm shocked and saddened just now when checking what he is up to now to to see on Wikipedia that Cliff died recently of a heart attack:
    http://en.wikipedia.org/wiki/C...

    What a big loss for Cliff's family as well as the world. And not that long after the sad loss of Professor Jim Beniger, who was an inspiration and good role model to both Cliff and myself in various ways.

    I can see though how Thinking Machines could also have benefited from Cliff's cleverness in thinking about human/machine interaction related to control of a (then) new type of machine. Maybe they'd still be in business if Cliff had gone to work with them? And maybe, being associated with MIT, they did not need yet one more programmer or hardware person, no matter how much they were interested in parallel processing or had done their own projects already on it

    --
    A 21st century issue: the irony of technologies of abundance in the hands of those still thinking in terms of scarcity.
  75. Re:Programs people want to use... by Tablizer · · Score: 1

    Maybe that's why the banks F'd up mortgage pricing?

  76. Hardware verification, not software QA. by Ungrounded+Lightning · · Score: 1

    Verification is the process of checking that software works correctly. The more complex the system, the more complex the process of verification.

    You said "verification" but you're thinking of "software quality assurance". Though "verfication" is sometimes used to describe a step in that process, when used standing alone (at least here in silicon valley), it refers to the analogous process in integrated circuit design.

    Verification is a BIG DEAL in integrated circuit design. A good hardware project will have at least as many verification engineers as designers (and hardware designers will freely act as verification engineers - on OTHER designers' modules - during the later stages of a chip tapeout, without taking a carreer hit.) It is the limiting factor in when the chip design hits silicon and when it hits the market.

    So IMHO the previous poster is talking about the up-front quality assurance processes and costs of hardware, rather than software, complexity.

    (Releasing a rev to a software product due to a QA issue missed due to added complexity may be costly. But releasing a rev to silicon takes months and millions of dollars of sunk cost. They're not in the same league.)

    --
    Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
  77. Re:Programs people want to use... by psmears · · Score: 1

    Why haven't you written such a thing before? Because it's too much hassle. Which is the very reason threading is underused.

    LOL. Actually there's a better reason such a thread launch facility doesn't commonly get written - which is that, in most circumstances, it really doesn't help performance that much, if at all - and the added complexity makes for a big net minus. There are a number of issues:

    Firstly, spawning threads is expensive. Yes, on Linux it's "cheap", but that's "cheap" compared to other implementations - it's still a lot compared to doing a modest amount of work on the local CPU. (Why is it so expensive? Basically because there's a lot of housekeeping to do. In addition to the kernel creating new kernel structures for the new thread of execution (similar to creating a process), the process's thread library must allocate a stack for the new thread (involving modifying the process's page tables), iterate through all loaded shared libraries in order to allocate any thread-local storage they require, and so on, requiring multiple syscalls, a TLB flush, at least one context switch, and so on. To some extent the impact of this overhead can be reduced by maintaining a pool of ready-created threads, but this either takes away control of performance (if done automatically by your language/library) or substantially increases complexity (if you implement it yourself, since you then have to synchronise the threads carefully).

    The second problem is that, unless you're very careful, extra threads don't buy you much performance, and can indeed hurt. Take the example you gave - doing some processing on each struct in an array, where each such struct contains an int and a double (16 bytes total, including alignment padding). With 64-byte cache lines (typical on x86), there are 4 such structs per cache line. If you distribute the processing over threads running on different cores, then instead of one core waiting for the cache line to come in to main memory, and then processing the 4 structs very rapidly (since they're now all in cache), you'll have 4 cores each waiting for the data to be available - i.e. up to a 4x slowdown for memory-bound tasks. And that's assuming the structure is only read from; if it's written to as well then the cache line will have to bounce between cores, and the multithreading slowdown will be many times worse. Now, if you ensure that structs in the same cache line get processed by the same core (ideally in sequence, and by the same kernel thread), then you do potentially get a big speedup - provided you don't hit any other gotchas - but the C++ code you're promoting doesn't seem to guarantee this in any way.

    Third, and perhaps most importantly, data dependencies matter. In your example you're detaching all the threads; this is not realistic, because that means you cannot ever depend on their operations having finished. In the vast majority of cases you do need to know when an operation has finished: you're generally doing work for a reason - i.e. that you're going to use the result - and you can't begin to use that result until you know it has been produced. That, in of itself, adds complexity: you have to analyse your program's dataflow much more carefully in the presence of threads, because C/C++ will quite happily let you use a variable before another thread has finished assigning to it, without any sort of warning or exception. The analysis can certainly be done, and synchronisation put in place to eliminate the problems - but that is further overhead, both in the program's performance but also in the complexity of the program itself, and hence the time taken to write it (and especially to enhance it later, when the synchronisation model may not be so fresh in one's mind).

    Used correctly and in the right circumstances, threads on an N-core system can give a N-times speedup (or greater, due to caching effects). Used badly, at best they'll reduce performance, and usually they'll increase complexity and lead to subtle bugs that are hard to debug.

    The new thread features in modern C++ are very cool, but the fact they didn't exist before is not what's been preventing competent programmers from using threads all over the place :)

  78. Limited application by MoarSauce123 · · Score: 1

    There is limited application for making processes faster through parallelism. It only works well for processes that do not rely on the results of any of the other processes. Unfortunately, many real world applications depend on sequential tasks and I/O. Leaves running multiple applications in parallel, but that is different than parallel programming and a task already accomplished quite well by current OS.

  79. Re:Let's see how that sounds in 5-10 years time .. by Junta · · Score: 1

    The issue is that when processor vendors went to dual and then quad core, people started extrapolating and saying 'oh in a decade, we'll be using hundreds of cores on a random desktop'. Instead it tapered out at about 4 for the most part with focus on reducing the power envelope while minimizing performance loss.

    I would say the discussion presuming massive core counts is based on an extrapolation of older trends of increasing core count, and it's perfectly reasonable to step back and recognize the change in the trend. Sure, tomorrow we could suddenly be back on the path to 256 core desktop solutions for unforeseen reasons, but as it stands, there's no signs of that being the priority of the industry.

    --
    XML is like violence. If it doesn't solve the problem, use more.
  80. Re:Linus wrong? Shocking! by bsdasym · · Score: 1

    It sounds like you're suggesting that memory bus speed will not continue to increase, and thus, we should stop adding bus contention by adding cores. The conclusion there hinges on a rather unsupported premise that is contradicted by the (historical) empirical data. All signs point to memory becoming much faster indeed.

    If Linus' expertise were really relevant here, perhaps Transmeta wouldn't have failed.

  81. Re:Linus wrong? Shocking! by Half-pint+HAL · · Score: 1

    Memory bus speed is increasing, and therefore the cost of cache misses is decreasing. One way or another, that still leaves us with cache misses as a bottleneck. The question is not a straightforward one of "memory bus speeds are increasing so who gives two hoots" -- there is a very subtle equation needed to determine what cache size is optimal with what bus speed, and for which task.

    --
    Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
  82. Re:Let's see how that sounds in 5-10 years time .. by SpinyNorman · · Score: 1

    Well, there's obviously no need to add more cores/parallelism until there's a widespread need for it (unless you are Chinese, when octocore is a must!), but I think the need is coming pretty fast.

    There are all sorts of cool and useful things you can do with high quality speech, image, etc recognition, natural language processing and AI, and these areas are currently making rapid advances in the lab and slowly starting to trickle out into consumer devices (e.g. speech and natural language support both in iOS and Android).

    What is fairly new is that in the lab state of the art results in many of these fields are now coming from deep learning / recurrent neural net architectures rather than traditional approaches (e.g. MFCC + HMM for speech recognition) and these require massive parallelism and compute power. These technologies will continue to migrate to consumer devices as they mature and as the compute requirements become achievable...

    Smart devices (eventually *really* smart) are coming, and the process has already started.

  83. Türkiyenin en AOK dinlenen radyosu by arabeskinsesifm · · Score: 1

    tÃf¼rkiyenà n can Damara ± arabesk radyo dinle www.arabeskinsesi.com ±