Slashdot Mirror


How We'll Program 1000 Cores - and Get Linus Ranting, Again

vikingpower writes For developers, 2015 got kick-started mentally by a Linus Torvald rant about parallel computing being a bunch of crock. Although Linus' rants are deservedly famous for the political incorrectness and (often) for their insight, it may be that Linus has overlooked Gustafson's Law. Back in 2012, the High Scalability blog already ran a post pointing towards new ways to think about parallel computing, especially the ideas of David Ungar, who thinks in the direction of lock-less computing of intermediary, possibly faulty results that are updated often. At the end of this year, we may be thinking differently about parallel server-side computing than we do today.

449 comments

  1. Mutex lock by Anonymous Coward · · Score: 5, Funny

    All other ended up in a mutex lock situaton so I had chance to do the first post

    1. Re:Mutex lock by NoNonAlphaCharsHere · · Score: 4, Funny

      Thanks a lot asshole, a lot of were busy-waiting while you were typing.

    2. Re:Mutex lock by NoNonAlphaCharsHere · · Score: 5, Funny

      I think I a word.

      A lot of US were busy-waiting.

    3. Re:Mutex lock by Z00L00K · · Score: 2

      In any case - a multi-core machine can also handle multiple different tasks simultaneously, it's not always necessary to break down a single task into sub problems.

      The future for computing will be to have a system that can adapt and avoid single resource contention as much as possible.

      --
      If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    4. Re:Mutex lock by TheRaven64 · · Score: 5, Funny

      That's what happens when you try to write without a lock.

      --
      I am TheRaven on Soylent News
    5. Re:Mutex lock by Anonymous Coward · · Score: 1

      The future for computing will be to have a system that can adapt and avoid single resource contention as much as possible.

      I think this is basically the point LT was making. The core is already dozens of times faster than memory and thousands of times faster than storage, so adding more cores does not really address resource contention. Make more and better caches; make more and better I/O. Having 1000 cores all waiting for 3,000 microseconds while the hard drive rotates to the other side of the platter does not improve performance over 4 cores waiting.

    6. Re:Mutex lock by buckfeta2014 · · Score: 1

      I use SSD, you insensitive clod!

      --
      Buck Feta. You know what to do.
    7. Re:Mutex lock by drinkypoo · · Score: 2

      The core is already dozens of times faster than memory and thousands of times faster than storage

      When you add more cores, you also can add more memory bandwidth, if you couple them closely to memory controllers. This is how multiprocessor PCs work today. Hell, even some processors with more cores in them have more memory buses, it's not just adding chips that gives you more bandwidth.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    8. Re: Mutex lock by Anonymous Coward · · Score: 0

      While you were busy bit-banging that core, she was spinning on my lock.

    9. Re:Mutex lock by Anonymous Coward · · Score: 1

      The future for computing will be to have a system that can adapt and avoid single resource contention as much as possible.

      The future? That's the whole point of out-of-order execution. Execution is governed by availability of input data. OoO microprocessors have been around for 25 years now.

    10. Re:Mutex lock by arth1 · · Score: 2

      I use SSD, you insensitive clod!

      Then after all blocks on the drive has been written to, you wait for a second while the drive moves data away and clears a sector so there's space to write to.
      SSDs have far better average write speeds, but far worse worst case write speeds. Using them for anything timing critical without a battery backed up controller is asking for trouble.

      "Use TRIM", I hear from the peanut gallery. Except that there are no RAID controllers (or software RAIDs) that actually support TRIM in practice. Nor does TRIM work for partitions where there is no file system support. Like raw database partitions or swap. Yep, put a single swap partition on the drive, and you will still be subject to the drive not knowing what blocks are free, and can't write to them unless asked to overwrite them.

      For guaranteed rate I/O, spinning platter drives and pure battery-backed RAM disks is still the way to go. A RAID of short-stroked HDs have a worst case performance far better than modern SSDs, despite the average being much slower. For a desktop user, an occasional or rare "hickup" of a second might not be noticeable or even a concern if it is, so SSDs are fine, and even great. For real-time data processing, it can very well be a big concern.

    11. Re:Mutex lock by Bengie · · Score: 1

      That's why SSDs keep a reserve of about 10%-30% of the logical storage as pre-TRIM'd. Some of the newer SSDs even reserve another bunch of the drive as a scratch pad for writes.

      Looking at this benchmark, it seems reads are more likely to have a random long access time than writes. http://techreport.com/review/2...

    12. Re: Mutex lock by Anonymous Coward · · Score: 0

      At least Samsung provides a manager utility (Magician) for their SSDs, which uses ram for caching and does some other tricks to keep things up to speed.

    13. Re:Mutex lock by azav · · Score: 1

      Do you also own cat?

      You use *an* SSD.

      --
      - Zav - Imagine a Beowulf cluster of insensitive clods...
    14. Re:Mutex lock by Jeremi · · Score: 1

      You use *an* SSD.

      He uses an solid state disks?

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    15. Re:Mutex lock by Anonymous Coward · · Score: 0

      He uses a Ess Ess Dee?

    16. Re:Mutex lock by Anonymous Coward · · Score: 0

      A lot of US were busy-waiting.

      A properly tail recursive event at the mornings.

    17. Re:Mutex lock by Jane+Q.+Public · · Score: 1

      Yes, in fact I've been looking at a lot of the newer SSDs coming out that are reporting sustained random read rates slower than sustained random writes (4k blocks).

    18. Re:Mutex lock by Marillion · · Score: 1

      Will no one think of the dying Dining Philosophers?

      --
      This is a boring sig
    19. Re:Mutex lock by Anonymous Coward · · Score: 0

      Most IO sensitive tasks perform few writes, usually compute is waiting on reads. If you're an edge case you man up and use a battery backed ram cache.

    20. Re:Mutex lock by Anonymous Coward · · Score: 0

      If, by worst case you mean ripping the ssd out of a raid 0 array by the data cable and throwing it under the treads of an abrams tank while it's in the middle of running an artificial benchmark writing a petabyte of data to your 32 MB relabled Tyan drive that you salvaged from your netbook.
      Yeah, you got me. Try filling up your raided 15k Savios to the hilt and see how fast the raid card peuks them out of the array like last nights gin.
      I realized you were trolling when i read battery baked ram, but I was hooked already. Well played. Very well played.

    21. Re: Mutex lock by Anonymous Coward · · Score: 0

      Sorry,not how English.works.

  2. Won't somebody find his blanket already! by Anonymous Coward · · Score: 0

    This is beyond the pale, and I am beside myself.

  3. Pullin' a Gates? by Tablizer · · Score: 4, Interesting

    "4 cores should be enough for any workstation"

    Perhaps it's an over-simplification, but if it turns out wrong, people will be quoting that for many decades like they do Gates' memory quote.

    1. Re:Pullin' a Gates? by AchilleTalon · · Score: 0, Offtopic

      It won't turn wrong. Linus is right on this one. We are talking about massively-parallel computing and Linus describes it right. It is a niche which will need specific algorithms tuned for the hardware (GPU or other) the pipeline must be kept busy to observe a performance gain. It doesn't scale to general purpose computing.

      --
      Achille Talon
      Hop!
    2. Re:Pullin' a Gates? by cb88 · · Score: 2

      It already is wrong...

      Linux Workstation: 16cores = way faster builds than 4 cores.
      CAD workstation: I imagine alot of geometry processing is parallelized... the less waiting the better (either format conversion or generating demo videos etc.. eat up alot of CPU)
      Video workstation: Thats just a blatantly obvious use for multiple cores...
      Linux HTPC: I wanna transcode stuff fast... more cores
      Linux Gaming: These days using at least 4 cores is getting more common...

      Things that I often seen that are *broken* for instance 200Mb work documents that hang the entire system when you scroll (yes windows thats bad). Linux isn't much better though disk IO starvation is a long time pet peeve there... 4 cores is the wrong place to draw the line currently maybe 6-8 cores + improved disk IO would be a realistic ideal these days.

      Granted alot of programs will *ought* to run just fine on my Sparcstation LX @ 50Mhz and 128Mb ram... but that isn't the future unless we have a nuclear apocalypse. Also, there is a good chance that alot of my cores will sit ide even so power management is better than it used to be and more cores can improve latency because now I have more available CPU time even though the individual cores are probably slower. Overall thats is a good tradeoff.

    3. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      If you had bothered to RTFA, you'd know that he never disputed any of those points. What he was actually referring to was the insane push for EVERYTHING to be done with hundreds of cores, and that is wasteful and stupid and insane.

    4. Re:Pullin' a Gates? by bruce_the_loon · · Score: 4, Interesting

      If you went and read Linus' rant, then you'll find you are actually reinforcing his argument. He says that except for a handful of edge use-cases, there will be no demand for massively parallel in end user usage and that we shouldn't waste time that could be better spent optimizing the low-core processes.

      The CAD, video and HTPC use-cases are already solved by the GPU architecture and don't need to be re-solved by inefficient CPU algorithms.

      Your Linux workstation would be a good example, but is a very low user count requirement and can be done at the compiler level and not the core OS level anyway.

      Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway. No need for a 64 core CPU for that one.

      Redesigning what we're already doing successfully with a low number of controller/data shifting CPU cores managing a large bank of dedicated rendering/physics GPU cores and task-specific ASICs for things like 10GB networking and 6GB IO interfaces is pretty pointless, which is what Linus is talking about, not that we only need 4 cores and nothing else.

      --
      Trying to become famous by taking photos. Visit my homepage please.
    5. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      considering they misquote Gates on the memory quote linus doesn't have much to worry about.

    6. Re:Pullin' a Gates? by jhol13 · · Score: 3, Insightful

      Why not? Currently Firefox has problems rendering (loading) two pages simultaneously, although it should be able to handle tens, using several cores.
      Same with Evince (which is crap anyway), it cannot do anything in parallel, should be able to use tens of cores.
      Javascript? Although the language is the worst I have seen since APL, a smart compiler could at least in some cases parallelize it (maybe with speculative execution or like).
      And so on.

      It will turn out to be as wrong as "640k".

    7. Re:Pullin' a Gates? by Tablizer · · Score: 1

      The Gates quote is ambiguous. One can read it different ways.

    8. Re:Pullin' a Gates? by ls671 · · Score: 2

      hmmm... Linus sounds right to me too. He specifically said, or almost, that people wanting to load 10 pages in sandboxed firefox process/thread in parallel could find a use for 16 cores ;-)

      --
      Everything I write is lies, read between the lines.
    9. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      no, the quote is plain untrue. It was originally supposedly related to the hybrid 8/16 bit processor where he supposedly said it in reference to that chip (which indeed had he said it he would have been right). But no source has ever been found to confirm this and gates himself has said he has said many things wrong and made many bad predictions but that one credited to him is false.

    10. Re:Pullin' a Gates? by bloodhawk · · Score: 3, Insightful

      Actually the quote is just an internet myth, at least no one has ever found a source for it or anyone that even reports to have heard him say it and gates denies having said it as well.

    11. Re:Pullin' a Gates? by davmoo · · Score: 2

      Except that Bill Gates never actually said the so-called "quote" that is attributed to him.

      --
      I want a new quote. One that won't spill. One that don't cost too much. Or come in a pill.
    12. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      hmmm... Linus sounds right to me too.

      Would you still think that if you didn't know who said it? Or. perhaps, if you thought it was another Bill Gates quote?

    13. Re:Pullin' a Gates? by Rei · · Score: 3, Interesting

      Linus's argument basically boils down to, "Parallel algorithms are sorcery, and the only place they matter are places applications that demand performance which are indeed increasingly using parallelism".

      Of course you don't need, say, a 50-threaded version of vi or alsamixer or whatever. But for apps that need performance, increasingly they have to get them from threading. And there's nothing "magical" about parallelism. Perhaps in Linus's dislike for C++ he's missed how trivially easy it's gotten to launch threads in C++11, but it takes less work now than a for-loop, since std::thread is so simple and you can inline the command with a lambda. And you have a nice clean mutex library including scoped mutexes like std::lock_guard so you don't even have to remember to unlock them.

      It's quite true that having multiple cores needing to read to and write from the same chunk of memory isn't a good thing. But I'd bet you that only in under 5% or so of high performance apps is that the *only* level you can thread at. Because if you have say five nested levels of looping, 4 of them can be memory constrained, but so long as least just one can be threaded without heavy reads/writes on shared cache, you can thread to your heart's content with minimal adverse impact. And "heavy" is the key word. So long as you're not doing essentially *constant* heavy reads/writes on shared cache, the overhead cost is minimal.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    14. Re:Pullin' a Gates? by Urkki · · Score: 5, Insightful

      Why not? Currently Firefox has problems rendering (loading) two pages simultaneously, although it should be able to handle tens, using several cores.
      Same with Evince (which is crap anyway), it cannot do anything in parallel, should be able to use tens of cores.
      Javascript? Although the language is the worst I have seen since APL, a smart compiler could at least in some cases parallelize it (maybe with speculative execution or like).
      And so on.

      It will turn out to be as wrong as "640k".

      Javascript is generally used in event driven manner, so it will perform quite well on a single core. Firefox having trouble loading multiple pages simultaneously should still be IO-bound, not CPU-bound, and if the engine has trouble, then it's an SW architecture problem where more cores will not really help.

      Point of Linus was, taking a 6 core CPU, and replacing 2 cores with more cache and more transistors per core should make almost anything on Desktop run faster.

    15. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      Firefox has problems rendering (loading) two pages simultaneously, although it should be able to handle tens, using several cores.

      What you describe is basic concurrency (doing two mostly independent tasks), massively parallel as some people believe would magically use all your cores to 100% to load one or more pages in an instant. Most work does not scale that way, just like nine women cant make a child in a month.

    16. Re:Pullin' a Gates? by Urkki · · Score: 2

      It already is wrong...

      Linux Workstation: 16cores = way faster builds than 4 cores.

      Did the 4 core CPU have 1/4th of the transistor count of the 16 core CPU? Then I'd expect it to be much slower of course. Point of Linus was, a 4 core CPU with same transistor count (used for more cache, better out-of-order execution logic, more virtual registers, and so on), as 16 core CPU will be faster on almost every task. So cores beyond 4 (the number Linus threw as the ballpark count) make sense only, if you really can not spend any more transistors in making those 4 cores faster, but still have die space to spare.

    17. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      Why not? Currently Firefox has problems rendering (loading) two pages simultaneously, although it should be able to handle tens, using several cores.
      Same with Evince (which is crap anyway), it cannot do anything in parallel, should be able to use tens of cores.

      You still don't argue against Linus here. The kind of parallel he is talking about is the kind that uses hundreds, not tens of cores and for a longer duration.
      The problem you are talking about only comes from Firefox bloat and bad design.

    18. Re:Pullin' a Gates? by ls671 · · Score: 1

      Nope, I only say that because I already thought the same way before I was aware of his view. It happens all the time.

      --
      Everything I write is lies, read between the lines.
    19. Re:Pullin' a Gates? by im_thatoneguy · · Score: 3, Interesting

      It is a niche which will need specific algorithms tuned for the hardware (GPU or other) the pipeline must be kept busy to observe a performance gain. It doesn't scale to general purpose computing.

      I feel like this is moving the goal posts. "You will never do massively parallel computing on a CPU because if it's massively parallel it's a GPU not a CPU."

      Linus is 100% wrong. What's the "general purpose" computing that we all want? The NCC-1701D's main computer from star trek. If I say "Cortana/Siri/Google Now please rough me out a flyer for our yardsale on Saturday." you're going to be looking at massively parallel task for the neural networks to not only interpret the voice but then make sense of the words and finally produce a printable flyer suitable for hanging. Programming is still a really fancy version of "IF A THEN B". "for X in GROUP do Z". "X = Y". Yeah, if your application is incredibly serial then a serial processor is all that you'll need. When computing advances to the next phase of neural networks, AI and directed (not instructed) computing then it'll need to be more like our brain: massively parallel.

      Now there are two obnoxious tautological arguments against this:
      A) "That's not a "CPU" that's like a NeuroProcessorUnit, an NPU if you will"
      B) "Yes we'll need a giant mainframe, but it'll be a server in the cloud!"

      A is moving the goal posts. Just because the processor isn't an ARM or x86 instruction compatible chip doesn't mean it's not worthy of the label CPU. As mentioned above you can't say that there'll never be a CPU with massive parallelism because as soon as it has massive parallelism it's by definition no longer a CPU. B is just saying that nobody will have a need for computers because we'll have a giant mainframe. Which might be true but you just need a basic DSP not even a CPU if it's just a pure thin client transmitting a video, audio and input stream to the cloud for processing. In which case all of the CPUs in existence... need to be massively parallel AI processors.

    20. Re:Pullin' a Gates? by gnupun · · Score: 1

      If you went and read Linus' rant, then you'll find you are actually reinforcing his argument. He says that except for a handful of edge use-cases, there will be no demand for massively parallel in end user usage and that we shouldn't waste time that could be better spent optimizing the low-core processes.

      So, if someone wants to optimize a critical app-specific operation "foo()" in their app and make it to go 4 times faster using 4 cores, they are crazy?
      Your argument implies that other than these so-called "edge cases" there is no need to improve performance of any other type of code.

    21. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      We are talking about massively-parallel computing and Linus describes it right. It is a niche...

      ...for now, yet it only takes a single killer app to change that. Ask Google if end-users may benefit from massively-parallel computing. It doesn't reach workstations because it's in best interest for them to keep them server side, but the same techniques for natural language processing and deep learning would greatly benefit for massive parallelism, and having them in a machine you own would make it much more private and tailored to you.

      End-user development is a field where many tasks would benefit from this kind of parallelism, and it's the great unknown for many people. Tools for handling unstructured and semi-structured content (like easy web scrapping and multiple editing) need to perform a deep analysis about a page's content to infer what the user is trying to do and make their life easier.

      That most developers haven't heard of such tools and don't know how to build them doesn't mean that they're not useful. A copy/paste clipboard tool that worked like Lapis could be used both by developers (I've heard a less powerful "column edit" mode is popular in the Sublime Text editor) and non-developers (batch renaming of files and album tracks is a common task performed by end-uses, and they edit names one by one because no one understands the existing pattern-based batch renaming tools).

    22. Re:Pullin' a Gates? by itzly · · Score: 1

      A neural net is a very specific, massively parallel, purpose, not general purpose.

    23. Re:Pullin' a Gates? by itzly · · Score: 1

      No, outside of the edge cases, using 4 smaller cores instead of a single big one will not make foo() go faster.

    24. Re:Pullin' a Gates? by TheRaven64 · · Score: 2

      If you look at a typical web page, you have a load of images, a few iframes with ads, scripts (possibly with with multiple web workers). Each one of those really wants to be a separate security domain. You don't want a vulnerability in libpng (something that has happened many times before) to be able to do anything other than break the single image that it's decoding. This kind of fine-grained security is a lot easier if you have the ability to have a load of cheap threads.

      --
      I am TheRaven on Soylent News
    25. Re:Pullin' a Gates? by itzly · · Score: 1

      You may need a bunch of cheap threads, but that doesn't mean they'll run faster on separate cores. Unless you have really fast I/O (most people don't) a single core should handle it just fine.

    26. Re:Pullin' a Gates? by TheRaven64 · · Score: 1

      They're likely to be bursty and when you get the data you want to run most of them in parallel. Add to that, current constraints on CPU design (Dennard Scaling no longer working) mean that adding a load of cores that spend most of their time sleeping is actually quite an easy thing to do.

      --
      I am TheRaven on Soylent News
    27. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Please provide proof for such a hefty statement. There is none. You cannot proof that there are no methods that will efficiently use 1000 core processors. All arguments use todays thinking...

    28. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Proving that something does not exist is always quite tricky.

    29. Re:Pullin' a Gates? by SuricouRaven · · Score: 1

      But by the time you've finished reading the first paragraph of the first page, the other nine are loaded even if you can't parallise.

    30. Re:Pullin' a Gates? by AK+Marc · · Score: 1

      Chrome, Opera, and others load each tab/page in a separate process. Why isn't everyone doing that? Quad core, 2 threads per core lets me run 100+ pages, and any one or two of them freezing up won't cause a problem.

      It doesn't have to be truly parallel, just separate. There's a difference.

    31. Re:Pullin' a Gates? by SuricouRaven · · Score: 3, Informative

      If massive-neural nets do reach common use (Which isn't that likely, they are somewhat overhyped) then I'd expect to see specific accelerators designed to run them. Probably something like FPGAs: Software writes the net, hardware executes it. A general-purpose processor (Probably x64 or ARM) does the coordinating, but augmented by specialised or semi-specialised hardware for certain tasks. Very much as we have today with hardware acceleration of 3D graphics or video decoding.

      You can see the trend already. 3D acceleration was introduced for graphics, but then repurposed for other things, and followed up with revised graphics architectures designed for non-graphics applications. They are still useless for general-purpose computing, their architecture too limited, but used in conjunction with a general processor they can greatly outperform the processor alone on things like image processing, cryptographic tasks, physics simulation and such. It's now quite common to see even consumer applications, with games using physics simulation to provide much more detailed rigid-body simulation than was previously possible - ie, more bits of shrapnel and chunks of corpse bouncing around when you lob that grenade.

      As for neural nets, you probably won't see much need to simulate huge ones. Small ones work surprisingly well, and their applications are really quite limited - they aren't some magic AI bullet that turns into a functional mind if you make them big enough. They excel at classification tasks, so they ar very handy in OCR, handwriting recognition, speech recognition and such. Google made one that can recognise cats, and if you can recognise cats then you can recognise other things, so straight away I'm seeing applications in web filter software.

    32. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Linus's argument basically boils down to, "Parallel algorithms are sorcery, and the only place they matter are places applications that demand performance which are indeed increasingly using parallelism".

      No, that's a strawman you're making up. Linus' argument is "heavy parallelism means a lot more cores which means a lot more simple cores for the same power usage which just doesn't make any sense except in specific use cases" (he lists graphics (aka gpus) and servers). Once you get there, you realize that there's nothing wrong with using the GPU for those heavy parallelism edge cases and recognizing the CPU is still going to be a few core and not many most idle cores that otherwise just waste power.

    33. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      find a file called "bill-gates-1989.mp3". Its a bit over 80MB and about an hour and a half long. Somewhere in there is where he makes the memory statement while talking about how *he* designed the memory layout of the IBM PC.

    34. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1, Informative

      find a file called "bill-gates-1989.mp3". Its a bit over 80MB and about an hour and a half long. Somewhere in there is where he makes the memory statement while talking about how *he* designed the memory layout of the IBM PC.

    35. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      considering the quote supposedly has been around since about 1981, that would obviously be where the quote came from.

    36. Re:Pullin' a Gates? by gnupun · · Score: 1

      What if the cores don't become much smaller while cores are added to your PC? Your general desktop/workstation can have up to 16 cores each of which are more powerful than the previous generation core. Should we still do single-threaded programming for any time-critical foo() and run roughly 10 times slower?

      There's plenty of code than would benefit from the speedup of multi-core programming, not just some niche code.

    37. Re:Pullin' a Gates? by itzly · · Score: 1

      A 3+ GHz single core CPU is easily capable of decoding images that come in at full speed over a typical internet connection. You may be able to use multiple cores, but it's going to make the overall page loading any quicker than using a single core.

    38. Re:Pullin' a Gates? by itzly · · Score: 1

      Obviously, adding more big cores is better, but that's not what Linus was talking about.

    39. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      no that is just where he talks about the 640k barrier and why it existed (i.e. because of the architecture of the 8088 chip). The quote supposedly originates almost a decade earlier.by 1989 we were already needed more than 640k. hell even my cheap arse machine at the time had more than 640k, from memory it had 4x 256k modules.

    40. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1
      http://en.wikiquote.org/wiki/B...

      I have to say that in 1981, making those decisions, I felt like I was providing enough freedom for 10 years. That is, a move from 64 K to 640 K felt like something that would last a great deal of time. Well, it didn't - it took about only 6 years before people started to see that as a real problem.
      1989 speech on the history of the microcomputer industry.

      Said in 1989 about a 1981 decision. While it's not "640kB should be enough for anyone", verbatim, it's where that quote seems to come from.

    41. Re:Pullin' a Gates? by bloodhawk · · Score: 1

      exactly, he never made the statement he is quoted as saying. There is a massive difference between what he is quoted as saying and what is said in that presentation. He also discusses in other interviews how he wanted the limit to be higher but was restricted by the chip architecture but thought it would be good enough for the lifetime of the architecture, he was actually pretty close to being right.

    42. Re:Pullin' a Gates? by itzly · · Score: 1

      For a fair comparison, you shouldn't compare a single core with 16 cores, each the size of the single core. Instead, you should keep the number of transistors fixed, and decide whether you want to divide them into 4 big cores, or 16 smaller ones (with smaller caches). And since we're talking about general purpose PCs, you should consider a typical mix of user applications.

    43. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      As if he had a hand in the design of the hardware anyway. Micro-Soft was the second or more choice for providing the OS for the IBM Personal Computer anyway.

    44. Re:Pullin' a Gates? by bloodhawk · · Score: 1

      Microsoft had supposedly had a very large influence over which chip went into the IBM PC. supposedly they are the reason IBM went with a 16bit chip instead of an 8 Bit one as they talked IBM into changing and they were also considering a 32 bit chip from motorola.

    45. Re:Pullin' a Gates? by TheRaven64 · · Score: 2

      First, that's with a single thread and a single security context. If each one is an isolated sandbox it's not the case (trust me on this: it's my research area and we've done a lot of benchmarking). Second, even if it were true, it would be a lot less power efficient. If you can parallelise your workload, then two 1.5GHz cores will use less power than one 3GHz one. Four 750MHz cores will use less still.

      Until a few years ago, most computers had a single core, so there wasn't much point trying to exploit parallelism and the fastest way of implementing many problems was to serialise them. That's no longer an automatic win.

      --
      I am TheRaven on Soylent News
    46. Re:Pullin' a Gates? by jhol13 · · Score: 1

      Several processes in multicore is parallel.

      And it needs to be parallel. For example my current desktop is A8-5500, got it for ~$100. Four times more single thread performance - how much would that cost?

    47. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Actually, using the 68000 series would have been a great time saver for programmers and obviated all the nightmares of near and far pointer which plagued most programmers well into the 90s. Of all the 16 bit chips available, the 8086 was the less 16 bitty of them, with some details directly inherited from the 8008/8080 for easy upgrade of assembly programs. Intel published a guide for the conversions from 8080 to 8086 assembly.
      The only advantage of the segment registers is that they could be used as a very poor man's MMU, but as soon as a program overflowed 64kB code + 64KB data (you had to set DS=ES=SS to be in genuine 8 bit mode compatiblity, but CS could be different without much trouble) and you had to use far pointers for data accesses, it became a nightmare and it was slow (32 bit pointers for 20 bit of address space, what a waste!). Other details come from 8080 compatibility: rotate and shifts only affect the carry and overflow flags and not the zero flag, waste instruction encoding space with short form instructions using the accumulator, a separate I/O space which lingers despite recommendations to avoid it since the inception of PCI, the LAHF/SAHF insrtuctions (to emulate some 8080 pushes and pops), a parity flag that noone uses (except perhaps for floating point compares, see below, at least the Z80 used the parity flags for overflow for instructions where parity does not make sense) and probably a few other things that I forget.
      Of course, the other nightmare of the x86 crapitecture was the x87 FPU with its register stack, which drove compiler writers insane. The layout of the flags in the x87 status register is "interesting", indirectly driven by the desire to map to the x86
      flags, i.e., the 8080 flags through FSTSW and SAHF (again!) in the x86 flags. The parity flags is used to test for unordered comparisons, but most programmers and many compilers get it wrong. Don't let me started on the numerous problems that the intermediate excess precision of the x87 registers caused, with results depending on optimization levels and compiler vagaries.
      In short, IBM made the worst possible choice in 1981 and I shall never forgive them. They have driven us into a monoculture in which Intel has far too much power on the computing landscape.

    48. Re:Pullin' a Gates? by Rei · · Score: 0

      It's not a straw man at all - I know how to read. He literally says "magical parallel algorithms" are needed to make use of hundreds of cores. And he says "The only place where parallelism matters is in graphics or on the server side, where we already largely have it", which is nothing more than pointing out that apps that need high performance (aka, graphics and servers) *do* in fact use parallelism. What you claim is "Linus's argument" isn't even brought up until the last paragraph. And the only reason that the CPU would "still going be a few core and not many core" is if programmers don't threading their cpu-intensive apps sufficiently. Which, one should note, are mainly graphics and server stuff, the things that Linus notes *are* being threaded.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    49. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Look at what you are saying:
      My SOFTWARE has problems PERFORMING ITS BASIC OPERATIONS, and I know this because OTHER SOFTWARE perform the same basic operations better.
      More cores will solve nothing, if it still takes forever to do a job that is suppose to be instant if not constrained by network lag.

    50. Re:Pullin' a Gates? by visualight · · Score: 1

      Instead of paraphrasing why not just quote him directly? It's not a long article and no one will think 'strawman'.

      "Big caches are efficient. Parallel stupid small cores without caches are horrible unless you have a very specific load that is hugely regular (ie graphics)." ...
      "the crazies talking about scaling to hundreds of cores are just that - crazy."

      In that context, he's right. If you're doing hundreds of dumb cores you should be using gpu already.

      --
      Samsung took back my unlocked bootloader because Google wants me to rent movies. They're both evil.
    51. Re:Pullin' a Gates? by HuguesT · · Score: 2

      Thanks, interesting document, found here. The audio is really bad at the beginning and fluctuates throughout the talk. The interesting bit that you refer to is at 21 minutes from the start.

      I'm trying to type in what he said directly from the audio:

      The 16-bit design gave us a megabyte of memory. The 8086 has a 20-bit address. It is really a segmented 16-bit data path with segment registers that are really indexes. It is a 1-MB address space. And in this original design I took the upper 384K and tied it to a certain amount to provide for memory video, the ROM and I/O. And that left 640K for general purpose memory. And that leads to today's situation where people talk about the 640K barrier. The limit to how much memory you can put to these machines. I have to say that in 1981 while making those decisions I felt like I was providing enough freedom for 10 years. That is, a move from 64K to 640K felt like something that would last a great deal of time. Well, it didn't. It took only 6 years before people started to see that as a real problem.

      Fortunately, there is a reasonable solution. Intel has moved forward with its chips families, the 286 chip introduced in 1984 moves us to a 24-bit address space (mumbles about segmented indirection, being not that good). That is sort of an intermediate milestone. in 1986 we moved up to the 386 where we get a full 32-bit offset to these segments that have been designed in this architecture. So what we have is a machine that can address 4GB of RAM. And I have to say with all honesty, I believe that it will take us more than 10 years to use up that address space.

      So he never makes that exact quote, however one can understand why people picked it up. Essentially, BG thought in 1981 640K would be enough for everybody for a long while. Note that he was reasonably prudent regarding using up the 32-bit address space (that ship has sailed now).

      Later, regarding memory, he says that computers should have about 1MB of RAM per MIPS. Specifically, he goes on to saying machines with 30-60MB of RAM should be desirable soon (in 1989).

      In this talk he talks about many things, most are pretty insightful in fact: OS design, multitasking, parallelization, multi-processor designs, dynamic linking, object-oriented design. Funnily he talks at length about OS2 in a very positive way. This was before Windows 3 of course. He compares OS2 and Unix, saying that OS2 will take over the desktop and Unix the servers, and all other OSes will die out. He talks about the FSF, saying its task of creating a free Unix-like OS is doomed.

      Some interesting comments on that talk here.

    52. Re: Pullin' a Gates? by Anonymous Coward · · Score: 2, Insightful

      There has been a push back against integrating ANNs into mobile platforms. I think low power real time classification is simply missing an application in the mass market that can't be solved by off loading to a server. We simply assume that we are continuously connected to a sufficiently large data pipe and the problem goes away. Whether the hardware changes on the server side or not is a question of power savings, but I doubt we will see gains in performance over software implemented on server farms.

      That said if we put our future caps on, is there a point when the amount of data our electronics gather for processing that pushing into the cloud is cost and time prohibitive? If wearable electronics becomes a pervasive technology, we may need some on board continuously learning classifier cores to locally fuse sensor data rather than sending raw data Into the cloud. This is where we could see truly assistive computing without the creepier general intelligence hassabis and crew are working on at deep mind.

      Imagine you have a conversation with your wife and she says the kids need to be picked up at 4 on Tuesday. If my phone put a reminder on my calendar for me based on my continuous audio stream, the mental offload would be huge as I could seamlessly continue with my day without managing my calendar, but I don't want to continuously stream my audio to Google nor do they want to continuously process the sound of me typing and sipping coffee... That's what we have the NSA for.

    53. Re:Pullin' a Gates? by itzly · · Score: 1

      trust me on this

      Of course, if you add enough cruft, you can slow anything down to a crawl.

      Second, even if it were true, it would be a lot less power efficient. If you can parallelise your workload, then two 1.5GHz cores will use less power than one 3GHz one

      You need a number of bit flips to solve a problem. Energy is related to the number of bits flipped. If you use twice the bits at half the speed, the energy requirements will be the same. By splitting the workload over multiple cores you have more overhead, so more energy is required.

    54. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      It's not a straw man at all - I know how to read. He literally says "magical parallel algorithms" are needed to make use of hundreds of cores. And he says "The only place where parallelism matters is in graphics or on the server side, where we already largely have it", which is nothing more than pointing out that apps that need high performance (aka, graphics and servers) *do* in fact use parallelism.

      Exactly. Because of this...

      What you claim is "Linus's argument" isn't even brought up until the last paragraph. And the only reason that the CPU would "still going be a few core and not many core" is if programmers don't threading their cpu-intensive apps sufficiently. Which, one should note, are mainly graphics and server stuff, the things that Linus notes *are* being threaded.

      Which is precisely the point. Programmers don't thread their cpu-intesive apps sufficiently because (1) a lot of tasks don't parallelize well and (2) programmers have consistently proven to suck at coming up with ways to adequately parallelize programs. Ergo, the stuff that people know will improve is heavily threaded. And the rest remains using one or two cores, max. Honestly, this is the same crap that came up with the Itanium and someone else pointed out the whole "magical compilers" comment because the whole notion that you'll get it fixed on that end is just as absurd.

      Simply put, sure, on paper 50 cores look nice. But actually keeping them all busy for most computers is just near impossible. Again, you have to point at specific points where it's useful and you end up having a specialty CPU like a GPU for just that task and for which (1) there are known heat issues and (2) most the time it's hardly used. Golly, just what Linus was saying.

      PS - You see, it's not that parallel algorithms is sorcery. It's that advocates for massively parallel general processing units seem to believe in magical parallel algorithms that work generally and show non-negligible improvement that negate all the lower clock-rate/processing that's done on each of these micro-cores to deal with power usage/heat issues. And those algorithms just don't exist and apparently there aren't enough genius programmers to use them even when they do exist (unless it's bundled in a library and mostly hidden and ends up 99% of the time under-utilized). See the difference between what does exist and is known to work and pretending that one can extrapolate it to known bad use cases?

    55. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      Linus is 100% wrong.

      That single leading statement made your entire post worthless. You know that Linus is not 100% wrong so you knowingly lied. That was stupid of you.

    56. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Rendering a web page is fundamentally a serial task. The data structure is hierarchical, but you can't set (say) one core on parsing and rendering <div id=footer> while another parses and renders <div id=header>, because the footer needs to know where the header ends. Now, if there were a way to structure web pages that made multiple elements independent of each other, then you might really benefit from multiple cores. In most cases, though, I would expect that slow rendering of multiple simultaneous pages is more dependent on network latency than on processor saturation.

    57. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      Also it's a false dichotomy - Linux can support traditional architectures & massively parallel ones at the same time without making the traditional performance worse.
      Amateur packet radio stuff is supported without fighting about it being an edge case.

    58. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      An i7-4790k has approx 2.5x single thread performance over the latest A10 APUs. so... $329 on sale.

    59. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Let me guess, you don't pay the electricity bill in your household. more big cores = bigger bills even at idle. Linus covered that in the rant too.

    60. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      the language is the worst I have seen since APL

      Blasphemer! Everyone knows that APL is Easy, and that APL is elegant.

    61. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      "Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway. No need for a 64 core CPU for that one. "

      You are wrong.

      a) There is no such thing as linux gaming machine. Maybe some day, but currently gaming on linux isn't a gaming hobby, it's a "i run games on linux" hobby.

      b) Game coders just love to write bad code. They can easily waste any amount of cores for things like AI. Heck, just give every AI instance it's own core. No upper limit visible there. The more you can simulate the more they will. It's the same with GPUs, you are still nowhere near of being able to draw the wholescrenery to the horizon without huge tricks. Gaming will use every last flop you can give it. The simulation will just get more complex, there is no upper limit. World is not enough in this case.

    62. Re:Pullin' a Gates? by Anonymous Coward · · Score: 1

      Linus is WRONG on this. He may be right for today's tasks, but like the (purported) Gates quote, it will not hold for tomorrow's tasks. It is easy to extrapolate today's usage into tomorrow and turn out being wrong. It is more difficult to try to predict future uses.

      As machines are asked to do more, for example, speech, vision, AI-ish type tasks (e.g. Google's self-driving car), massively parallelization will become critical. There are likely many other uses that will come up that wouldn't come to mind immediately. Perhaps the computer will track your eye movements (more than some do now) to try to anticipate what you are going to do and pre-calculate/prepare something for you. Or your tablet pays attention to its surroundings in more than a very superficial way so that it can be context aware. There are many "little" improvements that may provide large benefits when combined, but that need continuous input, kind of like our hearing and eyesight.

      Think about 40 years ago what the use cases were in 1975. The PC wasn't even a mainstream use and barely available.
      Think about 30 years ago what they were in 1985. Networks were used in colleges, but few thought about the internet being pervasive in the mainstream - certainly in CS departments they did, but primarily for research usage.
      Think about 1995, the Web was still being dismissed as a fad - Krugman said it would have as much impact as the fax machine even a few years later.
      Think about 2005, the cell phone as a miniature computer was considered, but just a miniaturized version of a PC, not as something with a touch interface etc.

      Not to knock Linus, but it is hard to predict what is coming, I don't know, just that it probably will be something that doesn't just involve a higher res screen, and faster CPU, but something that additional processing power will allow that no one considers important now, but will end up being a game-changer when you can have 10 cores each doing 10 different things and still have 900 cores to spare.

      And IAACSWAAD (I am a computer scientist with an advanced degree).
      (Sorry for any typos - continuous spell check is an example that wouldn't have been considered 30 years ago on a phone)

    63. Re:Pullin' a Gates? by drinkypoo · · Score: 1, Flamebait

      So he never makes that exact quote, however one can understand why people picked it up. Essentially, BG thought in 1981 640K would be enough for everybody for a long while.

      "Bill Gates, CEO of Microsoft Corp. a fiercely competitive company(...)" - Microsoft Encarta, 1996
      "Bill Gates, CEO of Microsoft is a contributor to several charitable causes, including...(...)" Microsoft Encarta 2000

      In some other discussion about whether BG ever said 640k should be enough for anyone, evidence was presented including an eye- (and ear-)witness account of when he did say it. But hey, let's not lose any sleep over whether Bill Gates is a liar, because we know he is. The DoJ had him over a barrel. Then the Gates Foundation was created to promote the goals of Big Pharma and Strong IP law, if you look at where they spend their money and the terms under which these foreign nations get aid from the foundation, it's clear what their actual goals are.

      If you have a hard time believing that Gates ever said 640k should be enough for anyone, keep in mind that 1) he has claimed that he personally created the 640k limit, 2) that he could easily have said it meaning that 640k should have been enough for anyone at that moment, justifying the 640k barrier*, and 3) Bill Gates is a liar, and has been proven such in court.

      * The chip could only address 1MB, video memory had to go somewhere, there had to be a split somewhere, the design may well have been completely justified.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    64. Re:Pullin' a Gates? by drinkypoo · · Score: 0

      The CAD, video and HTPC use-cases are already solved by the GPU architecture and don't need to be re-solved by inefficient CPU algorithms.

      If the CPU becomes more like the GPU, then it won't be inefficient. It will be more efficient, because you won't be having to shovel data back and forth between them, nor will you have to have two different actors both trying to access the same data in main memory. The CPU will do all the work.

      We only use the GPU for computing on the desktop because our GPU has so much power. But as computing power improves, we'll find new ways to use it, and eventually the CPU will outstrip the GPU again, because it's easier to program for just one fat core than to have to try to utilize two. At least, that's the pattern of history.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    65. Re:Pullin' a Gates? by DarkOx · · Score: 1

      You are correct, a parallel algorithm is going to be more complex, requiring more total operations. In a world of frictionless pulls and perfectly spherical cattle you are sure to be right.

      We don't live in that world though. In practice higher clock speeds usually require higher voltages for circuits to stabilize. Higher voltage means more current is going to flow, batter will be drained quicker.

      Its likely the case manufacturing and materials constraints are such that we can economically build a 1.5GHz part that uses fewer watt/hours per operation than 3GHz part, if the overhead of parallelism is kept to a minimum its entirely possible two 1.5GHz parts could do the same work as a single 3GHz part in nearly the same amount of time using less power.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    66. Re:Pullin' a Gates? by buckfeta2014 · · Score: 1

      The layout of the flags in the x87 status register is "interesting", indirectly driven by the desire to map to the x86 flags

      an Intel FPU was designed for an Intel CPU/APU? Who would have thunked it.

      --
      Buck Feta. You know what to do.
    67. Re:Pullin' a Gates? by dj245 · · Score: 1

      Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway. No need for a 64 core CPU for that one.

      I think you are being a little shortsighted here. AI for NPC's could be incredible if each NPC had its own core. Real life people analyze every action you take, no matter how small or insignificant. Real life people discard or take notice of these actions, weigh (rank) the important actions, and then combine the most important actions in consideration of what you are thinking or what you might be likely to do. Real people analyze the actions of all the people around them, and take that into consideration when dealing with a person too. In a computer, each AI thread do all these things too, but nowadays we normally use tricks and hacks since computing power is in short supply for AI.

      Doing this well takes a large amount of computing power, and there is no reason it can't be paralleled- real life people act in parallel and aren't all part of the same computing "thread". Simulating that doesn't have to be in the same computing thread either, but nowadays it often is because the vast majority of computers are limited to 2 to 8 cores.

      --
      Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
    68. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Javascript? Although the language is the worst I have seen since APL, a smart compiler could at least in some cases parallelize it (maybe with speculative execution or like).
      And so on.

      Typical javascript code cannot easily be parallelized (And no, a raytracer writing stuff in a canva element is not typical js code). Most of the time, (client-side) javascript code manipulates the DOM, and because of the "nice" property that the DOM is always a tree, this means putting a lock around any reparenting that happens, thus preventing any speed-up. Unless you use very advenced runtime analyses techniques (and even in that case) you'll end up with a GIL around the DOM and that's it (or you need to have a lock free implementation of the DOM interface and that in itself is already quite the challenge).

      Plus as other have said, page rendering is more I/O bound than CPU bound these days (for bad reason sometimes. Like scripts dynamically adding image/font elements while loading the page, which prevents the browser from fetching them early and make it wait while the ressource is downloaded. Also, bad behaving javascript code may cause several page reflows which means that what appears to be a single "long time to render" event is actually several re-render in succession).

      Also, if Javascript is the worst language you have seen since APL, I take that you have never seen PHP. *That* is an exemple of server-side language where parallelization could help, but won't because of the brokenness of the language/runtime/everything.

    69. Re:Pullin' a Gates? by Half-pint+HAL · · Score: 2

      Point of Linus was, taking a 6 core CPU, and replacing 2 cores with more cache and more transistors per core should make almost anything on Desktop run faster.

      There's an element of truth to this, but on the other hand, cache space is already big enough that it hits the law of diminishing returns. Yes, the biggest performance hits in current computing are cache misses. But cache misses are already unexpected events, and cache misses are of biggest concern to the user when there are lots of them at once -- ie when iterating through a large bit of data. Text searches on large documents in a complex format (eg MS Word). Making a global change to a large file. These are the situations where performance matters, and these are the situations where you're going to get cache misses. Torvalds dismisses photo editing as a task for "professional photographers", but our amateur cameras are taking phenomenally detailed pictures, and even making fairly simple edits is a compute-intensive task. He may be right, but he may equally be wrong.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    70. Re:Pullin' a Gates? by DarkOx · · Score: 1

      Okay fine you want to play word games have fun. Linus was obviously speaking in the context of the Von Neumann computers most of us are familiar with.

      I suspect if you asked him, does your quote apply to radically different architectural paradigms, he'd say "no".

       

      Programming is still a really fancy version of "IF A THEN B". "for X in GROUP do Z". "X = Y"

      Yes it is, I don't care what language you are using for all the computing machines in common use at some point a series of fairly limited branch, jump, add, subtract, multiply, and move like instructions have to be generated. This may even hold true for the basic units of computation that participate whatever system is ultimately able to handle very arbitrary requests like "please rough me out a flyer for our yardsale on Saturday."

      I say you are the one moving the goal posts, Linus and *most* of the other people working on parallelism solutions are working/speaking in the context of computers like the ones we know today, you they guy trying to apply what they say to *any* computer. Linus will probably be proved correct there. Past n cores the fundamental architecture in use today will not scale but for niche cases.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    71. Re:Pullin' a Gates? by wisnoskij · · Score: 1

      "SW architecture problem"
      And the software is changing. It used to be that no one needed 64x, as all software was written for 32-bit so getting a 64-bit processor would not make anything faster. And then Crysis came out with a 64-bit edition, and other applications followed suit. The next game that absolutely blows everyone's minds, and stretches what we think is possible, will be released to take advantage of multiple cores (8-16, or possible even more), and slowly applications will follow (and this is not far away). Trust me, right now loads of people are working on this problem, and it is a big problem. We need better compilers at least, possibly brand new languages or ways of using existing ones, but everyone who has bought a computer in the last 5 years has at least 2 cores,

      --
      Troll is not a replacement for I disagree.
    72. Re:Pullin' a Gates? by AchilleTalon · · Score: 1

      A single killer app is a niche. They already exists and they do not change anything more than require specialized hardware to get handled. You don't change a general purpose computer for a specialized one unless you are running only a single killer app and nothing else.

      It seems pretty obvious many people comment here and have no exprience at all of parallel programming and parallel architectures. Also, it would help a bit to read not only the article but the other refered links.

      --
      Achille Talon
      Hop!
    73. Re:Pullin' a Gates? by Half-pint+HAL · · Score: 1

      You need a number of bit flips to solve a problem. Energy is related to the number of bits flipped. If you use twice the bits at half the speed, the energy requirements will be the same. By splitting the workload over multiple cores you have more overhead, so more energy is required.

      Energy is related to the number of bits flipped... true. But there are also other factors in the equation. A more efficient processor uses less energy to flip a bit. Slower processors are generally more efficient.

      Now, the GP poster told us he's a researcher working on parallelism and efficiency. What is your qualification that allows you to dismiss his expertise and assume that he's writing crap software to test out his theories?

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    74. Re:Pullin' a Gates? by Half-pint+HAL · · Score: 1

      Well, embedded images, flash animations, iframes etc can all be handled in parallel. They're certainly threadable.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    75. Re:Pullin' a Gates? by chthon · · Score: 1

      The thing is that I remember reading it in Elektor around 1982 or 1983. I think in the context of an electronics show.

    76. Re: Pullin' a Gates? by Anonymous Coward · · Score: 0

      That sounds like a typical Linus rant: arguing against a position that is exaggerated to the point of being an inaccurate straw man.

    77. Re:Pullin' a Gates? by AchilleTalon · · Score: 1

      A massively parallel system is not necessarily a GPU. GPU are a class of massively parallel systems, not the only one. For the rest of your post, other commenters reflected my thinking about it.

      --
      Achille Talon
      Hop!
    78. Re:Pullin' a Gates? by meta-monkey · · Score: 1

      One of the reasons we started moving to multiple cores was because of increasing chip size and density and the limitations of propagation delay. This was actually the basis of my master's thesis (but this was ten years ago and now I just do software, no computer architecture). Thing is, in a single clock tick of a 1GHz processor, light can only travel .3m. And electrons moving through a wire are about a third that speed. Plus gate delays, wire capacitance, etc. Point is, it was getting to the point where you couldn't get from one side of the chip to the other in a clock cycle. So it made sense to keep signals local, and only pay the propagation delay penalty when you needed to. So you can't necessarily say 4 big cores are better than 16 small cores. Otherwise, we never would have bothered with 4 cores to begin with. We would have just kept making bigger and bigger single-core processors.

      As for what's "faster" it depends very much on the algorithm. There some algorithms with course grain parallelism, some with fine grain parallelism and some with none at all. What you're running, how it's programmed and what your OS is doing matter a lot.

      --
      We don't have a state-run media we have a media-run state.
    79. Re:Pullin' a Gates? by Anonymous Coward · · Score: 2, Interesting

      Perhaps in Linus's dislike for C++ he's missed how trivially easy it's gotten to launch threads in C++11, but it takes less work now than a for-loop, since std::thread is so simple and you can inline the command with a lambda. And you have a nice clean mutex library including scoped mutexes like std::lock_guard so you don't even have to remember to unlock them.

      He doesn't mention C++ or anything like that. What he is talking about is that since the overhead for task switching is pretty large so in cases where a tradeoff is made between the performance of a single core or adding more cores to a CPU you will typically get more performance gain by having fewer better cores since the task most users do most of the time is of a nature that doesn't lend itself to parallellization. In those cases where it is easily done it is already delegated to dedicated hardware like GPU.
      For your typical for-loop that is so easy to launch threads for the problem is that the overhead for moving the task to another core with another cache is so high that you don't get a performance gain. There are still cases where it makes sense to launch threads but people who does it without thinking because "parallell is better" is the kind of programmers that jumps on every new programming fad.

    80. Re:Pullin' a Gates? by Rei · · Score: 0

      Which is precisely the point. Programmers don't thread their cpu-intesive apps sufficiently because (1) a lot of tasks don't parallelize well and (2) programmers have consistently proven to suck at coming up with ways to adequately parallelize programs.

      That's simply nonsense. Probably the most common algorithm in the book, what wastes probably 80% of compute cycles in some form or another, simplifies down to:

      for (auto& i : container)
          auto.do_something_independently_to_or_with_this_object(): .... which can be parallelized tremendously, up to container.size(). Probably the next most common cycle waster is along the lines of:

      while (true)
      {
          do_regularly_occurring_largely_independent_task_1();
          do_regularly_occurring_largely_independent_task_2(); // ...
          do_regularly_occurring_largely_independent_task_N();
      }

      Which again, can be parallelized trivially.

      Of course there exist algorithms that can't be easily parallelized. That's not the point. They're not the most common cases. And I'd be glad to demonstrate this to you by, say, pulling up random pieces of source from the Linux kernel or whatnot and showing with real-world examples.

      What don't people parallelize this sort of stuff? Because they're too lazy and most programming languages (but C in particular) make it too much of a hassle. Hence they only parallelize when forced to, which means that the general case of your software runs fine, but the edge cases run terrible. It has nothing to do with whether the app *can* be parallelized. The overhwhelmingly vast percentage of code that could be sped up by parallelizing, 99.99% of the time isn't parallelized.

      And it's not just about performance, it's about general user experiences. Threaded code is simply more pleasant to use. If I'm using, say, blender and I turn on decimate for a mesh, I have to sit there and wait for the mesh to finish decimating before I can do anything else. Anything at all. I can't go off and start developing a texture or tweak an unrelated object, I have to sit there and wait for that *one* task to finish. And sometimes these time wasters come accidentally or unexpectedly.

      Threads are Good Things(tm), and they're way underused. And seriously, you want to talk about wasted silicon, you have to look no further than all of the huge amount of silicon that gets wasted trying to gain small incremental improvements in serial processing speed.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    81. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Your fanboy is showing.

    82. Re: Pullin' a Gates? by Lije+Baley · · Score: 1

      But I want to read the last paragraph of the last page first!

      Actually though, tab loading seems more dependent on external factors anyway...

      --
      Strange things are afoot at the Circle-K.
    83. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Oh it's definitely wrong. I'm sure most power users have noticed the gain going from 4 to 8 cores. It's only a matter of time before lame users catch up on resource usage of current power users. That hasn't shown any real sign of slowing down just yet.

      Honestly I can't see a future without parallel processing at it's heart. But then I spent several years playing around with GPUs, pushing them as hard as I could just testing crazy ideas that requires parallel processing.

    84. Re:Pullin' a Gates? by eth1 · · Score: 1

      Point of Linus was, taking a 6 core CPU, and replacing 2 cores with more cache and more transistors per core should make almost anything on Desktop run faster.

      The real problem is that some desktop tasks really need one thread to run as fast as possible, and others (path finding for 200 drunken Dwarf Fortress denizens, for example) would benefit from having 100 somewhat slower cores. When you buy a desktop CPU, all the cores are the same, and you end up having to compromise between number of cores, single-thread speed, heat, etc.

      Maybe it's time we started designing systems with two separate chips - one dual core chip optimized for running single tasks as fast as possible, and another with 10-50 simpler cores optimized for parallel tasks. I think we're halfway there already, what with GPUs being used that way to some extent, but standardizing it would actually allow non-custom applications to make use of it.

    85. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Yeah, believe it or not I actually worked with the "Father of the IBM PC" here in Florida. Bill Lowe was his name.

      We used to take tons of PC's up from Boca Raton to Jax, FL where our logistics hub was, and then ship them out all over the USA. We were the #1 IBM business partner, shipping more PC's than any other business partner before I actually worked directly with Bill at another company.

      During my discussions with Bill the topic of the 640K memory barrier came up a couple times. Trust me folks, you can thank the level of sophistication available at the time with Intel CPU's, IBM and the team at Boca for this one. Bill Gates had "nothing" to do with it. They did try to help folks out though with the Lotus/Intel/Microsoft spec for swapping memory in and out via memory expanders.

      Higher memory density (heck, all the DRAM at the time was 16 PIN "DIPS"), Page tables and virtualization tech changed all the rules that you take for granted today.

    86. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      See my other post, Gates might like to joke or take credit for the limitations in the early IBM PC's, but last I checked, he had nothing to do with the hardware engineering. The hardware design is what set the limits.

      He just bought a DOS program that he paid some stupid programmer to write for $18k or somesuch while signing away all rights to the IP, the rest is history.

    87. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      A neural net is a method not a purpose.

      A purpose is something like "to make sandwiches for me whenever I want them".

      "Computer... [chirp]... Make me a sammich!"

      If you can immediately follow that request with something like "Computer... [chirp]... where is the nearest restroom?" and get a compete, accurate, and valid response for both things, then you indeed have a general purpose computer (with a general purpose processor at its core).

    88. Re:Pullin' a Gates? by eth1 · · Score: 1

      Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway. No need for a 64 core CPU for that one.

      I beg to differ. Games that are trying to run hundreds/thousands of copies of a unit AI or pathfinding (Dwarf Fortress, RTSs, etc.), or are doing tons of physics (KSP, From the Depths, etc.) are what usually end up causing slide shows for me these days, not the graphics. More cores & threads, please. (Yes, I'm aware that a lot of times this due to the games not taking advantage of existing cores)

    89. Re: Pullin' a Gates? by SuricouRaven · · Score: 1

      More likely outcome: The web-dev has a thirty-core monster workstation and produces a page without any thought for performance, because it works quickly for him. Then you try on your portable device, and it takes two minutes to load all the embedded video advertising.

    90. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      > What don't people parallelize this sort of stuff? Because they're too lazy and most programming languages (but C in particular) make it too much of a hassle.

      No, because it doesn't work.
      If your loop contains 10 instructions and loops 5 times there is _no way_ with any architecture anyone came up with so far that parallelizing will be fast. It actually will be almost certainly slower.
      And that is ignoring the problem with the "largely" of your "largely_independent_task". If that "largely" means you need a lock, you now will lose even more performance. Worse, you will _guaranteed_ lose performance compared to a single-threaded implementation in the cases where you need it most: when the machine is already heavily loaded and you get only one core to work on with all your threads.
      This problem is harder than vectorization on loops, and if you ever tried gcc's auto-vectorizations you should know that it will almost always make your code slower.
      And please, do pull those examples from the Linux kernel and show me how you can parallelize them in a way that is not actually slower.
      People have been working on massively parallel architectures since over 30 years to my knowledge, the fact that none has become useable for general purpose tasks should be quite a strong hint that maybe you're wrong. And your point about threading on user-interface things is completely unrelated, as this is only about logical threading, there is no need for any parallism or multiple cores whatsoever and was possible - though harder - to solve in the times of cooperative multi-tasking on single core processors. User-interface related threads will lead to exactly the behaviour where you have at most tiny abouts of time where you manage to have several cores busy. There is no way you will manage to consistently load 500 cores to 100% with that.

    91. Re:Pullin' a Gates? by marcosdumay · · Score: 1

      Every program has a very specific pourpose, not general porpouse.

      A neural net is still a turing complete computer, by the way.

    92. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      That actually sounds like the most serial problem i can think of.

      1. first interpret voice
      2. make sense of the words
      3. produce flyer
          -Size
          -Graphics
          -Text
      (all of those need to know the one before it for it to make any sense, although you may exchange graphics and text with each other)

      It's impossible to do in parallel, and even if you could do it in parallel, it would be completed faster in serial.
      The computer cannot make sense of the words before it has interpreted them, and it cannot start the actual task before it knows what it is.

      Correct me if i'm wrong, but i see no parallel operations.

    93. Re:Pullin' a Gates? by Rei · · Score: 2

      If your loop contains 10 instructions and loops 5 times

      Duh. And that's obviously not what is being discussed here. Step up a level or 20 in the call stack.

      If that "largely" means you need a lock,

      "largely" meaning "does a bunch of stuff on its own and only briefly needs to lock common data structures to update based on the results of what it's been doing". That is by far the most common case in the real world. If you have a texture loading thread for a game it only needs to briefly lock the texture structure when it's gotten its latest texture loaded and processed. If you have a mesh tweaking function for a 3d editor it only needs to lock the list of meshes briefly to swap out its newly tweaked version for the old version. And on and on and on. The most common case doesn't involve locking to wait for calculations to be done from the other side, it just needs to lock briefly to make sure it doesn't read an incomplete state when the results of calculations are being written out.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    94. Re:Pullin' a Gates? by fisted · · Score: 1

      as all software was written for 32-bit so getting a 64-bit processor would not make anything faster.

      Like 64bit software would be somehow faster, rather than potentially slower, on a 64bit CPU, sure.

    95. Re:Pullin' a Gates? by CronoCloud · · Score: 1

      one dual core chip optimized for running single tasks as fast as possible, and another with 10-50 simpler cores optimized for parallel tasks.

      Sounds like "Cell", maybe Sony and IBM had the right idea before it's time.

    96. Re:Pullin' a Gates? by phantomfive · · Score: 1

      To add to your point, it's worth remembering that doubling the processor speed will always give you a better return than doubling the number of processors.

      --
      "First they came for the slanderers and i said nothing."
    97. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      for (auto& i : container)

      auto.do_something_independently_to_or_with_this_object(): .... which can be parallelized tremendously, up to container.size(). Probably the next most common cycle waster is along the lines of:

      Feel free to cite a random example in the Linux source that would present an obvious speed-up.

      while (true)

      {

      do_regularly_occurring_largely_independent_task_1();

      do_regularly_occurring_largely_independent_task_2(); // ...

      do_regularly_occurring_largely_independent_task_N();

      }

      Again feel free to cite a random example in the Linux source that would present an obvious speed-up that doesn't risk issues of dead lock, live lock, or significant desyncing (presuming the example requires that) over the long-term.

      Of course there exist algorithms that can't be easily parallelized. That's not the point. They're not the most common cases. And I'd be glad to demonstrate this to you by, say, pulling up random pieces of source from the Linux kernel or whatnot and showing with real-world examples.

      Please do, including some rough math of where and how there'd be a speed up. I mean more than just hand-waving with "it's not in parallel form, so it must be faster". As another post points out, you have to show there's enough work being simultaneously done that would otherwise bottleneck on one thread and take significantly longer.

      What don't people parallelize this sort of stuff? Because they're too lazy and most programming languages (but C in particular) make it too much of a hassle. Hence they only parallelize when forced to, which means that the general case of your software runs fine, but the edge cases run terrible.

      Not just terrible but incorrectly. Race conditions, especially involving error cases, are a mess to clean up when you start running many threads simultaneously. Only in trivial examples do you have very little risk of that and most the time trivial examples aren't time intensive in the first place. Never the less, yes, if it all amounted to the fact that programmers are lazy and it's too much of a hassle, well, you're already signing the coffin on the idea.

      It has nothing to do with whether the app *can* be parallelized. The overhwhelmingly vast percentage of code that could be sped up by parallelizing, 99.99% of the time isn't parallelized.

      Assuming the same CPU performance and simply more executable threads? Not just marginal, at best, returns with worst case scenarios that are horrible? And with code that's not much more prone to live lock?

      Sure, I'd love you to provide some real examples. Please do and let me eat my words.

    98. Re:Pullin' a Gates? by Kjella · · Score: 1

      If you look at a typical web page, you have a load of images, a few iframes with ads, scripts (possibly with with multiple web workers). Each one of those really wants to be a separate security domain. You don't want a vulnerability in libpng (something that has happened many times before) to be able to do anything other than break the single image that it's decoding. This kind of fine-grained security is a lot easier if you have the ability to have a load of cheap threads.

      Per tab security so visiting myonlinebank.com and evilmalwaresite.com at the same time won't be a problem sure, but honestly I don't care if one image can bork just that image or the whole webpage since they from my perspective is equally untrusted. I request a page from slashdot.org and I don't want it to hose my machine. Slashdot embeds an ad image from their advertising network and it's the same. I suppose you could say that the malicious PNG can now social engineer the whole page or use another exploit in the HTML/Javascript engine to gain even more privileges, but that seems highly theoretical. Particularly since those should be in the same sandbox since you can have bad HTML/Javascript too. I can't imagine the overhead of visiting Google's image search and have it spawn hundreds of security contexts, that seems like a total waste since they're all under Google's control.

      --
      Live today, because you never know what tomorrow brings
    99. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Was Bill Gates part of the team of 12 that designed the IBM PC? I was thinking he had already setup Micro-Soft at the time and was making cards for Apples home computers. I wasn't aware that he had any influence over IBM during their development phase of the IBM PC.

    100. Re:Pullin' a Gates? by Zeromous · · Score: 1

      Yup, but there was no telling anyone else that.

      Enterprise CELL: Hey I hear these new blades are really fast! Let's throw the kitchen sink at them and prove to everyone they are garbage!

      Never thinking for one minute that their everyday tasks might have performed far better than X86 if they had managed their processing differently or at least attempted to test this difference. Instead it was, heres the worst workload we can think of for any processer, and then tested that on CELL to find, meh, its an underwhelming chip compared to Xeons.

      Well duh! CELL was never about being faster than a Xeon at general computing!

      --
      ---Up Up Down Down Left Right Left Right B A START
    101. Re:Pullin' a Gates? by laird · · Score: 1

      Thinking Machines did this. We had one front-end CPU that ran the sequential process that controlled everything, and thousands of parallel CPUs that did all of the heavy lifting by processing the data in parallel. For large data problems, it worked extremely well. Yes, at any given time some CPUs might not be doing work because they're waiting for other CPUs, but when you're pushing the performance (e.g. processing TB of data, doing PFLOPS) the cost of making a single CPU faster goes up much faster than the performance increase and then becomes impossible, while piling up more CPUs the performance goes up linearly. Of course, some problems don't parallelize in obvious ways, but IMO anything running on large data sets can be parallelized if you look at it right.

      Luckily things like rendering graphics, sorting, searching, running web sites, many crypto problems, simulations, games, image processing, video processing, etc., parallelize really well. Admittedly it takes some cleverness to write a sort algorithm that runs on thousands of CPUs in parallel, but it's valuable to have a constant-time sort (i.e. you can scale hardware linearly with the data size, and sort arbitrary amounts of data in fixed time). The main challenge that parallel computing has, IMO, is that most programmers don't think that way, similar to how most programmers don't think in terms of multi-threading. But that's a matter of education. People used to be terribly confused by event-based programming frameworks, too!

      Once you start thinking in terms of having thousands or millions of (virtual) CPUs, and decomposing problems to run in parallel based on data or actors, pretty much everything becomes highly scalable.

    102. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      When the PC boots, at which time exactly does ANY of Bill Gates software get to play?

      The first things that execute are the BIOS, which executes other code stored on addon ISA/peripheral cards that contain jump vectors to their initialization routines, etc. Some of that code is setting up video cards, hard disk controllers, etc etc. All of the memory mapped I/O, BIOS tables etc are set in memory long before the operating system loads a single line of code.

    103. Re: Pullin' a Gates? by Anonymous Coward · · Score: 0

      I only need one core to send the packets out for processing at a facility optimized for parallelism.

    104. Re:Pullin' a Gates? by laird · · Score: 1

      Faster switching requires more power. Doubling clock speed consumes (roughly) 4x the energy, which is why doing the work in two slower cores is much more power efficient. That's one of the reasons that mobile devices that are power constrained run at slower clock speeds than desktop devices.

    105. Re:Pullin' a Gates? by laird · · Score: 1

      It's not a technical issue, it's a "chicken and egg" market issue. Many desktop applications _would_ run very well on massively parallel hardware, but that's not what people have, so it's not what developers target. And since games are written not to use more CPUs, people don't buy computers with many CPUs. And because MPP hardware is a niche, mainstream developers have no idea how to program for them, much less to think about what problems would run well in parallel.

      From a technical perspective, which I think Linus is trying to argue from, many desktop applications could easily take advantage of massive parallelism. Once you start thinking in terms of data parallelism or agent parallelism, almost all problems decompose in ways that parallelize nicely. For example, there are hundreds of AIs and simulation objects in many games, and each could run on a CPU (or process or thread). Video and image processing are "embarrassingly parallel", and now that people edit video at home, they could happily consume all the CPU you have. Sorting, searching, indexing, scrolling in documents, rendering characters to the screen - all very parallel.

      Luckily the "graphics processors" are breaking out of the "chicken and egg" trap. The better GPUs are now not really "graphics processors", they are fully general MPP CPUs, and many applications are taking advantage of them. Interestingly this architecture is similar (at a high level) to the MPP supercomputers from decades ago. The Thinking Machines' Connection Machine had a fast front-end computer, controlling an array of thousands of tens of thousands of CPUs that did the heavy lifting, and now it's your CPU controlling an array of CPUs in your "GPU". So millions of PCs are MPP, even though their owners probably don't think of them that way. And this is leading to more and more applications taking advantage of MPP!

      So I think that Linux is wrong, in that he's missed that what he's dismissing as GPUs are actually MPP co-processors that are astoundingly powerful and are increasingly being taken advantage of by developers when performance matters.

    106. Re:Pullin' a Gates? by laird · · Score: 1

      In the real world the tradoff is dollars (or power consumption, for mobile devices). So the question is - should you buy a 2x faster CPU for 4x the cost and 4x the power consumption, or should you buy 2 cores for 2x the cost and 2x the power consumption?

      For applications that only run single-threaded, you don't have a choice - you have to buy the fastest CPU you can. But for well-written applications, more cores is a cheaper, more power efficient way to scale performance.

    107. Re:Pullin' a Gates? by laird · · Score: 1

      This is only true if you're unable to use more than one CPU chip in your computer, a hurdle that was overcome 30 years ago. :-) People have been running multiple CPUs to improve performance for a _long_ time.

      The real question is - would you rather have multiple CPUs at the price/performance peak, or one CPU that's a bit faster for a much higher price. Typically getting 2x performance costs 4x or so, making 2 cheap CPUs a much better deal than one really expensive CPU.

    108. Re:Pullin' a Gates? by cb88 · · Score: 1

      I did read it... but he did say 4 cores is enough for most people and I refuted that.

      Even though largely in the context of his rant... he is correct. That single statement is rather horrendous.

      1 core even is "enough" for most tasks... however it doesn't give the best experience no one wants to wait on thier computer more than necessary.

    109. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway. No need for a 64 core CPU for that one.

      I see: you don't play Dwarf Fortress. Good for you (really: less of your life is being wasted), but what about the rest of us, who are already enslaved?

    110. Re:Pullin' a Gates? by cb88 · · Score: 1

      There is pretty good hard data that says.. HTML rendering is embarrassingly parallel... thus Mozilla is working on Servo.

      That is 99% of computer use right there... parallel is here to stay and knowing the web it will get vastly more parallel once a browser engine is out there that can do it.

    111. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Energy is related to the number of bits flipped. If you use twice the bits at half the speed, the energy requirements will be the same.

      Not true, "flipping bits" at a higher speed requires a higher voltage, which results in a higher energy consumption.

      See: http://en.wikipedia.org/wiki/Dynamic_voltage_scaling#Power

    112. Re:Pullin' a Gates? by Bengie · · Score: 1

      Larger caches also mean higher latency. If you have a lot of registers, that's great, but if you can't be constantly prefetching data from memory for one reason or another, higher cache latency will make everything slower for a small subset of workloads.

      We're in the transition where throwing more transistors at a single core makes the core slower for most workloads with marginal gains for specific workloads. We need more cores, either of the same type or different types that specialize for certain types of workloads. Maybe we need some cores with large caches and some with smaller caches or some with lots of SIMD or some with few SIMD.

    113. Re:Pullin' a Gates? by Bengie · · Score: 1

      Cutting your frequency in half may reduce your power consumption by 80%. Lots of low frequency cores will kick the crap out of a single high frequency core, efficiency wise.

    114. Re:Pullin' a Gates? by Bengie · · Score: 1

      Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway. No need for a 64 core CPU for that one.

      Spoken truly like someone who does not understand anything about games or parallel processing. We already have games that ran make a 12 core CPU run at 80% load and a quad-GPU about the same. "Pipeline" style multithreading is becoming popular. Each stage of the graphics pipeline is ran where it is most efficient, which may be the CPU or the GPU, and data may bounce between the CPU and GPU several times before completed.

      Piplining plays well with streaming, so one of the first jobs is for the CPU to break up the work to be done into many difference smaller pieces that can be "streamed". "Streaming" is just breaking up a large object into smaller objects. This allows for the GPU and CPU to be kept busy working on difference pieces that are at different points of the pipeline, keeping them busy.

      The natural evolution of this is each stage of the pipeline can be a collection of "processing units", each unit capable of working on an unit of the stream. While the pipeline may prefer to use the GPU for certain types of work or the CPU for other types, if the current machine has an unbalanced mixture of CPU and GPU, the pipeline may augment one with the other. While the CPU may not be as good, if the machine has a lot of CPU cores left unused, might as well use them to speed things up.

      We need a framework that allows code to run where it would be best to run, but also have the ability to overflow to less efficient execution units if they are relatively unused. In other words, we need to treat CPUs and GPUs like a pool of computing resources, and *something* needs to manage these resources to crunch data.

    115. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Re: "Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway."

      Wow, that's actually pretty stupid. It comes from the "there's only one way to solve a problem" heap of thinking. While we're at it, let's mandate just one OS for everyone. Also, only one application in every problem space will now be mandatory. Linux is fragmented and that must stop, so let the edicts spring forth!

      Furthermore your point is defeated by your own statement. If Linux is OK with 3/4 cores, why cannot it be better with more? I mean, you've already opened the door to parallel processing, so what's the problem? Does parallel offend you? Oh right, parallel processing is 100% OK on the GPU but I guess it's icky on the CPU and must be controlled?

    116. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Please don't let Bill whitewash his inglorious past. Yes, he said it, and it was originally quoted by an IBM executive in a meeting who heard him say it. He also went on to say things like it all the time: "When we set the upper limit of PC-DOS at 640K, we thought nobody would ever need that much memory." — William Gates, chairman of Microsoft

      HIs PR firms that scrub the Internet can go to hell. Fuck you, Bill.

    117. Re:Pullin' a Gates? by TheRaven64 · · Score: 1

      Per tab security so visiting myonlinebank.com and evilmalwaresite.com at the same time won't be a problem sure, but honestly I don't care if one image can bork just that image or the whole webpage since they from my perspective is equally untrusted.

      You don't use webmail then? Or any web pages that have adverts in them?

      --
      I am TheRaven on Soylent News
    118. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      CAD workstation: I imagine alot of geometry processing is parallelized... the less waiting the better (either format conversion or generating demo videos etc.. eat up alot of CPU)

      Let me tell you that it is definitely not solved for CAD for design. Quite the opposite. (Not talking about producing renders etc. but design work)

      As an Example: CATIA v5 (others and v6 are no better) in its core is so old it works the exact same on graphics hardware for 40$ (non certified) and 3000$, the only thing that counts is single core performance. And while it is certainly threaded, it is not multi-core capable. Even worse, management software (such as VPM in case of CATIA) tends to run on the same CPU core as the main program. CAD is the worst current offender regarding parallelization I now.

    119. Re:Pullin' a Gates? by tibit · · Score: 1

      For iteration, as long as you know a bit in advance what you will need, you can certainly issue prefetch requests. They can often remove cache stalls altogether.

      --
      A successful API design takes a mixture of software design and pedagogy.
    120. Re:Pullin' a Gates? by linuxrocks123 · · Score: 1

      It's certainly not the case that "almost all" problems decompose into data parallelism. Likewise, while there are some tasks that GPUs can do very well, there are others where they don't do well at all. What Linus is arguing is that the cases where they don't do well at all dominate. I concur. Programming this stuff typically locks you deeply into a single GPU's architecture, and, oh, you have zero cache, zero pipelining. There are other problems, too. A good overview of the technology as a whole is here: http://cstar.iiit.ac.in/~kkish...

      GPGPU's have had some impressive successes, but CPUs are still more versatile, and, like Linus says, I don't see Intel giving up single core performance so people can program a bunch of tiny little ant-processors that can't communicate with each other in less than 500 cycles.

      --
      vi ~/.emacs # I'm probably going to Hell for this.
    121. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      BULLSHIT, The quote was supposed to have originated at a computer show. No one has EVER come forward and said he heard him say it. There have been several claims that proved to be false. He did many things wrong and said many wrong things, but that wasn't one of them, it is pure myth.

    122. Re:Pullin' a Gates? by linuxrocks123 · · Score: 1

      Games can do all sorts of unnecessary and stupid things, including loading 12 CPUs and consuming 500 watts so Duke Nukem's whiskers waft with the prevailing winds in the virtual environment.

      They are a model for no type of problem except themselves.

      --
      vi ~/.emacs # I'm probably going to Hell for this.
    123. Re:Pullin' a Gates? by Bengie · · Score: 1

      In this case it wasn't. There was no obvious waste of resources, the devs had a nice breakdown with a profiler showing the main things being done. No one parts of the system was using vastly more resources than the other, and there was a lot of different parts. Relative to other gaming engines, it was beating them on number of objects rendered by quite a bit.

    124. Re:Pullin' a Gates? by awol · · Score: 1

      More than 20 years ago I had a full and frank exchange with a macweenie friend of mine where I posited that in the vast majority of cases the core "functionality" of the work we were doing was already within the capacity of the processors available at that time and the advances in speed that will come in the future will all be about enhancing the user experience of that core.

      What I meant was that the calculating of the spreadsheet cells or redrawing the document window or .... was already doable by the current processor. It was the handwriting UI, or voice recognition or eye candy (or stuff I couldn't envisage, like parsing my email history to find the right advertisement to display :-) that would consume the CPU advances that were coming. When I say "OK Google what's the weather like today" and my cell phone tells me in a moderately human voice a 2 sentence forecast and displays a detailed weather page for my freakin' suburb. I kinda feel vindicated. When the address I was searching on my desktop is the first entry in the dropdown box on the GPS on my phone when I get in the car later that day. Same. (All points about the invasive nature of that connectivity duly noted).

      The parent poster is absolutely right, this trend is ongoing and the amount of "work" that I can get my compute resources to do via more and more sophisticated interactions is only going to increase and the more encompassing that work becomes the more it can be broken down into smaller discrete and hence parallelizable tasks.

      Having said all that.... my professional expertise is in quite high performance transactional software and Linus statement is absolutely true. I'll take cache size/control over a proliferation of cores any day, given a certain number of cores and within that all the goodness of branch prediction and ooo execution, four sounds about right. So much so that, we find situations where adding cores actually reduces our performance we suspect due to caching issues.

      So in essence there are two trends. Form Linus's perspective he is right, the time spent on parallelism is not worth it. At a more macro level it is. Perhaps that macro level is n application software level rather than a system software level and hence the difference in view point.

      --
      "The first thing to do when you find yourself in a hole is stop digging."
    125. Re:Pullin' a Gates? by DamnOregonian · · Score: 1

      Are you joking?

      I know that there is a wide variance in performance differences between compiled programs on 32-bit and 64-bit architectures, but I do a fair amount of work in assembler, and I assure you there are very large speedups to be had moving over to x86-64. First and foremost, increased register size and double the amount of general purpose registers.

      If you want to go to a higher level, 64-bit pointers also allow for all kinds of very neat OS syscall latency related tricks like mapping stupidly-large files into memory.

      64-bit is the way, my friend. If your software doesn't run faster in long-mode, it's because either you or your compiler just isn't quite with the program yet.

    126. Re:Pullin' a Gates? by DamnOregonian · · Score: 1

      You're so right :(

      I keep a PS3 non-updated just so I can play around with the Cell in linux.
      It certainly is a shitty ass part if you're just trying to write normal software for it- the HT PPC core in it is a total dog.

      However, if one were bored and wanted to whip up a stupidly parallel task, like computing segments of a mandelbrot- then one could zoom down to precision failure in a second.
      Ever generated a 40,000 x 40,000 mandelbrot? Sure it's not quite general-purpose, but my i7 desktop, Q9650 desktop, i3 work computer, i3 laptop, and i7 laptop running parallel generation software struggle to keep up. (Granted- no GPU assist on those)
      I just wish I had come up with more workloads for it before I got bored.

    127. Re:Pullin' a Gates? by DamnOregonian · · Score: 1

      The kind he is talking about is also using those cores to load a single page. He's arguing against parallel computing being the answer for mundane tasks. At the end of the day, improving the instructions/cycle (or cycles/second, but I think we're pretty close to tapped out in that department) performance of cores is more important than increasing the cores.
      People arguing against him largely don't understand that's the argument he's making.
      Adding 12 more cores to your quad core is not going to make the desktop perform better.
      However, a 5% increase in instructions/cycle performance *will*

    128. Re:Pullin' a Gates? by AchilleTalon · · Score: 1

      I wonder if the fine guys marking my post offtopic has read himself the fine article and associated documents. /. is full of surprises, people are even moderating subjects they don't give a fuck to read about.

      --
      Achille Talon
      Hop!
    129. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      And IBM team members have said this is not true, even though MS had the impression that their input had been influential. When Bill Gates was suggesting they should have a 16-bit chip, they had already made that decision. But it wasn't IBMs style to tell people things they didn't Need To Know (yet). Read about Digital Research's refusal to sign the IBM NDA and secrecy agreements.

    130. Re:Pullin' a Gates? by im_thatoneguy · · Score: 1

      I say you are the one moving the goal posts, Linus and *most* of the other people working on parallelism solutions are working/speaking in the context of computers like the ones we know today, you they guy trying to apply what they say to *any* computer. Linus will probably be proved correct there. Past n cores the fundamental architecture in use today will not scale but for niche cases.

      Within the context of traditional Van Neumann computers we already today have voice recognition, we already have SLAM 3D positioning, we already have databases like Wolfram Alpha which can give us insights, we already have applications which crunch massive 3D datasets. Some of these run ok on GPGPUs and some need the larger cache sizes of a CPU to run efficiently.

      My point isn't that we need some completely exotic system, my point is that with the very limited amount of applications today for AI-driven solutions there are plenty of applications that can and would use hundreds of cores. Computers were once a "niche" tool for rich people. The internet was once just a niche tool for academics. Only gamers needed a GPU etc etc. All the way back through history when something becomes accessible someone finds an application. Build it and they will come.

    131. Re:Pullin' a Gates? by im_thatoneguy · · Score: 1

      If that whole process takes 3 seconds (which would be amazing) then your computer only performed 1 "operation per second". But computers don't perform "operations" they have to perform millions of sub-actions to accomplish your goal.It would be like saying that "Rendering a game's frame is only a single task so it would be a very serial task without any potential for multithreading." when in reality "rendering a frame" is a massively parallel task of rasterizing millions of triangles (or intersecting rays) and sampling textures, computing lighting values and performing table look ups.

      Take interpreting voice. By applying multiple models simultaneously you can get better results. Seems pretty obvious.
      http://devblogs.nvidia.com/par...

      For the flyer maybe it'll generate 1,000 flyers simultaneously and then compare them to award winning graphic design projects to see which of the 1,000 ideas it had matches historical good ideas.

    132. Re:Pullin' a Gates? by fisted · · Score: 1

      I actually had the caches in mind. One of the reasons why people came up with the x32 idea.

    133. Re:Pullin' a Gates? by im_thatoneguy · · Score: 1

      The point isn't to pick any one approach or technology (say neural nets) the point is that we *already* have an application that comfortably uses more than Linus' mythically adequate 4 cores. A 4 core CPU is fantastic at running a word processor and an email client in the background. But that's not the future of computing. The future of computing is going to be doing the work of the human brain, but better. The human brain is one example of the sort of application we are going to see more of. Improved Microsoft Word is not the future. Improved Chrome is not the future, we see the future in Science Fiction and it's an interface that can communicate with us naturally. Natural human/computer communication means a whole new set of problems, and these are not problems relegated to "niche" marketplaces like research lab super computers. The applications for machine vision are everywhere. The applications for voice recognition are everywhere. The applications for 'common sense' in your interaction are everywhere. These aren't problems that I expect will be solved best with fast linear serial processes. To date all of these classes of problems have been best approached with multi-threaded parallel computing.

      You mention the GPU. It's true the GPU was a custom semi-specialized piece of hardware. In fact the original 3D accelerators weren't even in the display card they were pass-through cards. But you know what else used to be a semi-specialized chip? Math Co-Processors. Even today GPUs are slowly blending back into the CPU. Once something like a math co-processor becomes sufficiently critical to the average user it becomes part of the CPU's die. AMD has already integrated pretty substantial GPUs into their "APUs". By definition SOCs are integrating the GPU. If we do develop a chip that is critical the average user like AI with a magic AI-chip then they'll just integrate it into the CPU.

      It used to be that video playback was a niche market and now just about every CPU, GPU and combination there-of has integrated video decoding into the chip. So what makes you think they won't integrate ai and call it a "CPU"?

    134. Re:Pullin' a Gates? by im_thatoneguy · · Score: 1

      You assume that task-specific tasks are all that people will come up with. If you have to spin a new ASIC every time you want to improve your software we aren't going to innovate. ASICs are specifically for something like 10GB networking which is a defined standard. But most tasks aren't defined standards. Changing specs is the norm not the exception outside of core OS functionality like storage or networking. GPUs couldn't keep up so they moved to a compiled per-pixel shading model so that developers could rapidly iterate and invent new uses. In the process GPUs by necessity became pretty general purpose. But GPUs are still frustratingly limited in their general purpose applications. There is a huge domain of problems that need more than 4 cores but need more memory and larger caches than a GPU offers them. You could legitimately call whatever processor manages to handle them a "CPU" or a "GPU".

    135. Re:Pullin' a Gates? by im_thatoneguy · · Score: 1

      Torvalds dismisses photo editing as a task for "professional photographers", but our amateur cameras are taking phenomenally detailed pictures, and even making fairly simple edits is a compute-intensive task. He may be right, but he may equally be wrong.

      Torvalds is being completely ridiculous here. Avid used to be the domain of professional film editors but iMovie is incredibly popular. We even see cell phones these days sporting 4k cameras. My Lumia has a 41 megapixel sensor! I have a RED camera and it's "only" 18 megapixels. In fact the less professional you are the more processing power you need. Photoshop's paint brush can accomplish wonders in the hands of a professional touch-up artist. But Photoshop's Content-Aware-Fill is processor murder and designed specifically to intelligently replace a professional artist. Take something like 3D rendering. You could have someone hand paint every frame. It would without question require a professional artist. But if you want a pretty picture at the push of a button you want raytracing.

      This is actually something that you see happening today in the high-end VFX market. It used to be that raytracing was too compute intensive for films. But for amateurs and non-artists ironically enough ray tracing was fine. The architect only needed to render 3 frames. Waiting a day was perfectly acceptable there wasn't another 100,000 frames that also needed to get rendered. In film there wasn't time for something like Global Illumination and the shortcuts caused unacceptable flickering. Now the film industry is starting to embrace advanced lighting like GI and they're getting all of the bounces and detail that used to take hundreds of lights to fake automatically. It's making artists more productive but it's coming at the cost of increased compute time. Again a professional lighter can as an artist fake global illumination. An amateur could simply position the sun, turn on GI and wait 18 hours.

      The future will be an Automagical button that not only fixes your photo *cough* instagram *cough* but also performs even more advanced editing like "Remove the gray clouds and put in a photorealistic blue sky. Oh yeah, and also change the lighting of the photo to make it look sunny!" That's going to be far more CPU intensive than any photoshop filter currently in existence and it'll be targeted as much as your average cell phone user as a professional.

    136. Re:Pullin' a Gates? by im_thatoneguy · · Score: 1

      Game developers waste less processor power than just about any other developer I know of short of super-computer developers. When you have 16ms to render a frame and you have to recreate the entire universe in those 16ms you have to be extremely judicious in your use of cycles.

    137. Re:Pullin' a Gates? by ultranova · · Score: 1

      But by the time you've finished reading the first paragraph of the first page, the other nine are loaded even if you can't parallise.

      No, they haven't. If you can't parallise, you can't download and render in background. If you try anyway, you end up blocking the UI randomly. With nine not-really-parallel threads competing for various locks with each other and the user, you set the pages to load and go have coffee.

      Parallelism isn't just good for optimal resource utilization, it's also good for "smooth" user experience. Users might not care if a page loading in the background takes a second rather than two, but they do care about being able to scroll or close the current page while it does.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    138. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      3/4 cores for gaming!? You do know the consoles have 8 atm? Off-loading to GPU means you need to compensate your graphics, we could use AI chips to off-load the AI and that might keep the required core count down. But as a game designer and programmer 3/4 core limit for games is a foolish statement.

    139. Re:Pullin' a Gates? by ultranova · · Score: 1

      Maybe it's time we started designing systems with two separate chips - one dual core chip optimized for running single tasks as fast as possible, and another with 10-50 simpler cores optimized for parallel tasks. I think we're halfway there already, what with GPUs being used that way to some extent, but standardizing it would actually allow non-custom applications to make use of it.

      It's standardized - OpenCL is for exactly this - but it's such a pain to program, people usually won't. All of our popular programming languages are designed for sequal execution, and multithreading is just an afterthought. I don't think the problem can be solved through shared, mutable state. Maybe something inspired by physics: every event has its immutable "past light cone" of events who's output it can access, and can't access any data not in this cone?

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    140. Re: Pullin' a Gates? by Anonymous Coward · · Score: 0

      Why do people use parallel for concurrently running sequential processes that are not related, don't share memory or transfer data, and are written using no parallel techniques ?

    141. Re: Pullin' a Gates? by Anonymous Coward · · Score: 0

      Imagine you have a conversation with your wife and she says the kids need to be picked up at 4 on Tuesday. If my phone put a reminder on my calendar for me based on my continuous audio stream, the mental offload would be huge as I could seamlessly continue with my day without managing my calendar,

      A real wife puts it on your calendar for you. And will criticize you if you fail to pick up the kids: "It was on your calendar!"

    142. Re:Pullin' a Gates? by ultranova · · Score: 1

      A 3+ GHz single core CPU is easily capable of decoding images that come in at full speed over a typical internet connection. You may be able to use multiple cores, but it's going to make the overall page loading any quicker than using a single core.

      If you had 12+ cores, you could keep those images in their compressed form and decompress when they become visible as the user scrolls a page or switches tabs, thus saving a lot of memory. Also, modern webpages tend to be full of "dynamic" content, from animated gifs to ads. Being able to give a separate thread for each of these would do a lot to make the UI more responsive.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    143. Re: Pullin' a Gates? by ultranova · · Score: 1

      I don't want to continuously stream my audio to Google nor do they want to continuously process the sound of me typing and sipping coffee...

      Of course they do. It's behavioral data, which can be used to target advertizing and perhaps even feedback data valuable to manufacturers. For example, how much coffee do you drink per day? How long do you spend with a single mug? Do you brew a little and often, or use a thermos, or do you simply let it sit in the pot? Are other people around - are your coffee breaks spent alone? How does your chair sound like - is it time to get another?

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    144. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      ahhh yes I am sure that is true, they were just telling their contracted programmers they were using 8 bit CPU's to fuck with the development process and cost.

    145. Re:Pullin' a Gates? by lucien86 · · Score: 1

      Yep 50 tabs open - assign a thread to sub-maintain each tab. I like it... Mind you it opens up the possibility of more viruses than can be squeezed into a small box..

      --
      Below the speed of light Special Relativity is one of the most accurate theories in physics - above the speed of light..
    146. Re:Pullin' a Gates? by lucien86 · · Score: 1

      One of the great advantages of running multiple threads on a single core is that you can get rid of deep pipelining, taking along a lot of old problems and superfluous complexity with it.. Of course the same thing can be extended to multiple cores - or even many multiple cores. The big problem with having many cores on a single die is the large bottleneck that tends to form between the CPU's and the main memory.

      I think the real breakthrough will come with having the main ram on the same die as the CPU cores . . that will be a dream come true and we will see a massive improvement in performance then.. Just adding more chips will increase your system processing power & memory - as many as you want.. Massive parallelism is definitely the future, its just a question of when it will arrive...

      --
      Below the speed of light Special Relativity is one of the most accurate theories in physics - above the speed of light..
    147. Re:Pullin' a Gates? by lucien86 · · Score: 1

      So as long as you can keep guessing at least half the balls in the lottery next week you can keep winning..

      If only it was that easy in Strong AI .. but then I suppose anyone could do it. :(

      --
      Below the speed of light Special Relativity is one of the most accurate theories in physics - above the speed of light..
    148. Re:Pullin' a Gates? by aliquis · · Score: 1

      To begin with the Gates quote never was.

      But regardless, what he seem to suggest isn't that technology won't shrink and that you couldn't make more cores.

      What he seem to suggest is that you're rather not use that to make more cores but rather make even better/bigger cores and caches.

      No?

      And maybe he's right. I mean. The Pentium IV was supposed to be able to go to 10 GHz!
      We had a real mega-hertz race going on then. And then that stopped and we got more cores instead.
      And maybe not people assume more cores is the future because that's what we've seen. But maybe he's right and it's hard to use that efficiently and that's not where we're going at all.

      I don't know :)

    149. Re:Pullin' a Gates? by Tablizer · · Score: 1

      More fuel for the Gates quote debate:

      http://imranontech.com/2007/02...

    150. Re:Pullin' a Gates? by Anonymous Coward · · Score: 0

      Thank you!

      One issue has been, and continues to be, that Moore's Law/Dennard Scaling has repeatedly rescued programmers from having to think in new ways. This has led to a certain amount of predictability in the educational curricula, algorithm theory, and all the rest. Give a person a task and they will almost always try to perform it in the most familiar, least uncomfortable, easiest way.

      It's only when you give people a task that is impossible with conventional techniques, that they start trying new things. Things like parallel programming. And even then many software analysts will simply say "that's impossible" and leave it there.

      Reality is, parallel techniques are still in their infancy in the mainstream software world. HPC has done parallel for a long time. All the major OSs have decent parallel support (although MPP support is still routinely treated as a niche specialty, with special OS versions and builds to support it, often at extra cost or more inconvenience). Applications though, are a near desert of concurrency. Only the occasional application will take a serious run at decomposing tasks for parallel execution. And that includes the entire field of GPGPU programming.

      Also, there are multiple examples of OS designs and ecosystems that do (or did) much, much better job of supporting high levels of parallelism than the most popular OS designs of today. Just off the top of my head, UNICOS, BeOS, OS/400, etc.

      Therefore it is defeatist and premature for Linus to make these statements. It has been done before, repeatedly. Is Linus Torvalds saying that he cannot? If so, then why?

      It's very important to understand that this is a software engineering problem. It is not a matter of violating the laws of physics! Saying that something is impossible, when all it requires is more and better technology and design and implementation, is terribly short-sighted. It's like the talking heads of 100 years ago saying that it was impossible for a human being to withstand travelling more than 60 miles per hour. Which just sounds dumb now. And it's not made better by understanding that they confused speed with acceleration.

      Software engineering is slowly being forced to parallel solutions because Moore's Law is slowing down. That's a powerful forcing mechanism and it's only likely to get stronger over time.

    151. Re:Pullin' a Gates? by perryizgr8 · · Score: 1

      Adding 12 more cores to your quad core is not going to make the desktop perform better.

      Why, though? Why can't they separate things involved in loading a single page? Like network, static images, css, javascript, WebGL, etc. Each gets their own core. Won't that speed up the loading?

      --
      Wealth is the gift that keeps on giving.
    152. Re:Pullin' a Gates? by perryizgr8 · · Score: 1

      Everyone except Firefox is doing just that. How the mighty have fallen. Firefox was the one that brought down the evil IE.

      --
      Wealth is the gift that keeps on giving.
  4. Linus should try git by MichaelSmith · · Score: 3, Funny

    ...a tool which he may have heard off. It does connectionless, distributed data management, totally without locks.

    1. Re:Linus should try git by Anonymous Coward · · Score: 0

      I understand that Linus has tinkered around with git, but he doesn't like it very much.

      Now he uses TFS for pretty much everything, he loves it.

    2. Re:Linus should try git by phantomfive · · Score: 3, Informative

      In his post, Linus was talking about single, desktop computers, not distributed servers. He specifically said that he could imagine a 1000 core computer might be useful in the server room, but not for a typical user. So if you're going to criticize him, at least criticize what he said.

      Also, git is not totally without locks. Try seeing if you can commit at the same time as someone else. It can't be done, the commits are atomic.

      --
      "First they came for the slanderers and i said nothing."
    3. Re:Linus should try git by MichaelSmith · · Score: 2

      My point is that git knows how to merge. It knows when a merge is required, when it is not, and when it can be done automatically. If you design your data structures properly, the same behaviour can be used in massively parallel systems.

    4. Re:Linus should try git by Anonymous Coward · · Score: 0

      My desktop already has a gazillion cores. It is called a DMA, video card, bridges, ... You might say that all these have drivers (which is exactly my point). For video cards there are actually usefull options to use it as a processing unit. So where is the distinction? There is none.

      Ill make it simple. There is a CPU (which will take data and a program) and transport of data (via busses, networks or memory transfers: just variations of the same thing). Transport does not modify data (which is why it is interesting why we code so much to not-modify-data).

    5. Re:Linus should try git by Anonymous Coward · · Score: 0

      What complete hogwash! You have no idea what you are talking about.

    6. Re:Linus should try git by Half-pint+HAL · · Score: 1

      Git isn't a performance system, though. The timescales it works on are completely different from those of desktop computing.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    7. Re:Linus should try git by Anonymous Coward · · Score: 0

      But not every algorithm is suitable to parallelism, regardless of data structures. Simple read/compute/write dependency cycles can kill parallelism.

      I do agree with you that organizing data structures and following certain practices, such as using immutable objects, or consumer producer queues, can make certain types of multi-threaded applications easy to write. However other algorithms get grotesquely complicated, particularly if you want the holy grail of concurrency - lock free parallelism. You will eventually run into cases where you HAVE to use a lock, and that will create contention which will of course reduce throughput.

      Some algorithms are trivial to parallelism, and tragically some are impossible.

    8. Re:Linus should try git by phantomfive · · Score: 1

      Git couldn't do any of that without locks. That might sound like a small detail, but it's important......when it comes to multi-processing, the details are important. If you only look at the big picture, you will make poor decisions.

      --
      "First they came for the slanderers and i said nothing."
    9. Re:Linus should try git by Anonymous Coward · · Score: 0

      Linus wrote git.

      If you were being sarcastic, I totally missed it.

    10. Re:Linus should try git by MichaelSmith · · Score: 1

      The only locks in git are within single repositories. The locks which control distributed merging are controlled by the hashes which identify change sets. They tell a repo about the origin of the data being merged in. So rather than thinking about a static blob of data which changes sometimes and needs to be preserved while other nodes are working on it, you think of a graph which extends into the future, each node identified by its hash. By working this way it is easier to find places to reintegrate the results of processing which takes place remotely.

    11. Re:Linus should try git by Anonymous Coward · · Score: 0

      And yet how many cores does the graphics card(s) of the average gamer have?

  5. Programs people want to use... by Anonymous Coward · · Score: 0

    .. more cores, don't or can't make use more cores. We've had lots of cores for a while now but it is a rare game that can use all of them.

    1. Re:Programs people want to use... by Tablizer · · Score: 1

      Grokking and managing parallel programming seems to be the bottleneck. Using mass parallelism can be done, but so far it's been so difficult that it has yet to be worth it for the vast majority of apps (or at least the vast majority of the operations in a given app, for graphics and database calls can sometimes use lots of parallelism).

      It's too early to know if it's just too hard a problem for the human mind in general, or the current generation of programmers is too locked into a way of thinking.

      Regarding the suggestion to follow nature, nature can be unpredictable. Do we want that characteristic in our applications? How do you debug something if you can't faithfully recreate the state? I can see an organic mess being fun for some games, but not for accounting and tracking software.

      We need more pilot projects to experiment with techniques.

    2. Re:Programs people want to use... by Rei · · Score: 3, Insightful

      Indeed. There's tons of CPU-intensive tasks that need to be done in a modern computer game, but they're typically done as:

      while (true)
      {
          do_task_1();
          do_task_2();
          ( ... )
          do_task_N();
      }

      Rather than...

      std::thread([&](){ while (true) do_task_1(); }).detach();
      std::thread([&](){ while (true) do_task_2(); }).detach();
      ( ... )
      std::thread([&](){ while (true) do_task_N(); }).detach();
      }

      ... or similar. Because in C and older versions of C++ launching a thread takes significant typing and ugly code, up to and including - in the case of the same function threaded a variable number of times in a loop with more than a trivial argument - having to have a memory-managed threadsafe container to hold your arguments (and in C you don't have STL containers, you have to do all that work yourself too). It's not the end of the world to have to code threads in C or earlier C++, but it's enough work that programmers usually don't do it any more than they're pretty much forced to. "Okay, my game will literally run at half the speed if I don't thread this function" - fine, they'll thread it. But "this function call eats up 3% of my performance, this one 6%, this one 4%, this one 2,5%, this one 3,5%...."? Usually such functions just get stuck into one big main loop.

      I really hope with how easy it's gotten in C++11 that more people will make better use of threads. In the first example code, not only do you relegate all of your tasks to the same core, thus hitting performance, but if any one task hangs, all of them hang. It's a terrible approach, but it's the most common. The only case where threads aren't good is where you're doing heavy concurrent read/writes to the same cached data, but in real world apps there's almost always a level where you can launch the thread where this isn't the case, if it's even an issue to begin with in your particular application. The presumption that concurrent access to cached memory will usually or always be a problem (which seems to be Linux's presumption) requires that A) your threads not doing the majority of their work on thread-local memory, AND B) that the shared data area being read from / written to concurrently is small enough to be cached, AND C) you can't just migrate your threads up in scope N levels to work around any such issue.
       

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    3. Re:Programs people want to use... by Anonymous Coward · · Score: 0

      Or most programmers use the wrong tool for the job. I'm guessing Erlang makes parallelism a lot less difficult to manage. Haskell as well.

    4. Re:Programs people want to use... by Anonymous Coward · · Score: 0

      "I can see an organic mess being fun for some games, but not for accounting and tracking software."

      Have you ever used accounting and tracking programs? As a user, I really don't care if the mess is organic or inorganic. And no, it's not fun. Maybe organic mess would be moe fun, who knows? What i'm really wondering who are the superior salespersons who can sell these programs to upper management? I mean, yes, the buyer won't know a shit about what they are actually buying or how it will be used, but still.

    5. Re:Programs people want to use... by 0123456 · · Score: 1

      1. Until recently, most PCs had only a dual core CPU.
      2. You're assuming those tasks can trivially be done in parallel. In reality, most can't. You can't render the graphics until the physics are calculated, for example. Yes, you can be calculating physics for the next frame while you're rendering the current one, but then you have to maintain two copies of all the relevant data (current and new), or use a more complex data format which can support multiple threads updating it at the same time. That's a lot more work than just wrapping a thread around the physics calculations.

    6. Re:Programs people want to use... by AuMatar · · Score: 1

      Because in C and older versions of C++ launching a thread takes significant typing and ugly code,

      Bullshit. It takes 1 function call- because if you had a need to do all that repeatedly, you would write the damn call once, turn it into a function, and let it be done. People didn't do it because the tasks weren't parallelizable- they had massive resource contentions on memory object. Contentions that would be non-trivial to solve, and would cause using threads to be a minimal gain or even a loss in efficiency.

      Libraries like std::thread don't do anything that people weren't already doing- they just prevent people from going out and writing their own implementations. But any problems that would benefit from them were already being solved with roll your own solutions.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    7. Re:Programs people want to use... by jandrese · · Score: 1

      The second approach runs into trouble if your tasks aren't independent. Parallel processing works great until you have to start synchronizing state. If one process stalls and the other processes are dependent on it for some data, then the other processes are going to stall anyway. In the real world, most problems are hard to separate cleanly--data dependencies are very very common. So there is a hidden cost to parallelism--the cost of synchronization between the threads, and the cost grows very fast as you add more threads. This is basically Linus's point: outside of specialized domains it's just not possible to cleanly break up most problems into more than just a handful of threads, so having a 1,000 core beast of a processor doesn't help. You would just have 990+ cores waiting on some other core to finish its job, all of the time. Plus there's the fact that debugging multithreaded programs is inherently more difficult than single threaded ones and that all of this is moot if you are I/O bound anyway.

      --

      I read the internet for the articles.
    8. Re:Programs people want to use... by Rei · · Score: 0

      Since it's so simple, prove it. Write this in C: Given a data structure - oh, let's say:

      struct my_struct {
      int i;
      double d;
      };

      And a local-context data structure (of your choice) called my_struct_array which contains an arbitrary number of my_struct entries, iterate over all entries and launch a detached threaded function (we'll call it my_function) on each of them in their own thread.

      Here it is in C++11:

      for (auto& i : my_struct_array)
          std::thread([i](){ my_function(i); }).detach();

      It's hardly any longer than the non-threaded version:

      for (auto& i : my_struct_array)
          my_function(i);

      Now, your turn. Show me your "1 function call" C version. Note that this isn't some sort of contrived problem, this sort of thing is one of the most common use cases you'll encounter, so it should be trivial, right? And I'll note, if you're too lazy to do it here, or change the requirements to present yourself with a simpler problem, then I'm going to take it that you're too lazy to do it in your code, too.

      You're on.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    9. Re:Programs people want to use... by Rei · · Score: 1

      Limited data dependencies are common, it's true, but fundamental lockings between tasks are not that common in the real world. Most real world tasks aren't like matrix multiplication or whatnot. Let's say the task is a video game and your tasks are things like:

      1. Get user input
      2. Translate/rotate moving objects
      3. Backcompute armature positions
      4. Calculate mesh data from armatures
      5. Load/unload new scene data
      6. Load/unload textures
      7. Scale objects by level of detail
      8. Process AI
      9. Play sound effects
      10. Play music
      11. Autosave
      12. Read from the network.
      13. Write to the network
      14. Handle special effect animations
      15. Render

      And on and on and on, your average game has a whole laundry list of these sort of things, and each one is made of many subtasks. Some will be trivial, while others warrant threading even at the subtask level.

      Now, when you look at these, of course they're all obviously interconnected in some ways, you obviously have to use mutexes. But the connections are limited. For example, If you're backcomputing how an armature must be configured, it's obviously going to use the same data structure as the thread that deforms mesh data with armatures. But the only real practical limitation is that the thread that changes armature positions has to lock the one armature it's computing briefly while writing the results of its calculations so that the other thread never reads half-written results - that's it. Likewise, rendering (which has tons and tons of subtasks, and is famously parallel) obviously depends on all sorts of texture and model data from different threads. But again, all it needs is that there not be anything half-written, it doesn't have to wait on any particular result. Objects moving will change their needed level of detail, user actions and collisions may cause sound effects, and on and on, but again, the only requirement is that you not have half-written states.

      This is what the vast majority of CPU-intensive tasks in the real world are like. Yes, you have to use mutexes, and you have to be aware of iterator / pointer invalidation on insert / delete into data structures (where applicable), but apart from those sorts of things, they tend to thread very, very well.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    10. Re:Programs people want to use... by AuMatar · · Score: 1

      Here's pure C, C++ would lead to slightly neater syntax.

      void do_operation_on_all(my_struct *array, int size, threadfunc func){

        for(int i=0; i<size; i++){
           launch_thread(func, array[i]);
        }
      }

      Where launch thread is a function that calls the correct OS specific function to launch a thread (probably pthread in most cases).

      It would then be called:

      do_operation_on_all(array, size, func);  which is actually even simpler than your solution.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    11. Re:Programs people want to use... by AuMatar · · Score: 1

      ANd when I said C++ would lead to a nicer syntax- I mean C++ 01 without std::thread and autos. Mainly because you could make it a template function instead of special casing for the type of using void pointers.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    12. Re:Programs people want to use... by bluefoxlucid · · Score: 1

      Most people don't understand lock contention, or lockless code. That's why Dragonfly BSD is ignored, yet is so far ahead: every time someone sees a new problem with parallel computing, with semaphore contention, with threading models, DragonflyBSD is there with a fix from 10 years ago, DragonflyBSD wanted fast semaphores, lockless schedulers, threading models designed to handle running thousands of threads on hundreds of cores, and so on; this was seen, in the early 21st century, as a useless waste of time and a source of complexity; DFBSD is a fork of FreeBSD because the FreeBSD devs wouldn't let the DFBSD guy just do it in FBSD.

      It's one of those things. I expect a long, arduous path to catch up to DragonflyBSD, to Minix, and so on, in the same way that we spent so much time catching up to XFS (ext4 spent years trying to reach feature and performance parity with XFS; it now even has on-the-fly inode allocation as an option). There's always some laughable side project somewhere claiming it will change the world, and there's always a point in the future where everyone else starts imitating that project. Whenever I see something big and long-running like this, I recognize it as some other thing; when people start doing multi-version Linux, I will immediately start talking about NixOS (which I think is implemented like crap, but has the right idea).

    13. Re:Programs people want to use... by Anonymous Coward · · Score: 0

      You are also assuming that your program is made up of a large number of independent tasks. That is, task_2 doesn't rely on the results of task_1, etc. Sure, for some problems this is the case, but the point is not ALL problems can be broken down like that...

    14. Re:Programs people want to use... by Rei · · Score: 2

      BZZT, fail.

      1) You didn define launch_thread.
      2) my_struct_array was said, and I quote, "a local-context data structure", so congrats, your data is going to go out of scope on you.
      3) The concept of having to write that is absurd because "for (auto&i : container)" is a "do whatever you want, any number of steps, no matching function signature required, inline, on any container whatsoever" built into C++11, *and* it's something that anyone who knows C++11 will know rather being something you brewed yourself.

      Again, to repeat, given your failures on #1 and #2:

      " if you're too lazy to do it here, or change the requirements to present yourself with a simpler problem, then I'm going to take it that you're too lazy to do it in your code, too."

      Hence, I'm going to take it that you're likewise too lazy to actually thread your code. And the fact that your code contains a fundamental oversight resulting in a memory leak which wouldn't have caused a compile error is just icing on the cake.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    15. Re:Programs people want to use... by Rei · · Score: 1

      Hmm, I was thinking of your launch_thread in terms of passing by reference, but I now imagine you meant copy (would have helped if you had actually, you know, defined the function). But then you're just adding an extra and unnecessary copy.

      Let me help you out. Your function is going to have to keep a global data structure of all of the threads' arguments because they're too big to pass as the pthread's argument. Now, your array isn't going to be fixed-size because you don't know how many instances are going to be called (you could limit it and put a hard cap, but you still have to put checks for that). If it's pure C, then you don't have STL containers, so you have to implement all of your memory management overhead. Regardless, you at the very least have to do an additional copy of your passed my_struct into your global arguments structure (2x), versus the one that std::thread needs. Now, there is a way to work around having to keep a global data structure, but it sucks: it's to have your launch_thread function pass a pointer to the local copy of my_struct and then sit around and wait for the thread to start up, copy off of the pointer, and then zero out your copy to alert launch_thread that it's started and has copied the data structure (of course, this involves yet another copy, plus a ton of reads while sitting around and waiting and wasting time). All of this, of course, is on top of all of the overhead imposed by pthread itself, including defining a function (and not in the same place where the code is being used, which reduces clarity), and roughly three lines for the pthread calls themselves.

      This is all assuming that you implement it pthread-only and not portable. Otherwise, you have to add in #ifdefs and do a whole different approach for whole different platforms.

      Could you do all this? Of course you could. Would you do it? Clearly you didn't, and I know no amount of badgering would have gotten you to do it (I've tried this experiment before, you're not the first). Could you write it once and then reuse it?** Sure you could. Have you? No, of course you haven't, otherwise you would have just pasted it before. Why haven't you written such a thing before? Because it's too much hassle. Which is the very reason threading is underused.

      ** - kind of. You see, it's actually worse than that because unless you make an even more convoluted and unreadable and type-unsafe function, your thread launcher is going to be only set up for launching this particular case. But one can encounter all kinds of threading needs that would require significant changes. But I digress.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    16. Re:Programs people want to use... by Bengie · · Score: 1

      It's too early to know if it's just too hard a problem for the human mind in general

      Most user-space parallel problems aren't hard, it's just programmers who use algorithms and data-structures as black-boxes without understanding their implementation or characteristics, or alternatives, or generally being able to think for themselves. I don't know how many times I've glanced at problems that were throughput sensitive, and I immediately saw large potentials for parallelism, but required designs that would be utterly illogical for a serial design.

      Solving code parallelism problems is nearly identical to making well factored code. You need to break down the problem into its atomic parts, then rearrange those parts. Once you understand all atomic parts of a system and all of the data dependencies, parallelism becomes trivial. The problem is most people don't "understand" the system that they're working on, they just mindlessly throw code at a wall and some sticks. Most parallel code really needs to be designed from the beginning. Designing code? What's what?

    17. Re:Programs people want to use... by Anonymous Coward · · Score: 0

      What a massive oversimplifcation of how threading and games work and how simple it would be for them to use threading, if you think that the reason threading is hard and 'avoided' is because of the typing then you have no idea of what you are talking. Are you even a programmer?

    18. Re:Programs people want to use... by toby · · Score: 1

      It's ridiculous that not only does the article not mention Erlang or Haskell, but no high modded comment does either.

      Sad. Erlang's been around for more than 25 years with its successful lockless model.

      --
      you had me at #!
    19. Re:Programs people want to use... by Anonymous Coward · · Score: 0

      Almost all (multiplayer) games need to be deterministic. Launching all of those tasks in separate threads would require extensive locking which would negate the benefits of being threaded. The final game would likely be slower, not faster. You'd have to completely redesign the engine to be able to do something like what you propose.

      Game developers don't use a lot of threads not because the code is annoying to write, but because the engine complexity goes way, way up with only minor benefits at best. The new bugs would likely far outweigh any other benefits. A different engine architecture could take advantage of it, but gaming companies don't take risks like that and rarely do research. Unless some academic does a thesis on it and provides a detailed architecture doc, it isn't gonna happen. Easy thread creation code isn't some magic bullet you seem to think it is.

    20. Re:Programs people want to use... by Tablizer · · Score: 1

      Maybe that's why the banks F'd up mortgage pricing?

    21. Re:Programs people want to use... by Anonymous Coward · · Score: 0

      There's no memory leak faggot.

      In the code above:
      launch_thread(func, array[i]);

      is a copy cast of array[i]. the definition of array is:

      my_struct *array

      Ie. an array of pointers as an argument. When array[i] is evaluated it is copied as a new struct. If it was written like: launch_thread(func, array+i);

      then it would be a memory leak. Learn some fucking C for fucks sake.

    22. Re:Programs people want to use... by psmears · · Score: 1

      Why haven't you written such a thing before? Because it's too much hassle. Which is the very reason threading is underused.

      LOL. Actually there's a better reason such a thread launch facility doesn't commonly get written - which is that, in most circumstances, it really doesn't help performance that much, if at all - and the added complexity makes for a big net minus. There are a number of issues:

      Firstly, spawning threads is expensive. Yes, on Linux it's "cheap", but that's "cheap" compared to other implementations - it's still a lot compared to doing a modest amount of work on the local CPU. (Why is it so expensive? Basically because there's a lot of housekeeping to do. In addition to the kernel creating new kernel structures for the new thread of execution (similar to creating a process), the process's thread library must allocate a stack for the new thread (involving modifying the process's page tables), iterate through all loaded shared libraries in order to allocate any thread-local storage they require, and so on, requiring multiple syscalls, a TLB flush, at least one context switch, and so on. To some extent the impact of this overhead can be reduced by maintaining a pool of ready-created threads, but this either takes away control of performance (if done automatically by your language/library) or substantially increases complexity (if you implement it yourself, since you then have to synchronise the threads carefully).

      The second problem is that, unless you're very careful, extra threads don't buy you much performance, and can indeed hurt. Take the example you gave - doing some processing on each struct in an array, where each such struct contains an int and a double (16 bytes total, including alignment padding). With 64-byte cache lines (typical on x86), there are 4 such structs per cache line. If you distribute the processing over threads running on different cores, then instead of one core waiting for the cache line to come in to main memory, and then processing the 4 structs very rapidly (since they're now all in cache), you'll have 4 cores each waiting for the data to be available - i.e. up to a 4x slowdown for memory-bound tasks. And that's assuming the structure is only read from; if it's written to as well then the cache line will have to bounce between cores, and the multithreading slowdown will be many times worse. Now, if you ensure that structs in the same cache line get processed by the same core (ideally in sequence, and by the same kernel thread), then you do potentially get a big speedup - provided you don't hit any other gotchas - but the C++ code you're promoting doesn't seem to guarantee this in any way.

      Third, and perhaps most importantly, data dependencies matter. In your example you're detaching all the threads; this is not realistic, because that means you cannot ever depend on their operations having finished. In the vast majority of cases you do need to know when an operation has finished: you're generally doing work for a reason - i.e. that you're going to use the result - and you can't begin to use that result until you know it has been produced. That, in of itself, adds complexity: you have to analyse your program's dataflow much more carefully in the presence of threads, because C/C++ will quite happily let you use a variable before another thread has finished assigning to it, without any sort of warning or exception. The analysis can certainly be done, and synchronisation put in place to eliminate the problems - but that is further overhead, both in the program's performance but also in the complexity of the program itself, and hence the time taken to write it (and especially to enhance it later, when the synchronisation model may not be so fresh in one's mind).

      Used correctly and in the right circumstances, threads on an N-core system can give a N-times speedup (or greater, due to caching effects). Used badly, at best they'll reduce performance, and usually they'll increase complexity and lead to subtle bugs that are hard to debug.

      The new thread features in modern C++ are very cool, but the fact they didn't exist before is not what's been preventing competent programmers from using threads all over the place :)

    23. Re:Programs people want to use... by Anonymous Coward · · Score: 0

      Jeez. Just use:
      #pragma omp

      OpenMP will take care of your thread scheduling. Yes, you still have to use #pragma omp correctly. Yes cache contention can be a problem, and sometimes hardware just sucks, but I've managed to get significant speedups on several medium size projects using OpenMP, and it's not that difficult; much simpler than any of the threading libraries I've tried, and you have the advantage that you can quickly ask "is this a threading bug?" by simply rebuilding the software without OpenMP (of course this presumes you've already hit the bug, it says nothing to the proof of absence of bugs).

      Caveat: Clang doesn't (yet) support OpenMP in any release builds.

    24. Re:Programs people want to use... by Anonymous Coward · · Score: 0

      That is why I love of OpenMP

      #pragma omp parallel for schedule dynamic
      for(size_t i=0; i less_than tasks.size(); i++)
      {
              tasks[i].execute();
      }

  6. Re:GAY NIGGERS can be DEVELOPERS 1000 WHORES! by Tablizer · · Score: 1

    Example of a brain that can't handle parallel thought processing.

  7. best quote from the article by phantomfive · · Score: 0
    I haven't finished reading yet, but here is the best quote of the article:

    You may recognize David as the co-creator of the Self programming language

    No, actually, I don't think many people will recognize him as that. OK, back to reading the article.

    --
    "First they came for the slanderers and i said nothing."
    1. Re:best quote from the article by Anonymous Coward · · Score: 0

      What's wrong with it? It only said you may recognise him - it didn't say that most or many would.

    2. Re:best quote from the article by Anonymous Coward · · Score: 1

      What's wrong with it? It only said you may recognise him - it didn't say that most or many would.

      Shut up, Dave...

  8. make -j lotsandlots by Anonymous Coward · · Score: 0

    'nuff said. I'll take all the cores you can give me.

    1. Re: make -j lotsandlots by Anonymous Coward · · Score: 0

      Congratulations, now you're IO bound....

    2. Re: make -j lotsandlots by Z00L00K · · Score: 1

      Only if you have a single I/O device and channel.

      NUMA architectures can also apply to disks and other I/O devices.

      Of course - it comes with a new set of problems, but there's no golden solution.

      --
      If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    3. Re: make -j lotsandlots by Shinobi · · Score: 1

      Something I wish I could have in a workstation again is a full-fledged crossbar switch like the Octane and Octane 2 had.

    4. Re: make -j lotsandlots by Anonymous Coward · · Score: 0

      Only now you, and the other posters in this thread look like the complete idiots you are, because the specific case you are discussing is concurrency, not parallelism.

        Besides the OP is a double idiot for not recognizing that he'll just get a massive deadlock with "lotsandlots" of cores because of cache contention and various congestion issues.

    5. Re: make -j lotsandlots by Z00L00K · · Score: 1

      And what most usage is on a computer is actually concurrency.

      Massive parallelism is a special case, and even then you suffer from concurrency.

      --
      If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    6. Re: make -j lotsandlots by Anonymous Coward · · Score: 0

      AMD CPUs have had a crossbar switch since the 32-bit Athlons came out. The 64-bit Opteron and decendants, which came out in 2003 and which have evolved into the FX series as well, have NUMA interconnects, which are derived from technology developed at Cray back in the day and subsequently acquired by Sun and SGI, amongst others. Your Octane was very cool back in 1995, but things have moved on.

  9. Clue token? by Anonymous Coward · · Score: 0, Insightful

    Linus doesn't have a clue about much of computing:

    * Floating point? Nope.
    * Graphics? Nope.
    * High performance? Nope.
    * Parallel? Nope.
    * Compiling the Linux kernel? Maybe.

    This is another clear indication he currently lacks the clue token.

    1. Re:Clue token? by Anonymous Coward · · Score: 0

      Linus will be happy as long as he can compile the kernel under a terminal inside a mundane-looking X desktop.

  10. i'm so tired of political correctness by Noah+Haders · · Score: 0, Troll

    i'm so over this idea of political correctness. here's all it means: some assholes want to continue to be unmitigated assholes, just like they remember being "in the good old days." However, people are tired of putting up with their bullshit any more. so the assholes came up with this term "political correctness" to be like "i can say whatever I want, and if you don't like it then you are just bing politically correct." maybe you should jsut stop being assholes, k? the world has changed.

    1. Re:i'm so tired of political correctness by Anonymous Coward · · Score: 0

      Fuck you. You can't tell me what I can think or say.

    2. Re:i'm so tired of political correctness by jones_supa · · Score: 1

      I think the actual problem is that some people are so worked up about political incorrectness that they take pleasure from it spewing insulting angry messages all day long. Lol, look at my freedomz of speechorz. But a clever guy can say things straight, without being a upsetting dickhead at the same time.

    3. Re:i'm so tired of political correctness by goarilla · · Score: 2

      And some of us just grew up in the sort of nuclear family where offensive expletives are the norm.

    4. Re:i'm so tired of political correctness by Attila+Dimedici · · Score: 4, Insightful

      No, "political correctness" is a thing. It is where someone gets in trouble for using the word "niggardly" because it sounds like another word.

      --
      The truth is that all men having power ought to be mistrusted. James Madison
    5. Re:i'm so tired of political correctness by jareth-0205 · · Score: 1

      Fuck you. You can't tell me what I can think or say.

      So, what you're saying is... his right to tell you things is trumped by your wish to not hear things? Freedom of speech does not mean what you think it means...

    6. Re:i'm so tired of political correctness by Oligonicella · · Score: 1

      You're being pedantic. "You can't tell me" doesn't mean a literal 'you have to not talk', it means you cannot force your will on me to make *me* not think or say things. This was pretty much exactly what the poster he was responding to meant by "you should jsut stop". He's got freedom of speech correct.

    7. Re:i'm so tired of political correctness by Anonymous Coward · · Score: 0

      > And some of us just grew up in the sort of nuclear family where offensive expletives are the norm.

      Part of growing up is learning how to behave around people who aren't your family.

    8. Re:i'm so tired of political correctness by Anonymous Coward · · Score: 0

      the world has changed

      You would have to be pretty sheltered to believe that. Humans are still tribalistic assholes; if anything, only acceptable targets have changed, when it's not the same old shit dressed up in doublethink.

    9. Re:i'm so tired of political correctness by drinkypoo · · Score: 1

      And some of us just grew up in the sort of nuclear family where offensive expletives are the norm.

      You mean, low-class? I grew up with that kind of family, but I don't have any illusions about whether obscenity is the crutch of the inarticulate motherfucker.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    10. Re:i'm so tired of political correctness by Half-pint+HAL · · Score: 1

      On the other hand, if most people think your word means something different, it's not worth using that word.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    11. Re:i'm so tired of political correctness by Noah+Haders · · Score: 1

      think whatever you want, man. but when you feel like being an asshole, own it and say i'm an asshole. don't say I am just saying my mind and you're too politically correct if you take offense.

    12. Re:i'm so tired of political correctness by Noah+Haders · · Score: 1

      that's exactly what's changed. there's a group of people who were used to being on top for no reason of their own doing. others are like "god this guy's an asshole and I'm fed up with it because there's no reason he's on top". so he's not on top any more but he's really butthurt about it, which is what #gamergate is all about. so to all of these assholes, I'm saying wake up because the problem is you, not everybody else in the world.

    13. Re:i'm so tired of political correctness by Anonymous Coward · · Score: 0

      But how will you get that thrill of self-satisfied superiority?

    14. Re:i'm so tired of political correctness by Anonymous Coward · · Score: 0

      By screaming about how everyone is bigoted.

    15. Re:i'm so tired of political correctness by Attila+Dimedici · · Score: 1

      Perhaps not, but should someone get fired because people think a word they used is related to a racial epitaph, when it isn't?

      --
      The truth is that all men having power ought to be mistrusted. James Madison
    16. Re:i'm so tired of political correctness by Half-pint+HAL · · Score: 1

      Perhaps not, but should someone get fired because people think a word they used is related to a racial epitaph, when it isn't?

      No.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    17. Re:i'm so tired of political correctness by Anonymous Coward · · Score: 0

      epithet, not epitaph. Although, "Got fired for saying niggardly" is one leap away from an epitaph.

    18. Re:i'm so tired of political correctness by Noah+Haders · · Score: 2

      +1 this would make the best gravestone ever.

    19. Re:i'm so tired of political correctness by Anonymous Coward · · Score: 0

      Everything insults someone. Your calling people assholes for disagreeing with me is offensive, maybe you should work on your own fucking delivery, k asshole?

    20. Re:i'm so tired of political correctness by Bengie · · Score: 1

      Part of growing up is learning to know when you don't know what you're talking about and Linus is calling you on it. Every time that I've looked into why Linus was "wrong", it was because he was wrong in theory, but correct in practice, because in practice, people are idiots and Linus recognizes this.

      I assume Linus is looking at this from a practical standpoint, that jumping the gun to making massive overhauls of the kernel to optimize for our current limited understanding of concurrent software and hardware interactions for a problem that most programmers are too stupid to even take advantage of, would be a bit premature. We should wait for hardware and software to better stabilize before we get locked(pun) into a concurrency regime for the next few decades.

      We've only just recently gained concurrent support for network and storage IO, and hardware has been changing a lot in the past few years as we keep scaling up SSDs and 40gb+ NICs. We can use work-arounds for the mean time, and once everyone says "yes, this is the best way", we can make large kernel changes.

      Another example, AMD is already working on Mantle. Even if it doesn't fully take off, it's research into a related area, and we'll learn a lot from it. At some point in the future, a Mantle-like system may be incorporated into the Kernel, but lets not turn the kernel into a cesspool of ever changing interfaces while they figure this problem out.

    21. Re:i'm so tired of political correctness by u38cg · · Score: 1

      Yes, specifically it's where there's a bit of a stooshie over something silly like niggardly, everyone finally calms down a bit, and then some asshole decides that the correct thing to do is run around shouting "NIGGER NIGGER NIGGER", because political correctness gone mad.

      --
      [FUCK BETA]
    22. Re:i'm so tired of political correctness by Cederic · · Score: 1

      Bury him next to Dr "Got shot for being a paediatrician".

    23. Re:i'm so tired of political correctness by Anonymous Coward · · Score: 0

      i'm so over this idea of political correctness. here's all it means: some assholes want to continue to be unmitigated assholes, just like they remember being "in the good old days."

      Political correctness is an attempt to control communication deemed unacceptable or harmful. This would like a laudable goal until one realizes that postmodernist understandings of communication have taken hold within our culture. Increasingly, the intent and/or background of the individual presenting the message is deemed irrelevant to how the message is understand. Among the academics, the history of a term, phrase, or message is more important in understanding the message. Among common folk, the feelings of the recipient of the message are more important in understand it. How does this play out?

      Say I write an email to you to point out that a "big shipment of books" has arrived. As a common folk, you have a problem with the word "big." Thus, you go off the rails and get angry at me over my use of the word "big." I am informed that I should no longer use that word. That is, my speech is being controlled because it's deemed hurtful. However, I attempt to defend myself against this ridiculous accusation by pointing out what I intended to communicate by the word. You would fire back that my intent is not relevant.

      Sure, the idea sounds ridiculous that anyone would be offended by the word big. That is, it sounds ridiculous until you've spent 10 minutes reading blogs on Tumblr or about the increasingly ridiculous speech control taking place on college campus. Universities are no longer a place where the open exchange of ideas can take place, but it's a place where babies are coddled and made to feel "safe" from the big bad ideas that might challenge them. These babied kids are the leaders of tomorrow, and so they will bring a ridiculous form of political correctness to bear on our speech.

    24. Re:i'm so tired of political correctness by Anonymous Coward · · Score: 0

      More like some people are used to being coddled and praised for everything they do, and don't know how to act when someone tells them bluntly that their work is garbage. Sorry, the Real World doesn't give out participation ribbons. Get over it.

    25. Re:i'm so tired of political correctness by Anonymous Coward · · Score: 0

      No, political correctness is simply not being a bigoted asshole. People who complain about "niggardly" are stupid and ignorant but that doesn't make bigots ok.

      Also, the term "political correctness" reflects the intellectual dishonesty and hypocrisy endemic to the right wing, which has very strong strictures on what is politically correct. For instance, you aren't allowed to say that anthrogenic global warming is real and is the consensus view of science, even if you aren't the sort of stupid ignorant git who believes otherwise.

    26. Re:i'm so tired of political correctness by Anonymous Coward · · Score: 0

      You're just being stupid and dishonest.

  11. Core of the article by phantomfive · · Score: 1

    The article makes the point (which is not correct*) that to have high scalability we need lockless designs, because locking has too much overhead. If you can't imagine trying to get ACID properties in a multithreaded system without locks, well, neither can I. And neither can they: they've decided to give up on reliability. They've decided we need to give up on the idea that the computer always gives the correct answer, and instead gives the correct answer most of the time (correct meaning, of course, doing exactly what the programmer told it to do).

    Here is what the guy says: " The obstacle we shall have to overcome, if we are to successfully program manycore systems, is our cherished assumption that we write programs that always get the exactly right answers." Not only that, we need to give up memory/cache locks within the processor (I don't know a whole lot about those), because when you scale to 1000 processes on a single processor, RAM becomes a bottleneck.

    Now, if he's right, and the only way to get such high performance is by not worrying about whether the computer does what it is told, then he's not going to be able to convince many people.


    *It is not correct in situations where each processor can work on a single chunk for a long time, that is, for problems where resource contention is a small fraction of processor time, like in video encoding. Then the overhead is still small, no matter how many processors you have.

    --
    "First they came for the slanderers and i said nothing."
    1. Re:Core of the article by imgod2u · · Score: 3, Insightful

      The idea isn't that the computer ends up with an incorrect result. The idea is that the computer is designed to be fast at doing things in parallel with the occasional hiccup that will flag an error and re-run in the traditional slow method. How much of a window you can have for "screwing up" will determine how much performance you gain.

      This is essentially the idea behind transactional memory: optimize for the common case where threads that would use a lock don't actually access the same byte (or page, or cacheline) of memory. Elide the lock (pretend it isn't there), have the two threads run in parallel and if they do happen to collide, roll back and re-run in the slow way.

      We see this concept play out in many parts of hardware and software algorithms actually. Hell, TCP/IP is built on having packets freely distribute and possibly collide/drop with the idea that you can resend it. It ends up speeding up the common case: that packets make it to their destination along 1 path.

    2. Re:Core of the article by Rei · · Score: 2

      There are cases where getting exactly the right answer doesn't matter - real-time graphics is a good example. It's amazing the level of error you can have on an object if it's flying quickly past your field of view and lots of things are moving around. In "The Empire Strikes Back" they used a bloody potato and a shoe as asteroids and even Lucas didn't notice.

      That said, it's not the general case in computing that one can tolerate random errors. Nor is the concept of tolerating errors anything new. Programmers have been using for example approximations for square roots for a long, long time to save compute cycles where precision takes a back seat to "just get the shape of the curve roughly right". There's even a number of lower-precision hardware math methods.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    3. Re:Core of the article by Rei · · Score: 1

      I'm wondering about what he is thinking for real-world details. For example, a common use case is one thread does searches through a data structure to find an element (as, say, a pointer or an iterator), but before it can dereference it and try to access the memory, some other thread comes along and removes it from the list and frees it. Then your program tries to dereference a pointer or iterator that's no longer valid and it crashes.

      The problem isn't that it's no longer in the list. Clearly the other thread had a good reason to remove it and if your first thread had happened just a split second later it never would have seen the removed entry. The problem is that your program crashes because it's trying to use a freed memory address.

      What sort of implementation details is he thinking of that prevent this sort of problem? I mean, if he actually has a realistic solution, I'd love to see it, it could make for a brilliant extension of STL containers.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    4. Re:Core of the article by Half-pint+HAL · · Score: 1

      Immutability gets rid of problems of atomicity. Functional Programming is therefore a viable paradigm for parallel computing. Even if Scala is a bad example in terms of implementation, Odersky was right in deciding that scalability worked better with immutability-by-default. Mutability is there if you need it, but it's a conscious choice, which forces you to think carefully about it.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    5. Re:Core of the article by phantomfive · · Score: 1

      Immutability gets rid of problems of atomicity. Functional Programming is therefore a viable paradigm for parallel computing.

      Functional programming makes the conceptual load easier, but it's not a magic bullet. The algorithmic problems are still there, because sometimes you need to update the database.

      --
      "First they came for the slanderers and i said nothing."
    6. Re:Core of the article by phantomfive · · Score: 1

      Yeah, in cases where you don't care about precision, then precision doesn't matter. That's different than not being able to perform a transaction.

      --
      "First they came for the slanderers and i said nothing."
    7. Re:Core of the article by phantomfive · · Score: 1

      Yeah, good luck doing transactions without locks.

      --
      "First they came for the slanderers and i said nothing."
    8. Re:Core of the article by Bengie · · Score: 1

      Immutability increases memory usage, which puts pressure on your allocator, which also needs to be thread safe, and garbage collecting tends to be a "stop world" issue, which means all of your threads stopping. Depending on how you use "immutability", it could be worse than mutability for parallelism. People need to understand how things work in order to understand how they interact. There is no magic bullet, programmers need to start understanding and stop assuming everything is a blackbox that just works as desired.

    9. Re:Core of the article by Half-pint+HAL · · Score: 1

      Functional programming makes the conceptual load easier, but it's not a magic bullet. The algorithmic problems are still there, because sometimes you need to update the database.

      OK, functional programming isn't a good paradigm for database implementation, but DBMSes are typically a prewritten blackbox. Given that most mutability in large systems is best done as a database anyway, and then the DBMS will handle the lion's share of the resource locking anyway.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    10. Re:Core of the article by shutdown+-p+now · · Score: 1

      If you can't imagine trying to get ACID properties in a multithreaded system without locks, well, neither can I.

      Databases have been doing ACID without locks for a while now, with MVCC. As I understand, STM is basically built on the same ideas for running program state.

    11. Re:Core of the article by phantomfive · · Score: 1

      Nah, that still requires locks for writing, it just lets you read while someone else is writing.

      --
      "First they came for the slanderers and i said nothing."
    12. Re:Core of the article by phantomfive · · Score: 1

      Well, I certainly favor functional programming, especially on the server side as business logic......I just don't think its going to give any real algorithmic improvements like these guys are talking about. It makes multithreading easier, and easier to do without bugs, but not more efficient. That is my main point.

      --
      "First they came for the slanderers and i said nothing."
    13. Re:Core of the article by shutdown+-p+now · · Score: 1

      MVCC? It only requires locks when committing the transaction (and reconciling it with other transactions), which is a much shorter duration than writing itself.You can easily have concurrent writes in two ongoing snapshots without any locking whatsoever, so they don't have to wait on each other.

    14. Re:Core of the article by Anonymous Coward · · Score: 0

      The Linux kernel has already solved this problem. Google "read copy update". It doesn't scale so well to STL containers because it relies on details of Linux, in particular that you can set things up so that there is an upper bound on how long you need to keep something around, after which you can be assured that no one is supposed to hold a reference to it any more, so thread A doesn't have to tell thread B that it's safe to dispose of the memory, thread B can figure that out without talking to thread A.

    15. Re:Core of the article by phantomfive · · Score: 1

      MVCC? It only requires locks when committing the transaction (and reconciling it with other transactions), which is a much shorter duration than writing itself.

      True.

      --
      "First they came for the slanderers and i said nothing."
    16. Re:Core of the article by Mr+Z · · Score: 1

      Eventual consistency means that the computer eventually computes the right answer if its quiescent long enough. Intermediate values, though, are an approximation, which is often enough.

      One example that Paul McKinney gives is of a distributed counters built out of per-CPU counters, and CPU-to-CPU events saying how much to update the total by. (Let's assume positive counts only.)

      Each CPU will see update events from other CPUs in different orders, each saying how much to update the count by. All CPUs will eventually see all updates. So, the total seen by any given CPU might differ from the true total in the short run (and may not even be a technically valid total given the original source of events, since events get reordered), but eventually all of the counters will converge on the same total if updates stop pouring in. Also, the totals are still locally monotonic.

      If you required all CPUs to see the same sequence of updates to the count, then you have to take locks and serialize memory accesses, which on a manycore system is an expensive operation that simply doesn't scale well. But, if you relax the constraint to "eventual consistency" and "monotonic updates", then each core can have its local approximation that isn't too far from the real value, knowing that each core is no further from the true value than the backlog of events yet to arrive.

      That's an extremely reasonable model for many types of data.

    17. Re:Core of the article by imgod2u · · Score: 1

      A lot of transactional machines don't have locks. That's not to say they don't have any mutex-like structures altogether but rather, the sequences themselves are treated as locks, thus allowing a finer granularity than normal mutex algorithms.

    18. Re:Core of the article by imgod2u · · Score: 1

      How about graceful seg faults instead of program crashes? Obviously modern architectures don't really support such things but one can imagine a processor that detected bad pointers instead of causing the program to crash. In fact, each program could program or transaction even could program a pre-determined fault handler.

      What'll happen is:

      1. Thread A sets a "start of code snippet" and programs an address that has a fault handler.
      2. Thread B starts its processing as well.
      3. Thread A at some point tries to dereference a pointer at address X.
      4. Thread B races ahead and deletes the pointer at address X.
      5. Normally, in protected memory, the processor would throw a fit as thread A tries to access an illegal memory address.
      6. Instead, the processor jumps to thread A's custom fault handler.
      7. Thread A's fault handler sees "hey, my code snippet tried to access an illegal address and I, the thread, am not guaranteed to be thread safe". It then rolls back all of the work it's done up until the instruction that faulted.
      8. Thread A tries again starting from 1. It could, at some point, decide to not try the thread unsafe method (if it faults too many times) and actually use the old mutex locking method.

      The idea is that the majority of the time, thread A and thread B don't actually conflict. Or thread A wins the race. In those cases, you have a case of parallel computation speedup.

      It's up to the programmer (or compiler, probably a JIT) to recognize when to exploit this by analyzing the algorithm and the likelihood of conflict. A JIT would probably use profiling information it gets in real time.

      Nobody's saying this will replace 100% of all synchronization methods. But we don't need to. To get a speedup, you only need to technically replace 1 use case. But most likely, you can replace a lot (90%) of use cases.

  12. How parallel does a Word Processor need to be? by Nutria · · Score: 3, Interesting

    Or a spreadsheet? (Sure, a small fraction of people will have monster multi-tab sheets, but they're idiots.)
    Email programs?
    Chat?
    Web browsers get a big win from multi-processing, but not parallel algorithms.

    Linus is right: most of what we do has limited need for massive parallelization, and the work that does benefit from parallelization has been parallelized.

    --
    "I don't know, therefore Aliens" Wafflebox1
    1. Re:How parallel does a Word Processor need to be? by phantomfive · · Score: 1

      Emacs can always use more cores [flame suit on]

      --
      "First they came for the slanderers and i said nothing."
    2. Re:How parallel does a Word Processor need to be? by Anonymous Coward · · Score: 1

      Emacs is a bad OS that can only use one core. If you use Erc (an irc client inside of Emacs), you will notice the real pain of Emacs. While Erc tries to reconnect to a server, you can do absolutely nothing, not even changing to another buffer. You just have to sit there waiting for it to either succeed or to time out.

      Emacs is only alive because of how it can handle some code, but as an OS it is terrible broken as it is single threaded.

    3. Re:How parallel does a Word Processor need to be? by phantomfive · · Score: 1

      Emacs is a bad OS that can only use one core. If you use Erc (an irc client inside of Emacs), you will notice the real pain of Emacs. While Erc tries to reconnect to a server, you can do absolutely nothing, not even changing to another buffer

      Isn't there a version of select() inside emacs? In other words, some kind of non-blocking connect?

      --
      "First they came for the slanderers and i said nothing."
    4. Re: How parallel does a Word Processor need to be? by Anonymous Coward · · Score: 0

      Browsers get massive gains from parallel algorithms... Just check out Mozilla's research browser 'Servo' and how parallel layout (amongst other parellel operations) made pages load in 25% of the time. They're aiming for an alpha release of Servo this year.

    5. Re: How parallel does a Word Processor need to be? by Anonymous Coward · · Score: 0

      Remember that some of Servo's gains may be because their implementation isn't quite correct yet. When it's done it probably won't lose all of it, but until they ship a rendering engine that's actually usable against any random tag soup it's not quite the time to look at the performance results.

      In case you actually know about how rendering HTML works... think about the horror that is table layout.

    6. Re:How parallel does a Word Processor need to be? by Anonymous Coward · · Score: 0

      It doesn't have to be parallel. Nothing has to be parallel. But it'd be nice if we could have applications efficiently use multiple cores for the following reasons:

      - Keeping up moore's law of roughly doubling the computing power every 1.5 years. (We've hit a clock frequency wall with silicon and graphene won't be coming too soon and even if it does, parallelism applies to graphene just as it does to silicon). While the computing power keeps increasing we can't make use of it. Because it increases through adding more cores instead of increasing clock frequency like in the past.
      - Fine grained energy control: If we have 1000 cores and applications can efficiently make use of them, then we can shutdown cores one by one depending on how much energy is left in the battery. Instead of what's happening currently where only the clock frequency can be adjusted and only parts of a core can be shut down.
      - Decreased complexity in chip design. Why spend transistors for complex out-of-order execution units, complex branch prediction schemes etc to get small speed ups when you can just design one simple core and then copy & paste it as many times as your transistor budget allows. And besides it's questionable how much speed up you get by these complex schemes and depends a lot on the applications too. With (strong) scaling parallelism it's simple, more cores more computing power.

      Of course it's an illusion that every application can be made to efficiently use multiple cores. More cores won't make computing the fibonacci sequence faster just like having more runners won't speed up running a marathon. However we're still far away from efficiently parallelizing the parallelizable parts of applications. And that's where current parallel computing research comes into play.

      By the way, Linus is making a good point with the big caches. While the computing power has been able to keep up with moore's law it's a bit different when it comes to memory. The size of memory has increased nicely, however the clock frequencey/access times of memory hasn't been able to keep up with the clock frequency of cpus. That's why big caches are so awesome, because they allow to hide this discrepancy very well.

    7. Re:How parallel does a Word Processor need to be? by Anonymous Coward · · Score: 0

      Maybe they mean more parallel computing by having the OS give programs their own cores rather than task switch? Otherwise yeah, I agree that there is only so far some programs need to go where any advantages in performance are wasted in building the damn thing!

    8. Re:How parallel does a Word Processor need to be? by maccodemonkey · · Score: 2

      Or a spreadsheet? (Sure, a small fraction of people will have monster multi-tab sheets, but they're idiots.)
      Email programs?
      Chat?
      Web browsers get a big win from multi-processing, but not parallel algorithms.

      Linus is right: most of what we do has limited need for massive parallelization, and the work that does benefit from parallelization has been parallelized.

      This is kind of silly. Rendering, indexing and searching get pretty easy boosts from parallelization. That applies to all three cases you've listed above. Web browsers especially love tiled parallel rendering (very rarely these days does your web browser output get rendered into one giant buffer), and that can apply to spreadsheets to.

      A better question is how much parallelization we need for the average user. While the software algorithms should nicely scale to any reasonable processor/thread count, on the hardware side you do have to ask how many cores we really need, especially in since a lot of users are happy right now. But targeting these sorts of operations as a single thread is also the entirely wrong approach. It's not power efficient for mobile users, and it drastically limits the gains your code will see on new hardware, while competing source bases pass you up.

    9. Re:How parallel does a Word Processor need to be? by Nutria · · Score: 1

      indexing and searching get pretty easy boosts from parallelization.

      How much indexing and searching does Joe User do? And what percent is already done on a high-core-count server where parallel algorithms have already been implemented in the programs running on that kit?

      Web browsers especially love tiled parallel rendering

      Presuming that just a single tab on a single page is open, how CPU bound are web browsers running on modern 3GHz kit? Or are they really IO (disk and network) bound?

      --
      "I don't know, therefore Aliens" Wafflebox1
    10. Re:How parallel does a Word Processor need to be? by gnasher719 · · Score: 1

      I'll give you an example. I use iBooks to read eBooks. I downloaded two eBooks which are actually each a collection of fifty full-size books. On my MacBook, the one I'm currently reading displays that it's around page 8,000. The total is about ten thousand pages.

      If I change the font size, it recalculates the pages and page breaks for the whole book. One CPU running at 100% for a very long time. For a five hundred page book, no problem. For a ten thousand page book, big problem. I'd love it if the re-pagination process would use all the cores that are available.

    11. Re:How parallel does a Word Processor need to be? by Nutria · · Score: 1

      First thought: Why the hell aren't those two (total of) 10,000 page "books" split into their constituent 50 "actual" books?

      That's the kind of parallelization and work optimization that needs to take place before algorithm changes.

      --
      "I don't know, therefore Aliens" Wafflebox1
    12. Re:How parallel does a Word Processor need to be? by mean+pun · · Score: 1

      That's actually a good example of Linus' point. You can not easily parallelise pagination, because you need to know what fits on one page before you can paginate the next page. Sure, you can do some heuristics (every page contains exactly 1000 words), which is dangerous and at best lowers the quality of the result. You can also try to be clever and for example paginate chapters in parallel and then do the exact numbering afterwards in a separate pass, but before you know it you have turned a fairly simple algorithm into something highly complicated and fragile. And for what? A few corner cases.

    13. Re:How parallel does a Word Processor need to be? by Anonymous Coward · · Score: 0

      Utterly missing the point. It's not about X cores for a single application, it's about running a whole load of crap at the same time without having to task switch via schedulers and interrupts. You probably have the same lack of understanding on network bandwidth too, always thinking of the singular instance rather than parallel multi-user. Massive case for home users will be music/video creation and the killer application: gaming.

    14. Re:How parallel does a Word Processor need to be? by swilver · · Score: 1

      Yes, and it re-rerenders all the pages as bitmaps at 400% zoom, scales them back down to get proper anti-aliased results, then compresses them with JPEG and stores them into main memory... ...or how about just recalculating the page that you need to display?

      Parallel processing is not gonna solve stupidity.

    15. Re:How parallel does a Word Processor need to be? by Rei · · Score: 1

      The key question is, what are as many common example cases one can list (in order of frequency times severity) where users' computers have lagged by a perceptible amount which in any way reduced their user experience, or caused the user to have to forgo features that would otherwise have been desirable? Then you need to look at the cause.

      In the overwhelming majority of cases, you'll find that "more parallelism with more cores" would be a solution. So why not just bloody do it?

      Not everybody suffers performance problems in the same way. But the vast majority of peoples' performance problems can be solved by the same solution.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    16. Re:How parallel does a Word Processor need to be? by Rei · · Score: 0

      No, it's a good example of disproving Linus's point. Even if one can't conceive of an algorithm that can paginate in a non-linear fashion, they can certainly well bloody paginate each of the 100 books at the same time, each on their own core.

      And seriously, renumbering pages is supposed to be some sort of complicated task, according to you? *boggle*

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    17. Re:How parallel does a Word Processor need to be? by Anonymous Coward · · Score: 0

      How much indexing and searching does Joe User do?

      A lot. All programming systems provide trees and maps for indexing and searching. Consider web browsing. Every single Javascript object is conceptually a map, with an index against which property references are resolved (search.) Every web page is a composition of thousands of elements that are referenced dynamically by various identifiers including id, class and other features, which is all accelerated by various indexed data structures that must be searched.

      Ordinary web browsing is a cacophony of indexing and searching. One of Mozilla's primary motivations in developing the Rust programming language, which has as its key application the Servo layout engine, is fully utilizing hardware concurrency to implement dynamic layout and rendering of HTML content. Fast and power efficient layout and rendering is both hard and crucial, so when your primary (only?) product is web browsers, you have a keen interest in utilizing however many cores are available.

    18. Re:How parallel does a Word Processor need to be? by Anonymous Coward · · Score: 0

      Excel... imagine a CPU per cell! Real time data/updates :-)

      And yes, that would be a very limited application.

      Besides, the problem isn't necessarily cores, it's the cache... if you have two very efficient threads talking (sharing memory), it's often MUCH slower than having 1 cpu (one cache) do the same exact job...

      So... how many completely distinct (no shared memory) tasks can you think of on an average user Desktop? Perhaps 30-ish? (would it benefit the web-browser to spin off a thread (to run on another CPU) for each page reader, even when it has to invalidate the cache to read the pages? who knows...).

      So yah... 4 may be enough for everyone :-)

    19. Re:How parallel does a Word Processor need to be? by jellomizer · · Score: 1

      You are stating that Linux is a Desktop OS?
      That we need Linux for these mundane tasks?

      We need Linux for big calculations. Faster Databases (Parallel sorting of data, parallel searches).
      Faster collection for Decision Support Systems... Err um Business Intelligence Systems... Err um... Big Data... Whatever the buzzard of the week for statistical based calculations.

      Some things just need parallelization but others need parallel processing.

      A good parallel algorithm can bring computation speed down by one order of magnitude. A Sort can happen in O(log(n)) time. A search can happen in O(c) time.

      These improvements happen when you have larger sets of data. Not baby toys like Word Processors, Spread Sheets and Email.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    20. Re:How parallel does a Word Processor need to be? by Half-pint+HAL · · Score: 1

      And of course, every HTML document is a tree, and any tree can be parallelised.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    21. Re:How parallel does a Word Processor need to be? by Half-pint+HAL · · Score: 1
      If there's a hard break in the ebook (ie new chapters start on new pages), you can divide-and-conquer at page breaks and number as x+1, x+2,... x+n and then substitute for x when the previous calculation finishes. The GP's problem isn't that he's waiting for page numbers, but that he's waiting to see his book. It's the reader software that's waiting for the page numbers before it allows him to see the current page. From the user's perspective, having the page visible immediately with "page ?? of ??" at the bottom for the next few minutes is infinitely preferable to waiting a few minutes to get everything at once.

      Of course, this doesn't need parallelisation -- it could just as easily be done in a single core:
      1. Search back from current page to previous hard break.
      2. Repaginate x+n until next hard break found
      3. Render to screen as "page ?? of ??"
      4. Repaginate from beginning of text to start of current segment.
      5. Update screen to show (for example) "page 8123 of ??".
      6. Repaginate remainder of document. 7. Update total page number, eg "page 8123 of 10424".

      That iBooks doesn't do this is a result of design, and mostly based on assumptions of book length. The GP's call for parallelisation is wasted, as if they see such cases as his to be rare enough as to be negligible, they most likely wouldn't bother parallelising the code on a more parallel computer anyway.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    22. Re:How parallel does a Word Processor need to be? by Nutria · · Score: 1

      And of course, every HTML document is a tree, and any tree can be parallelised.

      No one has answered the question about disk and network slowness.

      What's the human-perceived benefit of rewriting Firefox to get a 1/2 second speedup in page rendering when I'm still waiting 3-4 seconds for some ad server to send me the rest of it's crud (ABP needing to be blocked so that videos on ESPN will play)?

      --
      "I don't know, therefore Aliens" Wafflebox1
    23. Re:How parallel does a Word Processor need to be? by laird · · Score: 1

      Pagination can be largely parallelized, because you can do most of the analysis (line layout, font rendering, etc.) in parallel. The only part that's got to be sequential is breaking the lines onto pages. You can then parallelize the rest of the page layout (headers, etc.).

    24. Re:How parallel does a Word Processor need to be? by vidnet · · Score: 1

      How parallel does a Word Processor need to be?

      Don't forget the complementary question, "How fast does each individual core have to be to run a single threaded Word Processor at an acceptable speed?"

      Imagine if instead of 4x 3ghz Xeons you had 4,000x 486s or 4,000,000x 286s.

    25. Re:How parallel does a Word Processor need to be? by Moof123 · · Score: 1

      Moore's Law has always been behind the MS Word curve. no matter how fast the processor, or how many cores, Microsoft has managed to use them up and then some, making word processing for the masses slower, and slower...

    26. Re:How parallel does a Word Processor need to be? by Nutria · · Score: 1

      "How fast does each individual core have to be to run a single threaded Word Processor at an acceptable speed?"

      WordPerfect 6.0 ran great on a 286 w/ 640KB, and WordStar was zippy on a 4MHz Z80 with 64KB (it was the floppy disk IO that hurt performance).

      So... the answer to your question is: not very!

      --
      "I don't know, therefore Aliens" Wafflebox1
    27. Re:How parallel does a Word Processor need to be? by Anonymous Coward · · Score: 0

      - Decreased complexity in chip design. Why spend transistors for complex out-of-order execution units, complex branch prediction schemes etc to get small speed ups when you can just design one simple core and then copy & paste it as many times as your transistor budget allows. And besides it's questionable how much speed up you get by these complex schemes and depends a lot on the applications too. With (strong) scaling parallelism it's simple, more cores more computing power.

      This is exactly Linus' argument: fewer, complex out-of-order cores with more cache beat numerous simple cores with less cache for typical desktop computing workloads. Strong scaling parallelism only happens in graphics and a few other niches. His answer to "Why spend transistors on that?" is "Because it performs better".

    28. Re:How parallel does a Word Processor need to be? by Anonymous Coward · · Score: 0

      That's just a terrible design, whoever came up with it should be sacked.

    29. Re:How parallel does a Word Processor need to be? by Anonymous Coward · · Score: 0

      That's not true at all. Like sorting, there can be algorithms and approaches to rendering pages that are very parallel without being very complex. The fundamental task is arranging thousands of independent text blocks; how could that NOT lend itself to parallel calculation? I think your mistake is in thinking that pages don't exist until the words are there to fill them.

    30. Re:How parallel does a Word Processor need to be? by Half-pint+HAL · · Score: 1

      You're quite right -- browsing is I/O bound, and the fact that Firefox refuses to render many pages mid-load is a design decision, and nothing to do with either serial-vs-parallel or concurrency. The place where Firefox still stutters on concurrency is where an object in one page crashes all the pages with objects of the same time. The main selling point of the original Google Chrome was threading Javascript, Flash et al so that no one page would kill the browser.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    31. Re:How parallel does a Word Processor need to be? by Agripa · · Score: 1

      How much indexing and searching does Joe User do? And what percent is already done on a high-core-count server where parallel algorithms have already been implemented in the programs running on that kit?

      Indexing (sorting) and searching in my email client are right up there with games, engineering applications, compression/decompression, and error recovery in using 100% of 1 CPU core while not being I/O limited. Firefox is in that list as well.

      So most of the programs that I use which contribute to a slow experience on the user side do not take advantage of multiple cores. The exceptions are video transcoding and error recovery set generation which can use as many cores as I can provide which is currently up to 4.

  13. Sounds like programmers from 40+ years ago by Anonymous Coward · · Score: 0

    Linus sounds like a programmer from 40 years ago....."Nobody will ever need more than 2 digits for a year, so the crazies suggesting years be represented by 4 digits are just that - crazy."

    1. Re:Sounds like programmers from 40+ years ago by Paradise+Pete · · Score: 1

      "Nobody will ever need more than 2 digits for a year, so the crazies suggesting years be represented by 4 digits are just that - crazy."

      Even the people who knew it would be an issue still used two digits. Resources were extremely constrained. It wasn't worth spending all of that for a problem that would happen decades later. I used to write complete programs that fit in 8K.

    2. Re:Sounds like programmers from 40+ years ago by jandersen · · Score: 1

      Linus sounds like a programmer from 40 years ago

      Not necessarily a bad thing to sound like, IMO; 40 years ago you had to think and actually be insightful about what you were undertaking, because the tools and resources were so limited. And, as somebody else has already mentioned, Linus isn't against graphics and multi-core, he is against the stupid fad that blindly demands more cores at the expense of producing better cores (as well as the idiocy of wrapping everything in a graphical front-end, when that actually ends up getting in the way of doing the job).

      I think what he says makes a lot of sense - when do you actually benefit from having many cores? Only when you have many, independent tasks; there are large classes of tasks that are serial in nature, which would not benefit from having several cores to run on. And most of the independent processes on the average PC are so lightweight that nothing is gained from having several cores compared to multiprocessing on a single core. Unless you are running a proper server in a data centre or performing large computations, you are likely to just waste your money, if you buy into the multi-core fad.

    3. Re:Sounds like programmers from 40+ years ago by gweihir · · Score: 1

      Only if you have zero clue about what he is talking about. Note: It is not possible to deduce validity from the way something sounds. That requires actual insight.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    4. Re:Sounds like programmers from 40+ years ago by Anonymous Coward · · Score: 0

      Only if you have zero clue about what he is talking about. Note: It is not possible to deduce validity from the way something sounds. That requires actual insight.

      Here's a quote from Linus (from the article) "So give up on parallelism already. It's not going to happen. End users are fine with roughly on the order of four cores,"

      That's what is short-sighted. Just like programmers 40+ years ago saying no one will ever need more than 2 digits to represent a year....which we did need because many of those programs were still in use when we rolled into 2000.

      Never say never....it makes you look stupid, and that is what I was pointing out about Linus. He has zero clue about what end user needs will be in 40 years from now.

    5. Re:Sounds like programmers from 40+ years ago by gweihir · · Score: 1

      No, it is not. They tried getting parallel programming off the floor 40 years ago and have consistently been failing since then. Linus sums up the results of the last few decades of R&D perfectly.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    6. Re:Sounds like programmers from 40+ years ago by Anonymous Coward · · Score: 0

      No, what he is saying is wrong and if he was right then we wouldn't be in the current multi-core world we are in now. We hit a real tangible limit on what we could achieve with a single powerful core, these are physical limits that are not going to change. He is holding onto an idea that is outdated

  14. Bad summary, shocking by Urkki · · Score: 5, Interesting

    Linus doesn't so much say that parallelism is useless, he's saying that more cache and bigger, more efficient cores is much better. Therefore, increased number of cores at the cost of single core efficiency is just stupid for general purpose computing. Better just stick more cache to the die, instead of adding a core. Or that is how I read what he says.

    I'd say, number of cores should scale with IO bandwidth. You need enough cores to make parallel compilation be CPU bound. Is 4 cores enough for that? Well, I don't know, but if the cores are efficient (highly parallel out-of-order execution) and have large caches, I'd wager IO lags far behind today. Is IO catching up? When will it catch up, if it is? No idea. Maybe someone here does?

    1. Re:Bad summary, shocking by Z00L00K · · Score: 1

      Some I/O won't catch up that easily, you can't speed up a keyboard much, and even though we have SSDs we have a limit there too.

      But if you break up the I/O as well into sectors so that I/O contention on one area don't impact the I/O on another by using a NUMA architecture for I/O as well as RAM then it's theoretically possible to redistribute some processing.

      It won't be a perfect solution, but it will be less sensitive.

      --
      If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    2. Re:Bad summary, shocking by Anonymous Coward · · Score: 0

      Linus is right, but he might be wrong tomorrow. I'm saying that book printing and electricity were quite useless toys when they were invented (sure cheap Bibles are nice, but book printing was mostly used for jokes and porn in its early days. It was only after the first encylopedia was written when book printing started to pay off). It was only later when people figured out what all you can do with them. So I predict that once we start seeing 1000-core CPUs, few years or decades after we start seeing some new ways of using them.

    3. Re:Bad summary, shocking by Anonymous Coward · · Score: 0

      This is a problem I have with design right now, they are just like you say, giving up memory and bandwidth for inferior cores in higher numbers.

      More memory and bandwidth for lesser cores would do fine for most tasks people will be doing on a workstation, which can scale better with time as well as we get superior materials to build with.

      For those that DO need more power?
      Parallel Processing boards, aka, what GPUs have become now. Also better and more well known devkits to work with them.
      Another is for people to try throw out the idea that precision is absolutely needed 100% of the time.
      You can make great speed improvements and general efficiency if you drop a few numbers, or add a few, depending on the situation.
      Let's think of a windows position for example, as you are animating it. If something is moving fast enough in a short period, you do not need to be precise, you can be precise enough that a person can see a window moving from left to right and it stops. In fact, the unpredictable small movements would look quite nice IMO. This is just a simple example here for graphics.
      Things can accumulate small errors over small periods. You don't need 100% precision. We need to go derper than Float and design cores specifically for fuzzy logic and math. Having the same core copy-pasted all over is STUPID. We DO need to go back to older CPU designs where we had dedicated areas for math, logic, general processing, but this time through core design. (but this case, we'd have dedicated areas for doing faster maths, we won't eliminate the use of math processing in general cores!)

    4. Re:Bad summary, shocking by Half-pint+HAL · · Score: 1

      Cache misses are a problem, but caching is subject to the law of diminishing returns -- it takes an exponential growth in cache size to get a linear reduction in cache misses. Has Torvalds run the numbers and determined where the crossover point of the two lines is? Even then, the best solution depends very heavily on the task in question. General purpose computing has always been a compromise.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    5. Re:Bad summary, shocking by Anonymous Coward · · Score: 0

      I'd say, number of cores should scale with IO bandwidth.

      It seems to do so already for the Intel and AMD servers. More IO-enabling links and more memory channels are available in the larger chips and chiplets. Reading the summary made me almost think Ungar is arguing for Erlang everywhere. But he doesn't. Apparently dynamic heterogeneous parallelism is not something on Ungar's mind.

    6. Re:Bad summary, shocking by Anonymous Coward · · Score: 0

      Having 1 to 64 pthreads memset a total of 128G of memory (equally divided among the threads) a few hundred times on a 64 "cpu" AMD system (four CPU, 16 cores each) shows speed gains (lots from 1 to 4 threads), a valley, and then worsening performance somewhere out past 40 threads. So, yeah, just adding crazy numbers of cores may not be a boon, especially when the actual CPU performance compared to other offerings is... lacking.

    7. Re:Bad summary, shocking by Anonymous Coward · · Score: 0

      I have personally doubled the speed of my keyboard by using both hands. Touch typing with my toes is harder, so I mostly use the keyboard on the floor for mashing CTRL-C. Now if they could only maintain coherency on the keyboard buffers I'd be golden. Out of order execution on keystrokes remains a dream.

  15. DragonflyBSD comes to mind... by Anonymous Coward · · Score: 0

    As having taken a pioneering direction in lockless designs. In doing so they have made their system quite fast. It arguably has the fastest network stack of any operating system.

  16. Torvalds is half right by popo · · Score: 5, Insightful

    The problem is that Linus is discussing two different things at once and so it sounds like he's making a more inflammatory point than he is.

    The issue is not whether parallelism is uniformly better for all tasks. The question is, is parallelism better for some tasks. And as Torvalds points out, those tasks do exist (Graphics being an obvious one).

    The nature of the workload required for most workstations is non-uniform processing of large quantities of discreet, irregular tasks. For this, parallelism (as Torvald's correctly notes) is likely not the most efficient approach. To pretend that in some magical future, our processing needs can be homogenized into tasks for which parallel computing is superior is to make a faith-based prediction on how our use of computers will evolve. I would say that the evidence is quite the opposite: That tasks will become more discrete and unique.

    Some fields though: finance, science, statistics, weather, medicine, etc. are rife with computing tasks which ARE well suited to parallel computing. But how much of those tasks happens on workstations. Not much, most likely. So Linus' point is valid.

    But I have to take issue of Linus tone in which he downplays "graphics" as being a rather unimportant subset of computing tasks. It's not "graphics". It's "GRAPHICS". That's not a small outlier of a task. Wait until we're all wearing ninth generation Oculus headsets... the trajectory of parallel processing requirements for graphics is already becoming clear -- and it's stratospheric. The issue is this: Our desktop processing requirements are actually slowing and as Linus points out, are probably ill-suited for increased parallelism. But our graphics requirements may be nearly infinite.

    Unlike other fields of computing, we know where graphics is going 20 years from now: It's going to the "holodeck".

    Keep working on parallel computing guys. Yes, we need it.

     

    --
    ------ The best brain training is now totally free : )
    1. Re:Torvalds is half right by Anonymous Coward · · Score: 0

      I am not a computer scientist, so this may sound idiotic...

      Why not design multi-purpose chips that have some cores optimized for some tasks, and other cores optimized for others (parallel computing, for example), then allow flags to be added t the requests for resources when things are actually executed. Maybe it's already done that way (obviously).

    2. Re: Torvalds is half right by Anonymous Coward · · Score: 0

      Verification

    3. Re:Torvalds is half right by Anonymous Coward · · Score: 1

      The assumption that future workloads are highly parallel isn't faith based. It's simply a matter of recognizing that processing power for parallel workloads is much easier to scale and thus going to be much cheaper than processing power for sequential workloads. If you can get loads more parallel processing power for the same price that you would pay for sequential processing power, many of your processing tasks are going to look parallelizable, because you start looking at them differently.

    4. Re: Torvalds is half right by Anonymous Coward · · Score: 0

      that I'm idiotic? or that it's already done that way?

    5. Re:Torvalds is half right by Anonymous Coward · · Score: 0

      It is already, GPU handles graphics, it can also do parallel computing if application is written for that, for example bitcoin mining. CPU handles general computing tasks.

    6. Re:Torvalds is half right by Anonymous Coward · · Score: 0

      We do! especially in phones, we also have been embedding graphics cores into CPUs as well.

    7. Re:Torvalds is half right by Oligonicella · · Score: 1

      So you have faith that these processing tasks will actually be 'parallelizable' because you start looking at them differently?

    8. Re:Torvalds is half right by Anonymous Coward · · Score: 4, Informative

      AMD have a line of CPUs very much like this, the A Series. It has several conventional multi-purpose x86-64 cores for general-purpose use and a Graphics Processing Unit built-in for those embarrassingly-parallel floating-point operations. Best of all, they're very cheap and perform very well.

    9. Re:Torvalds is half right by JaredOfEuropa · · Score: 0

      The summary had it right: this is simply a rant. Linus has spoken out on various other subjects before; subjects in which he can hardly be considered an expert or thought leader. I wouldn't put too much stock in his opinion.

      --
      If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
    10. Re:Torvalds is half right by Anonymous Coward · · Score: 2, Insightful

      No, that's not faith. That's an economic argument. I know that many tasks which are considered practically non-parallelizable today can in fact be parallelized. We don't do that today because the additional work doesn't pay off when massive multicore systems are not yet available or not yet capable of running general purpose code. Often it's just a matter of getting the right tools, but sometimes you need to look at problems again and solve them in a different way. With new algorithms and new tools, you will make use of many cores, because if you don't, you will be left in the dust by the people who do.

    11. Re:Torvalds is half right by RabidReindeer · · Score: 1

      There are actually 3 kinds of tasks as far as parallelization goes.

      1. Totally linear tasks. Each step relies on the output of its predecessor. Thus nothing can begin before its time. Obviously, handing this sort of work off to a parallel system is a waste of time.

      2. Simple parallel tasks. This is the case where you can do a lot of trivial operations in parallel. The computer equivalent of a bunch of people using 4-banger calculators. As long as the tasks indivudually take longer to run than they do to schedule and collect, this is an ideal use case for massively parallel array processors.

      3. Complex parallel tasks. This is simply case #2, but armed with HP advanced function calculators. The individual processors would be more than just basic gate arrays and thus able to perform complex math functions in parallel. Not as cheap to scale up, but better than waiting out linear time.

      Of course, this is for the ideal world. Real-life heavy-computing scenarios may have components for 1, 2, or all 3 of the above.

    12. Re:Torvalds is half right by DutchUncle · · Score: 1

      I don't read "unimportant subset"; I read "subset, not general-purpose". By all means, graphics and parallel computer are/will-be important; but look at the history of processor development - the rise of the GPU as a separate device, taking the processing load *off* the CPU. I keep reading posts about using the appropriate language for a task; how much more so, then, is using the appropriate hardware design for a task?

      This is like debating whether your kitchen renovation would be better with an 8 burner stove, or more counter space. A restaurant, cooking separate meals for 4 or 6 people at a time, needs more burners - and usually has more cooks to go with them. At home, with one cook, 4 burners is usually as much as one can handle simultaneously, and the counter space is more useful. Different solutions for different problems.

    13. Re:Torvalds is half right by wisnoskij · · Score: 1

      But is not the real point that we have hit a wall on the hardware front, while while we can easily add parallel processing power, we can not easily add processing power to processors. Which is exactly why programs will increasingly, by necessity, be programmed to take advantage of more and more cores. Yes, individual algorithms/tasks sometimes cannot be broken up even into two parallel tasks, but even single programs are 99% a grouping of many different tasks to begin with. Yes, on the big data crunching scientific research end, this will not help, but for 99.9999999% of the uses for a computer there are already a hundred processing running at all times, and most of these could get broken up many, many, times further

      --
      Troll is not a replacement for I disagree.
    14. Re:Torvalds is half right by Half-pint+HAL · · Score: 1

      The issue is not whether parallelism is uniformly better for all tasks. The question is, is parallelism better for some tasks. And as Torvalds points out, those tasks do exist (Graphics being an obvious one).

      I think, as one of the quoted comments in the article said, that current programming languages have a lot to answer for in this debate. If we look at mathematics, the sum (i.e. sigma) and product operators are inherently parallel. Sadly our FOR and WHILE loops are not, as procedural iteration often relies on side-effects. It's no accident that Scala uses the mathematical basis of the functional programming paradigm as a foundation for massively scalable, parallelisable programming (even though it's a mind-twisting wreck of a language in many ways).

      Every non-trivial program will have to crunch through large iterative processes at some point, and even if these are a small percentage of execution time, the reality is that most interactive systems have a lot of idle time anyway, and the delay for the user is only when the program gets stuck in a long iteration. So it follows that even if parallelism reduces the overall performance, it is of no consequence as the perceived performance by the user is improved.

      But I have to take issue of Linus tone in which he downplays "graphics" as being a rather unimportant subset of computing tasks. It's not "graphics". It's "GRAPHICS". That's not a small outlier of a task.

      His point isn't that graphics is a small thing, it's that the GPU already handles that, and that we therefore don't need much parallelism in the CPU. But when I think about games, I think about AI, and the AI has to operate in 3D space, so the AI obviously benefits from the same parallelism as the graphics. Do we steal GPU cycles to run our AI? No, because the way of the market is typically that games sell first and foremost on their looks. So we need more parallel grunt. Plus, of course, as AI has to handle multiple independent agents, AI is an inherently parallel task (multithreaded in concept, regardless of whether a particular game implements it in threads or not)

      But going back to Scala and FP... A lot of the problems of memory locking are nicely sidestepped if you implement your code in FP: FP guarantees immutability of values: you cannot write to an existing value, so you don't need to have notions of "atomic" operations, and hence no need to lock in most circumstances. Caching becomes less of a pain, as nothing is ever going to change, so your cache value cannot be incorrect.

      Some of the comments in the article refered to theoretical extra bugs due to having to think in parallel, but it simply gives more motivation for a programming paradigm that is less bug-prone, and FP is that paradigm. FP has been rejected by programmers far too long, but the simple mechanism of immutability removes that most bothersome of bugs -- the erroneously altered value that you spend a week tracking down. FP should already be easier to reason about than procedural programming, if we learned to do it properly. FP for parallel isn't all that much different from single-threaded FP, so would actually make basic parallel code no more complicated to learn than algebra.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    15. Re:Torvalds is half right by Half-pint+HAL · · Score: 2

      In essence, it's already done that way. A System-on-a-chip (SoC) typically has a couple of general-purpose cores, along with sound and video processors. In a full-sized PC, the graphics processing is usually taken to another chip -- in fact another circuit board entirely. Because most of the work the graphics processor (=GPU) does is largely independent of the main processor (=CPU) (the CPU pushes in the data, says "do X with it", the GPU then churns away through the data) it doesn't need to be closely linked or share a lot of memory. In fact, it's more efficient for them not to share memory, as then they're not getting in each other's way.

      Expanding that system for more types of semi-general-purpose cores would get rather complicated.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    16. Re: Torvalds is half right by Half-pint+HAL · · Score: 4, Informative

      Verification is the process of checking that software works correctly. The more complex the system, the more complex the process of verification. Rather unfair of the GP to throw that in as a single word after you explicitly said that you're not a computer scientist.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    17. Re:Torvalds is half right by Half-pint+HAL · · Score: 1

      Yes, but it can't really do much non-graphics parallel processing at the same time as rendering a game. As I understand it, a lot of problems with AI in modern AAA titles is down to the fact that they need the parallelism, but in AAA-land, graphics are king, and the AI guys don't get enough cycles to do a decent job.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    18. Re:Torvalds is half right by Euler · · Score: 1

      Even in case #1, there is sometimes things that can be done. For example, speculative execution. If you can boil down to a small number of choices as a result of the first operation, then it may make sense to compute both outcomes. Or there may be some other intermediate value that might be needed in only some outcomes. But this requires application-specific knowledge usually to know exactly what is allowable and what the payoff would be. You wouldn't want to create a situation where executing both cases affects a global resource. So you would need a language expressive enough to hint this information to the compiler.

    19. Re:Torvalds is half right by RabidReindeer · · Score: 1

      Even in case #1, there is sometimes things that can be done. For example, speculative execution. If you can boil down to a small number of choices as a result of the first operation, then it may make sense to compute both outcomes. Or there may be some other intermediate value that might be needed in only some outcomes. But this requires application-specific knowledge usually to know exactly what is allowable and what the payoff would be. You wouldn't want to create a situation where executing both cases affects a global resource. So you would need a language expressive enough to hint this information to the compiler.

      Very good point. A high-level equivalent to the predictive processing at the CPU hardware level!

    20. Re:Torvalds is half right by Lunix+Nutcase · · Score: 2

      Why not design multi-purpose chips that have some cores optimized for some tasks, and other cores optimized for others

      We do have those. Any CPU with an iGPU is such a chip. We've had such CPUs for years and years now. Have you missed out on the last decade of CPU design?

    21. Re:Torvalds is half right by Khyber · · Score: 1

      "The nature of the workload required for most workstations is non-uniform processing of large quantities of discreet, irregular tasks. For this, parallelism (as Torvald's correctly notes) is likely not the most efficient approach."

      Please, he can't even PP his way out of a DX-OGL call/wrap. He's got zero standing ground to talk about paralelism when there are people taking Linux, making it run highly parallel, and it works like a goddamned dream. Being able to do all of those irregular discrete tasks without having to wait for something else to finish first is the goal.

      People have worked on pseudo-parallel code for OoO and what not. minimum 200% increase in performance.

      Meanwhile, Linus still refuses to fix a bug in kernel, which exists all the way back to before kernel version 2.x ever hit the scene, which alows anyone to hardlock the kernel (and could've been mitigated or entirely prevented by having some fucking paralel-capable code.)

      Linus needs to go crawl in his hole and shut up. People more competent than him have taken over his project, and he's just bitching about it in a non-descript way.

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    22. Re:Torvalds is half right by Khyber · · Score: 1

      " Any CPU with an iGPU is such a chip."

      Except the iGPU wasn't put on the same physical package as the CPU for a LONG time, and was only recently happening with AMD and Intel. Used to be the iGPU was on the northbridge.

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    23. Re:Torvalds is half right by Baloroth · · Score: 1

      The nature of the workload required for most workstations is non-uniform processing of large quantities of discreet, irregular tasks. For this, parallelism (as Torvald's correctly notes) is likely not the most efficient approach. To pretend that in some magical future, our processing needs can be homogenized into tasks for which parallel computing is superior is to make a faith-based prediction on how our use of computers will evolve. I would say that the evidence is quite the opposite: That tasks will become more discrete and unique.

      Right, but we want to continue the "Moore's Law" speedup of processing year over year. And that simply can't happen with single core processing: clock speed is already near the physical limit (as in we would need to start violating the speed of light to increase it much further), and manufacturing process size can't continue shrinking indefinitely either, no matter how close we are to the actual physical limits there. So unless we invent entirely new computing systems (e.g. quantum computers), the only speed gains in the future will inevitably be from parallelization, and there are (for many cases) still massive speed gains to be made in that field, simply because the software was never designed for any parallelization at all. Granted, that'll hit a wall where you can't split tasks up anymore as well, but in many cases this process hasn't even started.

      You're quite right about the graphics, though: the long-term future of graphics technology is probably ray-tracing, and that takes absolutely massive amounts of completely parallel CPU power.

      --
      "None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
    24. Re:Torvalds is half right by Anonymous Coward · · Score: 0

      ah, yes... AMD, the best option when none of {perf, perf-per-watt, perf-per-watt-per-purchasing-dollar} are your project's priorities. i don't know if they universally win at perf-per-puchasing-dollar either, across the possible perf range.

    25. Re:Torvalds is half right by paulpach · · Score: 1

      But I have to take issue of Linus tone in which he downplays "graphics" as being a rather unimportant subset of computing tasks. It's not "graphics". It's "GRAPHICS". That's not a small outlier of a task. Wait until we're all wearing ninth generation Oculus headsets... the trajectory of parallel processing requirements for graphics is already becoming clear -- and it's stratospheric. The issue is this: Our desktop processing requirements are actually slowing and as Linus points out, are probably ill-suited for increased parallelism. But our graphics requirements may be nearly infinite.

      I agree, he dismisses graphics as something a few people do. WTF? in mobile, all the top grossing apps are games. The number 1 thing games do is graphics. If anything, I would argue that graphics is the single most important type of workload in mobile. Gaming (and therefore graphics) might not be quite as big in desktop, but it is still very far from being the niche he pretends it is.

      That said, I think he is right that the fast single threaded big CPU's are not going anywhere. The trend for mobile and desktop has been to do graphics and general processing in separate hardware (GPU + CPU), I don't see a reason in sight to change that. Even if they were to be combined in a single chip, it would still be different part of the chip doing the tasks.

    26. Re:Torvalds is half right by RyuuzakiTetsuya · · Score: 1

      Torvalds is half wrong too.

      The problem with Torvald's assertion is that while he's probably right(and I read a Steve Jobs kinda ethos when he says that end users are fine with 4 cores; which I like a lot), i think though that there's a pretty practical problem computing is running into.

      We're kind of at the limit with how much work a single core can do. We can't make single cores faster. We can make them cheaper and lower power. Which I think overall is a huge gain for anyone who pays power bills :) But now that we can have a lot of them, we shouldn't be shy about actually using them.

      I think it might be ultimately a fruitless effort, but I think the possible gains make it worth fleshing out.

      I just wonder how much there is to gain by ditching legacy CPU architectures(x86, ARM, etc) and starting from the ground up. Probably not much, but I am hopeful I am wrong.

      --
      Non impediti ratione cogitationus.
    27. Re:Torvalds is half right by lgw · · Score: 1

      How many systems with "totally linear tasks" have only 1 user?

      Smart phones don't need 1000 cores - heck, they only need 2 so you can have a low-power and a high-power core. But servers? 1 core per user seems like a good start - but across how many boxes? The trick is getting that workload to easily scale horizontally, across any number of servers that can only talk to one another slowly, and that's not an easy trick! I expect a lot of work in that area in 2015.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    28. Re:Torvalds is half right by Anonymous Coward · · Score: 0

      Keep working on parallel computing guys. Yes, we need it.

      Torvald's comments are appropriate to anyone using a computer like he or people he knows do.

      The final question is: what really is the typical user use case?

      Not everyone uses just a web browser and a terminal client and a compiler all day. Typical office use is more like a web browser and Microsoft Ofice $APP or a web browser and a fat email client. Plus video games. Lots of video games.

      Other tasks need parallel today on the desktop.


      • Emulators
      • Encryption
      • Bitcoin Mining
      • Plain-old Java applications
      • Trip Planning (so bad we farm it off to the "cloud" instead of doing it locally)
      • Virtual Machines (really, more of a multiprocess case like the browser)

      and video games! Not every cycle is spent on drawing pixels. Crysis may make your old video card cry, but just try loading up Dwarf Fortress and watch your CPU peg.

    29. Re:Torvalds is half right by rgbatduke · · Score: 1

      Note well that historically, MOST parallel computers have profited the MOST from parallelizing totally linear tasks. Not the tasks themselves -- embarrassingly parallel tasks, simply running many instances of completely independent code or many instances of code that is extremely coarse grained so that one can run almost all of the task as linear code with only infrequent communications with a "central" controller. Classic examples are plain old multitasking of the operating system with code that doesn't make heavy use of bottlenecked resources (the reason most users see some small benefit from e.g. quad core vs single core processors, as there is enough often enough work being done to keep 3-4 cores busy at least some of the time without much blocking, and this keeps the processor itself from thrashing by providing the illusion of parallelism through multitasking with time slices. It works best if the cores have independent caches and contexts and if there is sufficient task affinity. Also, classic "master-slave" parallel computing, where e.g. a Monte Carlo computation might spawn N slaves, each one with its own random number generator seed, and run N "independent" samplings of some process that are only infrequently aggregated back to the master. Again, the characteristic is lots of nearly independent serial computation with only short, infrequent, non-blocking, non-synchronous communications back to some collection point. Two programs that often were used to demonstrate the awesome advantages of scaling at the limits of Amdahl's law were parallel povray (rendering can be broken up into nearly independent subtasks in master-slave) and a parallel Mandlebrot set generator/displayer (where each point has to be tested independently, so whole subsets of the relevant parts of the complex plane could be distributed to different processors and independently computed, with the master collecting and displaying the results.

      Sadly (well, not really:-) modern processors are so damn fast you can get to the accessible bottom of the Mandlebrot set with almost no perceptible delay from rubber banding even with a single core, so the latter isn't so dramatic, but the point remains -- quite a lot of work that can be done with multiple cores (arguably MOST of the work that can efficiently and easily be done with multiple cores) is trivial parallelism, not parallel programming. Instance 1 is the richest source of advantage for a parallel system, and tasks that will scale out to 1000 cores are almost certainly ONLY going to be trivially/embarrassingly parallel tasks because Amdahl's law and the complexity of unblocking communications between subtasks is a royal bitch at 1000 processors no matter how you architect things. SETI at home, maybe. Solving a system of partial differential equations on a volume with long range interactions not so much.

      The fundamental problem with 2 and 3 is that they have to be hand coded. Really pretty much period. Sure, you can get away with getting some advantage from using e.g. a parallel linear algebra program as a link step in a program that can run on serial resources, but typically the gains you can get are limited and will not scale well, certainly not to anywhere near 1000 cores, even for case 2. To use 1000 cores for a tightly coupled parallel computation where every core talks to every other core per step of the computation -- well, that just isn't going to happen without an incredible (literally) boost in interprocessor communication speed, reduction in communication latency, elimination of resource blocking at both the hardware and kernel level. The problem at some point becomes NP complete (I suspect, of course pending the issue of whether P = NP etc) and simply working out ways for the communications to proceed in a self-avoiding pattern to eliminate collisions or delays due to asynchronicity is itself a "hard problem", forget the problem you're actually trying to solve.

      So I'm largely with Linux on this one. Advantages to parallelism at the OPERATING SYSTEM level

      --
      Even when the experts all agree, they may well be mistaken. --- Bertrand Russell.
    30. Re:Torvalds is half right by Bengie · · Score: 1

      Because most of the work the graphics processor (=GPU) does is largely independent of the main processor (=CPU) (the CPU pushes in the data, says "do X with it", the GPU then churns away through the data) it doesn't need to be closely linked or share a lot of memory.

      There is little GPU+CPU workloads because communications between the GPU and CPU is so slow, not because there is no demand for it. There is a huge class of hybrid workloads that require data ping-ponging back and forth between the GPU and CPU, but no one writes code for that because it's latency sensitive to nanoseconds and GPU to CPU latency is in microseconds. AMD is working on reducing the latency by integrating the GPU and CPU together.

    31. Re:Torvalds is half right by Anonymous Coward · · Score: 0

      Spot on!

      At best, Torvalds is stuck in the "now". He's thinking about today's problems, today's tools, and today's way of doing things. This is a technology and process issue, not a matter of inevitability due to the laws of physics.

      Even Amdahl's Law, which is the most often cited reason why parallelism is limited, is abused in this way. It's not that Amdahl's Law is wrong, it's the fixation with the notion that parallel algorithms are niche exceptions in logic. If every algorithm can be revised to be parallel then Amdahl's Law becomes irrelevant. This is an extreme example--I'd suggest that in time, parallel algorithms will the norm and serial processing a "worst case scenario" awaiting a rewrite to make it better. Which may never come but the notional idea will exist anyway.

    32. Re:Torvalds is half right by Anonymous Coward · · Score: 0

      in mobile, all the top grossing apps are games.

      Don't you see how the very relevance of "top grossing" is subjective? Some of us think that a list of the top grossing apps tells you nothing about what problems are most important to work on.

      And while I realize you (and plenty, plenty of other people) might disagree with this point of view, surely you're not going to claim it's new to you. This discussion is happening within the context of Linux, after all.

      Everything is niche.

    33. Re:Torvalds is half right by dslbrian · · Score: 1

      The issue is this: Our desktop processing requirements are actually slowing and as Linus points out, are probably ill-suited for increased parallelism.

      Depends on the desktop requirements. I think he is off the mark here. Specifically to quote him from TFA:

      The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless.

      Since he points out servers and graphics as largely solved, I assume he is talking about desktop usage. In this he is assuming a standard usage model for a desktop user, a set of apps - web, devel, coding, games, whatever. I think the view is that of a user who can only focus on a single-task at a time (with perhaps background OS tasks). But this is a myopic view, the rise of virtualization has enabled a convergence of hardware onto a single machine. This is only possible with the rise of multi-core/parallel computing. VMs are a huge benefit, in terms of power/area efficiency and even being able to create and destroy them on a whim.

      On my desktop machine (8 core) I have two VMs running all the time. These machines used to be physical separate machines, consuming power, taking up floor space, making noise, etc. I could not have run this setup on my previous single/dual core machines. However now they are virtual, and my normal desktop usage doesn't even notice them running (even heavy 3D gaming is not lagged by these VMs).

      There are compounding parallelization factors - having the whole setup on encryption means wanting the cores to handle AES in hardware, so as he points out having hordes of parallel weak cores might be pointless for that. However, multiple powerful cores, I can put those to work.

      IMO the advantages are clearly obvious. Sure for a single-task desktop user, you may only want a few cores for background tasks plus the foreground task. But the ability to consolidate lots of hardware into a single box, I want as much of that as I can get. I can easily think of desktop + VM scenarios that can push beyond 4 cores.

    34. Re:Torvalds is half right by skids · · Score: 1

      FP has been rejected by programmers far too long, but the simple mechanism of immutability removes that most bothersome of bugs

      ...and kills you rmemory/cache profile. FP is great for a subset of problems, but should not be held up on a pedestal, just appreciaed as one tool in the box.

      FP should already be easier to reason about than procedural programming

      Considering it makes many everyday things harder to express, the fact that FP lends itself to easy modeling is offloading the mental effort in the wrong place. You're buying academic ease of manipulation at te expense of increasing the drudgery of everyday tasks, which is why FP is favored for research but not generally accepted for application.

    35. Re:Torvalds is half right by skids · · Score: 1

      Also the future typical user will be using more speech recognition, computer vision, and "AI" experts, all of which scale with parallelism.

    36. Re:Torvalds is half right by Half-pint+HAL · · Score: 1

      My personal view is that the problem comes from looking at a program with a single paradigm. I'd love to see FP as a subsystem, where control code is on the outside in procedural style and all the heavy work is buried in strict functions. At the moment, trying to code in that style is overcomplicated, and in the end usually results in coding in a language that doesn't guarantee immutability and you end up having to hunt down phantom mutation bugs, which kind of defeats the purpose of the exercise.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
  17. Rant? Thats not a rant. by Anonymous Coward · · Score: 0

    It is a well argued opinion.
    And he doesn't say that parallel computing is a bunch of crock.

    Aren't you supposed to actually RTFA before writing the summary?

  18. No locks by ShakaUVM · · Score: 2

    Ungar's idea (http://highscalability.com/blog/2012/3/6/ask-for-forgiveness-programming-or-how-well-program-1000-cor.html) is a good one, but it's also not new. My Master's is in CS/high performance computing, and I wrote about it back around the turn of the millenium. It's often much better to have asymptotically or probabilistically correct code rather than perfectly correct code when perfectly correct code requires barriers or other synchronizing mechanisms, which are the bane of all things parallel.

    In a lot of solvers that iterate over a massive array, only small changes are made at one time. So what if you execute out of turn and update your temperature field before a -.001C change comes in from a neighboring node? You're going to be close anyway? The next few iterations will smooth out those errors, and you'll be able to get far more work done in a far more scalable fashion than if you maintain rigor where it is not exactly needed.

    1. Re:No locks by Anonymous Coward · · Score: 1

      The probabilistic synchronisation sounds similar to how we got over the Shannon limit with modems.

      Modems used to compress data before sending it over a line with a bit rate that would be below the Shannon Limit. Modern modems use error correction and send over the Shannon Limit, it is as if the universe has compressed the data more efficiently than a computer could.

    2. Re:No locks by Anonymous Coward · · Score: 0

      I wonder if the no locks actually simulates brownian instability better after all.

    3. Re:No locks by HiThere · · Score: 1

      I prefer message passing through queues, but it clearly depends a lot on what problems you're working with.

      Also, some problems can't be done in parallel, but we won't know how many can until we start trying....and then try for a few decades.

      --

      I think we've pushed this "anyone can grow up to be president" thing too far.
    4. Re:No locks by ShakaUVM · · Score: 1

      >Also, some problems can't be done in parallel, but we won't know how many can until we start trying....and then try for a few decades.

      Right, but there's also a grey area between completely serializable and embarassingly parallel, in which methods like this will allow scaling algorithms up from "a few" computation nodes to "many", with the optimal numbers depending on the specific algorithms.

      The biggest problems are still the same ones that existed when I got my Master's over a decade ago. Language support for parallelism isn't very good (I personally used MPI, which was awkwardly bolted on top of C++), it requires a certain amount of specialized knowledge to write parallel code that doesn't break or deadlock your machine (and writing optimized code is a bit more advanced than that), and library calls aren't all threadsafe. On the plus side, a lot of frameworks and libraries are now multithreaded by default, which nicely isolates the problems of parallel computing away from people who haven't been trained in it, and gives the benefits of parallel computing with only the downside of having to use a framework. =)

    5. Re:No locks by LateArthurDent · · Score: 1

      So what if you execute out of turn and update your temperature field before a -.001C change comes in from a neighboring node? You're going to be close anyway? The next few iterations will smooth out those errors

      Unless you're dealing with a stiff system and that small error just caused you your iterations to start going divergent.

      I mean, not to dismiss the approach, because I agree with you there are certainly lots of situations where it'll be fine. However, it's also one of those things that aren't going to replace current paradigms either. We're not going to go all lock free. We're going to add lock-free programming to the toolset.

  19. Mmm... Cores... by Greyfox · · Score: 1

    I'll see your cores and raise you your boss strangling all your cores by forcing you to get all the data you were planning to process from NFS shares on 100 megabit LAN connections. Because your developers and IT department, with all the competence of a 14-year-old who just got his hands on a copy of Ruby on Rails, can't figure out how to utilize disk that every fucking machine in the company doesn't have read access to.

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  20. 'make -j64 bzImage' by hazeii · · Score: 1

    How does Linux compile his kernel? Certainly I use a parallel make across as many cores as possible (well, up to the point where there's a core for every compilation unit).

    --
    All your ghosts are just false positives.
    1. Re:'make -j64 bzImage' by gweihir · · Score: 1

      Wrong question. C compilation has linear speedup as each file can be compiled without knowing the others. The question is how he links his kernel, and the answer is on a single core as there is no other sane way to do it. Fortunately, this problem is almost linear in the input size (assuming good hash-tables), or we would not be having any software in the size of the kernel.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    2. Re:'make -j64 bzImage' by itzly · · Score: 1

      But would you rather do a parallel make on 4 cores with big caches, or 64 cores, each with 1/16th of the cache ?

    3. Re:'make -j64 bzImage' by itzly · · Score: 1

      C compilation has linear speedup as each file can be compiled without knowing the others

      As long as I/O bandwidth is infinite.

    4. Re:'make -j64 bzImage' by Gaygirlie · · Score: 1

      The compiler doesn't actually do parallel processing when you're compiling the kernel, it does multi-processing and that's the crux here; when you're compiling the kernel each process that spawns works on its own set of files -- multi-processing, that is -- whereas if it was doing parallel-processing they'd be working on the same files simultaneously. They are two very different concepts and you're confusing them.

    5. Re:'make -j64 bzImage' by gweihir · · Score: 1

      Or you do it on separate machines. But yes. Ideally, it has linear speed-up, if I/O is not a bottleneck. In practice, things are not as nice, although with 4...8 cores and an SSD to feed them you do not notice much.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    6. Re:'make -j64 bzImage' by Rei · · Score: 1

      No, as long as I/O bandwidth is not the limiting factor. The sort of thing you're compiling can have radically different CPU vs. I/O requirements. Some simple but verbose C code with little optimization might be almost entirely IO limited while some heavy templated C++ and full optimization might be almost entirely CPU limited.

      The thing is, there's no way to know what is going to cause a particular person to think "I wish my computer was performing faster". It all depends on the individual and what they use. But one can name various cases that might likely cause a certain subset of users headaches, and so those become use cases for improving system performance. One such use case is clearly compilation that's not IO-limited.

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    7. Re:'make -j64 bzImage' by Anonymous Coward · · Score: 0

      true, but only at an insufficiently-high level of abstraction. look at it from the 60k foot view of "compile my kernel"; and when seeing dozens of cores occupied in that effort, the underlying mechanics of which core is processing which files (or parts thereof), and whether the load is split across threads or processes, is largely inconsequential to the end user.

    8. Re:'make -j64 bzImage' by Anonymous Coward · · Score: 0

      seriously. i think we're mostly talking about gcc here, and there's no way you're going
      to be io bound, even if your persistent storage is a guy with a stylus punching clay tablets.

    9. Re:'make -j64 bzImage' by laird · · Score: 1

      They're all parallel processing, just on different units of storage. Processing 1,000 files in parallel is parallel. Processing 1 file using 1,000 parallel processes is parallel.

  21. weird by drolli · · Score: 1

    The central claim of Linus seem to be that there are many people out there who claim an efficiency increase by parallelism. While i agree that many people claim (IMHO correctly) a increase in the performance (reduction of execution time) within the constraints given by a specific technology level by doing symmetric multiprocessing, i have not heard many people to claim that efficiency (in terms of power, chip area, component count) is improved by symmetric, general parallelization; and nobody with a good understanding of infromation-related aspects of computation.

    I am now speaking as a physicist, I find it disturbingly easy to show the opposite for many cases in the limit of ideal performing systems (that is, resource per implemented gate operation remaining constant with the number of gate operations).

    Having said that, I speculate that there are reasons to introduce paralellism:

    a) The performance you require can not be achieved without it. An example woulf be an FPU, or even just an 8-bit a full adder. You *can* implement it bit-wise, but you dont like to. The full adder also is an excellent example on how paralellism can increase power consumption (i.e. fast-carry-look-ahead) and resource usage

    b) Your implementation simulates operations in a way in which requires a significant effort for fetching and decoding to simulated function. The extreme case of a extreme RISC processor with one bit operations and 1bit ALU only is more inefficient for many problems than the processors we use. This means that there probably is an ideal "processing power/RAM (cache)" combination, which is a function of your communication cost (i.e. bus drivers) and your algorithm.

    c) From b) we can actually see that it can be extremely resonable to create non-symmetrich mutilprocessing units. For listening to a sensor signal to change, a 8-bit 1MHz Microcontroller with less than 100kGates may be an excellent choice (seen the ti430 line, from example), since it does not insist in keeping an overkill of ALU persistenly on.

    d) Paralell programming is almost never used to increase efficiency (unless you really have a distributed input/output and inherent costs of collecting it), but only for these operations where the efficiency loss due to parallelism is negligible (or zero).

    1. Re:weird by serviscope_minor · · Score: 2

      The central claim of Linus seem to be that there are many people out there who claim an efficiency increase by parallelism.

      They do, and to an extent they are correct.

      On CPUs that have high single thread performance, there is a lot of silicon devoted to that. There's the large, power hungry, expenive out of order unit, with it's large hidden register files and reorder buffers.

      There's the huge expensive multipliers which need to complete in a single cycle at the top clock speed and so on.

      If you dispense with that and replace it all with simple, in order, highly pipelined ALUs, you can fir an awful lot more raw artihmetic performance in a given area of silicon.

      So it is much, much more efficient (at certain workloads). The trouble is getting good use out of a hudge wodge of simple cores. That's what GPUs do: the cores are simple and wide, but the problem of filing them is "solved" by limiting the workload to something very regular. The result is something vastly more efficient than a general purpose CPU... for those workloads.

      The flops/W of a CPU are very much in excess of a CPU. Great, if you can use them.

      Personally, I still want to have time to play with those AMD HSA chips, they put the cores of both types on the same side of the cache and MMU. Much more like a tightly coupled co-processor then.

      --
      SJW n. One who posts facts.
    2. Re:weird by Rei · · Score: 0

      Exactly. Put simply, you get far more raw computing bang for your silicon buck with smaller, simpler cores.

      What's so wrong with having our cake and eating it too? Why can't we have future system architectures like:

      1x main core, made as fast as we realistically can
      8x secondary cores, each 75% as fast as the main core but using a lot less silicon each
      64x teriary cores, each 50% as fast as the main core, but again even simpler in terms of silicon consumption

      Or some such? Your threader can try to keep the most intensive tasks on the main core. In the long run we could even have adaptive threading: all threaded function calls are interpreted by a bayesian thread launcher that does random sampling of how performance of different threads varies when their components are launched in different threading environments (including "completely unthreaded and inline") and changes the odds of launching in different environments accordingly. So the programmer could use threads and futures to their heart's content and all work gets distributed out where it best belongs.

      Another thing to simplify the task for programmers would be changes to standards to even further simplify the launching of threads in common situations. For example, a common programming recipe involves doing an operation on every member of an object. Quite often these can be done in parallel. Now, in C++ one could do a "std::for_each" or a "for (auto& i : container)" and have the have an iterative std::thread call inside of it. Of course, if you have a million entries, you probably don't want a million threads, so it gets more complicated, you want to break the iteration down into several sub-iterations, each with one thread iterating over a fraction of the total entries. But instead of making the programmer have to think about all that, why not have a "std::for_each_threaded" or "for (auto& i ~: container)" or somesuch that automatically threads your repetitive command as efficiently as possible?

      --
      If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
    3. Re:weird by drinkypoo · · Score: 1

      What's so wrong with having our cake and eating it too? Why can't we have future system architectures like:

      In theory, we could have system architectures like that right now with nothing but OS support and a dedicated VRM for each processor socket, which is not exactly abnormal anyway. (Hell, we used to put them on modules, right next to the socket. I thought that was a great place for them, but maybe the connector caused problems. Certainly it kept the heat off the mainboard.) There's nothing in principle that stops you from making a Hypertransport mesh of disparate processors, for example, except the general lack of processors on a Hypertransport bus outside of AMD.

      However, since each manufacturer has a different bus, we can't have nice things.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    4. Re:weird by SecurityGuy · · Score: 1

      I don't think computer efficiency is the goal in many cases, and it often shouldn't be. I don't have a fixed amount of computer time to burn in my life, I have a fixed amount of time. If I can throw more hardware at a problem and make it finish faster, that's usually good. In some cases, it's mandatory. I could run a big weather model on a single CPU desktop with a small number of cores (and a lot of memory). It'd finish eventually, but long after the weather it was supposed to predict has happened.

      You're right in what you're saying. That single desktop would me more efficient in some sense, but it would also not be at all useful for many real world problems.

    5. Re:weird by serviscope_minor · · Score: 1

      What's so wrong with having our cake and eating it too?

      Why can't we have future system architectures like: 1x main core, made as fast as we realistically can 8x secondary cores, each 75% as fast as the main core but using a lot less silicon each 64x teriary cores, each 50% as fast as the main core, but again even simpler in terms of silicon consumption

      Nothing's wrong, except you get less silicon to devote to the mega core/simple cores.

      Nonetheless, that seems like what AMD's HSA is trying to do. There's a few high speed complex cores coupled closely with a lot of very simple, very wide floating point processors. In some sense, vector instructions are a bit like that. You get a lot more raw floating point performance at the penalty of less flexibility. The OoO unit still only has to track one element for 4 operations.

      It ought to be much easier to use HSA than GPGPU due to the nanosecond latency and lack of memory transfer scheduling due to sharing the same cache. They use different instruction sets though.

      But instead of making the programmer have to think about all that, why not have a "std::for_each_threaded"

      Well, if you've been looking for that, I hope you'll like this :)

      OpenMP is a C/C++/FORTRAN extension for exactly this kind of thing. It's a language extension rather than a library. Basically, you do:

      #pragma omp parallel for
      for(size_t i=0 i container.size(); i++)
      { ...
      }

      and it runs different iterations of the for loop in parallel. I think the default is to split it into N chunks (N=threads) for the segments [0,size/N], [size/N+1, 2*size/N], etc. That's simple and low overhead and works well if the loop finishes fast.

      You can certainly specify that it shoves the next iteration into whichever thread is free, which works particularly well if the iterations are slow and rather varied in time.

      I believe you can essentially smoothly go between either of the two by specifying a chunk size, something like:

      #pragma omp parallel for schedule(dynamic, 30)

      It's supported by GCC, ICC and VS. LLVM didn't last time I looked but it does now. Compile and link with -fopenmp on gcc.

      --
      SJW n. One who posts facts.
  22. Shi's Law, Gustafsson's Law, Amdahls Law by amplesand · · Score: 3, Insightful

    Shi's Law

    http://developers.slashdot.org...

    http://spartan.cis.temple.edu/...

    http://slashdot.org/comments.p...

    "Researchers in the parallel processing community have been using Amdahl's Law and Gustafson's Law to obtain estimated speedups as measures of parallel program potential. In 1967, Amdahl's Law was used as an argument against massively parallel processing. Since 1988 Gustafson's Law has been used to justify massively parallel processing (MPP). Interestingly, a careful analysis reveals that these two laws are in fact identical. The well publicized arguments were resulted from misunderstandings of the nature of both laws.

    This paper establishes the mathematical equivalence between Amdahl's Law and Gustafson's Law. We also focus on an often neglected prerequisite to applying the Amdahl's Law: the serial and parallel programs must compute the same total number of steps for the same input. There is a class of commonly used algorithms for which this prerequisite is hard to satisfy. For these algorithms, the law can be abused. A simple rule is provided to identify these algorithms.

    We conclude that the use of the "serial percentage" concept in parallel performance evaluation is misleading. It has caused nearly three decades of confusion in the parallel processing community. This confusion disappears when processing times are used in the formulations. Therefore, we suggest that time-based formulations would be the most appropriate for parallel performance evaluation."



    .

  23. Poor slashdot... by Anonymous Coward · · Score: 3, Insightful

    Few are actually people with a real engineering background anymore.

    What Linus means is:
    - Moore's law is ending (go read about mask costs and feature sizes)
    - If you can't geometrically scale transistor counts, you will be transistor count bound (Duh)
    - therefore you have to choose what to use the transistors for
    - anyone with a little experience with how machines actually perform (as one would have to admit Linus does) will know that keeping execution units running is hard.
    - since memory bandwidth has no where near scaled with CPU apatite for instructions and data, cache is already a bottleneck

    Therefore, do instruction and register scheduling well, have the biggest on die cache you can, and enough CPUs to deal with common threaded workflows. And this, in his opinion, is about 4 CPUs in common cases. I think we may find that his opinion is informed by looking at real data of CPU usage on common workloads, seeing as how performance benchmarks might be something he is interested in. In other words, based in some (perhaps adhoc) statistics.

    1. Re:Poor slashdot... by gweihir · · Score: 1

      Good summary, and I completely agree with Linus. The limit may go a bit higher, up to say, 8 cores, but not many more. And there is the little problem that for about 2 decades, chips have been interconnect-limited, which is a far harder limit to solve than the transistor-one, so the problem is actually worse.

      All that wishful thinking going on here is just ignorant of the technological facts. The time where your code could be arbitrary stupid, because CPUs got faster in no time, is over. There may also be other fantasies in there, for example people that do not understand that (true/strong) AI is not a question of cycles, and that have their hopes misplaced in that direction.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    2. Re:Poor slashdot... by laird · · Score: 1

      Having worked on machines with thousands of CPUs, I disagree. The thin that Linus is missing (IMO) is that modern GPUs are no longer "graphics processors" but are actually quite powerful MPP supercomputers, and there are millions of them out there, and applications are increasingly being written to take advantage of them.

      He's right that putting many extremely expensive, power-hungry Intel CPUs in a single box isn't a good tradeoff except in very specific cases. Luckily it's actually quite cheap to add large numbers of cheap, high performance CPUs to a computer, and in fact they're likely already there, so the cost of using them is $0 for hardware, just some developer effort. So the question is simply whether developers should ignore all those CPUs and use only the main CPU, or they should learn how to use the supercomputer sitting on the graphics card.

    3. Re:Poor slashdot... by gweihir · · Score: 1

      What you run in on modern GPUs is tiny programs for problems that have zero interaction between the parts, i.e. can be perfectly partitioned. That is not what Linus is talking about and not a typical workload.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  24. Depends by Anonymous Coward · · Score: 0

    Biggest machine I've run a single image on was 56 cores, by the time I'd finished lock contention was down in the noise.

    Admitted synthetic benchmarks but my comment would be that there isn't enough RAM to run a single process across 1000 cores (yet), the thread stack size will kill you before lock contention on well writted code does.

  25. From personal experience... by gnasher719 · · Score: 1

    Mostly writing code for MacOS X and iOS. All current devices have two or more cores. Writing multi-threaded code is made rather easy through GCD (Grand Central Dispatch), and anything receiving data from a server _must_ be multithreaded, because you never know how long it takes to get a response. So there is an awful lot of multi-threaded code around.

    But the fact that work is distributed to several cores is just secondary for that kind of work. It is also easy to make most work-intensive code use multiple cores. There are calls like sorting an array or searching for an item with multi-threaded variants. With GCD, you can just say "do this task on a background thread", and if you have five things to do, it uses five threads and up to five cores. It's so easy that people do it a lot without measuring how efficient it is. As long as your software is fast enough, it's fine.

    The typical result is an application that uses multiple cores to some degrees, but may have bottlenecks that require a single core. Now on an iPhone with 2 cores, that's fine. (If 30% of your time needs to run on a single core, but you have only two cores, it doesn't matter). On an iMac with 4 cores, it's quite OK. On a monster MacPro with 24 threads it might be a problem. On a hypothetical machine with 100s of cores it _is_ a problem.

    So your typical MacOS X or iOS app written by reasonably competent people will work fine in the current environment, but would need major changes to take advantage of 100s of cores.

  26. Linus is right by gweihir · · Score: 3, Insightful

    Nothing significant will change this year or in the next 10 years in parallel computing. The subject is very hard, and that may very well be a fundamental limit, not one requiring some kind of special "magic" idea. The other problem is that most programmers have severe trouble handling even classical, fully-locked, code in cases where the way to parallelize is rather clear. These "magic" new ways will turn out just as the hundreds of other "magic" ideas to finally get parallel computing to take off: As duds that either do not work at all, or that almost nobody can write code for.

    Really, stop grasping for straws. There is nothing to be gained in that direction, except for a few special problems where the problem can be partitioned exceptionally well. CPUs have reached a limit in speed, and this is a limit that will be with us for a very long time, and possibly permanently. There is nothing wrong with that, technology has countless other hard limits, some of them centuries old. Life goes on.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    1. Re:Linus is right by SpinyNorman · · Score: 1

      Yeah, parallel computing is mostly hard the way most of us are trying to do it today, but advances will be driven by need, and advised by past failures, not limited by them.

      You also argue against yourself by pointing out that CPU's have hit a speed limit - this is of course precisely why the only way to increase processing power is to use parallelism, and provides added incentive to find ways to make use of parallel hardware easier.

      The way massively parallel hardware will be used in the future should be obvious... we'll have domain specific high level libraries that will encapsulate the complexity, just as we do in any other area (and as we do for massively parallel graphics today). Massive parallelism is mostly about SIMD where the programmer basically wants to provide the data ("D") and high level instructruction ("I") and have a high level library take on the donkey work of implementing it on a given platform.

      Current parallel computing approaches such as OpenCL, OpenMP, CUDA are all just tools to be used by the library writers or those (which will become increasingly few) whose needs are not met by off-the-shelf high level building blocks. No doubt the tools will get better, but for most programmers it makes no difference as they use libraries rather than write them. Compare for example to all the advances in templates and generic programming in C++11 and later... how many C++ programmers are intimately familiar and proficient in these new facilities, and how many actually need to use them as opposed to enjoying the user-friendly facilities of the STL built atop them?!

    2. Re:Linus is right by gweihir · · Score: 1

      You also argue against yourself by pointing out that CPU's have hit a speed limit - this is of course precisely why the only way to increase processing power is to use parallelism, and provides added incentive to find ways to make use of parallel hardware easier.

      No, I don't. It is pretty clear that processing power for single-thread loads or hard to parallelize ones will _not_ increase much more. Get over it. Wishing limits away does not work.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    3. Re:Linus is right by SpinyNorman · · Score: 1

      The need for massive parallelism will come (already has in the lab) from future applications generally in the area of machine learning/intelligence.

      Saying that "single threaded loads" won't benefit from parallelism is a tautology and anyways irrelevant to Linus's claim.

      FWIW I'd challenge you to come up with more than one or two applications that are compute bound and too slow on existing hardware that could NOT be rewritten to take advantage of some degree of parallelism.

  27. Pullin' a Gates? by Anonymous Coward · · Score: 0

    Memorable only if stated by someone who both is "in power" and should know better. 640k was much better than 64k, but even in those days, big iron had (and needed) much more than 640k. Lots of people say stupid things, a corporate bigshot saying something stupid within his own field of expertise is a bit different.

  28. Re:GAY NIGGERS can be DEVELOPERS 1000 WHORES! by Barsteward · · Score: 0

    some people just don't deserve to breathe. is it always the same twat always posting this shit or is there more than one single brain cell moron out there?

    --
    "The hands that help are better far than lips that pray." - Robert Ingersoll (1833-1899)
  29. Two points on Linus' post by Qbertino · · Score: 1

    1.) Linus' wording is pretty moderate.
    2.) He's right. Again.

    --
    We suffer more in our imagination than in reality. - Seneca
  30. Linus' rants... by Anonymous Coward · · Score: 0

    ... are not politically incorrect. "Politically incorrect" is a phrase used by douchebags who want to be able to say anything they want about anyone at any time. It used to be used mostly by white male douchebags who wanted to be able to make racist remarks but now it's mostly been co-opted by MRAs.

    Linus sometimes curses and doesn't pull his punches, but that's not "politically incorrect".

  31. The single most significant sentence.. by OneSmartFellow · · Score: 1

    ..is this:

    The obstacle we shall have to overcome, if we are to successfully program manycore systems, is our cherished assumption that we write programs that always get the exactly right answers.

    This is an interesting observation. Let's take graphs for example. We rarely need to solve every possible path and find THE shortest one, we usually only need to find one which is shorter than almost all the other ones.

    Do we always care whether every pixel is the best possible color when compressing images ? No, it usually only has to be close enough so that we can't tell the difference.

    These are classic examples of that statement that have already been implemented in both parallel and linear algorithm design. I'd like to see much more research into understanding why some problems don't require an exact answer, and some do. Maybe we need to change the way we think about what a solution is, rather than how to solve.

    1. Re:The single most significant sentence.. by Shados · · Score: 2

      I remember an issue I had a few months ago... we were doing some image processing using HTML canvas element on a web app... Then we wanted a nightly job to use the same code, so we whip out a node.js script. Once it was done, to make sure it worked the same way, we compared the result...

      They were different. Spent 2 days trying to debug it (they were using the same code for the most part, wtf?).

      At the time, I didn't know about http://en.wikipedia.org/wiki/Canvas_fingerprintingcanvas fingerprinting Most of the time, different computers will generate equivalent, but different at the binary level, images from html canvas.

      And there's always the good old floating point operations. ie: 0.2 * 3 = 0.6000000000000001

      So its already everywhere, just not everywhere enough that we've been forced to deal with it (those things are usually just afterthought and end up in bugs). Soon, they won't be.

  32. Linus wrong? Shocking! by bsdasym · · Score: 0

    It's one thing to argue against massive parallelism in a single piece of software. Of course that's not the right answer to every problem, or even to most of them. But arguing against many cores at a hardware level, as he seems to be doing, is plain stupid. Of course more cores == better. As long as I have 100+ processes running on my desktop PC, the more cores I have to spread them around, the better.

    No special languages or programmer training required.

  33. If you could only.... by asylumx · · Score: 1

    ...imagine a beowulf cluster of these!

  34. Build a PC: More RAM or more CPUs or more I/O? by DutchUncle · · Score: 1

    If you were putting together a PC (any variety, any era), what would you expect to get the most bang for the buck? Obviously get the fastest current hardware, but then: double the CPU? double the RAM? double the comm (which at this point includes SATA controllers)? My experience all the way back to Z80s has normally been more RAM, the extension of which is more cache close to the CPU, which is one of the things Linus says.

    It's hard to parallelize one application, which is why we all point to a handful of well-understood examples in graphics and that's about it. It's more straightforward - and more understandable - to parallelize multiple applications, like a "server" hearkening back to the old mainframe days. For a *general-purpose* computer doing mostly one or two things at a time with background communication and I/O, more RAM/cache == less thrashing == better *all-around* performance without adding complexity.

    1. Re:Build a PC: More RAM or more CPUs or more I/O? by laird · · Score: 1

      It's not "hard to parallelize one application". It's just a matter of learning to think that way. Once you do, nearly all problems parallelize well.

      For example, consider video games. Most of them have hundreds or thousands of AIs and game objects that can run in parallel. Heck, even word processing renders thousands of characters to the screen, which can be done in parallel. Sorting, searching, indexing, all parallelize. Of course, as lot as it's considered "hard" developers won't do it, except in the highest value cases (e.g. video processing, graphics) but that's a matter of tooling. In languages/compilers that are designed for parallelism, it's easy. It's just hard in C++ because as a language it makes parallelism very hard. Compare to FORTRAN 90, or C*.

    2. Re:Build a PC: More RAM or more CPUs or more I/O? by DutchUncle · · Score: 1

      Let's consider one of your examples further: "word processing renders thousands of characters to the screen, which can be done in parallel." The starting position of each character depends on the position and size of the one before it, which in turn depends on the one before *that*, including where the lines break (not to mention ragged-right vs. double-justify). And let's not forget kerning - further interaction between characters. While it would seem, then, that one could treat every individual character as an individual sprite for display calculations, for the practical application of word processing it really makes just as much sense to handle the text file serially - and it's a lot less complex. The way I read Linus' posting, he's arguing that parallelism is overused and overhyped FOR GENERAL USE, and I tend to agree - at the same time that I'm very happy that my 4-core processor seems to overlap all of its network I/O and disk I/O and processing, and I'm very happy with my nice graphics card. From what I've seen over time, the GENERAL-PURPOSE bang for the buck is caching - making more memory faster and closer to the CPU.

  35. Re:Linus wrong? Shocking! by Anonymous Coward · · Score: 0

    You may want to read some other comments here.

  36. Limitations by Anonymous Coward · · Score: 1

    Multi-core CPUs are just a side-step because we can't scale single-core CPU performance to the same levels.

    For example, if there was a choice between a single-core CPU that could do 1000 bogomips or a 4-core CPU that could do 4x250 bogomips, I know I'd rather have the single-core chip because for the vast majority of use cases the single-core chip would destroy the quad.

    This is why modern multi-core CPUs have 'turbo' mode - Intel and AMD both realised that single-core performance is still much more important for individual programs so being able to run that code on one core and boost it at the detriment of the other cores gives a significant edge.

    I still remember when multi-core CPUs first came out - They were limited by TDP so cheaper single-core CPUs would almost always beat them in benchmarks because while they were slightly behind in multi-threading performance, they were far superior on single-core performance.

    One thing I am surprised is that no CPU manufacturer has come up with a dynamic pipeline system, where you could run a CPU as e.g. a quad core for normal usage, but when presented with highly predictable streaming data, switch to a P4-style long-pipeline by e.g. feeding one core into another and running the whole thing at a higher clockspeed
    .

    1. Re:Limitations by leuk_he · · Score: 1

      You are not wrong, but the point is that parralel system can scale the number of cpu's ftom 4 to 1000. However the same locking mechanisms used for 4 way parralelism are not useful in 1000 way parralelism. You need different techniques then. The linus rant is pointeda current programming techniquess that scale to 4-16 cores, but start to loose a lot of efficientcy at more cores.

      By the way, some synamic pipeway already exists a long time. Think about hyperthreading. 2 threads share1 core. second thread is optional to keep the cpu busy when one thread could not. Also cache might be local to one or more cores. This is also a way of dynamic pipeline.

  37. One application: Autonomous Car by Anonymous Coward · · Score: 0

    I think there is a little bit of noise on the early warning radar for future computing needs : sensor data processing in autonomous cars!

    That might be a fluke for the rest of our professional careers, but it might also determine where a good chunk of the silicon diffused goes in ten years.

    Since this area will be in constant and violent flux over the next twenty years, a lot of the processing power required for 3D-Analysis of video and radar sensors in each car will be provided by some sort of general purpose CPU/GPU. Video data could be processed parallel in segments of the images from multiple cameras, but also in parallel by something like a dozen different algorithms, to get extra safety and confidence out of majority voting.

    Currently, about half of the value of an automobile sold in western markets is not made up from mechanical parts, but electric and electronic devices and the software for them.

    With the advent of the autonomous car, I would think the computing requirements would explode again. Since this would coincide with the petering out of Moore's law, and come well after clock rates have effectively been capped, this will require some sort of parallel computing solution.

    Linus is right, there is little need for massive parallelism outside of niche areas like HPC right now. But I am not so sure this cannot change.

  38. Linus "Ranting" by Anonymous Coward · · Score: 0

    "Anonymous Coward" - That's kinda of harsh just because I choose not to register ;-)...

    Some individuals responding to Linus's "rant" aren't fulling reading what is said by quoting a couple of the various laws of parallel computing by saying he's 100 percent wrong. As Linus stated "The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless.", please note server side where these "laws" are relevant where there is very little relevance on the desktop. Many programmers have gotten lazy or aren't being taught very well, some either aren't programming for parallelization or are using in areas where it negatively impacts performance. I believe programmers need to take more care when utilizing parallelization.

    "Give it up. The whole "parallel computing is the future" is a bunch of crock.", I do believe Linus is wrong with this statement, this statement lacks vision and would question his leadership role with in the linux community. I can think of a number things that would prompt me to make such a statement in his place where I would misstate my actual feeling on the matter, i.e. bad programming...

    I certainly believe that parallel computing is the future of computing, perhaps not the immediate future. I see multiple people referencing this or similar articles: http://www.cmu.edu/silicon-valley/news-events/seminars/2011/ungar-talk.html

    Aside from individuals saying they already thought of this or some such, who cares, I was thinking something similar as I read Linus's "rant". The point of the article of the article was point out that we need to go beyond our classic computing model to achieve massively parallel systems that continue to scale, with this being just one example of how to achieve it. I believe we need a whole host of new tools to perform parallel computing efficiently on a massive scale. Look at what the likes of FB and Google are doing with parallel computing and AI, they go so far as to custom build their own HW, yes, partly due to scales of economy but partly because COTS HW doesn't meet their functional requirements, not to mention the full software stack...

  39. SOME THINGS ARE NOT PARALLELIZABLE by Theovon · · Score: 1

    There are many common algorithms at the heart of important workloads that are not parallelizable. Consider sorting and shortest path algorithms that are important for managing data and route finding. The O(n-squared) versions can be parallelized (Bellman-Ford vs. Dijkstra's), but for any useful input size, the n-log-n version will be faster on a single core than the n-squared on a supercomputer (no hyperbole there). Even for workloads that do have a lot of parallelism, the inter-process communication often dominates. Except for benchmarks with no application to reality, there is always SOMETHING that serializes computation. Amdahl's law always bites you in the ass.

    So much for parallel computing.

    If you have many INDEPENDENT tasks, then sure, parallel computing is great. Web servers with many clients, graphics, etc. But that's for servers.

    On end-user systems, the amount of thread-level parallelism is very limited. Unless you're compiling Gentoo, you're going to top out at a handful of cores. This is not limitation of the languages people use. It's a practical limitation of the parallelism inherent (or not) in the workloads people run, and it's a hard mathematical limitation of the optimal algorithms people use for common low-level tasks.

    http://crd-legacy.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf
    http://www.davidhbailey.com/dhbpapers/inv3220-bailey.pdf
    http://www.cs.binghamton.edu/~pmadden/pubs/dispelling-ieeedt-2013.pdf

    There are some people in parallel computing who need to go back to school and learn computational complexity.

  40. except for ..... by Anonymous Coward · · Score: 0

    So parallel is a crock except for.....
    graphics
    servers
    in other words, if there's a need for multi-core then it happens.

  41. Re:GAY NIGGERS can be DEVELOPERS 1000 WHORES! by Anonymous Coward · · Score: 0

    is it always the same twat always posting this shit ?

    You are an even bigger twat by replying to him. He is modded to -1, and few would even see his post if not for your reply making it more visible.

    NEVER RESPOND TO TROLLS!!!

  42. Let's see how that sounds in 5-10 years time ... by SpinyNorman · · Score: 1

    It sounds rather than Bill Gates' [supposed] "64KB is enough for anyone", but no denying that Linus said this one!

    Saying that graphics is the only client side app that can utilize large scale parallelism is short sighted bunk, and even ignores what is going on today let alone the future. In 20 years time we'll have handheld devices that would look just as much like science fiction, if available today, as today's devices would have looked 20 years ago.

    I have no doubt whatsoever that in the next few decades we'll see human level AI in handheld devices as well as server-based apps, and you better believe that the computing demands (both processing and memory) will be massive. Even today we're starting to see impressive advances in speech and image recognition and the underlying technology is increasingly becoming (massively parallel) connectionist deep learning architectures, not your grandfather's (or Linus's) traditional approaches. Current deep-learning architectures can be optimized to use significantly less resources for recognition-only deployment vs learning, but no doubt we'll see live learning in the future too as AI advances and technology develops.

    Linus's relegation of parallelism to server side is equally if not more shortsighted than his lack of vision of client-side CPU-sucking applications! If you want systems that are always available, responsive and scalable then that calls for distributed (client side) implementation, not server based. Future devices are not only going to be smart but the smarts are going to be local. Bye-bye server based Siri.

  43. Oblig XKCD? by thebes · · Score: 1
  44. Answering Linus' "Where the hell..." question... by tlambert · · Score: 1

    Answering Linus' "Where the hell..." question:

    "Where the hell do you envision that those magical parallel algorithms would be used?"

    When you have millions of robots running around your body, repairing your telomere length and resetting the cells Hayflick limit, and repairing other aging related damage, so you can live another 200+ years of healthy, relatively physiologically young.

    You know, unless you actually *want* to be old and decrepit, and die centuries before you actually have to...

  45. Re:Linus wrong? Shocking! by Half-pint+HAL · · Score: 2

    Not true, because if the processes are IO bound (and most are), most of the processes will be waiting anyway. But Linus's argument hangs on a more fundamental problem: memory bandwidth. If all the cores are sitting waiting because the data isn't in the cache and the other cores are already trying to use the memory bus, then you'll end up with more unused cycles than if you ran timesliced threads on a single core. The correct answer to this one cannot be made by reasoning and logic from first principles, but only by looking at raw empirical data. I daresay Linus has more of that than most of us here.

    --
    Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
  46. Ripe for Revolution by Roger+W+Moore · · Score: 2

    Nothing significant will change this year or in the next 10 years in parallel computing.

    You might be right but I'm far less certain of it. The problem we have is that further shrinking of silicon makes it easier to add more cores than to make a single core faster so there is a strong push towards parallelism on the hardware side. At the same time the languages we have are not at all designed to cope with parallel programming.

    The result is that we are using our computing resources less and less efficiently. I'm a physicist on an LHC experiment at CERN and we are acutely aware of how inefficient our serial algorithms are at using modern hardware. What we need is a breakthrough in programming languages to be able to parallel program efficiently, just like object oriented programming allowed us to scale up the size of programs. Until this happens I agree than not much will change but if there is some clever CS researcher/student out there with a clever idea for a good parallel programming language the conditions are right for a revolution.

    1. Re:Ripe for Revolution by gweihir · · Score: 1

      If you were right, Transputers would have been the really big thing 25 years ago. They fizzled. Basically all massive parallel things have fizzled, because performance is abysmally bad, often worse than a single large CPU. So have all attempts at programming languages supporting larger parallelism. Linus just sums up the results of about 40 years of research. And most relevant problems cannot be parallelized in a meaningful way anyways, and these are fundamental limits, i.e. no clever idea is possible. Really, there are not going to be any breakthroughs, what we have now is what we will have in 100 years, give or take a small factor.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    2. Re:Ripe for Revolution by Roger+W+Moore · · Score: 1

      ...and right up until the invention of the transistor computers would never be smaller than a large room or a small house. I would not be so sure about there being no clever idea possible unless there is a mathematical proof to support it. Until recently there was no need to go parallel now there is a growing need to be able to program in parallel and necessity is the mother of invention. While parallel does incur an overhead as CPUs become more parallel and less serial this will presumably eventually overcome the cost of the parallel algorithm.

    3. Re:Ripe for Revolution by gweihir · · Score: 1

      There are mathematical proofs for many algorithms that they cannot be efficiently parallelized. There was always a strong effort to get parallel software off the ground, but it failed time and again. There is huge interest from the military and from other communities for simulations, for example. And some things _can_ be parallelized efficiently, like hash-tables (but they are I/O bound, hence Google parallelizes them to different machines in its search engine), while others cannot (like sorting, here parallelization only pays if comparing elements is very expensive).

      This is not a new problem. It has 40 years or so of intense research thrown at it. It even has specialized languages like OCCAM.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  47. Slang? by bjohnso5 · · Score: 1

    "Crock of shit" maybe? "Bunch of crock" doesn't seem like it'd even be a thing.

  48. Slang? by bjohnso5 · · Score: 1

    "Crock of shit" maybe? "Bunch of crock" doesn't seem like it'd even be a thing.

    Replying to myself... apparently it is a thing: http://en.wiktionary.org/wiki/...

  49. Re:Let's see how that sounds in 5-10 years time .. by Junta · · Score: 1

    To be fair, the trend seems to be hitting a ceiling.

    Desktop processors got to quad core and have pretty much sat there. The mobile space has been at quad-core a little less long and there are octo-core implementations moreso than desktop, but it still seems quad core is about where most devices settle. There are more efforts to make GPU style execution cores available for non-graphics use, but in practice a relatively small portion of the market has been able to have meaningful gains exploiting them. As vectorized instructions in cores become more capable, many of those problems actually start coming back to the traditional CPU cores as it works as well as the GPU but with an easier programming model. In short, the marketing results seem to indicate that end user devices might settle around quad core.

    Servers have been going up, with 18 core per socket for 2-socket now available. This shows that the desktop parts have room to grow in that dimension, but it just isn't being bothered with.

    --
    XML is like violence. If it doesn't solve the problem, use more.
  50. Re:Answering Linus' "Where the hell..." question.. by Junta · · Score: 1

    That sounds more like a distributed computing problem rather than applications running on a single 'system'. Even if it were centrally controlled, the computational load being time-shared might mean the best solution is still just a handful of cores. Such nanites would presumably be independent or unused enough that continuous CPU load would likely not even be in the picture. This is very much science fiction, but it still strikes me that the computational load would be negligible compared to the medical/engineering problems overcome. You take 30-40 years to start feeling the effects of aging, so it's not like cells require continual repair to achieve your hypothetical situation, just have to manage to repair everything within 25 years.

    --
    XML is like violence. If it doesn't solve the problem, use more.
  51. driving schools by Duncan+White · · Score: 1

    Http:// www.duncan-white.co.uk

  52. sequential programming mindset (try 64k cores) by the+agent+man · · Score: 1

    I was lucky enough to gather some parallel programming experience on the Connection Machine CM2, a 64k CPU (yes that is 65536 CPUs), 12 dimensional hypercube, a long time ago. The CM2 ultimately failed but we did get many great insights into parallel programming. At the time it was just not feasible for low cost, on your desktop, computing. It is NO problem to keep massive numbers of cores busy doing interesting computing. OK, the 12 dimensions are less clear on how to use them. At any rate, to claim that there is no need for 100 cores or more is really small minded because unlike the time when silly "the world does not need more than 5 computer" kinds of comments were made we already have evidence that there are powerful ways to employ massive parallel computing that can use thousands or even millions of cores.

    Just because we are being caught in a sequential programming mindset does not mean that there is no room for parallel programming. If you are looking at a two dimensional array of data and think of a nested loop you ARE caught in a sequential programming mindset. Additionally, famous people, including Dijkstra, have poopooed some algorithms that are inefficient when execute sequentially to the point where researcher, or programmers, are not even looking any more for good parallel execution. Take bubble sort. Not sure it was Dijkstra but somebody suggested to forbid it. Yes, on a sequential computer bubble sort is indeed inefficient but guess what. If communication does matter and if you are using a massively parallel architecture (i.e., not 4 cores) bubble sort becomes quite efficient because you only need to talk to your data neighbors. Likewise there are AI algorithms that can be shown to be behave really well when conceptualized and executed in parallel. Collaborative Diffusion is an example: http://www.cs.colorado.edu/~ra...

  53. Re:GAY NIGGERS can be DEVELOPERS 1000 WHORES! by gatkinso · · Score: 1

    You make me miss Shampoo.

    --
    I am very small, utmostly microscopic.
  54. oh boy by Anonymous Coward · · Score: 0

    I imagine future processors internally will look like a spread sheets with millions or billions of registers 64 - 1gbit in size that can handle both application and graphics rendering(software) like gaming, running at 500 gigaherts.

  55. Re:Let's see how that sounds in 5-10 years time .. by SpinyNorman · · Score: 1

    The trouble is that extrapolating the present isn't a great way to predict the future!

    If computers were never required to do anything much different than they do right now then of course the processing/memory requirements won't change either.

    But... of course things are changing, and one change that has been a long time coming but is finally hitting consumer devices are the hard "fuzzy" problems like speech recognition, image/object recognition, natural language processing, artificial intelligence... and the computing needs of these types of application are way different than running traditional software. We may start with accelarators for state-of-the-art offline speech recognition, but in time (a few decades) I expect we'll have pretty sophisticated AI (think smart assistant) functionality widely available that may shake up hardware requirements more significantly.

  56. Lots of moving parts by m.dillon · · Score: 4, Informative

    There are lots of moving parts here. Just adding cores doesn't work unless you can balance it out with sufficient cache and main memory bandwidth to go along with the cores. Otherwise the cores just aren't useful for anything but the simplest of algorithms.

    The second big problem is locking. Locks which worked just fine under high concurrent loads on single-socket systems will fail completely on multi-socket systems just from the cache coherency bus bandwidth the collisions cause. For example, on an 8-thread (4 core) single-chip Intel chip having all 8 threads contending on a single spin lock does not add a whole lot of overhead to the serialization mechanic. A 10ns code sequence might serialize to 20ns. But try to do the same thing on a 48-core opteron system and suddenly serialization becomes 1000x less efficient. A 10ns code sequence can serialize to 10us or worse. That is how bad it can get.

    Even shared locks using simple increment/decrement atomic ops can implode on a system with a lot of cores. Exclusive locks? Forget it.

    The only real solution is to redesign algorithms, particularly the handling of shared resources in the kernel, to avoid lock contention as much as possible (even entirely). Which is what we did with our networking stack on DragonFly and numerous other software caches.

    Some things we just can't segregate, such as the name cache. Shared locks only modestly improve performance but it's still a whole lot better than what you get with an exclusive lock.

    The namecache is important because for something like a bulk build where we have 48 cores all running gcc at the same time winds up sharing an enormous number of resources. Not just the shell invocations (where the VM pages are shared massively and there are 300 /bin/sh processes running or sitting due to all the Makefile recursion), but also the namecache positive AND negative hits due to the #include path searches.

    Other things, particularly with shared resources, can be solved by making the indexing structures per-cpu but all pointing to the same shared data resource. In DragonFly doing that for seemingly simple things like an interface's assigned IP/MASKs can improve performance by leaps and bounds. For route tables and ARP tables, going per-cpu is almost mandatory if one wants to be able to handle millions of packets per second.

    Even something like the fork/exec/exit path requires an almost lockless implementation to perform well on concurrent execs (e.g. such as /bin/sh in a large parallel make). Before I rewrote those algorithms our 48-core opteron was limited to around 6000 execs per second. After rewriting it's more like 40,000+ execs per second.

    So when one starts working with a lot of cores for general purpose computing, pretty much the ENTIRE operating system core has to be reworked verses what worked well with only 12 cores will fall on its face with more.

    -Matt

    1. Re:Lots of moving parts by NovaX · · Score: 1

      Some things we just can't segregate, such as the name cache. Shared locks only modestly improve performance but it's still a whole lot better than what you get with an exclusive lock.

      What is the challenge with the namecache, specifically? If its due to being LRU then there are approaches to mitigate the lock. A buffering approach like this Java cache batch updates to avoid lock contention. Another technique is to take a random sample to be probabilistically LRU, like Redis does.

      --

      "Open Source?" - Press any key to continue
  57. Linus Lock by fyngyrz · · Score: 2

    The core is already dozens of times faster than memory

    It isn't, though, except for integer operations and tossing things around. Floating point core elements have a ways to go yet to get to single cycle for everything, and so spreading math among cores still saves time. OS folk like Linus may tend to think in terms of byte-to-BusSize manipulation. A lot of us deal with more nuanced data and operations. I *guarantee* you that a multicore processor will chew up properly designed image manipulation tasks a good deal faster than a single core will, and more flexibly (and more system-friendly) than a GPU can too, although slower for ops that fit in the GPU's memory and for which it offers competence. Software defined radio also makes terrific use of multiple cores, for instance here, a 3 GHz system with 8 cores is mostly free to do other stuff, and a system with one core running at the same speed is about 90% utilized, which doesn't leave enough horsepower to do much else. Whereas with the 8-core, I can run the SDR and do whatever the heck I want. Then there's the "what do you mean by 'core'" question. Does the core have an FPU, or is it one of those profoundly crippled integer-only units? Does the core actually share memory (and therefore memory bandwidth) with other cores, or does it have its own pool of RAM? Is eco throttling choking it half to death? And so on.

    Having 1000 cores all waiting for 3,000 microseconds while the hard drive rotates to the other side of the platter

    What is this "hard drive" thing you describe? Doesn't everyone use boards with terabytes of RAM for near-term storage?

    Seriously, though, we all know (well, the ones who have considered it) that's exactly where we're going. SSDs as they stand today are just the tip of the iceberg; you want to know what's coming, instantiate a ram disk on your machine and run some benchies with it. And when we get to real RAM based storage, or anything of similar speed (or perhaps better... memristors?), we won't have wanted CPU development to have been sitting on laurels planted in a garden made of dead-slow storage in the interim.

    Having 1000 cores all waiting for 3,000 microseconds while the hard drive rotates to the other side of the platter does not improve performance over 4 cores waiting.

    True enough, but of course, that's not what happens, so... Effectively -- of course they can and do switch roles when memory is shared -- one is monitoring your ethernet, several are kicking in and out of httpd threads and/or processes, and so on for hundreds of OS tasks, and if you're like me, more than a few users tasks as well. For every task within a process that isn't hidebound by disk (and there are already a lot of them) having an additional available core is a very worthy thing. And when cores are tied up waiting for high level math operations, memory is (more) free relative to the needs of the available cores, and things simply run soother, sooner. There's a lot of handwaving in there because of the complexity of caching and lookahead and so on, but the bottom line is in my 8 core machine, I can do a lot more than in my 2-core machine, both have the same amount of memory and run at the same speed. And I apologize for the mangling of terminology. I think the point remains clear:

    Multiple cores are a great thing.

    --
    I've fallen off your lawn, and I can't get up.
    1. Re:Linus Lock by kesuki · · Score: 1

      http://en.wikipedia.org/wiki/AMD_Radeon_Rx_200_Series#Radeon_R9_280X

      has 2048 'stream' processors and only 3GB of ram -- true it only has 32 of what most people would call compute cores, but it is getting data from thousands of threads processed from its stream processing units. i have one of these devices and it was about 75 times faster at altcoin mining (dogecoins specifically) than the general purpose so called 8 core fx 8150 cpu. even though it's 8 threads the wiki doesn't explain why they can call it an 8 core but i know shortly after i built the rig computer parts changed generations and got slower.

      basically speaking programming for thousands of cores exists today by having simple tasks that break up complex or end user desired tasks and make them simple to run in parallel.

    2. Re:Linus Lock by RoLi · · Score: 1

      True enough, but of course, that's not what happens, so... Effectively -- of course they can and do switch roles when memory is shared -- one is monitoring your ethernet, several are kicking in and out of httpd threads and/or processes, and so on for hundreds of OS tasks, and if you're like me, more than a few users tasks as well. For every task within a process that isn't hidebound by disk (and there are already a lot of them) having an additional available core is a very worthy thing.

      Yeah, that's the theory.

      In real life, my 6-core, 32 GB-RAM box swaps even the tiniest process to disk (which is of course SSD) so that even opening the KDE-menu takes ages after some time.

      I think programmers are just too lazy to really use the hardware (which exists already today). For example the smart thing to do would be to make sure that the user interface is never swapped to disk. That would reduce available RAM only slightly but would dramatically improve performance.

      But of course nobody does it because 1) their mind was closed by academia which preaches inefficient but supposedly programmer-friendly things like OO, scripting, one-size-fits-all frameworks etc. and 2) because everybody is hoping to squash every problem with faster hardware.

      So it won't happen.

      In 20 years, we will run huge machines that will slow down everything by running as much as possible on Python and Javascript because that's what is hip and performance be damned. (Isn't the Windows 8 framework - user interface based on CSS and Javascript already?)
      Performance will probably suffer because instead of having fonts on disk (how 20th-century is that?) our computers will load fonts from Google about 10 times per hour.

    3. Re:Linus Lock by davydagger · · Score: 1
      you have 32 GB of RAM

      swapoff bro, just swapoff, and put a # in /etc/fstab

    4. Re:Linus Lock by fyngyrz · · Score: 1

      I think programmers are just too lazy to really use the hardware

      Not everyone. I write in C -- large applications, too -- and I write as close to the metal as I can get. I don't mind assembler, but the processors in use move under our feet too often: there's just no practical way to keep up without a compiler in between my code and the actual CPU instructions.

      For example the smart thing to do would be to make sure that the user interface is never swapped to disk. That would reduce available RAM only slightly but would dramatically improve performance.

      Agreed, that sounds like it'd be worthy. The problem I would anticipate is that a lot of the OS/UI code may be contained in huge "black boxes" that, if all loaded all the time, would consume much more RAM than we might otherwise think would be needed. OTOH, maybe we should all have 32 GB of RAM like you do. It sure has gotten inexpensive. On the OTHER other hand, if we did, the bloody OS would probably balloon to 32 GB, so... lol

      In real life, my 6-core, 32 GB-RAM box swaps even the tiniest process to disk (which is of course SSD) so that even opening the KDE-menu takes ages after some time.

      Concur with davydagger. You're either doing something really resource-intensive you didn't mention, or your OS is configured wrong.

      If you have 32 GB of ram, unless you're running software that makes demands on that scale, you probably don't need swap at all. I've only got 8 GB of ram and my system does really well unless I actually use it up -- although mine's OS X, so your swap algorithms and so forth are different. Still, I'm almost certain you can set the box up to behave better.

      In the past, I know linux had a really annoying bias for using up all the ram with buffers and cache, and would pig out if you actually tried to use that ram yourself once all the RAM was used that way, despite the supposed ability to throw out the cache and the buffers if the RAM was needed, but I am under the impression that time has passed.

      davydagger offered some specifics there... sounds like the right place to at least start reading some man pages. :)

      Isn't the Windows 8 framework - user interface based on CSS and Javascript already?

      No idea. Microsoft is dead to me. :)

      --
      I've fallen off your lawn, and I can't get up.
  58. NN Architecture by Anonymous Coward · · Score: 0

    Think of the way eyeballs work. Our neurons don't stream full resolution video back down the optic nerve. In fact, a bunch of processing occurs right behind the retina itself. The data is crunched into a radically smaller format by the type it its the brain. In much the same way, wearable needs to crunch/compress in a massively parallel manner down to something that can be reasonably transferred down the pipe.

  59. Agreed; see also MapReduce and Hadoop; Cliff Nass by Paul+Fernhout · · Score: 1

    http://en.wikipedia.org/wiki/M...
    http://en.wikipedia.org/wiki/A...

    I learned MapReduce for use with CouchDB and it is a powerful technique even when not on parallel hardware -- although a bit of a conceptual shift.

    Here is a group using MapReduce with Hadoop for image processing:
    http://hipi.cs.virginia.edu/
    "HIPI is a library for Hadoop's MapReduce framework that provides an API for performing image processing tasks in a distributed computing environment. "

    Linus wrote: "The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless." But would Linus really think image processing (like for robots or self-driving cars or using Baxter to sort your kid's Legos) is not an important issue? Sounds a bit like "640K is enough memory for anyone". Failure of the imagination is all too common based on unfamiliarity with some problem domain. Although, to be frank, I thought 32K of RAM on a Commodore PET was more than enough memory for anyone, because I could not imagine writing a program that large at the time. :-)

    Also, agent-based simulations or zone-based simulations can often use as much parallel hardware as you can throw at it, even if there may be occasional short synchronization steps. For example you could have a Minecraft-like game with thousands of active entities like wolves, zombies, pigs, and so on -- as well as processes like erosion or plant growth going on in multiple zones simultaneously. Game design could really change with millions of available general purpose cores. My wife and I created an algorithm for growing botanically accurate plants, but current games like Minecraft can't use it to grow each unique plant because it would be too computationally intensive if you had millions of unique plants all growing at the same time.
    https://github.com/pdfernhout/...

    Congrats on your luck/skill in working with Thinking Machines hardware like the CM2. Around 1984, when an psychology undergrad at Princeton interested in AI, I had developed some software called "Mex" for multiple execution where I ran up to 1000 simulated processors on an IBM mainframe under VMUTS. I was using it to help process some data from a robot vision system I had put together (which itself had three 6502 processors). I was really excited about the idea of linking together lots of 6502 processors. I applied for a job then at Thinking Machines but didn't get an offer. A sociology grad student I knew from then (Clifford Nass) got a job offer there (and that is part of why I applied there) but he didn't take the offer, which is kind of ironic. He's brilliant and innovative as his career shows, but not really a programmer or hardware guy, and not all that interested in AI that I knew of:
    http://adlininc.com/uxpioneers...

    I'm shocked and saddened just now when checking what he is up to now to to see on Wikipedia that Cliff died recently of a heart attack:
    http://en.wikipedia.org/wiki/C...

    What a big loss for Cliff's family as well as the world. And not that long after the sad loss of Professor Jim Beniger, who was an inspiration and good role model to both Cliff and myself in various ways.

    I can see though how Thinking Machines could also have benefited from Cliff's cleverness in thinking about human/machine interaction related to control of a (then) new type of machine. Maybe they'd still be in business if Cliff had gone to work with them? And maybe, being associated with MIT, they did not need yet one more programmer or hardware person, no matter how much they were interested in parallel processing or had done their own projects already on it

    --
    A 21st century issue: the irony of technologies of abundance in the hands of those still thinking in terms of scarcity.
  60. Hardware verification, not software QA. by Ungrounded+Lightning · · Score: 1

    Verification is the process of checking that software works correctly. The more complex the system, the more complex the process of verification.

    You said "verification" but you're thinking of "software quality assurance". Though "verfication" is sometimes used to describe a step in that process, when used standing alone (at least here in silicon valley), it refers to the analogous process in integrated circuit design.

    Verification is a BIG DEAL in integrated circuit design. A good hardware project will have at least as many verification engineers as designers (and hardware designers will freely act as verification engineers - on OTHER designers' modules - during the later stages of a chip tapeout, without taking a carreer hit.) It is the limiting factor in when the chip design hits silicon and when it hits the market.

    So IMHO the previous poster is talking about the up-front quality assurance processes and costs of hardware, rather than software, complexity.

    (Releasing a rev to a software product due to a QA issue missed due to added complexity may be costly. But releasing a rev to silicon takes months and millions of dollars of sunk cost. They're not in the same league.)

    --
    Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
  61. Limited application by MoarSauce123 · · Score: 1

    There is limited application for making processes faster through parallelism. It only works well for processes that do not rely on the results of any of the other processes. Unfortunately, many real world applications depend on sequential tasks and I/O. Leaves running multiple applications in parallel, but that is different than parallel programming and a task already accomplished quite well by current OS.

  62. (sometimes) by Anonymous Coward · · Score: 0

    "Although Linus' rants are deservedly famous for the political incorrectness and (often) for their insight..."

    I would not say (often).

  63. Re:Let's see how that sounds in 5-10 years time .. by Junta · · Score: 1

    The issue is that when processor vendors went to dual and then quad core, people started extrapolating and saying 'oh in a decade, we'll be using hundreds of cores on a random desktop'. Instead it tapered out at about 4 for the most part with focus on reducing the power envelope while minimizing performance loss.

    I would say the discussion presuming massive core counts is based on an extrapolation of older trends of increasing core count, and it's perfectly reasonable to step back and recognize the change in the trend. Sure, tomorrow we could suddenly be back on the path to 256 core desktop solutions for unforeseen reasons, but as it stands, there's no signs of that being the priority of the industry.

    --
    XML is like violence. If it doesn't solve the problem, use more.
  64. Re:Linus wrong? Shocking! by bsdasym · · Score: 1

    It sounds like you're suggesting that memory bus speed will not continue to increase, and thus, we should stop adding bus contention by adding cores. The conclusion there hinges on a rather unsupported premise that is contradicted by the (historical) empirical data. All signs point to memory becoming much faster indeed.

    If Linus' expertise were really relevant here, perhaps Transmeta wouldn't have failed.

  65. Re:Linus wrong? Shocking! by Half-pint+HAL · · Score: 1

    Memory bus speed is increasing, and therefore the cost of cache misses is decreasing. One way or another, that still leaves us with cache misses as a bottleneck. The question is not a straightforward one of "memory bus speeds are increasing so who gives two hoots" -- there is a very subtle equation needed to determine what cache size is optimal with what bus speed, and for which task.

    --
    Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
  66. Re:Let's see how that sounds in 5-10 years time .. by SpinyNorman · · Score: 1

    Well, there's obviously no need to add more cores/parallelism until there's a widespread need for it (unless you are Chinese, when octocore is a must!), but I think the need is coming pretty fast.

    There are all sorts of cool and useful things you can do with high quality speech, image, etc recognition, natural language processing and AI, and these areas are currently making rapid advances in the lab and slowly starting to trickle out into consumer devices (e.g. speech and natural language support both in iOS and Android).

    What is fairly new is that in the lab state of the art results in many of these fields are now coming from deep learning / recurrent neural net architectures rather than traditional approaches (e.g. MFCC + HMM for speech recognition) and these require massive parallelism and compute power. These technologies will continue to migrate to consumer devices as they mature and as the compute requirements become achievable...

    Smart devices (eventually *really* smart) are coming, and the process has already started.

  67. Türkiyenin en AOK dinlenen radyosu by arabeskinsesifm · · Score: 1

    tÃf¼rkiyenà n can Damara ± arabesk radyo dinle www.arabeskinsesi.com ±