Slashdot Mirror


Windows and Linux Not Well Prepared For Multicore Chips

Mike Chapman points out this InfoWorld article, according to which you shouldn't immediately expect much in the way of performance gains from Windows 7 (or Linux) from eight-core chips that come out from Intel this year. "For systems going beyond quad-core chips, the performance may actually drop beyond quad-core chips. Why? Windows and Linux aren't designed for PCs beyond quad-core chips, and programmers are to blame for that. Developers still write programs for single-core chips and need the tools necessary to break up tasks over multiple cores. Problem? The development tools aren't available and research is only starting."

626 comments

  1. The Core? by Anonymous Coward · · Score: 0

    With Linux "My Kung-Fu is Strong!"

    1. Re:The Core? by palegray.net · · Score: 2, Interesting

      Hey, at least we aren't dealing with the lovely world of Cyrix anymore... those were truly fun times with respect to compiler optimizations (or lack thereof, as it turned out). That and the, um, heat "issues."

    2. Re:The Core? by Anonymous Coward · · Score: 0

      True, but the grandchild of the Cyrix cpu (the Nano) exonerates them, in my mind.

    3. Re:The Core? by Flywheel · · Score: 1

      The 6x86 was a great design that gave us much to be thankful for, in many ways - even though the FPU sucked (Funny enough because Cyrix used to make some magnificent FPUs).

      In most cases the heat issues could be minimized by using a better cooler, as long as it was manufactured by IBM - even better sold under the IBM brand

      --
      Live long and prosper...
    4. Re:The Core? by Flywheel · · Score: 1

      I would rather describe the Nano as the grandchild of the WinChip, not the 6x86

      --
      Live long and prosper...
    5. Re:The Core? by palegray.net · · Score: 1

      Agreed on all points, especially the irony of how back Cryix 6x86 processors sucked at floating point... they got their start designing math coprocessors, after all :).

  2. Adapt by Dyinobal · · Score: 3, Funny

    Give us a year maybe two.

    1. Re:Adapt by Dolda2000 · · Score: 5, Interesting

      No, it's not about adaptation. The whole approach currently taken is completely, outright on-its-head wrong.

      To begin with, I don't believe the article about the systems being badly prepared. I can't speak for Windows, but I know for sure that Linux is capable of far heavier SMP operation than 4 CPUs.

      But more importantly, many programming tasks simply aren't meaningful to break up into such units of granularity is OS-level threads. Many programs would benefit from being able to run just some small operations (like iterations of a loop) in parallel, but just the synchronization work required to wake up even a thread from a pool to do such a thing would greatly exceed the benefit of it.

      People just think about this the wrong way. Let me re-present the problem for you: CPU manufacturers have been finding it harder to scale the clock frequencies of CPUs higher, and therefore they start adding more functional units to CPUs to do more work per cycle instead. Since the normal OoO parallelization mechanisms don't scale well enough (probably for the same reasons people couldn't get data-flow architectures working at large scales back in the 80's), they add more cores instead.

      The problem this gives rise to, as I stated above, is that the unit of parallelism gained by more CPUs is to large to divide the very small units of work that exist among. What is needed, I would argue, is a way to parallelize instructions in the instruction set itself. HP's/Intel's EPIC idea (which is now Itanium) wasn't stupid, but it has a hard limitation on how far it scales (currently four instructions simultaneously).

      I don't have a final solution quite yet (though I am working on it as a thought project), but the problem we need to solve is getting a new instruction set which is inherently capable of parallel operation, not on adding more cores and pushing the responsibility onto the programmers for multi-threading their programs. This is the kind of the the compiler could do just fine (even the compilers that exist currently -- GCC's SSA representation of programs, for example, is excellent for these kinds of things), by isolating parts of the code in which there are no dependencies in the data-flow, and which could therefore run in parallel, but they need the support in the instruction set to be able to specify such things.

    2. Re:Adapt by Cassini2 · · Score: 5, Insightful

      Give us a year maybe two.

      I think this problem will take longer than a year or two to solve. Modern computers are really fast. They solve simple problems, almost instantly. A side-effect of this, is that if you underestimate the computational power required for the problem at hand, then you are likely to be off by large amounts.

      If you implement an order n-squared algorithm, O(n^2), on a 6502 (Apple II), if n was larger than a few hundred, you were dead. Many programmers wouldn't even try implementing hard algorithms on the early Apple II computers. On the other hand, a modern processor might tolerate O(n^2) algorithms with n larger than 1000. Programmers can try solving much harder problems. However, the programmers ability to estimate and deal with computational complexity has not changed since the early days of computers. Programmers use generalities. They use ranges: like n will be between 5 and 100, or n will be between 1000 and 100,000. With modern problems, n=1000 might mean the problem can be solved on a netbook, and n=100,000 might require a small multi-core cluster.

      There aren't many programming platforms out there that scale smoothly between applications deployed on a desktop, to applications deployed on a multi-core desktop, and then to clusters of multi-core desktops. Perhaps most worrying, is that the new programming languages that are coming out, are not particularly useful for intense data analysis. The big examples of this for me are: .NET and Functional Languages. .NET deployed at about the same time multi-core chips showed up, and has minimal support for it. Functional languages may eventually be the solution, but for any numerically intensive application, tight loops of C code are much faster.

      The other issue with multi-core chips, is that as a programmer, I have two solutions to making my code go faster:
      1. Get out the assembly print outs and the profiler, and figure out why the processor is running slow. Doing this, helps every user of the application, and works well with almost any of the serious compiled languages (C, C++). Sometimes, I can get a 10:1 speed improvement.(*) It doesn't work so well with Java, .NET, or many functional languages, because they use run-time compilers/interpreters and don't generate assembly code.
      2. I recode for a cluster. Why stop at a multi-core computer? If I can get a 2:1 to 10:1 speed up by writing better code, then why stop at a dual or quad core? The application might require a 100:1 speed up, and that means more computers. If I have a really nasty problem, chances are that 100 cores are required, not just 2 or 8. Multi-core processors are nice, because they reduce cluster size and cost, but a cluster will likely be required.

      The problem with both of the above approaches, is that from a tools perspective, they are the worst choice for multi-core optimizations. Approach 1 will force me into using C and C++, which doesn't even handle threads really well. In particular, C and C++ lacks an easy implementation of Software Transactional Memory, NUMA, and clusters. This means that approach 2 may require a complete software redesign, and possibly either a language change or a major change in the compilation environment. Either way, my days of fun loving Java and .NET code are coming to a sudden end.

      I just don't think there is any easy way around it. The tools aren't yet available for easy implementation of fast code that scales between the single-core assumption and the multi-core assumption in a smooth manner.

      Note: * - By default, many programmers don't take advantage of many features that may increase the speed of an algorithm. Built-in special purpose libraries, like MMX, can dramatically speed up certain loops. Sometimes loops contain a great deal of code that can be eliminated. Maybe a function call is present in a tight loop. Anti-virus software can dramatically affect system speed. Many little things can sometimes make big differences.

    3. Re:Adapt by Dolda2000 · · Score: 5, Informative

      Since the normal OoO parallelization mechanisms don't scale well enough

      It hit me that this probably wasn't obvious to everyone, so just to clarify: "OoO", here, stands not for Object-Oriented Something, but for Out-of-Order, as in how current, superscalar CPUs work. See also Dataflow architecture.

    4. Re:Adapt by MCSEBear · · Score: 1

      One of the nice things about Apple is that they follow the philosophy of Wayne Gretzky. Gretzky says: "A good hockey player plays where the puck is. A great hockey player plays where the puck is going to be."

      With the LLVM Compiler and GrandCentral, Apple has been working for years now on a way to better take advantage of machines with many cores. Once again, they are making a leap that Microsoft will not be able to match for many years.

      Of course, with the way multicore architecture has come to the forefront, I kind of wish Be OS had survived since it was designed to be multicore from day one. I have a feeling it's pervasively multithreaded nature would kick Apple and Microsoft's ass on modern hardware.

    5. Re:Adapt by Yaa+101 · · Score: 4, Interesting

      The final solution is that the processor measures and decides which part of which program must be run parallel and which are better off left alone.
      What else do we have computers for?

    6. Re:Adapt by tftp · · Score: 5, Insightful

      To dumb your message down, CPU manufacturers act like book publishers who want you to read one book in two different places at the same time just because you happen to have two eyes. But a story can't be read this way, and for the same reason most programs don't benefit from several CPU cores. Books are read page by page because each little bit of story depends on previous story; buildings are constructed one floor at a time because each new floor of a building sits on top of lower floors; a game renders one map at a time because it's pointless to render other maps until the player made his gameplay decisions and arrived there.

      In this particular case CPU manufacturers do what they do simply because that's the only thing they know how to do. We, as users, for most tasks would rather prefer a single 1 THz CPU core, but we can't have that yet.

      There are engineering and scientific tasks that can be easily subdivided - this comes to mind - and these are very CPU-intensive tasks. They will benefit from as many cores as you can scare up. But most computing in the world is done using single-threaded processes which start somewhere and go ahead step by step, without much gain from multiple cores.

    7. Re:Adapt by Sentry21 · · Score: 5, Insightful

      This is the sort of thing I like about Apple's 'Grand Central'. The idea behind is that instead of assigning a task to a processor, it breaks up a task into discrete compute units that can be assigned wherever. When doing processing in a loop, for example, if each iteration is independent, you could make each iteration a separate 'unit', like a packet of computation.

      The end result is that the system can then more efficiently dole out these 'packets' without the programmer having to know about the target machine or vice-versa. For some computation, you could use all manner of different hardware - two dual-core CPUs and your programmable GPU, for example - because again, you don't need to know what it's running on. The system routes computation packets to wherever they can go, and then receives the results.

      Instead of looking at a program as a series of discrete threads, each representing a concurrent task, it breaks up a program's computation into discrete chunks, and manages them accordingly. Some might have a higher priority and thus get processed first (think QoS in networking), without having to prioritize or deprioritize an entire process. If a specific packet needs to wait on I/O, then it can be put on hold until the I/O is done, and the CPU can be put back to work on another packet in the meantime.

      What you get in the end is a far more granular, more practical way of thinking about computation that would scale far better as the number of processing units and tasks increases.

    8. Re:Adapt by Dolda2000 · · Score: 4, Interesting

      As I mentioned briefly in my post, there was research into dataflow architecures in the 70's and 80's, and it turned out to be exceedingly difficult to do such things efficiently in hardware. It may very well be that they still are the final solution, but until such time as they become viable, I think doing the same thing in the compiler, as I proposed, is more than enough. That's still the computer doing it for you.

    9. Re:Adapt by Cassini2 · · Score: 5, Informative

      HP's/Intel's EPIC idea (which is now Itanium) wasn't stupid, but it has a hard limitation on how far it scales (currently four instructions simultaneously). I don't have a final solution quite yet (though I am working on it as a thought project), but the problem we need to solve is getting a new instruction set which is inherently capable of parallel operation, not on adding more cores and pushing the responsibility onto the programmers for multi-threading their programs.

      The problem with very long instruction word (VLIW) architectures like the EPIC and the Itanium, is that the main speed limitations in today's computers are bandwidth and latency. Memory bandwidth and latency can be the dominant performance driver in a modern processor. At a system level, network, I/O (particularly for the video), and a hard drive bandwidth and latency can dramatically affect system performance.

      With a VLIW processor, you are taking many small instruction words, and gathering them together into a smaller number of much larger instruction words. This never pays off. Essentially, it is impossible to always use all of the larger instruction words. Even with a normal super-scalar processor, it is almost impossible to get every functional unit on the chip to do something simultaneously. The same problem applies with VLIW processors. Most of the time, a program is only exercising a specific area of the chip. With VLIW, this means that many bits in the instruction word will go unused much of the time.

      In and of itself, wasting bits in an instruction word isn't a big deal. Modern processors can move large amounts of memory simultaneously, and it is handy to be able to link different sections of the instruction word to independent functional blocks inside the processor. The problem is the longer instruction words use memory bandwidth every time they are read. Worse, the longer instruction words take up more space in the processor's cache memory. This either requires a larger cache, increasing the processor cost, or it increases latency, as it translates into fewer cache hits. It is no accident the Itanium is both expensive and has an unusually large on-chip cache.

      The other major downfall of the VLIW architecture is that it cannot emulate a short instruction word processor quickly. This is a problem both for interpreters and for 80x86 emulation. Interpreters are a very popular application paradigm. Many applications contain them. Certain languages, like .NET and Java, use pseudo-interpreters/compilers. 80x86 emulation is a big deal, as the majority of the worlds software is written for an 80x86 platform, which features a complex variable length instruction word. The long VLIW instructions are unable to decode either the short 80x86 instructions, or the Java JIT instruction set, quickly. Realistically, a VLIW instruction processor will be no quicker, on a per instruction basis, than an 80x86 processor, despite the fact the VLIW architecture is designed to execute 4 instructions simultaneously.

      The memory bandwidth problem, and the fact that VLIW processors don't lend themselves to interpreters, really slows down the usefulness of the platform.

    10. Re:Adapt by Anonymous Coward · · Score: 0

      HP's/Intel's EPIC idea (which is now Itanium) wasn't stupid, but it has a hard limitation on how far it scales (currently four instructions simultaneously).

      You might want to clarify this bit abit...

    11. Re:Adapt by DJRumpy · · Score: 1

      Someone help me out here, but wouldn't even single threaded apps benefit if the OS was able to properly load balance CPU load?

    12. Re:Adapt by init100 · · Score: 3, Insightful

      To begin with, I don't believe the article about the systems being badly prepared. I can't speak for Windows, but I know for sure that Linux is capable of far heavier SMP operation than 4 CPUs.

      My take on the article is that it is referring to applications provided with or at least available for the systems in question, and not actually the systems themselves. In other words, it takes the user view, where the operating system is so much more than just the kernel and the other core subsystems.

      But more importantly, many programming tasks simply aren't meaningful to break up into such units of granularity is OS-level threads.

      Actually, in Linux (and likely other *nix systems), with command lines involving multiple pipelined commands, the commands are executed in parallel, and are thus being scheduled on different processors/cores if available. This is a simple way of using the multiple cores available on concurrent systems, and thus, advanced programming is not always necessary to take advantage of the power of multicore chips.

    13. Re:Adapt by camperslo · · Score: 4, Funny

      The programmers of Slashdot are ready for multiple cores and threads. There is no problem.

      When performing a number of operations in parallel the key is to simply ignore the results of each operation.
      For operations that would have used the result of another as input simply use what you think the result might be or what you wish it was.

      The programmers of Slashdot already have the needed skills for such programming as the mental processes are the same ones that enable discussion of TFAs without reading them.

    14. Re:Adapt by Trepidity · · Score: 2, Interesting

      The problem is still the efficiency, though. There are lots of ways to mark units of computation as "this could be done separately, but depends on Y"--- OpenMP provides a bunch of them, for example, and there's been proposals dating back to the 80s, probably earlier. The problem is figuring out how to implement that efficiently, though, so that the synchronization overhead doesn't dominate the parallelization gains. Does the system spawn new threads? Maintain a pool of worker threads and feed thunks to them? Some hybrid approach? How does it determine when it's worth the effort of doing anything for a particular bit of computation versus just doing it inline and saving the overhead? Etc.

      Basically Grand Central is yet another in the decades-long line of proposals for specifying parallelizable computations. What's still an open question is whether they've solved the harder part, a way to, as you say, "[route] computation packets to wherever they can go, and then [receive] the results", without that routing and receiving taking inordinate overhead.

    15. Re:Adapt by oftenwrongsoong · · Score: 1

      Of course, with the way multicore architecture has come to the forefront, I kind of wish Be OS had survived since it was designed to be multicore from day one. I have a feeling it's pervasively multithreaded nature would kick Apple and Microsoft's ass on modern hardware.

      I feel the same way. I loved BeOS back in its heyday. Maybe you should check out Haiku. It is supposed to be an open-source re-implementation of BeOS in such a way that provides source and binary compatibility with the last commercial version of BeOS, and then to proceed from there with new research. It finally has GCC 4, as of January, which means that it's not stuck in the "classic" 2.95 days of GCC anymore. This will help speed along development considerably. I hope to see a great comeback!

    16. Re:Adapt by Dolda2000 · · Score: 3, Interesting

      All that which you say is certainly true, but I would still argue that EPIC's greatest problem is its hard parallelism limit. True, it's not as hard as I tried to make it out, since an EPIC instruction bundle has its non-dependence flag, but you cannot, for instance, make an EPIC CPU break off and execute two sub-routines in parallel. Its parallelism lies only in very small spatial window of instructions.

      What I'd like to see is, rather, that the CPU can implement a kind of "micro-thread" function, that would allow two larger codepaths simultaneously -- larger than what EPIC could handle, but quite possibly still far smaller than what would be efficient to distribute on OS-level threads, with all the synchronization and scheduler overhead that would mean.

    17. Re:Adapt by Anonymous Coward · · Score: 2, Insightful

      This is also known as Processor Affinity outside of the Apple box

    18. Re:Adapt by Anonymous Coward · · Score: 5, Insightful

      You're thinking too simply. A single-core system at 5GHz would be less-responsive for most users than a dual-core 2GHz. Here's why:

      While you're playing a game more programs are running in the background - anti-virus, defrag, email, google desktop, etc. Also, any proper, modern game splits it's tasks, e.g. game AI, physics, etc.

      So dual-core is definitely a huge step up from single. So, no, users don't want single-core, they want a faster more responsive pc, which NOW is dual-core. In a few years it will be quad core. Most now hardly benefit from quad core.

    19. Re:Adapt by SeekerDarksteel · · Score: 1

      There's actually a large number of task based programming models out there. Intel's TBB, Cilk, recent versions of OpenMP, A CUDA block is effectively a task (albeit one further composed of multiple threads). There are even proposals for hardware support for task queues. The problem however, is the chicken-and-the-egg problem. We need better tools to encourage task queue parallelism, but few people want to develop those tools because there isn't a lot of support for them at the moment.

      --
      The laws of probability forbid it!
    20. Re:Adapt by drsmithy · · Score: 1

      With the LLVM Compiler and GrandCentral [appleinsider.com], Apple has been working for years now on a way to better take advantage of machines with many cores. Once again, they are making a leap that Microsoft will not be able to match for many years.

      Say what ? Microsoft built an OS from the ground up for SMP systems more than 15 years ago. It's called Windows NT.

      Apple is the *last* vendor to be getting into the SMP groove. Windows, Linux, Solaris, FreeBSD, and pretty much everyone else would like to welcome them to the party, but the party finished a few days ago. Heck, the biggest machine OS X runs on at all only has 8 processors - you could get off the shelf hardware to do the same with Windows and Linux nearly a decade ago.

    21. Re:Adapt by dollargonzo · · Score: 1

      I think your view of functional languages is pretty backwards. Part of the reason C/C++ are hard to parallelize is because the data flow is so complex. Many functional languages have a) extremely simple data flow, since it's relatively easy to program without side-effects and b) allow you to use the language's higher level optimization features to help with speed, e.g. programming using tail recursion that is easily optimized into a tight loop. Ultimately, multi-core optimization requires good parallelization capabilities, and none of what you are saying really helps with that. Erlang, as mentioned elsewhere, is a great example of a high level functional language which parallelizes much better than C/C++, even when using all the features you are talking about.

      --
      BSD is for people who love UNIX. Linux is for those who hate Microsoft.
    22. Re:Adapt by try_anything · · Score: 4, Insightful

      But most computing in the world is done using single-threaded processes which start somewhere and go ahead step by step, without much gain from multiple cores.

      Yeah, I agree. There are a few rare types of software that are naturally parallel or deal with concurrency out of necessity, such as GUI applications, server applications, data-crunching jobs, and device drivers, but basically every other kind of software is naturally single-threaded.

      Wait....

      Sarcasm aside, few computations are naturally parallelizable, but desktop and server applications carry out many computations that can be run concurrently. For a long time it was normal (and usually harmless) to serialize them, but these days it's a waste of hardware. In a complex GUI application, for example, it's probably fine to use single-threaded serial algorithms to sort tables, load graphics, parse data, and check for updates, but you had better make sure those jobs can run in parallel, or the user will be twiddling his thumbs waiting for a table to be sorted while his quad-core CPU is "pegged" at 25% crunching on a different dataset. Or worse: he sits waiting for a table to be sorted while his CPU is at 0% because the application is trying to download data from a server.

      Your example of building construction is actually a good example in favor of concurrency. Construction is like a complex computation made of smaller computations that have complicated interdependencies. A bunch of different teams (like cores) work on the building at the same time. While one set of workers is assembling steel into the frame, another set of workers is delivering more steel for them to use. Can you imagine how long it would take if these tasks weren't concurrent? Of course, you have to be very careful in coordinating them. You can't have the construction site filled up with raw materials that you don't need yet, and you don't want the delivery drivers sitting idle while the construction workers are waiting for girders. I'm sure the complete problem is complex beyond my imagination. By what point during construction do need your gas, electric, and sewage permits? Will it cause a logistical clusterfuck (contention) if there are plumbers and eletricians working on the same floor at the same time? And so on ad infinitum. Yet the complexity and inevitable waste (people showing up for work that can't be done yet, for example) is well worth having a building up in months instead of years.

    23. Re:Adapt by SpuriousLogic · · Score: 1

      I agree. Apple is very good at innovative consumer products, but it is not a R&D superstar. They are very much more of a consumer electronic company that a tech innovator. You need to look to IBM, Sun, Microsoft and to a lesser extend HP for true tech breakthroughs. Apple is just very good at assembling parts into consumer items and marketing them and are no where near in the ballpark of the other players (if you even want to consider Apple a player at all).

    24. Re:Adapt by David+Gerard · · Score: 4, Funny

      Three cores to run GNOME, one core to run Firefox.

      --
      http://rocknerd.co.uk
    25. Re:Adapt by Delwin · · Score: 1

      Just like CPU manufacturers have topped out where they can push the clock speed to (for now) there is likewise an upper limit on how many cores are actually useful. One is much better than two, but 1024 doesn't get you anything more than 512 did.

    26. Re:Adapt by jonbryce · · Score: 1

      I think the complaint is about the compilers, not the operating systems.

      Linux may well be very capable of powering vast beowulf clusters of multi-core machines, but how good is gcc at compiling software for them?

    27. Re:Adapt by Delwin · · Score: 1

      That sounds a whole lot like cell.

    28. Re:Adapt by Anonymous Coward · · Score: 0

      CPU manufacturers have been finding it harder to scale the clock frequencies of CPUs higher, and therefore they start adding more functional units to CPUs to do more work per cycle instead
      Bzzt. Wrong. It's nothing to do with more work per cycle, it's transistor count. They are unable to increase the clock because the ramp becomes vague the faster the clock, so the GHz game is over for the time being. So they increase cores because it's pretty simple to do. The fact that most programs do not need, or lend themselves to parallel work means we're not getting faster machines. Our main bottleneck is I/O. Until harddrives cease being pathetic slugs, and I include top end SSDs, our machines are going nowhere fast.

    29. Re:Adapt by nmb3000 · · Score: 5, Funny

      To dumb your message down, CPU manufacturers act like book publishers [...]

      What is this "books" crap? Pft, I remember when car analogies were good enough for everyone. Now you have to get all fancy. Let me try and explain it more clearly:

      CPUs are like cars. Intel and Friends haven't been able to keep increasing the velocity they can safely and reliably run, so instead of relying on increased speed to get more people from point A to point B, they are instead starting to look at parallelization as a means to achieve better performance.

      Now you are chopped up into 10 pieces and FedEx'd to your destination with 100 other people. Pieces may go by road, rail, air, or ship and thus overall capacity--"bandwidth" you might say--of the lanes of travel has been increased.

      The only problem is that the people who make use of this new technique ("programmers", that is) have a hard time chopping you up in such a way that you can be put back together again. Usually it's a bit of a mess and more trouble that it's worth, thus we just keep driving our old-fashioned cars at normal speeds while adding lanes to the roads.

      --
      "What do you despise? By this are you truly known." --Princess Irulan, Manual of Muad'Dib
      /)
    30. Re:Adapt by Anonymous Coward · · Score: 0

      Best dumbed-down analogy I've heard for multicore scaling problems yet. Bravo.

    31. Re:Adapt by Dolda2000 · · Score: 1, Insightful

      Until harddrives cease being pathetic slugs, and I include top end SSDs, our machines are going nowhere fast.

      You must be using Vista, if you think that modern computers are slow. :)

    32. Re:Adapt by jcaplan · · Score: 1

      Yes, but it depends on your workload. If your workload can be split among multiple copies of your application, such as apache launching several processes to serve web page requests, then your application does not need to be multi-threaded to benefit from multi-core hardware. The key here is that each http request is independent and can be handled by a different process. If your workload depends on a single application, which is not it not multi-threaded, then the only benefit you get from multi-core hardware is that other applications may get shifted to other available cores.

    33. Re:Adapt by Joce640k · · Score: 1

      At the lowest level, using spinlocks instead of mutexes means threads can come back to life faster after being stalled. Whether this would make a noticable difference or not is debatable.

      And yes, the returns diminish very quickly. More than four cores is very unlikely to make much difference to an operating system and office/productivity apps. Very few tasks are generally scalable.

      --
      No sig today...
    34. Re:Adapt by Anonymous Coward · · Score: 0

      As for the car analogy:

      Two cars only help you get x people faster from one place to another, if you have more people to transport than fit in one car.

      (I challenge anyone to dump that down.)

    35. Re:Adapt by MCSEBear · · Score: 1

      All modern OS's support multiple cores. Unfortunately, not all application programmers are smart enough to be able to write code to take advantage of this. Having the OS and it's compilers change single threaded code into something that can take advantage of multiple cores *for you* is what Apple is working on.

    36. Re:Adapt by Tiger4 · · Score: 1

      Finally, training from work actually means something in the real world.

      Take a look at the concepts of Theory of Constraints (TOC) specifically the areas of Manufacturing and Program Management. They deal with the problems of Bottlenecks in production. How to optimize a process flow to maximize throughput when one or more resources are both critical and limited. There is a concept in there called Critical Chain. It looks a lot like Critical Path, except it tries to account for the load on the critical resource, not necessarily just time.

      In the case of CPUs and compilers, just as in manufacturing process flows and projects, first you decide what all the tasks and sequence links need to be. Then you work out what resources are actually available to you. If you only have one of everything, you space out the tasks so that one worker/processor can move from task to task easily and efficiently. If you have more than one worker/processor, you can arrange the tasks to be more parallel and everything gets accomplished in shorter time.

      The key lesson is that the basic tasks and links do not need to change. You only change your paradigm of how you assign processing power to each of them as they need to be done. Just like a work manager in a shop might assign one, two, ten, or more mechanics to a job based on priority, the OS in a computer would need fined grained understanding of each work unit in every job in execution, as well as control of every processing unit, and a way to switch them in and out. You don't assign more than would be useful, but you do assign enough to get the job done quickly AND to keep available processing power from going idle.

      That would be a big change, but the efficiency and total throughput gained would be huge.

      --
      Behold, this dreamer cometh. Come now, and let us slay him... and we shall see what will become of his dreams.
    37. Re:Adapt by Godji · · Score: 1

      Just to avoid a few hundred whooshes, this here analogy is a joke, and it's grossly inaccurate at that. But it's funny! :)

    38. Re:Adapt by TheRaven64 · · Score: 5, Informative

      This is simply not true. Assuming both cores are fully loaded, which is the best possible case for dual core, then they will still be performing context switches at the same rate as a single chip if you are running more than one process per core. Even if you had the perfect theoretical case for two cores, where you have two independent processes and never context switch, you could run them much faster on the single-core machine. A single-core 5GHz CPU would have to waste 20% of its time on context switching to be slower than a dual-core 2GHz CPU, while a real CPU will spend less than 1% (and even on the dual-core CPU, most of the time your kernel will be preempting the process every 10ms, checking if anything else needs to run, and then scheduling it again, so you don't save much).

      The only way the dual core processor would be faster in your example would be if it had more cache than the 5GHz CPU and the working set for your programs fitted into the cache on the dual-core 2GHz chip but not on the 5GHz one, but that's completely independent of the number of cores.

      --
      I am TheRaven on Soylent News
    39. Re:Adapt by Anonymous Coward · · Score: 0

      CPU manufacturers have been finding it harder to scale the clock frequencies of CPUs higher, and therefore they start adding more functional units to CPUs to do more work per cycle instead
      Bzzt. Wrong. It's nothing to do with more work per cycle, it's transistor count.

      You're awfully rude for not making any sense. If you agree clock frequencies are not being increased, what do you think the point of having more transistors is if not for doing more work per cycle. Do you think before you speak?

    40. Re:Adapt by TheRaven64 · · Score: 1

      Actually, the problem is that very few workloads need more than a single 1GHz core. Most of those that do are run on dedicated silicon, and the few remaining are generally quite well parallelised.

      It is very hard to write parallel code that is faster than the serial equivalent on a single core, and if your code already runs fast enough on a single core then there is no point making it run in parallel. Unless your code is CPU-bound, making it concurrent is a waste of effort.

      --
      I am TheRaven on Soylent News
    41. Re:Adapt by drsmithy · · Score: 1

      Having the OS and it's compilers change single threaded code into something that can take advantage of multiple cores *for you* is what Apple is working on.

      Right. And each copy of Snow Leopard is going to come with a free unicorn.

    42. Re:Adapt by TheRaven64 · · Score: 2, Informative

      Erlang, as mentioned elsewhere, is a great example of a high level functional language which parallelizes much better than C/C++,

      No it isn't. Erlang gains absolutely no benefit in terms of parallelism from being a functional language. All of the concurrency of Erlang comes from the CSP model, while functional languages get theirs via an extension to the lambda calculus.

      The one relevant feature of Erlang when talking about functional languages is that it does not allow mutable data other than the process dictionary. If you want to write parallel code in any language, there is one golden rule you should follow:

      No data shall be both mutable and aliased.

      In Erlang, this is enforced for you; the only mutable data structure is the process dictionary. In functional languages, this is typically handled via something like monads. There is nothing stopping you from enforcing this constraint in an imperative language, however, and if you follow this simple rule then concurrent programming is easy.

      --
      I am TheRaven on Soylent News
    43. Re:Adapt by gbjbaanb · · Score: 2, Insightful

      Yeah, I reckon you've got the reason things are "single-threaded" by design. So the solution is to start getting creative with sections of programs and not the whole.

      For example, if you're using OpenMP to introduce parallelisation, you can easily make loops run in multi-core mode, and you'll get compiler errors if you try to parallelise loops that can't be broken down like that.

      Like your building analogy - sure, you have to finish one floor before you can put the next one on, but once the floors are up, you can plumb each room up concurrently. You have to then wait until the plumbing and wiring is done before you can start plastering, and then you have to wait for that to dry before you can decorate - but you can then decorate each room concurrently.

      Stuff like that will allow you to easily set some parts running concurrently, and I reckon that's as good as we're going to get unless we start thinking in full-on functional-style programming designs. (see the wikipedia entry for a good exmaple) But I don't hold out hope for that anytime soon, its still hard to get right if the task is not simple.

      Besides, who really needs 8 cores anyway - unless there are specialist tasks (and I can think of only a few) the biggest problems we have are memory and IO bandwidth, not CPU performance.

    44. Re:Adapt by maxume · · Score: 1

      So the chip companies are generally going to end up spending process improvements by making chips cheaper, rather than more complex?

      Sounds good to me. As it is, I probably spend more time waiting for Firefox to execute some javascript than anything else, and that is something that is relatively straightforward for the developers to deal with.

      --
      Nerd rage is the funniest rage.
    45. Re:Adapt by gbjbaanb · · Score: 1

      I'd say there are such things already, though they do depend on the programmer. I don;t think the compiler can parallelise any but the simplest tasks (though to be fair, neither can most programmers).

      OpenMP, Intel's TBB etc all try to make parallelising sections of your program easier, so easy that you might actually succeed in making a correct, concurrent application that doesn't have impossible-to-fix bugs.

    46. Re:Adapt by MillionthMonkey · · Score: 1

      Many programs would benefit from being able to run just some small operations (like iterations of a loop) in parallel, but just the synchronization work required to wake up even a thread from a pool to do such a thing would greatly exceed the benefit of it.

      It's not that hard. In Java for example the VM will map threads to processor cores and boilerplate code to handle this is simple. Say you have a slow method

      List<PainfulOutput> doWork(List<PainfulInput> inputs) throws Exception {...}

      where there's a loop and input i corresponds to output i. You can bust something like this up pretty easily. First move the loop body to a new lower level method that handles objects individually:

      PainfulObject doBitOfWork(PainfulInput input) {...}.

      Back within doWork(), create an ExecutorService with Executors.newFixedThreadPool(System.numProcessors()), or get one from a cache somewhere.

      Define a little inner class to represent a unit of work:

      final class CoreJob implements Callable<PainfulOutput> {
          PainfulInput myInput;
          Worker(PainfulInput myin) {myInput=myin;}
          public PainfulOutput call() throws Exception {
              return doBitOfWork(myInput);
          }
      }

      Submit instances of it to the ExecutorService:

      List<Future<PainfulOutput>> futures = new ArrayList<Future<PainfulOutput>>();
      for (PainfulInput in : inputs)
          futures.add(threadPool.submit(new CoreJob(in));

      Then collect the results, tearing the thread pool down afterward if necessary before returning:

      List<PainfulOutput>goodies = new ArrayList<PainfulOutput>();
      for (Future<PainfulOutput> future : futures)
          goodies.add(future.get());
      threadPool.shutdown();
      return goodies;

      Proper exception handling can be added but it wouldn't have gotten past the crap filter. Note no synchronized keywords either- the main thread will keep getting blocked in that last collection loop until all the worker threads finish.

    47. Re:Adapt by AmiMoJo · · Score: 3, Interesting

      So, we can broadly say that there are three areas where we can parallelise.

      First you have the document level. Google Chrome is a good example of this - first we had the concept of multiple documents open in the same program, now we have the concept for a separate thread for each "document" (or tab in this case). Games are also moving ahead in this area, using separate threads for graphics, AI, sound, physics and so on.

      Then you have the OS level. Say the user clicks to sort a table of data into a new order, the OS can take care of that. It's a standard part of the GUI system, and can be set off as a separate thread. Of course, some intelligence is required here as it's only worth spawning another thread if the sort is going to take some appreciable amount of time.

      At the bottom you have the algorithm level, which is the hard one. So far this level has got a lot of attention, but the others relatively little. The first two are the low hanging fruit, which is where people should be concentrating.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    48. Re:Adapt by DJRumpy · · Score: 1

      But can't the OS micromanage CPU queue's? I see what your saying about distinct URL requests and that makes perfect sense to me, but isn't an app, taken down to a command by command level the same thing? Couldn't it dole out the CPU queue among any number of processors?

      I apologize is these are rather basic questions. I do basic programming but not at the level where I have to actively multithread any apps I write.

    49. Re:Adapt by TheRaven64 · · Score: 4, Insightful

      So the chip companies are generally going to end up spending process improvements by making chips cheaper, rather than more complex?

      Probably. Cheaper, and less power-hungry. For the past 50 years we've had a set of cycles where computers get dedicated hardware for some task, then the general purpose hardware gets fast enough to run it and the dedicated hardware goes away, then the cycle repeats with some other algorithm (sound, 2D video, and so on). The side-effect of this is that it also consumes a lot more power. For any algorithm, you can design dedicated hardware that executes it with less power than a general-purpose CPU. The DSP on something like an OMAP3 can decode MP3 audio in under 50mW; even something like the Atom is going to struggle to get within two orders of magnitude of this.

      This wasn't a problem for desktop PCs, because they were plugged into the mains and no one has itemised electricity bills, so no one notices the difference between a 20W and a 100W CPU. In a laptop or palmtop, the difference between 250mW (a typical ARM Cortex A8 SoC) and 20W (Atom + a cheap chipset) can be several hours of battery life. People are starting to expect 10 hours of battery life from portables, and doing this with a small battery requires a lot of dedicated silicon that can be turned off when not in use and draw small amounts of power when executing the task it was designed for.

      I expect the future of CPUs will be heterogeneous multicore. In a way, that's the present of CPUs too; you can consider the FPU and vector unit as separate, specialised, cores (although they lack separate control instructions, so it's stretching it slightly).

      --
      I am TheRaven on Soylent News
    50. Re:Adapt by beav007 · · Score: 5, Insightful

      It's posts like these that make me think that I'm the only one with 7 programs on the task bar, 12 in the system tray, assorted server processes, and 32 tabs open in Firefox (come on, 1 thread per tab!!). It doesn't much matter to me if each of these parts are not multithreaded, as long as the OS is smart enough to put active threads on different cores.

    51. Re:Adapt by Opyros · · Score: 4, Funny

      Thanks for the explanation -- for a moment, I was actually wondering what OpenOffice.org's parallelization mechanisms had to do with anything!

    52. Re:Adapt by phantomfive · · Score: 1, Insightful

      I keep seeing comments like this, but I'm not sure you've actually thought through the issues here. What sort of applications are you running that are pegging the CPU at 25%? In the application you just described (in your second multiword paragraph), running those things in parallel can actually slow things down. Why? Because by far the thing that is taking all your time is the disk access. Launch photoshop sometime and see what it is doing on startup, 90% of it is loading palettes, etc. From the disk. A single disk access can take a million CPU cycles, which may be more than the entire rest of the startup code put together, so really it's not even worth optimizing until you deal with getting the disk faster.

      How can it actually take longer if you run them in parallel? There is only one disk, and if you are trying to load from it from two different processes, it will be forced to rapidly switch back and forth between the two. Not good. The fastest way to get info from a disk is sequentially.

      Parallelization is not the easy solution it seems on the surface.

      --
      Qxe4
    53. Re:Adapt by 99BottlesOfBeerInMyF · · Score: 1

      Having the OS and it's compilers change single threaded code into something that can take advantage of multiple cores *for you* is what Apple is working on.

      Right. And each copy of Snow Leopard is going to come with a free unicorn.

      Actually, it has been working in OS X since 10.5 for OpenGL applications. It's fairly limited right now, but it does provide real performance improvements for applications that were written without any foreknowledge that Apple was going to add such a feature.

    54. Re:Adapt by bertok · · Score: 2, Insightful

      I think the consensus was that making compilers emit efficient VLIW for a typical procedural language such as C is very hard. Intel spend many millions on compiler research, and it took them years to get anywhere. I heard of 40% improvements in the first year or two, which implies that they were very far from ideal when they started.

      To achieve automatic parallelism, we need a different architecture to classic "x86 style" procedural assembly. Programming languages have to change too, the current crop are too close to the metal. I suspect that in the future, languages will rely on intermediate byte-code more, and become ever more functional as designers realize that functional code is easy to transform due to a lack of side-effects.

      I've heard of automatically parallelized versions of some pure functional languages that can execute almost any code on almost any number of CPUs without the programmer ever having to write a single synchronization instruction! For example, Microsoft is working on "parallel LINQ" in C# 4.0, which is essentially a small island of parallelizable functional code that can be embedded in a procedural language.

    55. Re:Adapt by daemonburrito · · Score: 1

      I think the complaint is about the compilers, not the operating systems.

      You could go further up the stack. Approaching it from the OS level is an attempt to "parallelize" code compiled to run in a serial manner, even if this code was written to take advantage of OS api locks/mutexes; approaching it from the compiler means finding opportunities to parallelize code written in a serial paradigm; approaching it at the application level means using a writing code in a different paradigm than we've all been using for half a century, but it fixes the problem.

      Aside: When talking about locks, we should probably separate data and process concerns.

      Speaking of... functional programming is finally getting some respect. Many said it was impossible to write a kernel in C; it would be neat if it were possible implement at least some of it in something like Concurrent Haskell. This is just idle speculation, but it seems like the performance hit of using a functional language near the bottom of the stack would be worth it in the long run, if it makes it possible to use many-core systems that much more efficiently. (Btw, there are purely functional research OS's out there, and there used to be a bunch of LISP machine OS's).

    56. Re:Adapt by mgblst · · Score: 2, Insightful

      A better but less humorous analogy would be to consider that Intel and co can't keep increasing the top speed of a car, so they are putting more seats into your car. This works OK when you have lots of people to transport, but when you only have 1 or two, it doesn't make the journey any faster. The problem is, most journeys only consist of one or two people. What the article is suggesting is that we implement some sort of car-sharing initiative, we stop taking so many cars to the same destination. Or a bus!

    57. Re:Adapt by BigBuckHunter · · Score: 1

      It appears that the author has edited the article and removed the windows/linux but (which the submitter quoted from the article). I agree. Parallelization of most tasks is difficult. It will remain difficult till the end of time.

      BBH

    58. Re:Adapt by mabhatter654 · · Score: 1

      the problem is that most Windows and Linux programs and APIs are not safe to be moved to another core. Ironically, this was a problem BeOS sought to handle in spades, but I haven't seen Haiku check in on this lately. The OS maybe SMP capable, but the APIs programmers use don't nicely split "your" program behind the scenes, expecting the individual programmer to babysit this.. and they're not doing it right now.

    59. Re:Adapt by giorgist · · Score: 2, Interesting

      You havn't seen bulding go up. You don't place a brick render it, paint it hang a picture frame and go to the next one.

      A multi story building has a myriad of things happening at the same time. If only computers were as parralel processing.
      If you have 100 or 1000 people working on a building, each is an independant process that shares resources.

      It is simple, 8 core CPUs is a solution that arrived before the problem. A good 10 year old computers can do most of todays
      office work.

    60. Re:Adapt by FatdogHaiku · · Score: 1

      Thanks for providing the link. When I saw "Out-of-Order" my mind went to the courts... "I object Your Honor, this instruction is Out-of-Order! I demand the CPU be treated as a hostile witness."

      --
      You have the right to remain sentient. If you give up the right to remain sentient, you will be elected to public office
    61. Re:Adapt by pezezin · · Score: 1

      I don't have a final solution quite yet (though I am working on it as a thought project), but the problem we need to solve is getting a new instruction set which is inherently capable of parallel operation, not on adding more cores and pushing the responsibility onto the programmers for multi-threading their programs.

      Like the old Cray machines, or the new NEC SX series? After having studied them, I wonder why manufacturers don't add long vector instructions to current CPUs, they are much more flexible than short vectors (SIMD).

    62. Re:Adapt by ShieldW0lf · · Score: 1

      Wonder if we'll see the "Year of the GNU Hurd"...

      --
      -1 Uncomfortable Truth
    63. Re:Adapt by MichaelSmith · · Score: 1

      Just like CPU manufacturers have topped out where they can push the clock speed to (for now) there is likewise an upper limit on how many cores are actually useful. One is much better than two, but 1024 doesn't get you anything more than 512 did.

      What about the real time raytracing which people keep talking about for games?

    64. Re:Adapt by try_anything · · Score: 2, Interesting

      Short answer: only one thing I mentioned involved disk I/O, RAM is cheap, and application frameworks typically limit the number of jobs being run at one time.

      If there's really a performance need to serialize tasks involving disk I/O, then go ahead and serialize them. Eclipse, the application framework I'm most familiar with, makes this straightforward: just define a scheduling policy that allows only one job to run at a time and apply that policy to all your disk I/O jobs. Other jobs will continue to be scheduled and run according to the default policy or whatever other policy you specify -- might as well get some work done while you're waiting for the I/O to complete.

    65. Re:Adapt by ILuvRamen · · Score: 1

      dude, I just wrote a multicore app a few days ago. There's some slightly inefficient but still faster workarounds for not being able to share memory/variables between threads. If you have to process one big, long thing write some code to set it up to start the calculations at 100% divided by the number of cores then start each of them at the same time. Then have the results written to the hard drive since it can't directly share variables very well then read it all in when they all get done and combine the results. This doesn't work for all cases of course but that's how my app works.

      --
      Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
    66. Re:Adapt by jcaplan · · Score: 1

      No. An application has to execute commands in order. The OS cannot guess the order that you intend the instructions to be executed in or the dependencies between instructions. If the OS executed your instructions in a round-robin fashion, such that instruction 1 was executed on processor 1, instruction 2 on processor 2, ... then the OS would have to keep careful track of whether the result of instruction 1 would affect the execution of instruction 2. Similar problems would exist sending chunks of instructions to various processors. The OS would have to keep track of all of these dependencies, and slow things to a crawl.

      This kind of thing is done in hardware, however - its called hyperthreading or "simultaneous multithreading" and is done on many modern CPUs within individual cores at the cost of many transistors devoted to doing things like tracking dependencies.

      A single-threaded application could benefit from multi-core architecture, if some of its system calls were asynchronous, so that calling print("Hello") would spawn a thread that caused "Hello" to be (eventually) displayed, and then (immediately) execute your next instruction, without waiting for the printing to complete. Not all system calls are asynchronous, and those that are need to be used thoughtfully. Consider what might happen if you executed: print("Hello, "); print("world!"); and each instruction were sent to be executed on a separate core. You could get a "race condition" where your code sometimes printed "world!Hello, ". It is the programmer's job to consider these conditions and ensure threads are properly synchronized - in this case by verifying that print("Hello, ") completes before executing the next instruction.

    67. Re:Adapt by jd · · Score: 1

      Linux has supported 16-way SMP since the Xeon first came out. Debuggers and latency-monitoring tools that support multicore have existed for Linux for a long time - Intel's VTune being a commercial example, but DAKOTA and KOJAK being open-source examples.

      For compilers based on trivial modifications for existing languages, Cilk++ is basically GCC's C++ with instruction-level parallelism. Those wanting something more sophisticated need look no further than OpenMP. If you prefer all-out hardcore parallelism, use KROC.

      The tools are all out there. The OS provides the mechanisms. In the case of Cilk and OpenMP, the code will work just fine on a regular GCC install with no parallel support and on uniprocessor systems.

      If parallel support is lacking in software, it is not lacking because of any problems in the toolchain or the kernel. It is lacking because there are too many lazy bar stewards who aim only for the lowest common denominator and ignore the needs of anything better.

      The lowest common denominator should indeed not suffer and need not suffer, if code is designed well, but the current practice causes everyone else to suffer on their behalf, which is not good practice.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    68. Re:Adapt by jd · · Score: 4, Funny

      Well, you see, once IBM buys out Sun, Solaris is going to be re-implemented as macros in OpenOffice. Or Emacs. Whichever one they decide to pick as the new OS kernel.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    69. Re:Adapt by Waffle+Iron · · Score: 3, Insightful

      It's posts like these that make me think that I'm the only one with 7 programs on the task bar, 12 in the system tray, assorted server processes, and 32 tabs open in Firefox (come on, 1 thread per tab!!).

      I'd be willing to bet a good deal of money that almost all of those tasks are currently asleep and waiting for input, a timer signal or external I/O. Such processes don't need *any* cores unless and until they wake up.

      (The big exception for most people would be having flash ads running in those 32 firefox tabs. The way to solve that problem without adding more cores is by installing flashblock.)

      Right now "ps" says that my system is running 127 different processes. Current CPU utilization? 0.7%.

    70. Re:Adapt by Anonymous Coward · · Score: 0

      The addition of closures/lambdas to imperative programming languages really does help in providing simpler parallelism constructs for developers. Microsoft is working on something similar which it will release in .NET Framework 4.0 called ParallelFX which builds on the closure/LINQ functionality added in .NET Framework 3.5. By modifying a single line of code you can transform any iterative structure into a discrete task that can be scheduled across a series of queues to be processed by a pool of threads.

      Iterative:

      var items = GetItems();
      foreach(var item in items) {
              Process(item);
      }

      Parallel:

      var items = GetItems();
      Parallel.ForEach(items, item => {
              Process(item);
      });

      It also works on LINQ constructs by adding a single clause:

      Iterative:

      var query = from item in GetItems() where ComplexFunction(item) == true select item.Value;

      Parallel:

      var query = from item in GetItems().AsParallel() where ComplexFunction(item) == true select item.Value;

    71. Re:Adapt by jd · · Score: 4, Funny

      Three Cores for the Gnome kings under the Gtk,
      Seven for the KDE lords in their halls of X,
      Nine for Emacs Men doomed to spawn,

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    72. Re:Adapt by TapeCutter · · Score: 2, Informative

      "a game renders one map at a time because it's pointless to render other maps until the player made his gameplay decisions and arrived there"

      Rendering is perfect for parallel processing, sure you only want one map at a time but each core can render part of the map independently from other parts of the map.

      --
      And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
    73. Re:Adapt by DJRumpy · · Score: 1

      Thank you. A very clear understandable reply.

      Given what you've said here, what if there was some sort of API that was written to allow apps to take advantage of multiple cores so that the api managed timing for non-asynchronous apps? Dump some code for execution onto an API and let the API manage the wait states for any lagging instructions.

      Is that the gist of the article? Suggesting that the OS could take on some of those duties?

    74. Re:Adapt by erroneus · · Score: 2, Interesting

      Multi-core processing is one thing but access to multiple chunks of memory and peripherals are also keeping computers slow. After playing with running machines from PXE boot and NFS rooted machines, I was astounded at how fast those machines performed. Then I realized that the kernel and all wasn't being delayed waiting on local hardware for disk I/O.

      It seems to me, when NAS and SAN are used, things perform a bit better. I wonder what would happen if such control and I/O systems were applied into the same box? Smart RAID controllers are a step in that direction, but they are still accessed as SCSI devices. What might the results be if the secondary storage systems were in a server within the box dedicated to optimized disk I/O? The same sort of thing is being done with GPU processing, but I wonder how much more removed the graphics systems could become?

      Devices need to become smarter and faster to really make things perform at their best speed.

    75. Re:Adapt by beav007 · · Score: 1

      Indeed. If I have 4 cores, I have the ability to give 4 threads near 100% CPU time. I don't need to though, most of the time.

      At the end of the day, most programs that are CPU intensive are naturally threadable. For the rest of the programs, assigning active threads to the core with the least utilisation works fine.

      As long as the OS has a decent scheduling/balancing between cores, more cores will fix most processing resource problems.

    76. Re:Adapt by david.given · · Score: 3, Interesting

      I expect the future of CPUs will be heterogeneous multicore.

      You may be interested to know that, as far as I can tell from the rather fuzzy documention, the MSM7201A processor used in the G1 smartphone has at least three dissimilar cores, and potentially five:

      • an ARM11 for the application stack
      • an ARM9 for the radio stack
      • a QDSP4000
      • possibly a QDSP5000, the spec is unclear as to whether you get both this and the QDSP4000
      • a PowerVR 3D accelerator unit, although the spec is again unclear as to whether this is actually in silicon and not just a particular firmware load for the DSP

      I gather that it's pretty hard to make them share address spaces, even the two ARMs; so SMP is probably not feasible. Message-passing via specific shared memory segments is the usual approach.

    77. Re:Adapt by NeoStrider_BZK · · Score: 1

      Give us a year and two EIGHT-CORE MACHINES.

      fixed that for you

    78. Re:Adapt by NSIM · · Score: 1

      I don't think the article is really talking about the OS per-se, more the applications that run on it. Both LINUX and Windows are pretty good in terms of SMP support in the kernel, and scale quite well. The problem is applications, many of which are not written to make best use of multiple cores (or any use at all). Then again, many of the applications we use day to day have limited scope for multi-threading because they simply don't parallelize well and no amount of compiler trickery or fancy coding is going to help these apps.

    79. Re:Adapt by w0mprat · · Score: 1

      What about the real time raytracing which people keep talking about for games?

      Some tasks love parallelism. Personally I think the future is in FPGA chips http://en.wikipedia.org/wiki/FPGA. Where the architecture changes as needed.

      --
      After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.
    80. Re:Adapt by toddestan · · Score: 1

      Most users probably would never notice the difference. You may have two cores, but the disk is still just as slow, you haven't upped the memory bandwidth, the network or USB bus or GPU hasn't gotten any faster. Most things people do today aren't CPU bound, and the CPU spends a lot of its time waiting around for the rest of the computer to give it something to do. If I wanted a responsive computer, I'd take the cheaper of the chips myself, and use the money I save towards a SSD.

    81. Re:Adapt by perryizgr8 · · Score: 1

      you're both saying the same thing. increased transistor count means more work per cycle.

      --
      Wealth is the gift that keeps on giving.
    82. Re:Adapt by ps2os2 · · Score: 1

      To start with I know little about how the INTEL works (only what I have been told).
      From my vantage point there has to be built in instruction in the chip in order to really run SAFELY (more in a sentence or two). What has been ingrained in me was that the chip MUST have the ability to serialize any storage. That is if cpu is running thread a and cpu b is running a different thread and one (or both) need to update a storage location they must serialize on it or the other system could (and does) update the same location at the *SAME* time. The first thread (if no serialization takes place) moves a "B" to say location 4096 (any storage location this is an example) meanwhile the second location moves a "C" to the same location (it happens *OFTEN*) so the first thread gets control and checks
      to see what is in location 4096 and suddenly it was "C" there. It had moved a "B" there in the last instruction why isn't it B ?
      Then the program gets confused as what should have been there isn't. It could (depending on how the architecture is) sit there either in a never ending wait or a loop. Oh yes BTW the cpu has to flush the cpu's instruction buffer or things could really get bollixed up.
      I know this is an extremely simplified example but the chip has to do these in order not to create invalid output. The more threads that run simultaneously the more the chip has to be multi cpu aware *AND* can serialize update to storage across CPU's.
      I have seen many a programming issue that do not use the correct instructions for serializing storage. And it is a PITA to debug (almost impossible) unless everyone agree to use the correct serialization instructions all bets are off, IMO.

    83. Re:Adapt by Anonymous Coward · · Score: 0

      How many of them are actually doing anything though? I bet the vast majority are just sleeping.

    84. Re:Adapt by fractoid · · Score: 1

      Actually the analogy of a freeway isn't half bad. You have a certain number of commuters (tasks) that need to travel along the freeway (the computer system) for various amounts of time. You can only raise the speed limit so far before you need to make your cars out of exotic materials and it all gets too expensive. You can add extra lanes (processors) to speed things up. You can get the best speedups by either adding special lanes that don't allow lane changing (ie. tasks that parallelise without requiring cross-talk), or using larger vehicles such as busses or trains (SIMD instructions, but everyone has to be going at the same speed to the same place).

      --
      Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
    85. Re:Adapt by fractoid · · Score: 1

      But two cars can't get one person to his destination any faster than one car would, in much the same manner that two women cannot between them have a baby in 4.5 months.

      --
      Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
    86. Re:Adapt by shutdown+-p+now · · Score: 1, Insightful

      The remaining 4 for Flash?

    87. Re:Adapt by SL+Baur · · Score: 3, Informative

      Short answer: only one thing I mentioned involved disk I/O, RAM is cheap.

      Not in modern architectures and it depends. Registers are faster than L1 caches. L1 caches are faster than L2 caches, etc.

      See: http://lwn.net/Articles/250967/ for an excellent discussion about how one can dramatically speed up applications by optimizing memory access.

      And I disagree with the title of this thread - Linux (the kernel at least) is quite well prepared for multicore chips.

    88. Re:Adapt by shutdown+-p+now · · Score: 1

      No data shall be both mutable and aliased.

      There is nothing stopping you from enforcing this constraint in an imperative language, however, and if you follow this simple rule then concurrent programming is easy.

      Yep, but this breaks most existing conventional imperative programming approaches out there, which means it is a no-go for hordes of C++, Java and C# programmers out there. At this point, you might as well just throw Erlang or Haskell at them, because the mental paradigm shift it takes won't be much different by the time you get to "no shared mutable state".

    89. Re:Adapt by SL+Baur · · Score: 1

      That's not a bad analogy. The only part you really got wrong was at the end:

      we just keep driving our old-fashioned cars at normal speeds while adding lanes to the roads.

      That's adding bandwidth at its purest definition.

      +1 great car analogy.

    90. Re:Adapt by fractoid · · Score: 4, Insightful

      This is the sort of thing I like about Apple's 'Grand Central'.

      What's this 'grand central' thing? From a few brief Google searches it appears to be a framework for using graphics shaders to offload number crunching to the video card. It'd be nice if they'd stick (at least for technical audiences) to slightly more descriptive and less grandiose labels.

      <rant>
      That's always been my main peeve with Apple, they give opaque, grandiloquent names to standard technologies, make ridiculous performance claims, then set their foaming fanboys loose to harass those of us who just want to get the job done. Remember "AltiVEC" (which my friend swore could burn a picture of Jesus's toenails onto a piece of toast on the far side of the moon with a laser beam comprised purely of blindingly fast array calculations) which turned out to just be a slightly better MMX-like SIMD addon?

      Or the G3/G4 processors which lead us to be breathlessly sprayed with superlatives for years until Apple ditched them for the next big thing - Intel processors! Us stupid, drone-like "windoze" users would never see the genius in using Intel proce... oh wait. No, no wait. We got the same "oooh the Intel Mac is 157 times faster than an Intel PC" for at least six months until 'homebrew' OSX finally proved that the hardware is exactly the friggin same now. For a while, thank God, they've been reduced to lavishing praise on the case design and elegant headphone plug placement. It looks like that's coming to an end, though.
      </rant>

      --
      Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
    91. Re:Adapt by Draek · · Score: 5, Funny

      Three Cores for the Mozilla-kings under the GUI,
      Seven for the Gnome-lords in their halls of X,
      Nine for KDE Men doomed to be flamed,
      One for the Free Scheduler on his free kernel
      In the Land of Linux where the SMP lie.
      One Core to rule them all, One Core to find them,
      One Core to bring them all and in the scheduler bind them
      In the Land of Linux where the SMP lie.

      Which is, of course, what will eventually happen if the number of cores keep increasing: we'll need one dedicated exclusively to manage what goes where and when. Which is pretty cool when you think about it ;)

      --
      No problem is insoluble in all conceivable circumstances.
    92. Re:Adapt by Anonymous Coward · · Score: 0

      but you cannot, for instance, make an EPIC CPU break off and execute two sub-routines in parallel

      Traditionally, you would create separate threads for these subroutines depending of the OS overhead.

      What I'd like to see is, rather, that the CPU can implement a kind of "micro-thread" function, that would allow two larger codepaths simultaneously

      Are you talking about speculative loading and execution?

    93. Re:Adapt by electrosoccertux · · Score: 1

      It's funny you say that, I did that all with my overclocked Sempron socket 754 (it had hypertransport) single-core CPU at a measly 2.3Ghz. I had 1.5GB of RAM and it was all I needed to do what you're saying and more. Torrents, lots of Firefox tabs, 8 PDFs open, instant messaging, listening to music, burning a (data) DVD, watching a movie on my second monitor, copying gobs of data to an external harddrive for backups, writing a paper in Office, reading two others, etc etc.

      All this was going on with my single core CPU running with cool&quiet downclocked to 1Ghz loaded at 25%.

      So it was taking 250Mhz to handle all that stuff at once in Windows XP. Moving to Vista, it was a lot slower I'll give you that, but now I've got a dual core 3.4Ghz Core 2 series chip and if it weren't for gaming there would be _no point_ in upgrading. I _rarely_ burn video-dvds and when I do it only takes a minute per minute of video. Setting it up to burn+shutdown and then going to bed was never a problem in the first place, either.

      The fact of the matter is, and especially since Windows 7 is supposedly much more efficient than Vista, nobody needs a quad core CPU.

    94. Re:Adapt by KingMotley · · Score: 2, Interesting

      I guess that would be highly dependent on your particular field. First, .NET has functional languages like F#, and M. I also find that you dismiss the importance of profiling code in .NET simply because it doesn't generate machine language code. I'm at a loss as you why would you think it is any less important. Determining the areas which are being stressed the hardest and deserve more of your intention is completely unrelated to whether the code generates ASM, ML, or IL.

      You say .NET has "minimal support for it", but I suspect that's speaking more of your understanding that support. Background Workers is the easy way for highly independent routines to execute in many common scenarios. If that isn't enough or doesn't fit your need, then ThreadPools make highly parallelizable code a snap to implement. Example:

                      Dim eventhandles As New List(Of EventWaitHandle)
                      Try
                              For Each site As String In sites
                                      Dim ewh As New EventWaitHandle(False, EventResetMode.ManualReset)
                                      eventhandles.Add(ewh)
                                      Dim param As New ThreadData(ewh, "http://" & site)
                                      Threading.ThreadPool.QueueUserWorkItem(AddressOf DoDownload, param)
                              Next
                      Catch ex As Exception
                      End Try

      Now you are free to write the "DoDownload" routine, that could download some data, validate it, and do some processing on it. With no further changes, the above code would work well on any machine with a single processor, to one with 64 or more cores (I haven't tested more than 64 cores). If more control is needed, you can set the number of worker threads that will execute concurrently based on the number of processors in the machine with a single call, or you could implement your own threadpool, or create your own implementation by overriding specific functions of it. I also left the eventhandle code in the example, although it isn't needed for the example. It does show exactly how easy it is to create and use some more advanced thread synchronization primitives in .NET.

      Lastly, you could also spawn your own threads if you need/want even more control. It's incredibly easy. Example:
      Dim t1 as new thread(AddressOf DoDownload)
      t1.start()

      Not hard stuff, really. Of course if you want to get into larger scale outs, then you may want to look into the Azure set of .NET features, which is supposedly specifically designed for large scalability for cloud computing (I myself have no experience in that area).

    95. Re:Adapt by drizek · · Score: 1

      Intel had the right idea with hyperthreading and the pentium 4. In theory, it was the way of the future. One core, ridiculous clock speeds, multithreading for multitasking. The problem with the P4 is that it was an slow, expensive, inefficient piece of crap Multicore is not the answer though. Sure, two is better than one, maybe you can make use of three, really heavy users might need 4, but multithreaded octalcore CPUs are just ridiculous. For the vast majority of people they are just a waste of money and electricity.

    96. Re:Adapt by jcaplan · · Score: 1

      You seem to want to see if the programmer can avoid the issues inherent in writing threaded code - a worthy goal since writing threaded code can be challenging to write correctly (or maybe just easy to get wrong). I don't think passing code to an API that would understand your code would be the way to go, though. Compilers are much better suited to this - they have access to the source code after all and are designed to optimize it to run on particular hardware. The article suggests that part of the solution may lie in better tools such as compilers that recognize code that can be safely parallelized and compile it to run in multiple threads and handle all of the nasty "race conditions," that are easy to create and very hard to debug. The article also warns that automated tools may not be the panacea for the multi-core programming problem that some hope. You have to be careful about your threads competing for limited resources, such as memory bandwidth or needing to talk to each other so much that single-threaded code would have been faster.

      The article mentions OS's not being ready, but the main issue is one of applications tending to be written for single cores. Linux happily supports thousands of cores and Windows can now handle 256 cores. The OS can be optimized for multi-core machines by parallelizing certain system calls or kernel operations, or keeping related threads on cores that are near each other, but the applications that are CPU bound seem to be running application code most of the time, not kernel code, so there is a limit to how much the OS can do to help.

    97. Re:Adapt by spirit+of+reason · · Score: 1

      VLIW and out-of-order superscalars attempt to extract more performance out of the same type of parallelism, actually (instruction-level parallelism). The way I see it, they are just different approaches to the same source of speedup; VLIW just pushes a big chunk of the work from the hardware to the compiler.

    98. Re:Adapt by Anonymous Coward · · Score: 0

      "What is needed, I would argue, is a way to parallelize instructions in the instruction set itself."

      Isn't that the idea behind SIMD? Single Instruction, Multiple Data.

    99. Re:Adapt by windwalkr · · Score: 1

      It sounds like you're asking for threading (say, OpenMP style) which is supported at the hardware level rather than the OS level. The advantage of such a system would be that you could bring up additional threads with marginal start-up/shut-down/message-passing costs, allowing them to be used to accelerate small jobs for which the current OS-hosted threading models are too heavy. To accomplish this, the hardware would need to be able to bring up new threads at the request of the application programmer, with no per-thread OS interaction, and with little to no latency (the cpu effectively forks the current thread without missing a beat.)

    100. Re:Adapt by Anonymous Coward · · Score: 0

      The real reason why we have octuple-core cpu's is to give self-proclaimed Einsteins something to talk about.

    101. Re:Adapt by bytesex · · Score: 1

      Do what mainframes did instead: focus on bigger pipes (bus, IO).

      --
      Religion is what happens when nature strikes and groupthink goes wrong.
    102. Re:Adapt by Anonymous Coward · · Score: 0

      Three cores to run GNOME, one core to run Firefox.

      ... One core to bring them all,

      and in the darkness, give Windows Genuine Advantage....

    103. Re:Adapt by Anonymous Coward · · Score: 0

      Programing tools need to provide architects an easy solution to create process work flows.
      Once you create complete solution, it should be given a solution to isolate groups of code which can at certain time be executed in parallel with all dependencies. Similar with example you gave. All people on construction site know when to do what, and only because someone already gave them that order, which is part of greater workflow documentation.
      Hardware should solve this issue in same way and shouldn't bother software by any of low level parallel executions and dependencies.

    104. Re:Adapt by bitrex · · Score: 1

      One neat possibility of a FPGA or other "programmable hardware" I was thinking about the other day was an "analog physics processor", that is some kind of add-on device that when physics equations need to be solved for a simulation the FPGA reconfigures itself into the proper analog computer system (integrators, summers, differentiators) for the particular problem, then applies the requisite (scaled) initial conditions. Since you are using analog methods and not a digital mathematical representation of the physical system, many solutions that would require computationally intensive numerical methods in a totally digital system can be solved exactly and in real time; this is as long as you're not necessarily looking for a general solution to the problem but only the solution for some certain initial conditions, which in a simulation is usually the case.

    105. Re:Adapt by bh_doc · · Score: 1

      Why would disk IO be pegging the CPU at full core capacity? What, is it in PIO mode or something?

    106. Re:Adapt by Anonymous Coward · · Score: 0

      The synchronization overhead is only as large as you make it. The best parallel code doesn't need to synch very often. If you have one thread running on each processor, and they only rarely need to communicate with each other, then you can achieve very nearly linear speedup from multiple processors.

      The main question is whether or not you can parallelize your problem, not whether the O/S supports a particular type of thread.

      If you want to calculate the points of a line, you could figure out the number of points in the line, and use 2 threads to calculate the line, one at each end, each working toward the middle. That's not bad, and you'll likely get about a 95% speedup over 1 thread, if you do it correctly.

      Even better, you could design a thread pool to calculate every point, taking into account the specific hardware and the exact number of CPUs, but achieving a linear speedup with this method would be non-trivial, to say the least, unless you used something like a GPU acceleration library. With that comes more problems, however, because communication between the main system and the GPU is very slow.

      I guess what I'm saying is that you don't need better tools to use CPUs effectively, just a better attitude.

    107. Re:Adapt by Rockoon · · Score: 1

      But more importantly, many programming tasks simply aren't meaningful to break up into such units of granularity is OS-level threads. Many programs would benefit from being able to run just some small operations (like iterations of a loop) in parallel, but just the synchronization work required to wake up even a thread from a pool to do such a thing would greatly exceed the benefit of it.

      Its a question of scale. The programs that could not significantly benefit from current parallel execution methods are either bottlenecked on something else (that also needs parallel innovation) or are simply not time-consuming. Really.

      Multi-core tackles volume, not latency. Many of the posters here seem to be stuck on latency, primarily on system responsiveness issues, where multicore isnt really the best solution. Our OS's really arent designed to maximize responsiveness, and if they were they would have to sacrifice volume in order to do so.

      IMHO, my systems responsiveness is good enough for now.. its the volume of computation that is lacking, and that is why I welcome many-core. I want general-purpose teraflops for a reasonable cost.

      --
      "His name was James Damore."
    108. Re:Adapt by windwalkr · · Score: 1

      This is true, but perhaps an overly simplistic view of the problem. If you can lower the overhead, then certain problems which were too small to be worth threading could suddenly benefit from threading. Almost any code could be threaded, if the overhead is low enough. This is what out-of-order execution is all about, really, but allowing explicit threading of inner loops would be a nice trick if the threading overhead could be made sufficiently low.

    109. Re:Adapt by lpq · · Score: 1

      Pegging CPU at 25% (or at 12.5% on a Dual-socket, quad-populated motherboard) is a Window-ism.

      On Windows, an 8-cpu machine has 100% cpu total.
      Same as a 4, 2, or 1. 25% CPU peg'ed means 1 core is at 100%, 3 cores are idle.
      Not to do with disk I/O.

      It's one of my standard complaints on some Windows forums.

      I much prefer the unix/linux system of showing cpu usage as % of 1 CPU and allowing one to show
      200, 400 or 800% CPU usage based on number of cores, but Windows has always tried to hide hardware
      and what's really going on from users, so it's consistent in that way...

    110. Re:Adapt by gmack · · Score: 1

      You are forgetting that the processor is not what stops you from running that many things at once.

      Your biggest bottleneck will be your Drive followed by the RAM. It doesn't matter how fast the chip goes or how many things it can do at once if it's always waiting for data.

    111. Re:Adapt by X0563511 · · Score: 1

      How about something similar to what some supercomputers (and the Cell Broadband Engine do) and have one (or more) core(s) (not necessarily the same or similar to the others) dispatching for the rest.

      Ie, one specialized core that analyzes the program and dispatches out to the other cores. Rather than sending the actual instructions, just send them their instruction pointers and how far to execute before reporting back.

      At the start, putting this in it's own socket (or allowing something like microcode updates) would be wise as this would likely change frequently. Similar to how FPU coprocessors came into play.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    112. Re:Adapt by Hal_Porter · · Score: 1

      http://www.pluralsight.com/community/blogs/mike/archive/2004/05/25/415.aspx

      Q) Why did the multithreaded chicken cross the road?
      A) to To other side. get the

      Q) Why did the multithreaded chicken cross the road?
      A) other to side. To the get

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    113. Re:Adapt by smallfries · · Score: 1

      Although Tolkien's knowledge of the FOSS scene was pretty substandard he did, at least, know how to make poetry scan. That is butchery on an industrial scale

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    114. Re:Adapt by X0563511 · · Score: 1

      Yet other tasks, that tend to be done on desktops, do scale well. Such as rendering.

      Also, audio (and video) production benefits from this very much, but not to the extent as rendering.

      But most things that I can think of relate to content production, rather than the scuttwork the 'average' user does.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    115. Re:Adapt by X0563511 · · Score: 1

      How does it determine when it's worth the effort of doing anything for a particular bit of computation versus just doing it inline and saving the overhead?

      The compiler could do this ahead of time. It could spend the time analyzing the program, and write the results out somehow. Assign each chunk a metric of some kind.

      This data could then be read at runtime and used to quickly figure out what to do. Based on the hardware, metrics below a certain level could be combined so that you have nothing below a certain granular level... and there is your breakdown. Start running the chunks.

      There's no reason that a lot of this work could be done in a time-intensive manner once, instead of a less intensive manner every time it's needed. Think if it as compilation vs interpretation.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    116. Re:Adapt by SlashWombat · · Score: 1

      Have to agree, most programs do not require more than one to three threads. Where multiple cores will shine is when one runs multiple programs, and the expectation here is that the OS will partition out the load appropriately. (Gee, just like linux/windows/unix ...) so the article is just a load of FUD!

    117. Re:Adapt by mattcasters · · Score: 1

      Most of these threads and processes are consuming next to no CPU time so your reasoning would not hold up. That is, unless you have a huge number of tiny processes running. However, the context switching would then consume more time than the actual processing.

      --
      News about the Kettle Open Source project: on my blog
    118. Re:Adapt by mattcasters · · Score: 1

      Sorting in parallel can indeed be the faster option if you are CPU-bound as described in the example. The point of TFA was that programs are ill-adapted to a new multi-core reality and I have to agree that sorting on a single core is a prime example of that. The fact that there are slow disks and fast disks out there or that there is a thing like "I/O wait" is besides the point. It should be possible to do things in parallel.

      All that being said, I have to violently agree with tftp above: some things just can't be made to run in parallel in a safe way.

      --
      News about the Kettle Open Source project: on my blog
    119. Re:Adapt by Hal_Porter · · Score: 1

      It's not just ray tracing

      This Intel paper on Larrabee (pdf) shows pretty good scaling (90% of linear with 48 processors, linear with 32) with shipping DirectX 9 games like Gears of War, F.E.A.R., and Half Life 2 Episode 2.

      What's impressive is that the games are unmodified - the only change is in the graphics driver which tiles the rendering and allocates one core per tile.

      You could easily imagine a CPU/GPU hybrid that would do this. And you could use the processing power in a server too - imagine a thread pool servicing requests, it's not just for games. Larrabee is x86, but not apparently PC compatible, but you could probably get x86 Windows to run on it with the right HAL.

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    120. Re:Adapt by Anonymous Coward · · Score: 0

      The instruction set needed is just one instaruction: "Copy Data from to ". Really! You don't need anything else if you just memory map the functional units.

      The problem is everyone still thinks a processor should "actively" interpret instructions (a good thing for a human to do but cumbersome for a digital processor) when the processor only needs data routing.

      And noone seems to have understood that there _is_ reconfigurable logic. One doesn't need to reprogram software when one can reprogram hardware.

      Have fun anyway

      Gustavo Rocha

    121. Re:Adapt by Pentium100 · · Score: 1

      No, not really. Single core CPU is the same for all programs, so if you have a 5GHz CPU, it will generally be 5 times faster than a 1GHz* CPU for any application - be it games, video encoding or something else.

      Now, my computer has two sockets, both of which have a dual core CPU, so in total I have 4 cores 2GHz each. My PC would be quite fast if the program was NUMA-aware and had 4 threads. In that case, my computer would probably outperform a 1x8GHz CPU with the same RAM technology. However, as soon as I start playing a game that does not support multithreading, my computer becomes just 2GHz and memory with higher latency than a one single core CPU would have.

      There are tasks that you can't break up into multiple cores (for example the calculation of 2*2+2 can't be broken up because you have to know the answer of the multiplication before you can add anything to it, though 2*2+4*4 could be broken up).

      Also, PCs were pretty responsive with a 100MHz CPU not so long ago, now we need 2x2GHz (40x more) to do the same tasks (word processing etc)?

    122. Re:Adapt by ciderVisor · · Score: 1

      In much the same manner that two women cannot between them have a baby in 4.5 months.

      Maybe not, but I'd pay good money to watch them attempt it.

      --
      Squirrel!
    123. Re:Adapt by Anonymous Coward · · Score: 0

      Hello Great Blog I will definitely bookmark your blog. I am also having a blog related to IT news ( http://itresearchnews.blogspot.com/ ) which gives latest analysis and trends in IT industry in the present recession period. I would appreciate if you could kindly bookmark my blog too

    124. Re:Adapt by dna_(c)(tm)(r) · · Score: 1

      we'll need one dedicated exclusively to manage what goes where and when.

      Single point of failure. One process to crash them all

    125. Re:Adapt by chthon · · Score: 1

      This is 30-year old technology. I have at home an issue of an old electronics magazine where the same is shown for a 30 MFlops computer, built with TTL, which has a dispatching processor and 15 computing processors.

      The real advantages in the for the desktop would be that they finally start using the ideas from 40 to 50 years ago : that IO needs separate processors where these operations can be offloaded. However, to be really effective, it becomes time that hard drive manufacturers start to apply RAID internally to their hard drives in order to increase the speed with which data can be transported between the IO processors and the peripheral.

      My daily work entails building large embedded software. My speed is mainly capped by the IO from my harddrives, not by the amount of processing that can be done. If the speed of hard drives can be increased four fold (possible with RAID-5, but too expensive for the people that guard the budgets), then my processors would be utilized much better.

    126. Re:Adapt by Anonymous Coward · · Score: 1, Funny

      Three Cores for the Gnome kings under the Gtk,
      Seven for the KDE lords in their halls of X,
      Nine for Emacs Men doomed to spawn,

      One for the Ballmer, who throws his dark throne,
      In the land of Redmond, where the Windows roam.

    127. Re:Adapt by Anonymous Coward · · Score: 0

      My grandmother told me:
      There are two kinds of programs: IO bound and compute bound. In neither case does multi-core help performance.

    128. Re:Adapt by sergueyz · · Score: 1
      The blocking problem of dynamic dataflow architectures is the amount of content-addressable (or associative) memory.

      It is very easy to overflow any amount of CAM/AM by using just matrix multiplication. For N^2 data items matrix multiplication produces N^3 multiplications that can be executed in parallel. So for reasonable N=1000 we get 10^9 multiplications, stored as 2*10^9 multiplicands in CAM.

      We found a way to overcome that problem while I worked at IPMCE. We use "program time" to sort tokens out of the AM. Those that are "most far away in future" swapped out to free space for more needed ones. That way we control available parallelism and guarantee that tokens eventually meet their pairs so computation won't get stuck.

      It's like "bubble sort" of tokens with a window instead of single element, so we called it sorting dataflow machine.

      It is not unlike throttling which was used to control parallelism in CM-5/Id90 implementations (AFAIRC). The only difference is that usual dataflow throttling should be applied carefully as it is not guarantee computation progress (it can get stuck).

      To verify our suggestions I developed a model of that architecture: http://thesz.mskhug.ru/browser/hiersort

      The model exhibits some architectural decisions to speed up the whole computation sorting process. Take a look at it as the source files are there (except they are in Haskell programming language;).

      For first version of completely new architecture developed under three months of free time it worked pretty well. It do not stop, it sorts and it provides ~70% of "FPU" load in my testing tasks.

    129. Re:Adapt by dna_(c)(tm)(r) · · Score: 1

      Sure, two is better than one, maybe you can make use of three, really heavy users might need 4, but multithreaded octalcore CPUs are just ridiculous

      So, essentially you're saying that 4 cores should be enough for anybody?

    130. Re:Adapt by OrangeTide · · Score: 1

      "that IO needs separate processors where these operations can be offloaded. "

      Already exists, your computer is full of various processors and controllers. DMA engines, ethernet controllers, harddrive controllers, RAID controllers, graphics processors, audio dsps, system monitoring controllers. Some even have controllers just to monitor some of the I/O buses and alter their speed on demand.

      Also I would go with RAID-6 if I were you. RAID-5 just ain't worth it, doesn't scale and requires holding transactions in nvram to work around the write hole.

      For embedded development I tend to do "disk" I/O over a 25MHz 4-bit bus. So while faster disks would be nice, most of my real problems could be solved with a bit more RAM or a bigger battery. (but those things cost money)

      --
      “Common sense is not so common.” — Voltaire
    131. Re:Adapt by AlecC · · Score: 1

      Which is great - if you rewrite your software for such an architecture. Which is what the original article was saying.

      --
      Consciousness is an illusion caused by an excess of self consciousness.
    132. Re:Adapt by AlecC · · Score: 1

      Fine. Now how do you expect a dumb processor to do what a clever programmer cannot? If we knew how to build machines that could do this, we would have done it. But to allow for parallelism, you have to understand the why of the system, not the what. And that gets lost between design and code.

      --
      Consciousness is an illusion caused by an excess of self consciousness.
    133. Re:Adapt by LiquidCoooled · · Score: 1

      As for the car analogy:

      Two cars only help you get x people faster from one place to another, if you have more people to transport than fit in one car.

      (I challenge anyone to dump that down.)

      Are we talking about an American or European person?

      What happens if a single person errrr process is too large to fit in a single car?

      Maybe we could use 2 cars with a line between them.
      You could use the tow bar to hold him.

      --
      liqbase :: faster than paper
    134. Re:Adapt by TheNinjaroach · · Score: 2, Insightful

      A single-core system at 5GHz would be less-responsive for most users than a dual-core 2GHz. Here's why:

      Because you're going to claim it takes more than 20% CPU time for the faster core to switch tasks? That's doubtful, I'll take the 5GHz chip any day.

      --
      I went to eat some animal crackers and the box said, "Do not eat if seal is broken." I opened the box and sure enough..
    135. Re:Adapt by log0n · · Score: 1

      Anyone who didn't get Out Of Order from the context of your post should have their programming nerd id badge revoked :)

    136. Re:Adapt by Anonymous Coward · · Score: 0

      The fastest way to get info from a disk is sequentially.

      I disagree (for typical desktop usage)

      I read an interesting article about disk IO in the Google Chrome browser. When the browser starts, it "needs" to load the bookmarks, cache metadata, etc from disk. That would take time, slowing browser startup. So what Chrome does is it starts with no bookmarks and without initialising the cache system, and it loads the data in background threads. By the time you've typed a few characters in the address bar (which is when the data is actually needed), it's been loaded.

      This gives the appearance of the disk I/O taking zero time.

      Incidentally, for desktop usage if you're going to read lots of small files, the thing which takes the time is the seeks. So you really want to submit a large number of parallel requests and allow the OS disk scheduler and/or the disk (assuming NCQ) to optimise by doing elevator seeks - where the heads only move in one direction and don't keep jumping back and forth. Sequential reads are only quicker if you're reading a single large file on a defragmented disk.

    137. Re:Adapt by DrgnDancer · · Score: 1

      That only works for some kinds of problems. The article mentions that Photoshop and several video packages have already been optimized for multiple cores, seeing substantial performance gains. The reason that video and image software have already been substantially optimized, and other kinds of software have not is not really gone into, but seems twofold to my mind.

      1) Signal transformation of various sorts is one of the most highly parallelizable forms of computation around. It seems likely that audio software could be similarly rewritten with a minimum of effort to make better use of multiple cores.

      2) Because it's relatively easy to break down signal transformation problems, and because there are lots of viable commercial reasons to do so in markets that could afford multiprocessing systems early, before they became cheap (Movie and video rendering and editing comes to mind right away); it's one of the better researched forms of multiprocessing problems. I'd venture to guess that 80 or 90% of people who have ever done any work at all in parallel systems have done signal transformation work either as part of their education or part of their work.

      The problem with applying these techniques to consumer software is figuring out how to break down the problem. What CAN be done in parallel (not everything can, if you try you'll get race conditions interrupt servicing problems, all kinds of stuff.), and how can you do it such that breaking the problem down and coordinating the threads isn't MORE work than just solving the problem in a single thread would have been.

      GP's idea is to develop systems (I'd imagine that CPU instructions, kernel modification, and compilers would all be needed) capable of figuring this stuff out for us. Parallelizing the code at run time (or compile time? I wasn't clear) probably requiring a minimum of three threads to do it, the controller and N+1 computational threads you suggest, plus another thread to figure out how to break up the problem in real time and send chunks to the computational threads. I don't think I understand exactly how he wants to handle the real time breakdown, but I think that's the key. The computer figures out how to parallelize the problem for us, rather than forcing us to find the place in the algorithm where we can break the problem up.

      --
      I don't need a million points of light, just two points of multi-mode fiber and a 10 Gig-E router.
    138. Re:Adapt by marcosdumay · · Score: 1

      We already have that, it is called scheduler. Try inserting a bug there.

    139. Re:Adapt by jgtg32a · · Score: 1

      I thought that was the car analogy that we used for 32bit vs 64 bit architecture?

    140. Re:Adapt by marcosdumay · · Score: 1

      "can be solved exactly and in real time"

      Well, exactly except for the noise inserted during operations, and in real time if the cutt-off frequencies of your cirtuit are big enough.

      Exact results and fast calculations were the sellers of digital computers when analog ones were common.

    141. Re:Adapt by marcosdumay · · Score: 1

      The problems with hyperthreading are that it was designed before its time, so Intel couldn't replicate some essential parts of the core, and that out-of-order execution can't solve all the problems.

    142. Re:Adapt by marcosdumay · · Score: 1

      Well, CAD software also benefits from paralelization, as do games, and scientific simulations, or economical ones (to the currently desperate). Drawing software doesn't benefit too much (or at least, I think it doesn't), but painting sotware does. Developpers should also benefit, but compiling is already fast enough, so more than 2 cores is overkill for most of them.

      My point is, there is no "average" user. Some people need CPU, and most of those will benefit from massive paralelization. Other people don't need, and those won't benefit from neither paralelization nor faster cores. The latter will like cheap less power hungry chips, tough.

    143. Re:Adapt by Acapulco · · Score: 1

      Maybe we are approaching the parallelization thing in a wrong way.

      How about instead of adding more cores, we add more duplicate components to the MoBo? Such that we can access 2 or more I/O operations per cycle, etc.

      I mean, right now we have systems with multiple cores, multiple video cards, multiple hard drives, multiple various other stuff. Why not multiple north/southbridges, and other things I'm not aware of modern-day MoBos?

      How difficult it is to scale down the solutions used for parallelization by big data-centers to a consumer-grade single system? I honestly have no idea, but maybe it's something worth researching. No?

      --
      Slashdot. Unreadable news to annoy nerds. - wonkey_monkey
    144. Re:Adapt by mdwh2 · · Score: 1

      I'm glad that the megahertz myth appears to have passed, but it seems that some people have swallowed the "multicore myth", believing that multiple slower cores are better than single faster cores.

      While you're playing a game more programs are running in the background - anti-virus, defrag, email, google desktop, etc.

      Why does this matter? Computers have been capable of multitasking on a single core for decades. Are you really saying that this consumes more than 1GHz worth of time?

      If you mean the problem where one process can hog 100%, that's a software issue. Theoretically you could have an OS that limited the CPU time allowed to each thread to a maximum of 50%, and then it would behave like your dual core. But I think most people would prefer no such artifial limit.

      Also, any proper, modern game splits it's tasks, e.g. game AI, physics, etc.

      And? All this means is that it will take advantage of multiple cores. But this can never mean it runs faster. The best case scenario is that your dual core 2GHz will equal a single core 4GHz, but in practice, it will be less than that even. I want to know what sort of computer you have where running things on two processors makes it more than twice as fast!

      I'm also not sure what you mean by "responsive". My Amiga 500 was responsive, but being responsive is often not simply a CPU issue.

    145. Re:Adapt by mr_mischief · · Score: 2, Insightful

      This is indeed true on a general-purpose desktop most of the time. There are many server and workstation tasks, though, that ca take as many cores as you can throw at them.

      A web server, application middlware server, or database server will often run multiple single-threaded programs at once rather than running one huge multi-threaded application.

      People who say that you must have multi-threaded applications to use multiple cores are either incompetent or are looking at a very narrow section of the industry. Not everyone runs a single foreground process with just a virus scanner in the background. An SMP or NUMA server with a hundred application instances running isn't going to run all of them on the first four cores and ignore the rest.

      Some scheduling changes may be necessary to make doling the work out to really big number of cores, like 128, 256, or 512 work really well, but Linux is already run on HPC clusters much larger than that and Windows HPC is supposed to be capable of it, too.

    146. Re:Adapt by adosch · · Score: 1

      Yes, my analogy is better than yours. That's what more /. posts need. What's next? Who's gonna see who's old man can take who? Swinging dong contests are lame.

    147. Re:Adapt by mr_mischief · · Score: 1

      If you're running a dual-socket, 4-core x86 PC then you're probably using SMP, not NUMA. In SMP, all the processors see all the memory and it's the same cost for any processor to read or write any memory location. NUMA means Non-uniform memory architecture, which means either not all the processors see all the memory or that some of the memory is faster for each processor than the rest of the memory is.

    148. Re:Adapt by mdwh2 · · Score: 1

      Saying that you need an extra core to run a small background program, makes about as much sense as saying CPUs need to get another 2GHz faster in order to run it.

      32 tabs open in Firefox (come on, 1 thread per tab!!).

      Does Firefox multithread each tab? ISTR that this was something that Chrome did, but I wasn't aware of other browsers doing it yet.

    149. Re:Adapt by mr_mischief · · Score: 1

      How many "average" users do you figure are using Slashdot right now? Sure, their browsers might not need a lot of cores each, but do you think Slashdot itself is on a single-core machine at someone's desk? As more stuff gets done via centralized online applications, more people will be using multi-core systems and even clusters. They'll just be using them indirectly.

    150. Re:Adapt by 0xABADC0DA · · Score: 1

      A lot of single-threaded code can be made parallel with just a little bit of OS support:

      int my_cpu = sys_split(max_cpus, &ncpus, amt_work);
      // now executing on ncpus
      for (int i=my_cpu; i < niter; i++ ncpus) {
      // work
      }
      sys_combine();
      //now executing on last CPU to finish

      The OS can return you 1 CPU if amt_work is too small for how much time it will take to set up, or after the first split (doesn't make sense to break work into more than available processors). If there's a free CPU running idle thread it can be reassigned right away.

      This covers a lot of the code that could be multithreaded and has minimal synchronization. The program doesn't have to manage anything, and it can be done by library code to automatically speed up many programs. And it's portable, in that systems that are single-CPU make this syscall a no-op, or systems that take a long time to get another CPU working for the program just return 1 CPU unless work is huge, or on arch with CPU threading it can split on even smaller-scale workloads.

      But OS kernel designers will never do this, because they are only concerned about maximum throughput and making each CPU run at 100%, and not users. Honestly what most users want when they have an 8-core system is for a thread to have say 2-4 cores allocated to it if that means it can get even a 10% speedup. Eeking out even more performance on a single-threaded program is not hard, but OS designers need to be willing to 'waste' CPU time to do it, for instance by scheduling threads to two cores at the same time so that things like split() are practical... even when most of the time the other core is completely idle.

    151. Re:Adapt by try_anything · · Score: 1

      What I meant by "RAM is cheap" is that on modern systems you'll only start swapping under pathological circumstances. The GP poster was worried that running two jobs concurrently would turn the disk into a bottleneck. His objection was framed around jobs with heavy disk I/O, but I also wanted to address the question of swapping, which was a valid a concern back in the days when switching from task #1 to task #2 could mean swapping in a bunch of code and data for task #2. These days, #2's code and data would be in memory somewhere, a much quicker trip than disk.

      And I disagree with the title of this thread - Linux (the kernel at least) is quite well prepared for multicore chips.

      That seems to be the case to me, too. It's the applications that drop the ball. Emacs can get hung opening a large .cpp file if the macros confuse the parser used by the syntax highlighter. Why isn't that done in a separate thread so I can make my changes and close the file while the syntax highlighter flails in the background?

    152. Re:Adapt by HuguesT · · Score: 1

      Well, if you think parallel computation is hard, FPGA programming is yet another completely different ballgame. You have to worry about timing delays and the like. This is not for the faint of heart, and not all algorithms can be efficiently implemented in a FPGA. It works best on flow-like stuff but falls to slow CPU speed if you need to access memory randomly.

      Indeed you can emulate hardware on an FPGA, but not *very* fast.

    153. Re:Adapt by mr_mischief · · Score: 1

      Anyone really concerned with performance shouldn't have just one drive. Even on a desktop, it makes sense to have one drive for the applications and one for the OS, with the data on one or the other. Having one each for OS, apps, and data makes sense, or having three in RAID 5 for that matter.

      This machine has four hard drives, and is data heavy. There's one drive with one partition each for the OS, the distro's applications, swap space, and my separately tracked applications. Then there's the data store, which is a three-disk RAID 5 array. That holds my application data, my important documents, my work projects, and my backups (which then get burned to DVD and stored both on-site and off-site in fire-rated safes, BTW).

      Currently it's all spinning disk technology, but if you've been keeping track of SSD performance the last few weeks, you know that using an Intel X25 or an OCZ Vertex for part of the storage system could speed things up quite a bit. With an SSD (a good one, anyway), random reads are about as fast as sequential reads. Sequential writes are blazing fast, too. Random writes can be a problem area, but on a drive with a good controller they're still faster than a really nice rotating drive. Only the Velociraptor even competes with the better SSDs even on their weak points.

    154. Re:Adapt by Andy+Dodd · · Score: 1

      The article title is clearly misleading. They say Windows and Linux aren't ready, but the article spends all of its time talking about the applications not being ready.

      In fact, after RTFAing, I can't even find any mention of the OS in the linked article. Nothing about Windows or Linux deficiencies whatsoever.

      --
      retrorocket.o not found, launch anyway?
    155. Re:Adapt by mr_mischief · · Score: 1

      No, no, no! With 32 bits vs. 64 bits, it's not the car that gets bigger. It's the bus!

    156. Re:Adapt by donjefe · · Score: 1

      We mac users just don't have to reboot as often, so we don't get a lot of pent up rage :)

    157. Re:Adapt by mr_mischief · · Score: 2, Interesting

      So let the office workers keep the two-core machines. I'll take the 8-core machine since I'm not doing just word processing and spreadsheets.

      BTW, complex spreadsheets are actually an ideal application to break into parallel execution if there aren't too many dependencies in the functions. A slower and more power-efficient multi-core processor could update all the cells in many spreadsheets just as fast as a faster single-core one.

    158. Re:Adapt by Sloppy · · Score: 1

      Actually, in Linux (and likely other *nix systems), with command lines involving multiple pipelined commands...

      I always thought it was funny that I have bash scripts that (if fed sufficiently fast from the source input) are capable of taking advantage of over ten CPUs, but a lot of my code written with a "real" and more "sophisticated" programming language, isn't as capable of taking advantage of parallelism.

      When I look at those bash scripts, I realize that if I had implemented the same thing in a "real" language, I probably would just have one loop that sequentially does lots of little things to a record of data and then outputs it.

      Because of this, now when I'm working in a "real" language, I sometimes get a perverse idea and ask myself, "How would I do this in bash?" ;-) I usually don't follow through and actually do it that way (yet; because I'm actually still using single-core-single-CPU hardware) though.

      --
      As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
    159. Re:Adapt by Anubis350 · · Score: 1

      Actually, I like apple because of the price-point (or did), my previous gen 8-core mac pro was cheaper with my student discount than a comparable system from any other manufacturer or building it myself. Its thermal design also means that it's exceptionally quiet. OSX is a also a very nice OS, though ~half the time my tower spends on is in debian (with a small amnt of time in vista for gaming :-) )

      BTW, at the time when the g3/4 was being lavished with praise it *was* far faster, clock for clock, than intel/amd/via/etc. With the innovations of AMD64/EM64T and the stagnation of PPC (because apple is a very small fish to IBM), x86 caught up and surpassed

      --
      "goodbye and hello, as always" ~Prince Corwin, from Zelazny's Amber series
    160. Re:Adapt by Rex1Ballard · · Score: 1
      Both Windows and Linux have a problem with Desktop performance, because much of the performance is determined, not by the CPU speed, but by the disk drive access speed.

      Linux servers run quite nicely on multi-core systems, including IBM's Z-Series. The old 2.2 kernel had some problems with spinlock contention that made it slower than NT 4.0 when used with 4 SCSI controllers and 4 ethernet cards, but Microsoft had to really be creative to come up with that infamous Mindcraft benchmark.

      The 2.6 kernel uses queued events, which eliminates all but a few picoseconds of contention, giving Linux the ability to do some pretty radical performance on multi-core systems.

      Linux also provides better support for clusters and virtualization. The result being that load can be more easily distributed when the application is designed for Linux/Unix in the first place.

      One could make the case that HP_UX or Solaris might be faster or more reliable, but much of this depends on the specific types of benchmarks used, the types of specialized tuning allowed for that benchmark, and the optimization of hardware for that benchmark.

      Where throughput can be pipelined, queued, or otherwise isolated from the hard drives and video, the performance can be quite impressive.

      Unfortunately, for desktops, you can't update the display too fast, because the user can only see about 30-60 frames/second. Databases are more dependent on the rotational speed, number of hard drives, and amount of hard drive cache, rather than the number and speed of CPUs.

      For some applications, such as cracking encryption, or predicting the weather, 8-core Linux systems might be more useful.

      Other applications where 8-core CPUs make sense are applications such as 3D Rendering. But is there a big corporate market for real-time 3D graphics? Maybe we'll see a corporate version of Second Life?

      Perhaps something like predictive models based on real-time data mining, such as real-time display of actual call volumes in the sales department, allowing IT to see the entire economic results of the entire corporation in real-time, updated on a "per transaction" basis.

      To really exploit 8-core desktops, we have to think outside the traditional "Windows" paradigm and programming models. We need to think less in terms of huge binary blobs that gobble up huge amounts of memory, and more in terms of streams, pipelines, queues, and parsers that pass results from standard input to standard output in small chunks, and message routing systems such as MPI and PVM, which eliminate the latentcy that plagues traditional "Model/View/Control" solutions.

      --
      IBM Certified IT Architect http://www.open4success.org
    161. Re:Adapt by mr_mischief · · Score: 1

      If the compute-bound program has places where it isn't relying on earlier result, it can be made parallel. If you have multiple inputs and outputs and actually process them through different hardware paths, those can be made parallel.

      See GPUs and SIMD instructions for the former and channel bonding networks or RAID controllers for drives for the latter.

    162. Re:Adapt by fractoid · · Score: 1

      Odd, I always thought you DID (I know I did when I was forced to use a Mac) and that's why you had so much time to pester us while we're trying to work... :P

      I keed, I keed! ;)

      --
      Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
    163. Re:Adapt by Anonymous Coward · · Score: 0

      Opterons have always been NUMA, and I'd expect any i7-based server CPUs to be as well (I think they don't exist yet?)...

    164. Re:Adapt by Anonymous Coward · · Score: 0

      The only way the dual core processor would be faster in your example would be if it had more cache than the 5GHz CPU and the working set for your programs fitted into the cache on the dual-core 2GHz chip but not on the 5GHz one, but that's completely independent of the number of cores.

      Actually that is not truly independent of the number of cores I do some research on multi-core parallelization and one of the interesting effects is that the effective L1 cache size can be increased if you are careful.

      For example imagine if one core has 1 16 Kb L1 cache. If you put two cores on the same chip if you are careful with synchronization you essentially have 32 Kb L1 cache without taking the slowdown that caches typically incur by getting larger

    165. Re:Adapt by Lord+Ender · · Score: 1

      AJAX-enabled web pages/apps are constantly polling and doing other javascriptiness in the background. It would help for them to have their own cores.

      --
      A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
    166. Re:Adapt by mr_mischief · · Score: 1

      Yeah. You're right, but a desktop PC is probably not running two dual-core Opterons. It could be, but most 2 CPU by 2 core x86 systems one would find on a desktop rather than in a server rack will be Xeons. Those still use a common front-side bus with an external memory controller and are definitely SMP.

      Pentium100's situation may not be your average desktop, but NUMA-awareness is not as important for most people right now as the post might make seem. I made the mistake of using the generic "you" where it could be taken to be the specific "you". Pentium100's desktop probably is NUMA, since the post shows familiarity with the topic. People reading it don't necessarily have NUMA systems just because they have a similar number of processors.

      The Core i7 is indeed NUMA when in multi-socket systems, BTW. Still, at the present moment, most dual-socket motherboards are not Core i7 nor Opteron boards.

    167. Re:Adapt by poot_rootbeer · · Score: 1

      Remember "AltiVEC" [...] which turned out to just be a slightly better MMX-like SIMD addon?

      Are you proposing that "AltiVEC" was any worse a name for that type of technology than "MMX"?

    168. Re:Adapt by phision · · Score: 1

      The same applies for the 64-bit computing - useless nowadays, but there are promises programs will make use of it some day. Until then we will continue using our Core 2 Duo processors' one core in 32-bit mode. It is a pity we have paid for something we will never use.

    169. Re:Adapt by Pentium100 · · Score: 1

      No, my PC has two Opterons 270 and each of them has its own memory.

    170. Re:Adapt by Anonymous Coward · · Score: 0

      Actually, in Linux (and likely other *nix systems), with command lines involving multiple pipelined commands, the commands are executed in parallel,

      How the heck do you execute something in parallel, if the next command doesn't have the output from the previous one?

    171. Re:Adapt by Anonymous Coward · · Score: 0

      The problem is that analyzing the program takes more time than just running it. ften it takes longer to find an optimal solution and execute it than to execute a sub-optimal one.

      Never underestimate the difficult in program analysis. Remember, the problem is inherently incomputible.

    172. Re:Adapt by jd · · Score: 1

      Oh, I agree. The Transputer was an excellent example of a distributed computer. For that matter, so was the Commodore PET (IEEE 488 printers and disk drives all had their own CPU and the main system offloaded all work to them). The Intel iWarp system-on-a-chip was another distributed architecture.

      This stuff isn't new, per se, it is merely new in the sense that CPUs became so powerful that everything became hyper-centralized. Only in areas of sound and graphics have they meaningfully decentralized. Networking - well, how many TOE or RDMA engines do you use? There are plenty of Ethernet cards that support both and Linux drivers for those cards, but I'll bet fewer than one in every ten thousand Slashdotters has actually used such a card.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    173. Re:Adapt by ChrisA90278 · · Score: 1

      You are correct. What is needed is a new instruction set. But we don't need new machine level instructions. Source code level is good enough. A programming language or API that works at a higher level is what's needed. Currently you have to get down in the details of synchronization to make threads work. What's needed s a more declarative approach where you can "mark" sections of the program with informatin about how they interact.

      Object Oriented Programming is not a bad way to write parallel code. If the objects model real-world objects that each can be on it's own thread. Problem is that most dumb programmers idea of "object" is a GIU widget.

      In the end it is more about educating developers.

    174. Re:Adapt by rackserverdeals · · Score: 1

      This sounds very similar to the turbo mode in opensolaris that Intel has been working on.

      --
      Dual Opteron < $600
    175. Re:Adapt by papna · · Score: 1

      Silly computer scientists and your overloaded acronyms.

    176. Re:Adapt by Kymermosst · · Score: 1

      That analogy doesn't work either.

      Adding seats to a single car either assumes everyone goes to the same destination (all your processes are the same), or your car suddenly has to drive to all of their destinations anyway (all processes wait for identical events at identical times).

      --
      "Alcohol, Tobacco, Firearms, and Explosives" should be a convenience store, not a government agency.
    177. Re:Adapt by Bitmanhome · · Score: 1

      Are you thinking about the right kinds of problems? Math and simulation are inherently easy to parallelize; the hard ones are decision-making algorithms like AI and data compression. If you can solve those, you'll be a hero.

      --
      Not that this wasn't entirely predictable.
    178. Re:Adapt by mr_mischief · · Score: 1

      You might want to note my 10:51 AM response to the AC before your response at 11:26 AM. I pointed out that you, specifically, are probably using something that's not a a typical "desktop" configuration.

      I mistakenly used the general "you" referring to the random reader when it could be read as meaning specifically you, Pentium100. Most dual-socket machines intended for general desktop use are Xeons with an SMP front side bus rather than Opterons with ccNUMA integrated memory controllers, after all.

      See post #27299117 where I said pretty much the same thing.

    179. Re:Adapt by LordWoody · · Score: 1

      Even if you 'offload' IO, there is the problem that many programs are still blocked until the IO is completed. Watch top when you system is bogged down and see how much time each processor loses to '%wa'. That represents not how much time is spent by that CPU on IO, but how much time is lost to waiting on IO (is it ready yet?, is it ready yet?...).

      I also work in an environment developing massive computing embedded systems all based on x86/x86_64. Our smaller systems start with four cores and we currently ship up to 16 although we have a 24 core x86/x86_64 system in house for testing.

      We run into pure processing limits in some cases and in other IO throughput (using high performance PCIe based Areca RAID cards in RAID10 configuration). For absolute performance IO throughput we are looking at Violin-Memory systems which implement storage as an external device that is attached via PCIe. The increased throughput is astounding (up to 5x increased insert record rate into a database over a internally hosted PCIe connected RAID10). Yes the Violins are not cheap, but customers that need that kind of throughput will pay the price for it.

      RAID5 is ultimately a net performance loss over RAID1 and even a single disk. Simple RAID1 mirroring provides better throughput (single disk write performance, split disk read performance is the RAID1's read algorithm is implemented correctly (on Adaptec's cheaper stuff it appears not as an example; all on-board assisted softRAIDs I have tested do not optimize on the RAID1 and RAID0 performance, however Linux softRAID (md) does). Add striping to the mirror and you get boosted write performance on a RAID system that implements the write algorithm correctly.

      RAID10 (2n disks) will of course increase your storage costs over RAID5 (n+1 disks) because you need twice as many disks as the space you want to offer for storage. But the performance gain will justify the costs if you are truly IO limited.

      --
      Never meddle in the affairs of dragons,
      for you are crunchy and good with catsup.
    180. Re:Adapt by X0563511 · · Score: 1

      Actually, slashdot COULD use more cores.

      This new javascript stuff is pretty CPU intensive, for a web site. And when you have multiple stories loading in tabs... I wish the script engine was threaded.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    181. Re:Adapt by Pentium100 · · Score: 1

      Yes, I saw that post, but just wanted to clarify that I indeed have a NUMA PC (and not all dual opteron motherboards have memory slots for CPU2, mine does).

      I also have a dual Xeon 700MHz PC, and it is SMP. What I have noticed though is that if I run a single threaded program on the PC with opterons, task manager shows one core at 100% and others at ~0%. If I run that program on the Xeon PC, task manager shows both CPUs at 50%. So I guess this is one way of finding out if you have SMP or NUMA.

    182. Re:Adapt by LordWoody · · Score: 1

      This may sound silly but we do something like this. On performance critical systems, we have one or two 'OS' and communication dedicated processors and then we use the rest (up to 14 more for 16 total) for the actual data crunching.

      Those off-CPU processors may exist but if they bind up, your main processors trying to read or write IO will go into wait states until the IO unbinds. having your data process unbind data crunching from IO helps quite a bit to smooth out the IO spikes that exceed instantaneous IO throughput capabilities.

      Skip RAID6 and move straight to RAID10. Maintaining redundancy internally, it is as fast as it gets with optimal RAID hardware. And it does scale as more drives are added to the stripe(s) up to the maximum performance of the RAID controller chip and PCI(e) throughput.

      The problem with bigger RAM and more battery is that if you have sustained IO that exceeds your write performance, your buffer no matter how large is eventually going to fill and either bind or drop data.

      More RAM simply allows you to handle larger IO spikes. Sustained performance is an entirely different animal. Only real IO performance helps there.

      I n my environment we have to deal with spikes and high baseline sustained throughput.

      --
      Never meddle in the affairs of dragons,
      for you are crunchy and good with catsup.
    183. Re:Adapt by djnewman · · Score: 1

      As to the OS's being ill prepared for multi-core that's probably true but I think the problem is not as described. I'm concerned about the comment that it's the developer's fault. To me, the fault lies with the chip vendor not providing multi threaded compilers that are simple to use. The Computer Science folks have hundreds of methodologies for effective multi threading. It's also pretty trivial for a developer to split the application into multiple parts based on usage: client UI, background tasks etc. When the compilers are available that allow simple and automatic assignment of the thread model to a core, the OS's and everything else will be much faster. This may be simplistic thinking, but it's at the level where the common developer can grasp it and use it. As a developer, I don't want to worry about how cores are being used - I just want my application to work. Until there's push from Microsoft and Apple to get Intel's compiler designers into gear, we'll keep having these discusssions.

    184. Re:Adapt by LordWoody · · Score: 1

      That polling is not processor intensive. So adding cores will not help much except to reduce cache misses due to task swapping. But even there that only works if the OS is smart enough to prevent unnecessary process/CPU migration.

      --
      Never meddle in the affairs of dragons,
      for you are crunchy and good with catsup.
    185. Re:Adapt by LordWoody · · Score: 1

      The CPU's could be in an bloacked IO stage simply asking in a loop: Is it ready yet?, Is it ready yet? ...

      In Linux watch top and not the amount of time CPUs have allocated to '%wa' That accounts partly for IO blocked processes. So a CPU can be at 0% idle and still doing nothing effective, just simply trying to write tot he disk that is too busy to accept the data.

      --
      Never meddle in the affairs of dragons,
      for you are crunchy and good with catsup.
    186. Re:Adapt by Anonymous Coward · · Score: 0

      Off-topic, but wouldn't it be nice if Slashdot had some sort of end-of-year award ceremony to honour memorable posts? Like a 'Car Analogy of the Year' award.

    187. Re:Adapt by lsatenstein · · Score: 0

      In elementry queuing theory, one learns that if one has a uniprocessor capable of speed 2x, and a dual processor working at speed x, more efficiency and performance will occur from the single processor. Extending that to 3 cpus versus 2 cpus, with each sharing the same clocking circuitry, and with separation, the 2cpu environment would outperform the 3 cpu environment. For quads, you really want to isolate contention. That is, to design new memory management circuitry to eliminate contention, to eliminate interrupt contention, (localized interrupts), and of course a whole new I/O system that is optimized as a multiple server, single queue (like the queue for a bank ATM). Forget Quad cores for home computers, as the underlying hardware is not ready for it, and as the topic suggests, neither is the kernels for linux and windows 7.

      --
      Leslie Satenstein Montreal Quebec Canada
    188. Re:Adapt by beav007 · · Score: 1

      No, it doesn't - I'm still waiting for it to come.

    189. Re:Adapt by fractoid · · Score: 1

      Yes. That's one of the less worst among Apple's vacuous marketing terms, but it's still pretty bad. MMX loses points for trying to be cool by using an 'X' for extensions, but at least it's shorter. :P

      --
      Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
    190. Re:Adapt by Cassini2 · · Score: 1

      If you try doing some of the harder data intensive problems, anything that doesn't generate machine code can really suck. But to give you an idea of some of the obstacles, take a simple algorithm:

      for each FileName in BigListOfFileNames {
      OpenFile(FileName);
      for i = 1 to NumberOfElementsInFile {
      for j = i to NumberOfElementsInFile {
      ComputeCovariance(i, j);
      } // for j
      } // for i
      } // for each FileName
      SumAllCovarianceCalcsFromAllRuns();

      Apologies for the pseudo-code, but even a simple numerical routine like this can become a very complex computational nightmare. There is no guarantee that the Covariance matrix fits in RAM. For speed, you want to distribute each file across the network. It is necessary to organize, as best as possible, the RAM used in the routine such that locality is preserved to minimize page hits. Appropriate use of the MMX instructions is desirable. It is also a good idea to ensure the compiler isn't adding function calls in unexpected places, like if you have a IsFinite() function call inside the CoVariance routine. Additionally, the MMX optimizations may only become obvious if the CoVariance routine is pulled out of the function and placed in the main body of the loop, so it can be easily parallelized.

      Even trivial looking statements, like OpenFile are complex. For instance, the OpenFile routine should preferentially process files stored on the local computer first, and then process files sitting on other computers elsewhere on the network. To completely confuse the issue, the path names for the same file are different in Windows, based on whether the file is located locally or on the network. (Pathname translation isn't as much of a problem on Linux.) Finally, for speed, and when the covariance matrix is small, it is desirable to have OpenFile open the file as a memory mapped file, and then have a background process hit all the pages so they are paged into memory. The foreground process can then happily process the data in the file without waiting for page hits. On the other hand, if the covariance matrix doesn't fit in RAM, then you don't want to have the OpenFile read ahead, because the hard drive will be busy doing other work instead.

      Additionally, it is convenient to do some of the operations with software transactional memory (STM). That adds another layer of library complexity.

      From a practical point of view, even the few lines of code above, can quickly become a parallelization and optimization nightmare when applied to cluster scale computing. Incidentally, those lines of code were one of the rate limiting steps in a data processing application that I was working on, so they are worth clustering. But clustering only makes sense, if the inner loop is tight. Otherwise, the hand tuned C code clobbers Java code for speed.

      I don't really mean to pick on Java and .NET so hard, but it is simply amazing how slow can get if you don't ever look at an assembly print out of what you told the computer to do. It really isn't hard to have 6 or 8 function calls per inner loop of the above code, especially if you don't very carefully keep track of what you just told the compiler to do.

    191. Re:Adapt by Anonymous Coward · · Score: 0

      The only way the dual core processor would be faster in your example would be if it had more cache than the 5GHz CPU and the working set for your programs fitted into the cache on the dual-core 2GHz chip but not on the 5GHz one, but that's completely independent of the number of cores.

      Two issues here, first you assume that a 5GHz system is comparable to dual core 2GHz system. It isn't. A quad 2ghz might be comparable to a 5 GHz system. Maybe even an 8 core. High clock speeds are significantly more expensive to produce than multi cores.

      Similarly, you assume that a 5GHz system will have twice the cache of a dual 2 GHz system. The cache closest to the cpu tends to be closest in speed to the cpu too. Doubling the cache tends to slow it down. With both of these working against you, it if very unlikely that you'll have similar cache.

      Finally, you seem to think that the goal is to run the cpus at 100% the whole time. It isn't. Most systems have a lot of i/o. Multi cores allows you to process multiple slow i/o requests with fairly fast response time. Keeping all of the slower i/o steadily flowing will speed up the overall system more than having them block for longer periods of time and processed quickly when it is their turn.

    192. Re:Adapt by SL+Baur · · Score: 1

      It's the applications that drop the ball.

      Agreed.

      Emacs can get hung opening a large .cpp file if the macros confuse the parser used by the syntax highlighter. Why isn't that done in a separate thread so I can make my changes and close the file while the syntax highlighter flails in the background?

      Sigh. You're not the first person who has asked that question. Short answer: emacs was just never designed with multithreading in mind.

      It took years of coding (and even worse, years of debugging) to convert the internals to support multibyte characters. For whatever it's worth, that's an easier job than converting the guts into something thread safe.

      Realistically, it's just not going to be done. We looked seriously at that in XEmacs a decade ago (also with a mind to replace the custom emacs lisp engine with a thread safe Scheme Lisp engine, or some such) and came to the conclusion that a total rewrite was in order *and* we would lose compatibility with emacs lisp. Not an option.

      It's sad. Emacs (like the language it is mostly written in) is something that has withstood the test of time - in the realm of ideas. The architecture on the outside, Lisp as implementation and extension language, modular additions easy, etc. is as good as it gets. Once you get under the hood a different story emerges.

      If I sound defensive, well, guilty. I wish I could give you a better answer.

      As to your font-locking issue, have you looked at (my personal favorite) fast-lock, or lazy-lock? fast-lock using caching to speed up initial font-locking, lazy-lock does font-locking on demand as different parts of the buffer become visible.

    193. Re:Adapt by try_anything · · Score: 1

      Thanks for the tips. I'll take a look at lazy-lock. I don't know if the font-locking code even terminates on the problem files and don't have the patience to find out, so caching wouldn't help.

      Just out of curiosity, since you're well-qualified to prognosticate, what do you foresee for Emacs? Will a more modern Lisp-based editor eventually displace Emacs, or will Emacs continue to eat its young until something totally different kills it?

    194. Re:Adapt by KingMotley · · Score: 1

      I'm afraid it's been a few years since I've needed to calculate things like Covariances, so this is going to be a very generic answer.

      It seems the problem is that you are having a hard time coming up with an way to do your calculations that isn't serial. In order to parallelize things you need to be able to break things down into smaller pieces that can run currently, while your for-loop in a for-loop in a for-each loop is designed to run things serially. Like:

      for each FileName in BigListOfFileNames {
      OpenFile{FileName);
      Dim eventhandles As New List(Of EventWaitHandle)
      for i = 1 to NumberOfElementsInFile {
      Dim ewh As New EventWaitHandle(False, EventResetMode.ManualReset)
      eventhandles.Add(ewh)
      Dim param As New ThreadData(ewh, i)
      Threading.ThreadPool.QueueUserWorkItem(AddressOf CalculateCovariance, param)
      }
      For each ewh as EventWaitHandle in eventhandles {
      ewh.join
      }
      }

      CalculateCovariance(args as ThreadData)
      {
        DoSomeStuff
        args.ewh.Signal()
      }

      This would in effect run multiple threads trying to calculate the Covariance running multiple i's at once, of course that assumes that the DoSomeStuff for CalculateCovariance doesn't depend on prior calculations to be able to do it's work. I thought that Covariance would require you to calculate the mean first before you could do any variance/covariance calculations, which would be another way of parallelizing the process, by having the system calculate the mean on one file while calculating the variance/covariance on different file. I suppose you could also use the unit of work for each thread to be a separate file if you wanted to (instead of each value of "i") and you didn't run into memory constraints trying to process two (or more) files at once.

      Lastly, you are correct. A compiler can't compete with hand tuned assembly code -- if you are good at it, and you optimizing a very small portion of your code. You seem to want to hand code the stuff yourself (since you specifically mention MMX optimizations, which are processor dependent and not something you "pick" in higher level languages, other than to allow/disallow them), which is fine if you need to wring out the very last bit of performance you can from your machine(s). Just make sure that the methods you chose are the best ones for the job, and there isn't a more efficient method that is available to you that could reduce the amount of calculations required, etc.

    195. Re:Adapt by init100 · · Score: 1

      How the heck do you execute something in parallel, if the next command doesn't have the output from the previous one?

      Because the next command may not need the complete output of the previous command before it can start processing. So e.g. first command 1 generates a line of output. While it generates the next line, command 2 can start processing the first line generated by command 1. This can be extrapolated to many chained commands.

      This is actually a specific case of the generic producer-consumer pattern for writing parallelized software.

    196. Re:Adapt by Yfrwlf · · Score: 1

      That's sad to think about, since GPCPUs are coming out this year I think. So you're putting everything inside one chip, but still segregating each core to do different tasks. The end result is a computer which, while being smaller possibly, means it's not modular. Consumers of course have more control when they can replace/remove/upgrade different dedicated parts.

      Or, are these new fangled GPCPUs going to be dynamic enough to get over this problem? :/

      --
      Promote true freedom - support standards and interoperability.
    197. Re:Adapt by JasterBobaMereel · · Score: 1

      a 2 core CPU is split between the OS and your program
      a 4 core CPU is split between the OS and a couple of your programs (one is probably the GUI) and one is little used

      an 8 core system 4-5 of the cores are never used ....

      Most programs spend most of their time waiting for resources (keyboard input, Disk, memory etc...) adding more cPU's or cores will not speed up a sleeping process

      More cores only help if the programs can actually use them, or need to ... most of the programs you are running today cannot logically use more than one core (maybe 2) and will spend most of thier time waiting for you ....

      The exception is very processor intensive programs like games and render software .... for most other software it would only help to spread the load across cores ...which both Windows and Linux do now ...

      --
      Puteulanus fenestra mortis
    198. Re:Adapt by beegeegee · · Score: 1

      Or the G3/G4 processors which lead us to be breathlessly sprayed with superlatives for years until Apple ditched them for the next big thing - Intel processors!

      Not the next "big" thing, the next "available" thing. The reason Apple switched to Intel had nothing to do with the suitability of the powerpc chip which ran much cooler and more efficiently than Intel but the availability. They got sick and tired of hold-ups on manufacturing of the powerpcs.

    199. Re:Adapt by bitrex · · Score: 1

      Well, considering the FPGA or other programmable hardware that the analog computer would be running on were designed to implement digital circuits running at many MHz, one would hope that they would have sufficient bandwidth to work with at least some interesting analog calculations. As for noise, you could help with that by using a Kalman filter on the A/D converted output - of course you're back to the constraints of digital again, but since efficient Kalman filter algorithms are O(n log(n)) if you were working with say a system of differential equations that had a best case solution big-Oh larger than that, you'd still get a speed improvement.

    200. Re:Adapt by somenickname · · Score: 1

      People are still doing data flow things even on commodity hardware. For example, Suns implementation of LAPACK uses data flow techniques in many of the important functions (LU/QA factorizations and matrix multiples). I worked on the tool that made this feasible. The initial data flow techniques were done by hand and took a very smart engineer about 6 months to do a matrix multiple. With the tool it took about a week to define and debug a blocked matrix multiple in a data flow manner. Scalability was through the roof.

    201. Re:Adapt by Anonymous Coward · · Score: 0

      How abouw gzip and tar ?

      I'm afraid gzip isn't using my quad core. Therefore backups are far from optimal. CPU is the bottleneck here.

    202. Re:Adapt by marcosdumay · · Score: 1

      I'm not against putting an analogic processor inside computers (better yet, a network of them). It has lots of usefull applications, but aren't the silver bullet your post implied.

    203. Re:Adapt by Cassini2 · · Score: 1

      I picked the Covariance as an example because it is an "embarrassingly parallel" problem.

      The difference between your approach and my approach might relate to difference between Computer Science and Computer Engineering. Try examining this problem from an engineering perspective.

      For example, solving the Covariance matrix is an O(n^2 * m) problem, requiring about 12*n^2 RAM for my application, and data on disk occupies 32*n*m. Assume disk speed is 100MB/s of which only 10% can be easily utilized, RAM is 2 GB/computer, floating point speed is about 1E9/s given the constants from the O(n^2*m) calculation. This gives:
      - If n=1E3, m=1E6, we need 12 MB of RAM, 32 GB of files, 0.9 hours disk time, and 0.3 hours CPU time. Disk is the bottleneck.
      - If n=1E4, m=1E6, we need 1.2 GB of RAM, 320 GB of files, 9 hours disk time, and 27 hours CPU time. CPU and RAM are bottlenecks.
      - If n=1E4, m=1E9, we need 1.2 GB of RAM, 320 TB of files, 9000 hours disk time, and 27000 hours CPU time. Supercomputer/cluster time.
      The goal for this application was n=1E6, m=1E9, so obviously algorithmic improvements are required too.

      The problem with your approach, is that it obfuscates what the real issues are. For really hard problems, you need to start at the bare metal and work up. Otherwise, you can code your multi-threaded application, watch the user load a real data set, and have to program bomb on out-of-memory before you even encounter the disk space and CPU speed issues.

      In practice, a balance between the top-down (high-level abstraction down to machine code) and bottom up (machine code to high level) are needed. I just don't see anything even remotely on the horizon that lets a programmer do this. The newer programming languages are making performance problems very non-obvious. In one scenario above, with 2 GB of RAM/computer, we were both CPU and RAM limited, so adding more cores and more threads wouldn't help. Some programmers are shocked to discover that a single-threaded algorithm can out perform a multi-threaded algorithm by a large margin. Sometimes statically allocated variables outperform code using new and/or malloc by a large margin, because new and malloc are expensive calls and zero-initialized static memory is cheap. For any application, critical design trade-offs can become very non-obvious, very easily. Parallelization makes many of these issues much more complex. How are the languages helping us to understand these problems?

    204. Re:Adapt by KingMotley · · Score: 1

      Well you always have to have your mind on the size of the data set that you are expected to be able to work with, that's always a given. It's apparent that the size of the data set you are trying to work with is exponentially larger than the typical program needs to, and that should be apparent before you actually start working on it.

      That said given your assumptions (Which I'd fault somewhat, but I'll go with them for now). Using a data set where n=1e6 and m=1e9 then:
      You need 12TB of RAM, 32PB of files, 900000 hours of disk time, and 277,777,777 hours CPU time.

      Ok first, let's knock out the easy ones. 12TB of ram -- You aren't going to get that in a PC. So either you need to change your approach so that you no longer need to store your entire data set in RAM by using a smarter function that does stuff while its running (Summation), or store your temporary results somewhere else (disk). Considering that your code had sum-something function at the end, I would assume you can sum while running.

      32PB of files. Ok, you aren't going to put that on a (single) seagate drive -- 100MB/s is fine for a today's single drive, but since you can't get a single drive capable of storing that amount of data... Even using 1TB drives, you'd need 32 million of them. From tape, you'd be looking at something like 32 fully filled T680's, which would have a throughput of about 92GB/s (not that any PC could ever take it that fast). In any case, you are talking about an insane about of throughput you'd have -- assuming that you actually had 32PB of data you needed to process, much more than the 100MB/s you quoted. Additionally, this is one of the few aspects that isn't an exponential problem. As such, this isn't likely to the cause of a bottleneck, so I'm going to drop this from further analysis is pointless.

      As for the CPU time, it's apparent that your disk throughput will greatly exceed your capacity to actually process the data in you CPU at your "goal". As such, you will be CPU limited, and very very much so. You would (assuming this would actually become a funded project) be wise to look into massive clusters to process the data, I would suggest starting with something like looking at doing the processing on video cards where you can pack 6+ processors (3 nvidia 295 gtx's) into a single machine, each having 240 cores. You should be able to process in excess of 6.0 teraflops using that, which is about 6,000 times faster than your quote. Even with that, you are likely to need a significant number of machines in a cluster to be able to process that amount of data.

      An interesting exercise I suppose, but I would like to point out something you said that could possibly be wrong (possibly, I'm not sure, but you said it, and it assumes something I don't know to be true):
      [quote]In one scenario above, with 2 GB of RAM/computer, we were both CPU and RAM limited, so adding more cores and more threads wouldn't help.[/quote]
      That is only true if you make the assumption that adding more threads would increase the demand on RAM, which isn't necessarily true if you can coordinate multiple threads to all work on the ram that is currently loaded. Then adding more cores actually helps immensely, as you can then likely use virtual memory to hold many parts of the problem, paging in parts then letting multiple threads work on that "piece", and when all the threads have finished, load another piece. Just something to think about. That would also greatly reduce the "hours disk time" as well, since it's read once, and used multiple times internally (1200+ times if you have 1200+ threads all using it).

      [quote]Some programmers are shocked to discover that a single-threaded algorithm can out perform a multi-threaded algorithm by a large margin. Sometimes statically allocated variables outperform code using new and/or malloc by a large margin, because new and malloc are expensive calls and zero-initialized static memory is cheap.[/quote]
      I think you are trying to make a statement that says that all mu

  3. print page for less ads by A+little+Frenchie · · Score: 2, Informative
    1. Re:print page for less ads by jimktrains · · Score: 1

      "print page for FEWER ads" FTFY

      --
      "You will do foolish things, but do them with enthusiasm." - S. G. Colette
  4. Nothing new to see here... by Microlith · · Score: 5, Insightful

    So basically yet another tech writer finds out that a huge number of applications are still single threaded, and that it will be a while before we have applications that can take advantage of the cores that the OS isn't actively using at the moment. Well, assuming you're running a desktop and not a server.

    This isn't a performance issue with regards to Windows or Linux, they're quite adept at handling multiple cores. They just don't need that much themselves and the applications run these days, individually, don't need much more than that either.

    So yes, applications need parallelization. The tools for it are rudimentary at best. We know this. Nothing to see here.

    1. Re:Nothing new to see here... by thrillseeker · · Score: 2, Interesting

      Did you ever follow the Occam language? It seemed to have parallelization intrinsic, but it never went anywhere.

    2. Re:Nothing new to see here... by phantomfive · · Score: 5, Interesting
      From the article:

      The onus may ultimately lie with developers to bridge the gap between hardware and software to write better parallel programs......They should open up data sheets and study chip architectures to understand how their code can perform better, he said.

      Here's the problem, most programs spend 99% of its time waiting. MOST of that is waiting for user input. Part of it is waiting for disk access (as mentioned in the AnandTech story, the best thing you can do to speed up your computer is get a faster hard drive/SSD). A miniscule part of it is spent in the processor. If you don't believe me, pull out a profiler and run it on one of your programs, it will show you where things can be easily sped up.

      Now, given that the performance of most programs is not processor bound, what is there to gain by parallelizing your program? If the performance gain were really that significant, I would already be writing my program with threads, even with the tools we have now. The fact of the matter is in most cases, there is really no point to writing your program in a parallel manner. This is something a lot of the proponents of Haskell don't seem to understand, that even if their program is easily paralellizable, the performance gain is not likely to be noticeable. Speeding up hard drives will make more of a difference to performance in most cases than adding cores.

      I for one am certainly not going to be reading chip data sheets unless there's some real performance benefit to be found. If there's enough benefit, I may even write parts in assembly, I can handle any ugliness. But only if there's a benefit from doing so.

      --
      Qxe4
    3. Re:Nothing new to see here... by 0123456 · · Score: 2, Informative

      Did you ever follow the Occam language? It seemed to have parallelization intrinsic, but it never went anywhere.

      Occam was heavily tied into the Transputer, and without the transputer's hardware support for message-passing, it's a bit of a non-starter.

      It also wasn't easy to write if you couldn't break your application down into a series of simple processes passing messages to each other. I suspect it would go down better today now people are used to writing object-oriented code, which is a much better match to the message-passing idea than the C code that was more common at the time.

    4. Re:Nothing new to see here... by ari+wins · · Score: 5, Funny

      I almost modded you Redundant to help get your point across.

      --
      Don't worry if you're a kleptomaniac, you can always take something for it.
    5. Re:Nothing new to see here... by caerwyn · · Score: 2, Insightful

      This is true to a point. The problem is that, in modern applications, when the user *does* do something there's generally a whole cascade of computation that happens in response- and the biggest concern for most applications is that app appear to have short latency. That is, all of that computation happens as quickly as possible so the app can go back to waiting for user input.

      There's a lot of gain that can be gotten by threading user input responses in many scenarios. Who cares if the user often waits 5 minutes before input? When he *does* do something, he wants it done *immediately*. The fact that it's a tiny percentage by wall time doesn't change the fact that responsiveness here is a massive percentage of user perception.

      --
      The ringing of the division bell has begun... -PF
    6. Re:Nothing new to see here... by jelle · · Score: 1

      As is tradition, I post before reading the article, but...

      IMHO, It's not just that programs should be multithreaded, it's that the model of processes and threads may be too primitive, too coarse, to make full use of the available sea of cores. The concept of threads and processes that each run on one core at a time may be perfect if you're mostly core-limited, but I can think of other OS and compiler-supported paralelism that I would like to see. What if you have 64 cores/threads and only a couple of programs to run. You'll quickly have 40+ idle cores... So why not add a scheduling paradigm where you can dedicate clusters of cores to a program for a timeslice?

      For example, if I would be able to tell the OS that my app would like N cores reserved and ready for me for every timeslice, then the compiler would probably be able to use parallelism at a much finer grain than threads (Perhaps with some hardware-support later on).

      Nowadays, a compiler can do all sorts of optimizations, unroll loops, eliminate unnecessary calculations, etc. The core then takes the instructions and tries to run as many of them as possible, re-ordering and executing in parallel where possible.

      What if my compiler would detect that I have two independent function calls, the results of which are combined after they finish. Why would the compiler then not be able to make a binary that would request 2 cores per timeslice and where core 0 would run the first function in parallel with core 1 running the other function?

      Sure, the two cores won't be 100% utilized, but that second core would sit idle in the system otherwise too, and this way my program will run faster. Most likely my system will be memory-bandwidth limited before it becomes core limited.

      When I try to do that with threads, I need to manually add my synchronization, plus I have to contend with scheduling latency because the OS might not allocate that second core for me on time (it has to learn about the un(b)lock/cond_signal of the thread, find an available core, switch the context, activate the thread, and the core has to load it's L1 cache, etc etc... By assigning more than one core to the process, fully dedicated to it, guaranteed to be available at a single clock cycle latency, I can make much better use of parallelism otherwise lost.

      In the time of GigaHerz processors, it's not unthinkable that I would have a million of such possible parallelisms per second in a program that, with threads, would be extremely hard to parallelize.

      Torvalds et al, and gcc team-members, please be inspired and take us all to the next level ;-)

      --
      --- Hindsight is 20/20, but walking backwards is not the answer.
    7. Re:Nothing new to see here... by palegray.net · · Score: 1

      So yes, applications need parallelization. The tools for it are rudimentary at best. We know this. Nothing to see here.

      You're right hat we've known this for quite some time now. The issue is the simple fact that this problem remains a hard one to solve, with very little real progress in the last few years, while CPU technology is continuing to pull ahead of effective tools to utilize that power. The problem is getting worse, and getting worse at an increasing rate. That's the story here, and I do consider that something worth drawing more attention to.

    8. Re:Nothing new to see here... by DarkOx · · Score: 1

      I don't know about doing it at the compiler level but you might do something at the chip level. Branch prediction essential contiunes inserting instructions into the pipeline for one branch or the the other both on some really advance CPUs before a branching instruction like beq, can be completed. If the branch is different then the predictued one the pipeline is flushed (stall).

      Suppose you did not expose all the physical cores to the OS, but isntead actually exposed a smaller number of logical cores; with some specialed branch prediction core switching unit on the front end. With additional cores you have full sets for registers availible, you can likely actually execute entire instructions ahead of a slow branch operation completing, by giving each of the opperands to two of the space cores. The branch unit then makes the active core the one that was doing the next instruction based on the result of the branch; the core doing the branch eval and the core doing the wrong branch get the registers cloned form the core doing the right branch and are ready to be used as the work ahead units for the next branch.

      Now this requires a 3 to one physical core to logical core ratio; but if we are talking about 80 core chips that might be reasonable. The branching prediction unit might be a bit of Si in and of itself too though. Could get expensive.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    9. Re:Nothing new to see here... by phantomfive · · Score: 1

      There's a lot of gain that can be gotten by threading user input responses in many scenarios. Who cares if the user often waits 5 minutes before input? When he *does* do something, he wants it done *immediately*.

      Exactly. But you are way better off figuring out exactly what is slowing the program down, rather than randomly throwing parallelization at the problem. In most cases it is not actually a problem of not having enough CPU power, it's a problem with disk latency, or something else. That is where the profiler comes in, it helps to determine exactly what is going on. In the rare cases that the bottleneck is various tasks waiting on the CPU when there is another one available, only then is it acceptable to parallelize things. And we have the tools now to do so.

      Often when people try to optimize without using a profiler or without understanding the problem, they just waste time while solving nothing (or making things worse). This is what Knuth meant when he said, "premature optimization is the root of all evil."

      --
      Qxe4
    10. Re:Nothing new to see here... by jelle · · Score: 1

      A lot of these things will be much more effective compile-time than run-time.

      The thing is, for these kinds of analysis/optimizations, a compiler can do everything that hardware can do, plus it can look into a much larger context, and do much more complex analysis, because it runs only the single time when the binary is built, not each time it runs.

      Hardware costs actual silicon real-estate, and logic costs time, slowing down the core's clock. A compiler has, relatively, all the time in the world for all the complexity you can think of.

      A lot of software people quickly think "why don't they 'simply' make an instruction for this", without realizing that in hardware it will cost N transistors (die size, chip cost), T nanoseconds (1/GHz clock speed), and generate P heat (TDP, operating cost, etc)...

      --
      --- Hindsight is 20/20, but walking backwards is not the answer.
    11. Re:Nothing new to see here... by AuMatar · · Score: 1

      There's a problem with Knuth's truism as well. If you don't consider performance early, you'll make architectural decisions which will end up being very non-performant, and will require high amounts of effort to fix (if it's even possible without a complete dump and rewrite). You need to think about performance at every stage of design and development. The trick is that each stage requires a different degree and area of worry, and only experience will teach you how much and when. But blindly following the Knuth doctrine of don't worry about it until the end actually puts you in a worse position than those who wasted time optimizing the wrong thing.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    12. Re:Nothing new to see here... by nschubach · · Score: 1

      True, the hard drive may be a slow point, but if you "thread out" a loader procedure to cache that data in memory for faster access, you can alleviate some of that waiting. The hard part is figuring out what to cache. Another hard part is describing to the user that your program isn't eating up all their memory because you have a leak. The other method is deciding if generation of data is quicker than saving and loading data from disk. What I'm talking about are things like texture generation for games and such. If you can task an 8 core processor to generate textures in memory at a quicker speed than loading it from disk, why not do that?

      I remember working with my old 386sx that storing an array of COS/SIN calculations for a set range of values in an array was much faster than letting the processor figure it out each time. I haven't tried to test if this is still the case today, but I imagine it's not. My point being, there may be things that people rely on data files for in order to speed up processes when they aren't needed today.

      It's an interesting world of IT we are coming into.

      --
      Every time I start to have faith in humanity, I ruin it by driving to work between 7 and 8 am.
    13. Re:Nothing new to see here... by phantomfive · · Score: 1

      Sure, but that is neither related to the content of my post, or the meaning of the Knuth quote.

      --
      Qxe4
    14. Re:Nothing new to see here... by caerwyn · · Score: 3, Interesting

      I don't entirely agree with you here. A lot of current applications *do* suffer from CPU-induced latency after user interactions, and the problem is simple: they don't differentiate between the things that must get done before control is returned to the user, and the things that need to happen in response to the action but can be allowed to happen whenever resources are free. Even when the problem is resource-access latency, multithreading can be a win because that latency no longer contributes to the latency that the user perceives if it happens on a background thread.

      Something as simple as tossing function calls off on a background thread to deal with some of these tasks would do a great deal to improve latency from the user's perspective, and is really quite trivial to implement. Most programmers don't do it, though. Part of that is that in most situations there aren't ready-made solutions- you can't just say "run this function call on a background thread", you've got to go through the pthread creation process, etc. (Apple's Cocoa framework is actually an exception to this with it's NSOperation).

      The situation is analogous to that of an interrupt task: Do absolutely as little as possible before returning; everything else should happen on some other thread.

      I agree with you regarding optimization, but it's been my experience that many applications *can* benefit from these sorts of simple multithreading techniques- the programmers just don't do them, either from lack of ability or lack of resources.

      --
      The ringing of the division bell has begun... -PF
    15. Re:Nothing new to see here... by caerwyn · · Score: 1

      This is very true, and I think this is part of the reason that many current applications have such poor performance- they get to the end stage and realize that they can't fix the problems without more resources than they have left, so they just throw up their hands and say "screw it".

      --
      The ringing of the division bell has begun... -PF
    16. Re:Nothing new to see here... by ultranova · · Score: 1

      But you are way better off figuring out exactly what is slowing the program down, rather than randomly throwing parallelization at the problem. In most cases it is not actually a problem of not having enough CPU power, it's a problem with disk latency, or something else.

      Actually, no. What you want is to have the UI react as fast as possible to user input. The easiest way to do that is to have the UI use a thread of its own, which does nothing but block on reading user input (or a screen redraw request) and upon receiving it immediately gives feedback by updating the relevant part of the screen. The alternative is to have the program interrupt any long-running task every n microseconds to check for user input, which is not only more complicated but also needlessly wastes time when no such input is available.

      In the rare cases that the bottleneck is various tasks waiting on the CPU when there is another one available, only then is it acceptable to parallelize things.

      The issue isn't about execution time, it's about latency. The term "bottleneck" isn't applicable here. What we want is that the UI input function gets called as soon as possible when new input arrives; it is impossible to beat a thread that's blocking on read on that regard. It is also the simplest solution which achieves low latency, at least for any program with long-running background tasks. So I'd say that it's not using a separate UI thread that requires justification.

      Often when people try to optimize without using a profiler or without understanding the problem, they just waste time while solving nothing (or making things worse).

      The problem is that it's nigh-impossible to guarantee that the execution time of a (sub)task has an upper bound.

      This is what Knuth meant when he said, "premature optimization is the root of all evil."

      This isn't a matter of optimization, it's about designing the program correctly from the start.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    17. Re:Nothing new to see here... by phantomfive · · Score: 1

      You are right, that is a useful optimization that can help improve latency, but think about it, on a 2GHZ machine, how many instructions can be executed in a millisecond? A whole lot of them. For the most part it's not worth worrying about. We can argue about how many programs have latency that could be helped with this technique, but unless you want to go out and do a survey of some open source projects and measure the latency of various programs due to CPU time, then we aren't really going to come to a conclusion as to what percentage would be actually helped.

      Make sure you put your stuff in a profiler before you draw conclusions. Drawing conclusions without measuring first is a guaranteed way to be wrong.

      --
      Qxe4
    18. Re:Nothing new to see here... by Coryoth · · Score: 1

      I suspect it would go down better today now people are used to writing object-oriented code, which is a much better match to the message-passing idea than the C code that was more common at the time.

      Indeed, it isn't that hard to take the OO paradigm and simply say that a method call is a message being passed from the caller object to the object for which the the method is being applied. Add some niceties to define which objects can be executed in parallel and method preconditions as wait conditions and you have a fairly simply way to program in object-oriented fashion and have the compiler do the work to make it multithreaded. See SCOOP for the basic ideas. Sadly, since it was proposed for Eiffel rather than C++ (and no-one in the C++ world bothers to look at what other languages are doing), I doubt it will actually get any traction and catch on.

    19. Re:Nothing new to see here... by phantomfive · · Score: 1

      I'm not sure you know what you're talking about. You seem to be addressing the idea of polling vs blocking wait, which has some relevance in network and command line programs, but has very little to do with GUI applications. In fact, I'm not convinced you've ever actually programmed a GUI application, because you seem quite ignorant of the event driven model. Most programs are not going to be helped by the model you described, since they spend most of their time blocked waiting for user input anyway. Finally your comments in response to my quotes at times showed a complete lack of understanding of what I wrote originally, from which I conclude you are trolling ignorantly.

      --
      Qxe4
    20. Re:Nothing new to see here... by Raenex · · Score: 1

      Something as simple as tossing function calls off on a background thread to deal with some of these tasks would do a great deal to improve latency from the user's perspective, and is really quite trivial to implement. Most programmers don't do it, though. Part of that is that in most situations there aren't ready-made solutions- you can't just say "run this function call on a background thread", you've got to go through the pthread creation process, etc. (Apple's Cocoa framework is actually an exception to this with it's NSOperation).

      That's a recipe for disaster. You can't just run arbitrary functions on background threads. You have to make sure they are thread-safe, and it is NOT trivial to do this. Plus any mistakes you make will wind up as bugs that are hard to reproduce and cause bizarre behavior. Also, even if you verify the code is thread-safe, every maintenance programming coming after you (and possibly you yourself) has a good chance of writing some code to break this thread-safety.

      Writing concurrent code is hard, and that is why we continue to see articles about it. The best advice when it comes to writing threaded code is to use as little as possible.

    21. Re:Nothing new to see here... by caerwyn · · Score: 1

      It's really not that hard if you've got a decent design. Believe me, I've done a lot of it. Is it trivial? No. But any programmer worth his salt at this point should know how to do it.

      The "threaded programming is hard!" whine is what's got us into the situation where there's a disconnect between the hardware we have available and the design of the software.

      It really isn't that hard. If you have your data reasonably managed and your system reasonably abstracted, then determining the portions that can safely run on other threads (or safely with some synchronization) should be easy. If it's not, your design is borked to begin with and should be rethought.

      Debugging threads really isn't that bad, either, tbh. I've spent more time tracking down obscure memory issues in c/c++ code than I have tracking down threading issues in any language.

      --
      The ringing of the division bell has begun... -PF
    22. Re:Nothing new to see here... by owlstead · · Score: 1

      Using a profiler? Just use a RAM drive for anything that you write, especially log files. I've seen that for more complex situations, logging to a RAM drive really helps. If possible, put the whole thing in RAM and see the speedup. Note that if information is read once, it is normally in the RAM already because of disk caching, so running an application (with the same config) twice and making sure you log/write into RAM drive already does the trick.

    23. Re:Nothing new to see here... by Trogre · · Score: 1

      You've never used Firefox, Openoffice.org or python, have you?

      --
      "Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
    24. Re:Nothing new to see here... by Raenex · · Score: 1

      I've done it too, and your experience doesn't match mine. It is far too easy to make a mistake that won't be caught in testing. The compiler won't help you. Testing might not trip it.

      It really isn't that hard. If you have your data reasonably managed and your system reasonably abstracted, then determining the portions that can safely run on other threads (or safely with some synchronization) should be easy. If it's not, your design is borked to begin with and should be rethought.

      Yeah right, and you know what every library call does with respect to thread-safety? Your design won't be accidentally violated later on down the road? Chances are, if you are a heavy user of threads, you have bugs that are not reproducible and people dismiss these bugs as one-offs.

      The single most important rule of threading is to require as little as possible and to isolate your use of it. It's just toxic waste.

    25. Re:Nothing new to see here... by Anonymous Coward · · Score: 0

      I remember programing in Occam on the Transputer in the ol' good uni days. Well, it wasn't that long ago, 10 or so years ago I would say...

    26. Re:Nothing new to see here... by caerwyn · · Score: 1

      If you don't know how your library deals with thread safety, then you either need a new library or you need to read the docs.

      The single most important rule of threading is to have very well defined data interfaces, and never violate them. That's it. Claiming that threading is "toxic waste" is, I'm sorry, simply wrong.

      Everything is moving toward more threading, not less- we need programmers who can begin thinking in that mode rather than those who just write it off. Threading doesn't *have* to be complicated unless it's in a design that's screwed up from the start and there's insufficient data abstraction and isolation. Once there is, if your design as simple and consistent synchronization requirements, following them should be trivial. If it's not, you've screwed up in the design phase.

      The fact that you *can* screw up threaded code is no reason not to use it. You can screw up a lot of things in code that won't be caught in testing or at compile time, but we do them because they give benefit or because they're necessary. Threading's the same way.

      --
      The ringing of the division bell has begun... -PF
    27. Re:Nothing new to see here... by Raenex · · Score: 2, Insightful

      There are plenty of good designs that work in a single threaded environment that do not in multi-threaded environment. It's just a completely different ballgame when you allow multiple threads to be running on the same piece of code. With threading, the complexity goes up an order of magnitude and so does the penalty for failure.

      Anyways, I'm out. This is the standard debate about "good" programmers and "good" designs vs dangerous techniques that should be avoided.

    28. Re:Nothing new to see here... by The+Mighty+Buzzard · · Score: 1

      On the up side, adding more cores rather than ramping up clock speed lead to Microsoft overestimating how their newest OS would run on the hardware that would be out at release time(Vista). The rampant pissed offedness from that debacle lead to a version of Windows(7) that actually is faster than the previous one, rather than just claiming it is.

      Who knows, another few years of idling around 4GHz and they may actually put out something worth using for something besides games.

      --
      Violence is like duct tape. If it doesn't solve the problem, you didn't use enough.
    29. Re:Nothing new to see here... by mattcasters · · Score: 1

      Just because "most programs" are likely desktop applications in YOUR situation, that doesn't mean that this is the most important class of software around with respect to this discussion. Then again, it does explain the evolution in CPU design where CPUs not only need to be capable of releasing a lot of processing power, they also need to be able to lay low and consume as little of power as possible when needed.

      --
      News about the Kettle Open Source project: on my blog
    30. Re:Nothing new to see here... by cowbutt · · Score: 1

      I for one am certainly not going to be reading chip data sheets unless there's some real performance benefit to be found. If there's enough benefit, I may even write parts in assembly, I can handle any ugliness. But only if there's a benefit from doing so.

      I agree; reading processor data sheets should only really be necessary for compiler and kernel programmers; for everyone else, pick the right algorithm, implement it efficiently and the compiler should turn it into efficient machine code that runs efficiently under the target OS kernel.

    31. Re:Nothing new to see here... by cowbutt · · Score: 1

      I agree with you regarding optimization, but it's been my experience that many applications *can* benefit from these sorts of simple multithreading techniques- the programmers just don't do them, either from lack of ability or lack of resources.

      There can also be usability issues; if a user plans to do thing A (a slow process), followed by thing B, they may be somewhat perturbed when initiating thing A apparently returns control immediately, and some aspect of thing A that they are monitoring hasn't been done. That looks a lot like the application failing to carry out an operation for some unknown reason. So they'll probably queue up another request of thing A. Eventually, the queue will complete, and they'll get a flurry of duplicate thing A operations which need to be undone before thing B can be done.

    32. Re:Nothing new to see here... by hey! · · Score: 1

      Actually, parallelizing your algorithms isn't the only way to parallelize a software system. If it were, then adding cores would be pretty much a no-brainer.

      The simplest and most common way to parallelize your system is through system architecture. If you have a program which manages data, instead of writing your own reading and writing of files, you let a database management system handle that. On reads, you just do the reads asynchronously. On writes, it's even easier: just tell the database to write the data and assume that not only the writing is done, but any kind of journaling and logging as well.

      Certainly other kinds of services besides data persistence could be handled this way, and a few are (like messaging). We can imagine computationally difficult tasks like image processing, pattern matching, or inference generation being handled by separate processes.

      There are few cases, although they are important, where a task can be effectively broken up into threads that can run largely independent of each other. In many more cases there is some benefit, but dependencies between threads makes any advantage something like logarithmic (I'd guess) in the number of cores. But I also suspect that within the context of an entire software system, the otherwise useful practice of handing complex tasks off to somebody else (like a database platform) could provide opportunities for exploiting more cores.

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    33. Re:Nothing new to see here... by Anonymous Coward · · Score: 0

      What you're describing reminds me of the Amiga days. A lot of apps were really just one "task," but the way the UI library (Intuition) did things, most programs were effectively split into two, and just increasing that number from one to two made a huge difference in responsiveness. The user clicked a button and Intuition would show the button going down and then back up instantly, and would be ready to deal with the next user event immediately. Meanwhile, the actual app is handling that event asynchronously, and the user can go on clicking buttons or picking menu options without having to wait for the app to do whatever it's doing, and the events just queued up and got deal with whenever the app caught up. Things weren't really "tossed off to a background thread" (thought they could be, but in practice usually weren't), but from the user's point of view, it sort of looked like it was, making the computer "feel" faster than machines with ten times the processing power.

      Somewhere along the way, even the Amiga lost this as many apps moved to fancier libraries (e.g. MUI) that looked cooler (to keep up with the Joneses) but did all their work in the app's task, so they were less "asyncy" unless the app programmer went to the trouble to split things up (and many programmers were lazy). The OS came with a great solution, and then lazy programmers threw it away. :(

    34. Re:Nothing new to see here... by shmlco · · Score: 1

      If the operations is solely processor-bound that's one thing. But access any external resource like a disk drive or (worse) a network, and you have an entirely different story. Hell, typing a single character into a window can trigger word-wrapping, pagination, spelling correction, force a screen redraw, and more.

      --
      Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.
    35. Re:Nothing new to see here... by shmlco · · Score: 1

      "That's a recipe for disaster."

      Actually, it's not. If you have a UI thread that queues up actions to a single background "worker" thread, then you've got a responsive UI and you've arranged for the work to happen in order of execution.

      You can extend this out for things like web browsers with a UI thread, a thread for each page/window, and additional threads that queue up resource requests and handle them.

      "The best advice when it comes to writing threaded code is to use as little as possible."

      Or hire someone who's read more than a single book on PHP and who knows what the hell they're doing.

      --
      Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.
  5. Just windows and linux? by Anonymous Coward · · Score: 0

    get a mac..

    1. Re:Just windows and linux? by zurtle · · Score: 1

      Actually, you should get a perm.

      --
      Couldn't stand the weather
  6. Dolphins still missing! by pm_rat_poison · · Score: 1

    Is this just me, or is this a classic piece of non-news on a par with the one the post subject is in reference to?
    I mean, isn't it a typical and completely rational technological modus operandi that hardware developments come first and software implementations take some time to emerge (with the possible exception of specialized applications)
    I mean, imagine software being developed for imaginary or speculatory hardware. Sounds like a big waste of time to me...

    1. Re:Dolphins still missing! by Asic+Eng · · Score: 2, Insightful

      Parallel computing and parallel hardware have been around for decades - not on the desktop, but in the supercomputer area. It's a tough problem to solve efficiently - there are some things which are hard to get around. As an example think of the equation y = SQRT(a*b) - you need two mathematical operations there. It doesn't really help if you have two processors, since you need the result of one operation before you can perform the second. The example isn't very interesting, but essentially you always have this problem - if you rely on the result of the previous steps, then you need to do things in order. You can modify your algorithms so that happens less often, but this is hard work and interferes with your desire to write clean readable code.

  7. Clarification please by cjalmeida · · Score: 1

    Is TFA talking about the Linux or Windows thread and scheduling not good enough for 4+ cores (so your programs no matter how good designed will not benefit from more cores), about being damn hard to split, thread and join tasks, or both?

    1. Re:Clarification please by bucky0 · · Score: 2

      I guess you could read it and find out...

      seriously?

      --

      -Bucky
    2. Re:Clarification please by cjalmeida · · Score: 1

      I did. And it messes things up. That's why I asked.

    3. Re:Clarification please by tepples · · Score: 2, Insightful

      Is TFA talking about the Linux or Windows thread and scheduling not good enough for 4+ cores (so your programs no matter how good designed will not benefit from more cores), about being damn hard to split, thread and join tasks, or both?

      I understood the article to refer to the latter. The programming languages that are popular for desktop applications as of the 2000s don't have the proper tools (such as an unordered for-each loop or a rigorous actor model) to make parallel programming easy.

    4. Re:Clarification please by Anonymous Coward · · Score: 0

      You're the victim of a bait and switch. TFA doesn't even use the words "Linux" or "Windows".

    5. Re:Clarification please by godrik · · Score: 1

      No, the article just state that most application are still single threaded.

      There is no low level consideration in the article. However, there is so much things to say. I believe that the actual threading model in operating systems does not allow to reach good performances. We need a system scheduler to be aware of the computation that are done per each thread so that it optimizes the thread allocation on computational units.

      The article is quite misleading. It tends to make the reader believe that almost no tools exists. However, there are a tremendous number of algorithmic techniques and software stack to deal with massively parallel environment. For most of the problems, openMP is efficient enough and it is a quite known tool. Dedicated language such as Cilk or Cilk++ were developped and they proved to be efficient and easy to use.

      For most parallel part of classical applications, they can lead to good speedups. But everyone should be aware that parallelization is not only a software problem. All programs can not gain a linear speedup and parallel algorithmic is much more complicated than sequential one.

      The problem is not (only) a software one.

    6. Re:Clarification please by b4dc0d3r · · Score: 1

      Slashdot, where the readers are master switchers and the editors are master baiters.

  8. There's a simple paradigm here by mysidia · · Score: 5, Interesting

    Multiple virtual machines on the same piece of metal, with a workstation hypervisor, and intelligent balancing of apps between backends.

    Multiple OSes sharing the same cores. Multiple apps running on the different OSes, and working together.

    Which can also be used to provide fault tolerance... if one of the worker apps fails, or even one of the OSes fails, your processor capability is reduced, a worker app in a different OS takes over, use checkpointing procedures, and shared state, so the apps don't even lose data.

    You should even be able to shutdown a virtual OS for windows updates without impact, if the apps that arise get designed properly...

    1. Re:There's a simple paradigm here by Anonymous Coward · · Score: 0

      So basically you want a mainframe... ?

  9. Huh? by Samschnooks · · Score: 5, Funny

    ...programmers are to blame for that

    The development tools aren't available and research is only starting."

    Stupid programmers! Not able to develop software without the tools! In my day we wrote our own tools - in the snow, uphill, both ways! We didn't need no stink'n vendor to do it for us - and we liked it that way!

    1. Re:Huh? by Jurily · · Score: 1

      Stupid programmers! Not able to develop software without the tools! In my day we wrote our own tools - in the snow, uphill, both ways! We didn't need no stink'n vendor to do it for us - and we liked it that way!

      Yeah, and we've been crippled with them ever since. Also, their correctness and bug-freeness were the stuff of legends.

      P.S. I did get the joke, thank you.

    2. Re:Huh? by LeafOnTheWind · · Score: 1

      They're not looking for vendor tools to come out of it, they're looking for research to come out of it to help build such a tool.

      http://www.haskell.org/haskellwiki/GHC/Data_Parallel_Haskell

      http://manticore.cs.uchicago.edu/

    3. Re:Huh? by Anonymous Coward · · Score: 0

      And the ls that was written, while siberian wind blow to neck, spawned a thread for every stat. That was a truly well scaled program.

    4. Re:Huh? by Anonymous Coward · · Score: 0

      and we did it all with only 64k of ram! nobody should ever need more than 64k of ram!

    5. Re:Huh? by shutdown+-p+now · · Score: 1

      The tools are out there, and have been available for a long time, even for the more conventional language. OpenMP, to name one. Of the upcoming non-"research" stuff, see Parallel LINQ (for .NET) and PPL (for C++) from Microsoft, and the Java fork-join API.

  10. The article's turning a real problem into FUD. by davecb · · Score: 5, Informative

    Firstly, it's false on the face of it: Ubuntu is certified on Sun T2000, a 32-thread and Canonical is supporting it.

    Secondly. it's the same FUD as we heard from uniprocessor manufacturers when multiprocessors first came out: this new "symmetrical multiprocessing" stuff will never work, it'll bottleneck on locks.

    The real problem is that some programs are indeed badly written. In most cases, you just run lots of individual instances of them. Others, for grid, are well-written, and scale wonderfully.

    The ones in the middle are the problem, as they need to coordinate to some degree, and don't do that well. It's a research area in computer science, and one of the interesting areas is in transactional memory.

    That's what the folks at the Multicore Expo are worried about: Linux itself is fine, and has been for a while.

    --dave

    --
    davecb@spamcop.net
    1. Re:The article's turning a real problem into FUD. by m1ss1ontomars2k4 · · Score: 1

      I dunno, I'm not feeling particularly fearful or doubtful after reading the article.

    2. Re:The article's turning a real problem into FUD. by cowbutt · · Score: 4, Funny

      I dunno, I'm not feeling particularly fearful or doubtful after reading the article.

      The articles has, apparently, sown Uncertainty in you, however, so it was 33.3% successful.

    3. Re:The article's turning a real problem into FUD. by Samschnooks · · Score: 1

      The real problem is that some programs are indeed badly written. In most cases, you just run lots of individual instances of them. Others, for grid, are well-written, and scale wonderfully.

      The article refers to applications programmers so I am assuming you mean applications when referring to "programs".

      Dealing with multiple cores is the operating system's problem - not the application's. If the programmer uses multiple threads or processes, then it should be the OS that worries about allocating resources among the cores.

    4. Re:The article's turning a real problem into FUD. by 99BottlesOfBeerInMyF · · Score: 1

      Firstly, it's false on the face of it: Ubuntu is certified on Sun T2000, a 32-thread and Canonical is supporting it.

      Actually, the article talks about application development tools for various OS's and their poor support for creating good, multithreaded applications for Linux or Windows. I didn't see much in the way of complains about the OS using the CPUs, just about the ability of applications to do so given the current toolset.

      I think OS's can do more to make multi-cores work better with existing software. For example, OS X can take any OpenGL program and spawn a thread for feeding data to the GPU, without the software having been rewritten to accommodate that change. In this way some existing applications can theoretically be twice as fast when running on a multi-core system. That sort of innovation at the OS level is as important as changes to developer tools for the transition to larger numbers of cores.

    5. Re:The article's turning a real problem into FUD. by tepples · · Score: 2

      Dealing with multiple cores is the operating system's problem - not the application's. If the programmer uses multiple threads or processes, then it should be the OS that worries about allocating resources among the cores.

      But the problem of TFA is that desktop applications don't use enough threads or processes. If the programmer hasn't split an application into multiple threads or processes, then there usually isn't more than one thread of one process that wants to run at any given time, and there is nothing for the operating system to schedule.

    6. Re:The article's turning a real problem into FUD. by AuMatar · · Score: 1

      And if the app isn't written to assume that there will be a separate thread, you're going to be up shit creek with synchronization problems and race conditions. Unless you mean that OSX provides a library function StartOpenGLThread which the programmer has to call- in which case the app does need to be rewritten to use it.

      Undoubtedly over time libraries will be written that will make some things easier. But doing things "automatically" that the programmer doesn't expect isn't going to work, its just going to cause bugs.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    7. Re:The article's turning a real problem into FUD. by 99BottlesOfBeerInMyF · · Score: 1

      And if the app isn't written to assume that there will be a separate thread, you're going to be up shit creek with synchronization problems and race conditions.

      Not really. You see programs using OpenGL already have to hand off data to the GPU so they are already coded with that in mind. Since they don't know how fast the GPU will process things, they already account for the fact that they don't know how long it will be before it is returned. OS X just starts a process dedicated to feeding the GPU, so that if the program happens to be CPU bound and that is the reason (a fairly common occurrence in some degree) it removes that bottleneck by leveraging another core. Theoretically, this could double the performance of those apps, although realistically you see much more modest gains.

      But doing things "automatically" that the programmer doesn't expect isn't going to work, its just going to cause bugs.

      Umm, it's been working in OS X for over a year and I don't see any of these bugs you theorize.

    8. Re:The article's turning a real problem into FUD. by cratermoon · · Score: 1

      This really is the core of the issue. An application that was written before multicores were common will not necessarily have partitioned up its work in a way that can take advantage of true parallelism. See Ahmdal's Law and Gustafson's Law. In general, writing good parallel algorithms is a Hard Problem, and simply pushing the work onto the O/S only gets a slight speedup, hardly worth the cost and effort.

    9. Re:The article's turning a real problem into FUD. by Anonymous Coward · · Score: 0

      Aye and Solaris runs on 128-thread units.. (dual UltraSparc T2+) - 8 Cores, 8 execution threads per core, 64 threads per cpu, times 2 cpus)..

    10. Re:The article's turning a real problem into FUD. by Anonymous Coward · · Score: 0

      The article makes it abundantly clear, had you actually read it, that the limitation is within the programs and not in the OS. I believe it named the OSes for scandal value only, as they (or their kernels, at least) are rather irrelevant to its point.

    11. Re:The article's turning a real problem into FUD. by davecb · · Score: 1

      What I said! --dave

      --
      davecb@spamcop.net
    12. Re:The article's turning a real problem into FUD. by Anonymous Coward · · Score: 0

      *cough*

      It did bottleneck on locks.

      http://en.wikipedia.org/wiki/Amdahl%27s_law#Parallelization

    13. Re:The article's turning a real problem into FUD. by davecb · · Score: 1

      Indeed it does! That was the motivation for my comment on transactional memory.

      --
      davecb@spamcop.net
  11. Article not as bad as summary by Protoslo · · Score: 1

    The article doesn't really say that Windows and Linux aren't "designed" for quad+ core chips; it just says that most software is still single threaded. No kidding.

    1. Re:Article not as bad as summary by owlstead · · Score: 1

      The problem with the actual article is is that it doesn't say *anything*. This is another Slashdot joke, and it is starting to get irritating. The discussion can be fun none-the-less, but I presume this is the firehose effect? Or can we just squarely blame the editors?

  12. Example: Scripting Languages by mcrbids · · Score: 3, Interesting

    Languages like PHP/Perl, as a rule, are not designed for threading - at ALL. This makes multi-core performance a non-starter. Sure, you can run more INSTANCES of the language with multiple cores, but you can't get any single instance of a script to run any faster than what a single core can do.

    I have, so, so, SOOOO many times wished I could split a PHP script into threads, but it's just not there. The closest you can get is with (heavy, slow, painful) forking and multiprocess communication through sockets or (worse) shared memory.

    Truth be told, there's a whole rash of security issues through race conditions that we'll soon have crawling out of nearly every pore as the development community slowly digests multi-threaded applications (for real!) in the newly commoditized multi-CPU environment.

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
    1. Re:Example: Scripting Languages by mcrbids · · Score: 1

      PS: Strikes me that this is a good time for us to truly develop cross-core, cross-system, and cross-cluster application development by unifying cross-thread, cross-core, and cross-system communications under a common API.

      Really, as a clustered application developer, why should I have to worry about where a process runs, on what server? The POSIX process scheduler should be extended to run applications on the cluster with the SAME API as launching a thread.

      It's stupid that I have to think about all the details when all I want is:

      A) X done.
      B) In another process/thread.

      Somebody who "gets" this could make a pretty serious dent ini the marketplace!

      --
      I have no problem with your religion until you decide it's reason to deprive others of the truth.
    2. Re:Example: Scripting Languages by DrSkwid · · Score: 1

      Forking is only expensive in your environment because of shared libraries.
      Static compilation ftw.

      --
      There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
    3. Re:Example: Scripting Languages by dansmith01 · · Score: 5, Interesting

      Perl has excellent support for building threaded applications. See http://perldoc.perl.org/threads.html . I code multi-threaded apps in perl all the time and they utilize my quad-code very efficiently - in fact, my biggest hassle with multithreading is keeping the CPU cooled! There's also a threads::shared module (http://perldoc.perl.org/threads/shared.html) for handling locks, etc. I'd be hard pressed to imagine better language support for threading. Hardware, operating systems, and a lot of languages support threading. Granted, it isn't always easy/possible/worth it, but as things currently stand, the only bottleneck is programmers who are too lazy to design their algorithms for parallel execution.

    4. Re:Example: Scripting Languages by godrik · · Score: 1

      well, you could rely on parallel libraries that do the job in parallel, or even using asynchronous function call as in cilk, or athapascan. However, I am not sure applications written in scripting languages are the one looking for using multi core architectures...

    5. Re:Example: Scripting Languages by profplump · · Score: 1

      What exactly makes forking so expensive with shared libraries? The code segment shouldn't be duplicated for shared or static libraries, and shared libraries have the benefit of only be loaded once per system, rather than once per program. Is there some important step I'm missing?

    6. Re:Example: Scripting Languages by amorsen · · Score: 4, Insightful

      Fork isn't slow or painful. And if you think shared memory is a bad way to communicate, you REALLY won't like threads.

      --
      Finally! A year of moderation! Ready for 2019?
    7. Re:Example: Scripting Languages by DrSkwid · · Score: 1
      --
      There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
    8. Re:Example: Scripting Languages by DrSkwid · · Score: 1

      There's also an older long discussion

      I link to that version of it rather than google groups because Linus has X-No-archive as a mail header - the chicken shit bastard

      --
      There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
    9. Re:Example: Scripting Languages by cryptoluddite · · Score: 1

      This makes multi-core performance a non-starter. Sure, you can run more INSTANCES of the language with multiple cores, but you can't get any single instance of a script to run any faster than what a single core can do.

      Take Java for instance. It runs the garbage collector in parallel with the program itself so what you'll find is a lot of single-threaded apps end up using say 110% CPU time or more. Programs never written for multiple cores get a speed-up from running on them.

      There's no reason scripting languages can't also do GC, or speculatively compile into native code on another core.

    10. Re:Example: Scripting Languages by Anonymous Coward · · Score: 0

      Already been done. Look up MOSIX. It's kind of in a non-maintained state now, though. There's a few other projects that do it too. There's also projects like Beowulf, these use a cluster-aware API but it's simple and can run on a single machine too.

    11. Re:Example: Scripting Languages by Anonymous Coward · · Score: 0

      Perl's threads are of the kernel variety. shared memory is slow for IPC?

    12. Re:Example: Scripting Languages by rackserverdeals · · Score: 1

      Languages like PHP/Perl, as a rule, are not designed for threading - at ALL.

      A few years ago I started getting into PHP to see what it was about and have worked on some PHP sites since then, but this is the main reason I decided to stick with Java for web development.

      Page load times can be dramatically reduced by offloading some processing to separate threads. It's great for caching or doing delayed batch writes to the db.

      --
      Dual Opteron < $600
    13. Re:Example: Scripting Languages by shutdown+-p+now · · Score: 1

      Fork isn't slow or painful. And if you think shared memory is a bad way to communicate, you REALLY won't like threads.

      I think that by "shared memory" he means the Unix/POSIX shared memory APIs, not the general idea of sharing memory between threads/processes. I guess he's just used to how global variables are just implicitly shared between all threads, and pointers can be passed around with impunity,

    14. Re:Example: Scripting Languages by Anonymous Coward · · Score: 0

      Try to share network connections (for example: DB connections) between perl "threads".
      You'll see the real problem.
      Perl threads are just a toy, only usefull for CPU filling. Very good if that's what you want, but most uses nowadays for threads are to allow pooling of resources between themselfs. The only way to true escalability.

    15. Re:Example: Scripting Languages by Anonymous Coward · · Score: 0

      If you have a website running php/perl, user 1 comes in and gets core 1, user 2 comes in and gets core 2, etc. The cores are used!

    16. Re:Example: Scripting Languages by Pinky's+Brain · · Score: 1

      Shared memory is a fine way to communicate, having a language make it the default that you can simultaneously access all shared data is what's the problem.

  13. How many tools do you need? by Anonymous Coward · · Score: 5, Insightful

    "The development tools aren't available and research is only starting"

    Hardly. Erlang's been around 20 years. Newer languages like Scala, Clojure, and F# all have strong concurrency. Haskell has had a lot of recent effort in concurrency (www.haskell.org/~simonmar/papers/multicore-ghc.pdf).

    If you prefer books there's: Patterns for Parallel Programming, the Art of Multiprocessor Programming, and Java Concurrency in Practice, to name a few.

    All of these are available now, and some have been available for years.

    The problem isn't that tools aren't available, it's that the programmers aren't preparing themselves and haven't embraced the right tools.

    1. Re:How many tools do you need? by Anonymous Coward · · Score: 0

      Instead(in the server side front) we see all this hype for Ruby on Rails which for my understanding doesn't even support native threads.

      Scala actors pwnz.

    2. Re:How many tools do you need? by julesh · · Score: 1

      "The development tools aren't available and research is only starting"

      Hardly

      Exactly what I was here to post. I mean, it's not like parallel computation is anything new. The concurrent programming lab at my university was the most important research division when I was back there in the early 90s. And they'd spent real cash on it, too... they had a transputer and everything.

    3. Re:How many tools do you need? by MattBD · · Score: 1

      I did hear that functional programming was likely to become the dominant programming paradigm in the near future because it was well-suited to developing with multiple cores. I'm currently learning Python and I daresay it will take a while to get used to OOP. Is functional programming appreciably harder or is it just different? I'd certainly be open to learning another language.

    4. Re:How many tools do you need? by Anonymous Coward · · Score: 0

      I did hear that functional programming was likely to become the dominant programming paradigm in the near future because it was well-suited to developing with multiple cores.
      I'm currently learning Python and I daresay it will take a while to get used to OOP. Is functional programming appreciably harder or is it just different? I'd certainly be open to learning another language.

      Every approach, be it functional, or OOP, or even crazier stuff like vector programming or logic programming, all fit certain types of applications.

      Pure functional languages are great for math and math-like things like spreadsheets. OOP is great for modular systems that you need to maintain over a long period of time. Vector programming is great for financial applications.

      Probably the best advice is to try a little of everything, then jump on what feels "right" to you. You'll get the most mileage by growing with the technology that fits how you think, until you can really wield it effortlessly.

    5. Re:How many tools do you need? by Kjella · · Score: 1

      The problem isn't that tools aren't available, it's that the programmers aren't preparing themselves and haven't embraced the right tools.

      I've dabbled a little in them. I've played with QT's concurrency functions, and written some software that's multithreaded. The real problem is that it's hard no matter what language you run in, unless you have one of those really trivially parallelizable problems that might as well be running on a 10000 node cluster. It's not that people don't have expperience with multitasking - if you've ever tried to cook anything more advanced than Ramen noodles you probably have - but that people are, despite many other flaws, rather good at resolving deadlocks, race conditions and resource contention. If me and a friend run into each other in a doorway we don't stand there headbutting each other like complete idiots. If we both need the kitchen knife to cut something we agree on who goes first. If each of us got one kitchen mitt we don't come to a standstill and let the roast burn.

      Of course, none of this is exactly new. I know that the computer is dumb as brick and will do exactly what i tell it even if it's obviously not what I intended. But that is pretty easy to figure out in a single-threaded application, you just step through and find out when it stops doing what I want and there you have it. Multithreaded software just have this inexplicable ability to interact in ways you didn't predict, only show up under mysterious conditions like release builds and in general being a pain in the ass to debug because I've still not come across a good tool that'll easily which threads are messing it up and why. Given that there's finite resources most of the time a single-threaded application that's functional, stable, maintainable and so forth is better the alternative.

      --
      Live today, because you never know what tomorrow brings
    6. Re:How many tools do you need? by Anonymous Coward · · Score: 0

      How about you add GCC to that list? It does have OpenMP support, meaning you can parallelize things like the common for loop with just a #pragma.

    7. Re:How many tools do you need? by shutdown+-p+now · · Score: 1

      Web "server side" or rather what is meant by that in this context, is inherently parallelizable because you have multiple requests to handle. You just put each one on its own core, and that's pretty much it.

    8. Re:How many tools do you need? by Aceticon · · Score: 1

      My experience in the Java world is there are very few people that can use multiple threads properly.

      Java is a language where creating a new thread is stupendously simple, thread synchronization was a standard part of the language from day one (although it's quite basic) and many advanced uses of the language (web-applications, EJBs) are by default multi-threaded. It's quite surprising that most developer's either don't know how to manage concurrency altogether or default to a "lock everything" strategy.

      Even more surprising is the fact that this remains true even for developers working on pure server applications (where a lot of value can be gained for applications that use multiple threads properly).

      I've worked with recent graduates and almost universally they don't really know how to program efficient multi-threaded applications: maybe Universities aren't teaching the right things???

  14. BeOS by Snowblindeye · · Score: 5, Interesting

    Too bad BeOS died. One of the axioms the developers had was 'the machine is a multi processor machine', and everything was built to support that.

    Seems like they were 15 years ahead of their time. But, on the other hand, too late to establish an other OS in a saturated market. Pity, really.

    1. Re:BeOS by yakumo.unr · · Score: 2, Informative

      So you missed Zeta then ? http://www.zeta-os.com/cms/news.php (change to English via the dropdown on the left)

    2. Re:BeOS by Anonymous Coward · · Score: 1, Interesting

      Haiku and Syllable are pretty much trying to continue the model that BeOS used, where threading is heavily used and threads communicate via. high level message passing.

    3. Re:BeOS by Anonymous Coward · · Score: 0

      Actually it's not quite completely dead: http://www.haiku-os.org/ and I believe they just got a version of gcc as well.

    4. Re:BeOS by Anonymous Coward · · Score: 1, Informative

      or Haiku?
      http://www.haiku-os.org/

    5. Re:BeOS by b4dc0d3r · · Score: 2, Informative

      Looks dead to me, a year ago they posted this:

      With immediate effect, magnussoft Deutschland GmbH has stopped the distribution of magnussoft Zeta 1.21 and magnussoft Zeta 1.5. According to the statement of Access Co. Ltd., neither yellowTAB GmbH nor magnussoft Deutschland GmbH are authorized to distribute Zeta.

      http://www.bitsofnews.com/content/view/5498/44/

    6. Re:BeOS by ChunderDownunder · · Score: 1

      So you missed that Zeta doesn't exist then?

      From the site you linked, Friday 06 April 2007, Zeta stopped distribution under the threat of legal action by the copyright holders of BeOS.

      Still, Be's spirit lives on in the Haiku project.

    7. Re:BeOS by verbatim_verbose · · Score: 2, Interesting

      It may have been an axiom, but really, what did BeOS do (or want to do) that Linux doesn't do now?

      The Linux OS has been scaled to thousands of CPUs. Sure, most applications don't benefit from multi-processors, but that'd be true in BeOS, too.

      I'd honestly like to know if there is some design paradigm that was lost with BeOS that isn't around today.

    8. Re:BeOS by Anonymous Coward · · Score: 0

      beos dident die the haiku project is still alive

    9. Re:BeOS by Anonymous Coward · · Score: 0

      Haiku is an open source recreation of BeOS and is even binary compatible with the last released BeOS, R5.

      http://haiku-os.org

    10. Re:BeOS by BenoitRen · · Score: 1

      Long live Haiku!

    11. Re:BeOS by Snowblindeye · · Score: 1

      It may have been an axiom, but really, what did BeOS do (or want to do) that Linux doesn't do now?

      The Linux OS has been scaled to thousands of CPUs. Sure, most applications don't benefit from multi-processors, but that'd be true in BeOS, too.

      I'd honestly like to know if there is some design paradigm that was lost with BeOS that isn't around today.

      I think their application framework was highly threaded. Or, structured to encourage you to use a lot of threads.

      Just creating a window and some task that did something automatically gave you two threads, without you thinking about it much. One would take care of the UI, the other would run your app. That way if even the app was busy the UI stayed responsive.

      Of course I don't know how far it would really have scaled, but I think the application framework that came with it probably did as much to enhance that as did the actual OS

    12. Re:BeOS by crispytwo · · Score: 1

      Not true.

      A revitalized OS that may work better than what we can offer using old designs can lend tons to the world of software development and users. Lots of the OSS can run on Haiku, the new rendition of BeOS.

      I am curious how well it will work when it nears the RC1 stage. I will most certainly be trying it out.

    13. Re:BeOS by yakumo.unr · · Score: 1
      Oops,yes, I had just remembered there was a continuation, and zeta sprang to mind, I didn't check over the page..

      The AC reply got the right one, haiku is the active project, they don't have a stable release yet though, only pre alpha nightly builds.

    14. Re:BeOS by Anonymous Coward · · Score: 0

      Haiku OS ... Successor to BeOS perhaps not in source code but definitely sharing the ideas goals and importantly APIs

      The project is comming along quite well now gets more interesting each year

      www.haiku-os.org and dev.haiku-os.org for the bug tracker

    15. Re:BeOS by jgtg32a · · Score: 1

      Wow

    16. Re:BeOS by margaret · · Score: 1

      I loved BeOS and was so sad when it went away.

      The Haiku page says that it's "inspired by" BeOS. So what's that mean exactly - that they're trying to reverse engineer it? What happened to the old Be source code? Seems silly to have to reinvent the wheel if the code is already written and not being used for anything.

    17. Re:BeOS by JasterBobaMereel · · Score: 1

      Database file system ... Like WinFS and GNOME Storage except it has actually been released and works ....

      --
      Puteulanus fenestra mortis
  15. Grand Central by tepples · · Score: 3, Informative
    Anonymous Coward wrote:

    get a mac..

    I assume you're talking about Mac OS X 10.6 (Snow Leopard), whose Grand Central framework is supposed to add some tools to make Mac-exclusive multithreaded apps easier to program.

    1. Re:Grand Central by heffrey · · Score: 1, Flamebait

      Vapourware

    2. Re:Grand Central by Anonymous Coward · · Score: 0

      Explain to me how the release of Snow Leopard which will include "Grand Central" is vaporware?

    3. Re:Grand Central by tepples · · Score: 1

      Mac OS X 10.6 is due in about three months, roughly a year after WWDC 2008, but I think heffrey's trying to say it's vapor until it's published.

    4. Re:Grand Central by heffrey · · Score: 1

      Well, 'cos it hasn't been released yet and there are no meaningful technical details of what it is. So on the face of it, it would appear to be a piece of marketing fluff from the company that is best at that. I just find it very hard to believe that a company which is very good at polishing other people's ideas is going to solve the problem of automatic parallelisation of software.

    5. Re:Grand Central by intheshelter · · Score: 1

      Based on the tone of your comment I would assume that your posts are more driven from a disklike of Apple then anything based on logic or impartial deduction.

      "it would appear to be a piece of marketing fluff from the company that is best at that."

      "a company which is very good at polishing other people's ideas"

      Has it ever occured to you that maybe the reason they keep this close to the vest until they release it is because everything they do seems to end up in the next version of Windows? Rather than give Windows more time to copy they wait until the last minute to release details. Jobs learned his lesson decades ago regarding Microsoft and I believe that is one reason Apple is so secretive now.

      But, I digress, if you want to hate Apple then that's fine, but calling it vaporware is a bit of stretch. If they release Snow Leopard and it's not in the release THEN you'll have a point.

    6. Re:Grand Central by Anonymous Coward · · Score: 0

      If he wasn't I will. Two years from now most people will be talking about this great "new" way windows or linux thought up to actually use all cores. Something Mac users are going to get in about two months. Its brilliant, and even if you hate mac you should check it out regardless. I switched about 6 years ago and have stuck with them because of the level of innovation they offer. Yes you can complain about a few things but overall they offer the best system out there and by far the best operating system. --> Yes I have used every version of Windows, and Linux you can think of. None quite match up although I will admit some Linux flavors come close.

    7. Re:Grand Central by heffrey · · Score: 1

      I don't hate Apple, but I don't think they are the wonderful innovative company that much of the media makes them out to be. They take good ideas and polish them, much like Microsoft do. That's not criticism, that's a compliment.

      Anyway I'm sure there will be something in Snow Leopard but I'll bet it won't revolutionary.

  16. Multicore not designed for the real world by stox · · Score: 1

    Yes, some problems lend themselves very well to multicore designs. Many others do not. Just because they are building multicore ships does not mean that multicore is the right answer. Current multicore designs have too small cache, and too slow memory bandwidth. If my problem is CPU bound, multicore can be a solution. If my problem is memory access bound, multicore is only going to make it worse.

    --
    "To those who are overly cautious, everything is impossible. "
    1. Re:Multicore not designed for the real world by higuita · · Score: 1

      Current multicore designs have too small cache, and too slow memory bandwidth

      No, this is a intel problem, not a multicore design fault... AMD cpus/cores require less cache than intel and have internal memory access and controlers, so they havent the memory bandwidth problems of intel

      --
      Higuita
  17. Not all programmers need threading by js3 · · Score: 1

    The idea that every program needs to support threading is kinda stupid. Most programs barely use any computational power, in fact there are very few programs that require all that computing power to operation and those are certainly well designed.

    --
    did you forget to take your meds?
    1. Re:Not all programmers need threading by Anonymous Coward · · Score: 0

      I think that you are right, with some exceptions such as web browsers and games.

      Web browsers need to run a s-load of applications inside itself, sort of like an operating system.

      Impressive games need all the resources they can get (otherwise they would probably not be impressive) but it is fairly difficult to analyze the program before it is made to find the optimal parallelization strategy and to realize that strategy in code.

    2. Re:Not all programmers need threading by tepples · · Score: 1

      Impressive games need all the resources they can get (otherwise they would probably not be impressive) but it is fairly difficult to analyze the program before it is made to find the optimal parallelization strategy and to realize that strategy in code.

      The critical path as I understand it is input => physics => graphics and sound. Are modern impressive games bottlenecked in physics or in the CPU side of graphics?

    3. Re:Not all programmers need threading by 0123456 · · Score: 1

      Are modern impressive games bottlenecked in physics or in the CPU side of graphics?

      Do you really think that game developers wouldn't like to include a vastly more sophisticated physics or AI engine if the CPU could handle it?

    4. Re:Not all programmers need threading by Anonymous Coward · · Score: 0

      I think that modern games can be bottlenecked almost anywhere depending on what the player might do to push the limits. Graphics, physics, "AI", network bandwidth, or any sort of overhead running wild.

      A WoW server has 10 000+ players during peak hours. It's not unlikely that half of them would love to get together to fight a huge battle somewhere in the game. What gives first? I don't know. But there are probably several bottlenecks that would give if the main bottleneck wasn't there.

      If they solve these bottlenecks, then each player will eventually want to command platoon of computer controlled minions in battle. And once that's possible they will want the minions to have better AI.

      Really, really impressive games will always be something of a nightmare to make. =)

  18. Don't imagine. Its name was Java. by tepples · · Score: 4, Informative

    imagine software being developed for imaginary or speculatory hardware.

    I think Sun called it "Java". It was run on emulators long before ARM and others came out with hardware-assisted JVMs such as Jazelle.

    1. Re:Don't imagine. Its name was Java. by pm_rat_poison · · Score: 1

      Yeah, there are also imaginary languages for imaginary processors like mic1 and stuff. But TFA is talking about operationg systems

    2. Re:Don't imagine. Its name was Java. by Dahamma · · Score: 2, Insightful

      Yeah, there are also imaginary languages for imaginary processors like mic1 and stuff. But TFA is talking about operationg systems

      Don't state what TFA says if you didn't even read TFA.

      There isn't a SINGLE reference to Linux, Windows, or any other operating system in TFA. It was about lack of developer tools to create effective multithreaded applications, and had nothing to do with operating systems.

    3. Re:Don't imagine. Its name was Java. by owlstead · · Score: 1

      Java is supposed to be everywhere, from the start. The fact that you can also implement the VM as hardware (and you can still buy dedicated hardware AFIAK for enterprise servers) is mainly a side effect. So I don't think this compares.

      You can basically implement any VM in hardware actually, and emulate any hardware using a VM. So, eh, basically, what's your point?

    4. Re:Don't imagine. Its name was Java. by tepples · · Score: 1

      So, eh, basically, what's your point?

      To point out that market disagrees with pm_rat_poison's assessment that "software being developed for imaginary or speculatory hardware [s]ounds like a big waste of time".

    5. Re:Don't imagine. Its name was Java. by owlstead · · Score: 1

      Ah, fair enough. I did not directly see if your article was trying to criticize the parent or trying to mock Sun, and I mistakenly assumed the latter. Sorry about that :) Not all those hardware ventures of Sun regarding Java were a success (and that's probably an understatement).

  19. Well, maybe... by windsurfer619 · · Score: 1

    Maybe I'm just not a multicore user. Ever thought of that?

  20. Another flamebait story by timothy by Anonymous Coward · · Score: 5, Insightful

    The quote presented in the summary is nowhere to be found in the linked article. To make matters worse, the summary claims that linux and windows aren't designed for multicore computers but the linked article only claims that some applications are not designed to be multi-threaded or running multiple processes. Well, who said that every application under the sun must be heavily multi-threaded or spawning multiple processes? Where's the need for a email client to spawn 8 or 16 threads? Will my address book be any better if it spans a bunch of processes?

    The article is bad and timothy should feel bad. Why is he still responsible for any news being posted on slashdot?

    1. Re:Another flamebait story by timothy by godrik · · Score: 1

      Where's the need for a email client to spawn 8 or 16 threads?

      well, one of the goal of multi core architecture is to reduce the energy consumption. 2 cores at 2Ghz cost less energy than 1 core at 2Ghz.

      The question is not how many thread should an application launch. Threads are just here to exploit the parallelism of the machine. There will be more thread to transform 2 cores at 2Ghz into 4Ghz of computational power. If you don't need such power (who needs a 4Ghz computer to write some mail), then the multi-core architecture can be downclocked to reduce the energy consumption. Same performance but better efficiency.

    2. Re:Another flamebait story by timothy by jonbryce · · Score: 1

      Kontact/KMail could certainly use multi-threading when it is searching through the 100 or so new mails for the 2 or 3 that aren't spam.

    3. Re:Another flamebait story by timothy by owlstead · · Score: 1

      "Where's the need for a email client to spawn 8 or 16 threads?"

      Eh? That's one application that definitely needs threads, or at least a split up into parts that can run separately (e.g. Java can run Runnables within the same hardware thread, which is more light-weight).

      It's not so much the speed up as doing multiple things in the background. E.g. it's stupid if user has to wait for getting messages from the server, or code for parsing/filtering messages is not multi-threaded.

      But an email client is almost too easy to split into threads, and many of the clients are already multi-threaded (e.g. you scroll or even write a message while mail is retrieved).

    4. Re:Another flamebait story by timothy by richlv · · Score: 1

      while article/summary is not the best, i would still want email client to be threaded.
      i want the interface to be responsive and fast while it sends, receives, checks for mail, filters spam, compacts mailboxes and does whatever it needs under the hood. same goes for my audio player, my web browser, everything - if something can slow down my interaction with it, i want to to happen in a separate thread/process.

      --
      Rich
  21. BS by Anonymous Coward · · Score: 0

    What good are multiple cores and threads when you are running event driven GUI application? Some applications, especially Java applications, already use too many unnecessary threads. Inciting threads use where it is unnecessary is stupid. There is only a limited space for parallelism in any algorithm.

    Also, "...research is only starting." What BS is this? Multiprocessing and multithreading issues are being researched and solved several decades now.

    1. Re:BS by mikael · · Score: 1

      The advantage is when you are using an applications like image processing, 3D rendering, video decompression, downloading where tasks can be run in the background. Scripts and macros that apply on multiple sets of data can be run in the background. Auto-save functionality has always been desirable, but annoying when the entire application freezes because it is single-threaded.

      --
      Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
  22. Really? by Darkness404 · · Score: 1

    Who would have ever guessed that most software is single-threaded rather then multi-threaded, and the programmers of Linux and Windows don't really feel like optimizing everything for 8-core CPUs that won't be released for quite some time and won't end up in the average user's box for 3 or more years.

    --
    Taxation is legalized theft, no more, no less.
    1. Re:Really? by Bert64 · · Score: 1

      The OS isn't really the problem, both Windows and Linux will happily use more than 8 cores... In the server space boxes with 4 or more quad core chips are very common, and machines with 16 physical cpus have been around for years.
      Linux scales especially well, you can buy a system with 512 dual core cpus from SGI (http://www.sgi.com/products/servers/altix/4000/configs.html) which is designed to run Linux, and Linux accounts for over 80% of the top500 supercomputers list and most of the top 10.

      It's end users apps that aren't optimized for multiple cores/cpus, as most server software can easily be multi threaded by handling multiple user connections at once...

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
  23. How many apps per user? by tepples · · Score: 1

    Multiple virtual machines on the same piece of metal, with a workstation hypervisor, and intelligent balancing of apps between backends.

    But with how many apps can one user interact? I understood the article to be referring to desktop applications, not server applications. In a desktop environment, most applications spend much of their time waiting for an event. For example, a virus scanner blocks until a file is modified or a removable medium is mounted. Or are you envisioning connecting four terminals to one desktop PC and binding one virtual machine to each terminal?

    1. Re:How many apps per user? by mysidia · · Score: 1

      No, i'm envisioning one VM to essentially act as a thin layer and provide display, KB, and mouse to your physical terminal; much like a terminal services client, while the worker processes (and processes displaying UI widgets) get distributed over the various CPUs and VMs, so the OS handling display is super-lightweight and isolated from UI logic, for instance.

      For example, Parallels Desktop's "Coherence", and VMware Fusion's unity feature. Don't need multiple physical terminals.

      You shouldn't have to care which VM is actually doing the interface rendering and the processing for each individual app.

    2. Re:How many apps per user? by tepples · · Score: 1

      [One VM runs the X server,] while the worker processes (and processes displaying UI widgets) get distributed over the various CPUs and VMs

      But if one human being is doing only one thing at one time, only one VM running on one core will be running one worker process.

    3. Re:How many apps per user? by mysidia · · Score: 1

      That goes back to the design of the application.

      But how many humans do you know that only run one or two windows on their PCs at the same time?

      Your average XP user has 4 or 5 windows open at the same time.

      In principal, with an API in place, apps could be made that run multiple worker threads for one application across the multiple VMS, that's to what I refer.

      So even if the App's OS doesn't scale well to multiple cores, if your virtualization layer does scale well, and has a decent Api the apps can take advantage of, then it can scale independent of the client OS limitations.

      And also, through the API achieve reliability. For example, if one of the running copies of Windows crashes, the Apps with multiple worker threads keep some threads running, and they can detect and recover from that failure (i.e. restore the lost thread on a fresh VM that's still running optimally).

      This allows applications to work around issues like blue screens in certain OSes, or kernel panics in others, so long as they perform sufficient checkpointing to allow restoration of the thread.

      And it would be beneficial for server applications to implement as much as for desktops...

  24. Parallel programming is hard, film at 11. by Troy+Baer · · Score: 5, Informative

    The /. summary of TFA is almost exquisitely bad. It's not Window or Linux that's not ready for multicore (as both have supported multi-processor machines for on the order of a decade or more), but rather the userspace applications that aren't ready. The reason is simple: Parallel programming is rather hard, and historically most ISVs have haven't wanted to invest in it because they could rely on the processors getting faster every year or two... but no longer.

    One area where I disagree with TFA is the claimed paucity of programming models and tools. Virtually every OS out there supports some kind of concurrent programming model, and often more than one depending on what language is used -- pthreads, Win32 threads, Java threads, OpenMP, MPI or Global Arrays on the high end, etc. Most debuggers (even gdb) also support debugging threaded programs, and if those don't have enough heft, there's always Totalview. The problem is that most ISVs have studiously avoided using any of these except when given no other choice.

    --t

    --
    "My life's work has been to prompt others... and be forgotten." --Cyrano de Bergerac
    1. Re:Parallel programming is hard, film at 11. by klossner · · Score: 4, Insightful

      In fact, TFA doesn't even use the words "Linux" or "Windows."

    2. Re:Parallel programming is hard, film at 11. by Ryzzen · · Score: 1

      And Macs are suspiciously missing from the summary. Perhaps this is what Steve Jobs has been doing in his spare time... You can't fool us, "Timothy!"

    3. Re:Parallel programming is hard, film at 11. by javax · · Score: 1

      I have to agree; yet parallel or (at least) multi-threaded algorithms haven't made it into standard libraries. An extreme example is the Standard C library qsort(3): There are multi-threaded versions of it available http://libmt.sourceforge.net/ but not included in the base libraries. And the glibc is imho a integral part of Linux. I don't know exactly about Windows, but I assume the standard sorting function to be a serial one, too.

    4. Re:Parallel programming is hard, film at 11. by Bert64 · · Score: 1

      The idea of machines getting faster every year has not just harmed parallel processing, it has resulted in slower and more bloated code written in increasingly inefficient languages...

      At http://shootout.alioth.debian.org/ you can see various benchmarks tested in different languages...

      Java seems to perform quite well, but has a significant memory and startup overhead (java -server where the runtime is already loaded performs massively better)..
      Ruby which seems to be fashionable right now performs terribly...

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
  25. Multithreaded applications not always needed by Pascal+Sartoretti · · Score: 2, Insightful

    Developers still write programs for single-core chips and need the tools necessary to break up tasks over multiple cores.

    So what? If I had a 32 core system, at least each running process (even if single-threaded) could have a core just for itself. Only a few basic applications (such as a browser) really need to be designed for multiples threads.

    1. Re:Multithreaded applications not always needed by stevied · · Score: 1

      So long as the OS scheduler is smart enough to keep each process on the same core to avoid cache thrashing, all the while num_procs < num_cores .. This was a problem at one point - I presume it's been fixed by now.

    2. Re:Multithreaded applications not always needed by nbates · · Score: 1

      My dvd burning program is quite slow, it could really use multiple threads...

    3. Re:Multithreaded applications not always needed by Pascal+Sartoretti · · Score: 1

      My dvd burning program is quite slow, it could really use multiple threads...

      I seriously doubt that burning a DVD is CPU-bound, therefore having a multi-threaded DVD burning program would not help much.

    4. Re:Multithreaded applications not always needed by nbates · · Score: 1

      I'm never sure when to put a sarcasm tag. In this case however, I was being completely serious...

  26. Some tasks are embarrassingly parallel by tepples · · Score: 4, Informative

    Most programs barely use any computational power, in fact there are very few programs that require all that computing power to operation and those are certainly well designed.

    Home users do use some apps that could benefit from multiple cores. Video encoding is one of them, but that one is embarrassingly parallel because the encoder could just split the video into quadrants and have each of four cores work on one quadrant.

    1. Re:Some tasks are embarrassingly parallel by DarkOx · · Score: 1

      Good point video encode / decode should be pretty simple to parallel. Would it not be much simpler though from a prespective of not needing to deal with complicating motion estimation algorithms and such just to split video work along groups of b-frames? Seems like as long as the video was more then a few frame groups in length you would get just as much gains without even needing to rempliment much if any of your existing codec algorithms.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    2. Re:Some tasks are embarrassingly parallel by evilviper · · Score: 3, Informative

      Video encoding is one of them, but that one is embarrassingly parallel

      This is most certainly not true. While many video codecs have been multi-threading enabled, they always do so at a significant quality reduction.

      because the encoder could just split the video into quadrants and have each of four cores work on one quadrant.

      Many features of H.264 (like GMC) require a a whole frame, not a quadrant. In practically all lossy video codecs, motion vectors have to be computed as the differential from the previous. And there are endless other examples. Of course there's little point in going into it, because the next time video encoding comes up on /., dozens of other people will make the exact same uninformed statements...

      Just go visit the x264 mailing list and ask the developers why they stopped using slice-based encoding for multithreaded encoding...

      I used to recommend splitting a 2-hour video into four 30-minute parts and feeding each to a single-threaded encoder.

      That would only make ANY sense with fixed bitrate encoding. It can possibly be used in the second-pass of multipass encoding, but that's not trivial to do by any stretch.

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    3. Re:Some tasks are embarrassingly parallel by Anonymous Coward · · Score: 0

      Good point video encode / decode should be pretty simple to parallel. Would it not be much simpler though from a prespective of not needing to deal with complicating motion estimation algorithms and such just to split video work along groups of b-frames? Seems like as long as the video was more then a few frame groups in length you would get just as much gains without even needing to rempliment much if any of your existing codec algorithms.

      The only caveat is that the human eye might be able to catch the transitions between 'chunks' (whatever it's called) in the encoded video, since the encoder will probably do slightly different encoding of the end of chunk number n than it does for the beginning of chunk n+1.

      Guessing it can be solved with just a little bit of communication between thread n and n+1?

    4. Re:Some tasks are embarrassingly parallel by drinkypoo · · Score: 1

      If more single programs are going to benefit from large numbers of cores, though, the system itself is going to have to become more parallel, to the point where a single window's decorations might be drawn through the efforts of multiple processors, to say nothing of its contents.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    5. Re:Some tasks are embarrassingly parallel by AndrewStephens · · Score: 1

      I don't do much video encoding and I certainly have never worked on a codec, but wouldn't it be possible to encode a 16 minute video file on 4 processors by breaking it into 4 minute sections and encoding these separately?

      I know that frames often depend on previous frames, but each section could overlap the following one slightly, up to a frame that did not reference previous data. That would make stitching the files back together relatively easy at the cost of encoding a few frames twice.

      Of course this doesn't work with encoding a realtime stream.

      --
      sheep.horse - does not contain information on sheep or horses.
    6. Re:Some tasks are embarrassingly parallel by Anonymous Coward · · Score: 1, Interesting

      With 4 30 minute pieces there will only be 3 problematic areas where the pieces are supposed to fit together, This shouldn't cause any major issues.
      The first couple of frames in each block wouldn't be able to use any previous frames as reference frames though which is a drawback (raises the bitrate for those frames) but if each block is large enough the negative effect should be negligable.

    7. Re:Some tasks are embarrassingly parallel by Ian_Mi · · Score: 1

      While many video codecs have been multi-threading enabled, they always do so at a significant quality reduction.

      x264 has supported frame-based parallel encoding for a long time now and it definitely does not result in significant quality loss. This works because the motion vectors are usually limited to 16-24 pixels in length so subsequent frames can start encoding after only a small portion of the current frame has finished.

      That would only make ANY sense with fixed bitrate encoding.

      That would also make perfect sense with crf encoding and there's hardly any reason to use 2-pass encoding over crf encoding unless you are still burning your videos to optical media.

    8. Re:Some tasks are embarrassingly parallel by Anonymous Coward · · Score: 0

      While I can think of a number of ways to parallelize video processing, I'm pretty sure doing it that way would result in noticeable quadrant borders -- which would be really distracting.

    9. Re:Some tasks are embarrassingly parallel by BitZtream · · Score: 1

      Or send it to the GPU and do it per pixel in parallel. You'd be amazed how fast a GPU can do transcoding with the right code.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    10. Re:Some tasks are embarrassingly parallel by petermgreen · · Score: 1

      Many video codecs effectively "start again" every so many frames anyway.

      You can sometimes see this with static text superimposed on video and then compressed heavilly. The text starts with a load of artifacts, over several frames they clear up then suddenly the artifacts return as another keyframe hits.

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
    11. Re:Some tasks are embarrassingly parallel by Anonymous Coward · · Score: 0

      ... just split the video into quadrants and have each of four cores work on one quadrant.

      I wrote an app for a second-year university assignment (to get a practical handle on threading and forking - ie, it was written both ways) that does essentially what you propose on a single bitmap. s/bitmap/frame/ and iterate through a movie, voila! Now if a second-year uni assignment can teach that, where the hell do these "modern" coders come from?!

    12. Re:Some tasks are embarrassingly parallel by phision · · Score: 1

      Video encoding? I can hear MPAA's army of lawyers knocking on your door...

    13. Re:Some tasks are embarrassingly parallel by evilviper · · Score: 1

      With 4 30 minute pieces there will only be 3 problematic areas where the pieces are supposed to fit together, This shouldn't cause any major issues.

      An assertion on high from someone who probably hasn't actually even encoded more than a handful of videos... Never mind understanding the technical details of video encoding, let alone WRITING a video codec.

      The answer is still NO. Encoding a video in segments SHOULD and DOES cause major issues. Unless you're doing fixed-bitrate encoding, you have no way of knowing what percentage of the overall file-size each "piece" should use. Credits? Action sequence? Slow pan? Do you really think they all require the same bitrate?

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    14. Re:Some tasks are embarrassingly parallel by evilviper · · Score: 1

      and it definitely does not result in significant quality loss.

      Not true. I just don't have the time to argue the point in detail right now... Go do some encodings with/without a large numbers of threads, with PNSR or other objective metric measurement, and try again.

      And besides that, x264 is really only multithreaded on the SECOND pass. The first pass will barely use more than a single CPU.

      there's hardly any reason to use 2-pass encoding over crf encoding unless you are still burning your videos to optical media.

      Not true. x264 improves on 1-pass encoding, but there are plenty of ways to improve quality that require 2 passes (or a much larger buffer) to work properly.

      http://forum.doom9.org/showpost.php?p=996114&postcount=2

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    15. Re:Some tasks are embarrassingly parallel by evilviper · · Score: 1

      wouldn't it be possible to encode a 16 minute video file on 4 processors by breaking it into 4 minute sections and encoding these separately?

      You have no way of knowing how complex one segment is, versus another, before you've encoded the entire thing, so the allocation of bits wont be optimal. If that didn't matter, nobody would spend the time on 2-pass encoding.

      Plus, other things, like I-frame placement, can have a cascading effect throughout the video... Not knowing what happened in the 600 frames before the start of "segment 2" is a real drawback in efficient placement.

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    16. Re:Some tasks are embarrassingly parallel by Ian_Mi · · Score: 1

      Not true. x264 improves on 1-pass encoding, but there are plenty of ways to improve quality that require 2 passes (or a much larger buffer) to work properly.

      The difference between direct spacial and temporal will be trivial. As explained here http://forum.doom9.org/archive/index.php/t-143904.html by Dark Shikari, an active x264 dev, 2-pass encoding is no more efficient than crf.

      "CRF, 1pass, and 2pass all use the same bit distribution algorithm. 2-pass tries to approximate CRF by using the information from the first pass to decide on a constant quality factor. 1-pass tries to approximate CRF by guessing a quality factor over time and varying it to reach the target bitrate."

      Here http://forum.doom9.org/showthread.php?t=134545 he says "2pass is not measurably better than CRF, in general."

      The idea that multiple passes increases quality is left over from the time of mpeg-4 part 2 and part 4 encoders where this was the case.

      As for the sliceless multi-threading used by x264 there should be no significant quality loss unless the number of threads exceeds video_width/mvrange, so it depends on what you mean by a large number of threads and what you consider a reasonable mvrange to be. If you are unhappy with this limitation look at how x264farm works: http://omion.dyndns.org/x264farm/x264farm.html. It splits a video at scene-cuts and allows the scenes to be encoded in parallel as mentioned earlier. And yes it does work with multi-pass encoding.

    17. Re:Some tasks are embarrassingly parallel by Ian_Mi · · Score: 1

      The idea that multiple passes increases quality is left over from the time of mpeg-4 part 2 and part 4 encoders where this was the case.

      Sorry I meant mpeg-4 part 2 and mpeg-2 here.

    18. Re:Some tasks are embarrassingly parallel by evilviper · · Score: 1

      by Dark Shikari, an active x264 dev, 2-pass encoding is no more efficient than crf.

      Just because one x264 developer over-simplified once, doesn't change facts. The link I posted ALSO happens to be from an x264 developer, and he details exactly which options suffer from CRF vs 2-pass.

      As for the sliceless multi-threading used by x264 there should be no significant quality loss

      Ah, good, more assertions...

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
  27. Most undergrad programs focus on single-threaded by mrflash818 · · Score: 1

    When I was studying Comp Sci, I recall that most assignments were to 'understand the concept' and program a solution.

    Usually the programs were single-threaded. Maybe a section of a course was on concurrency (mutexes, threading), but not an entire course or courses.

    As multi-core becomes more the norm, then perhaps there can be an entire course on concurrency and how to design/program with this thinking in mind.

    --
    Uh, Linux geek since 1999.
  28. Anonymous Coward by Anonymous Coward · · Score: 0

    This canard will not fly! In the 1980's we and many others were profitably using shared-memory processors (discrete versions of multi-core chips), 20 CPUs in a Sequent Balance, in my case. We used high-level languages and sophisticated library support.

    The success then was because processor and memory speeds were "balanced," now wildly imbalanced. It's an architectural problem that has not been solved by huge, multi-level caches. For a simple explanation of why, try the classic "Htting the Memory Wall: Implications of the Obvious" (www.cs.virginia.edu/papers/Hitting_Memory_Wall-wulf94.pdf).

    Split-phase memory operations have been shown to help, but that innovation must be tied to hardware-supported multithreading.

  29. Message classification by tepples · · Score: 1

    Where's the need for a email client to spawn 8 or 16 threads?

    Message classification. An e-mail client could open a process for each message, and the process would analyze the message to see what labels (spam, work, personal, etc.) belong on the message. If you get a lot of mail, I imagine that classifying several hundred downloaded messages might take a while.

    1. Re:Message classification by will_die · · Score: 1

      I have an account that is checked every few weeks and has a few hundred messages and it filters everything and the slowdown is not the filtering but the network and how fast it can get the messages.

  30. Would be great, but only for a few things. by Nakor+BlueRider · · Score: 1

    This is a problem, but one specific only to certain programs. Pull up task manager, and take a look at the processes list. Odds are unless you're running something big in the back ground, you won't see any process taking up more than 50% CPU on your dual-core, or 25% CPU on your quad core. In fact, odds are none will be even close to that.

    Multi-threading can offer little speed increase there (there is theoretically some as code is executed simultaneously, but it's negligible and probably unnoticeable); its value is only truly seen is when a program can actually make use of more processor power than any single core has. Video conversion is a good example -- on my dual core at home, most of my video conversion tools hit 50% CPU and run at that until done. It's programs like this that can take advantage of multi-threading and therefore having access to more raw processing power at once (double, in fact).

    I agree that it would be nice to see more tools out there to add ease to coding for multi-core processors, and to see those few, CPU intense programs suddenly see double the processing power. But given that only a very specific selection of software requires it, and moreover a lot of the time that is not software the "average joe" would be using, it's probably just not vital enough to hit the priority lists yet (especially given that there are a few programs out there that do successfully implement multi-threading, and others that mimic it to a lesser extent).

  31. Why a web browser needs threads by tepples · · Score: 4, Insightful

    What good are multiple cores and threads when you are running event driven GUI application?

    Mozilla Firefox is an event-driven GUI application. But if I open a page in a new tab, a big reflow or JavaScript run in that page can freeze the page I'm looking at. You can see this yourself: open this page in multiple tabs, and then try to scroll the foreground page. If Firefox used a thread or process per page like Google Chrome does, the operating system would take care of this. Other applications need to spawn threads when calling an API that blocks, such as gethostbyname() or getaddrinfo(); otherwise, the part of the program that interacts with the user will freeze. But these are the kind of threads that are useful even on a single core, not multicore-specific optimizations.

    1. Re:Why a web browser needs threads by Snowblindeye · · Score: 2, Insightful

      open this page in multiple tabs, and then try to scroll the foreground page. If Firefox used a thread or process per page like Google Chrome does, the operating system would take care of this.

      I think you are gravely oversimplifying things. Firefox certainly uses multiple threads. My Firefox thread is using 16 threads at the moment. The reason Chrome is using processes is so that when one of them crashes the other ones stay up.

      Also, if you look closely, it doesn't completely look up while the other tabs are loading. It *does* however, lock up at some point during the rendering. Which would indicate that some points of the code are synchronizing between threads, or bottlenecking on some resource, and that locks it up.

      Which is part of the problem. Its easy to say people need to use more threads. But the trouble comes when you need to synchronize, when they need to communicate with each other. Thats when you introduce performance bottlenecks. It's also one of the reasons why threading is harder than it seems.

    2. Re:Why a web browser needs threads by Anonymous Coward · · Score: 0

      No, Chrome uses multiple processes for a number of different reasons. Stability not the only one. Responsiveness is another very significant reason.

      The rationale is that pre-emptive multitasking OSes can schedule far more effectively than the average software developer or the user can. Firefox may use multiple threads, but those are almost all I/O/Network. Layout (of web pages) is inherently single-threaded and is all done on the same thread as the UI. This is where your bottlenecks are in a traditional browser architecture. Apps with architectures like Firefox (which is to say, most apps, hence the lament of the OP) can attempt co-operative multi-tasking, but at best the effect will be the same as it was on Windows 3.x or pre OS-X MacOS.

    3. Re:Why a web browser needs threads by electrosoccertux · · Score: 1

      My Firefox thread is using 16 threads at the moment.

      Could you clarify this for us? I thought it was common knowledge Firefox did not use multiple threads, and the code would be such a nightmare to try to thread that the developers just get angry at the mention of it (hence hardly anybody uses the gecko engine for the base of a new browser. Or something like that. I'm grossly oversimplifying what I've read from those supposedly in the know on /.) rather than try to deal with the problem.

      How do you know it's running 16 threads?

    4. Re:Why a web browser needs threads by AaronLawrence · · Score: 1

      Simply run task manager, and choose View - Select Columns. Then check Thread Count. You'll now be able to see how many threads each process has.

      --
      For every expert, there is an equal and opposite expert. - Arthur C. Clarke
  32. blah by Anonymous Coward · · Score: 0

    Wait, what? Sorry, the windows kernel might indeed not be prepared for large multicore systems (IIRC a 64-proc limit), but linux ALREADY RUNS most of the world's large HPC systems - some of which are clusters, but some of which are enormous SMP machines - linux runs 1024-proc SGI machines, for example. Linux has O(1) scheduling, good NUMA support, and can handle many, many cores already very well.

    Of course the article has nothing to do with that, but rather userspace.

    But even there, while multicore is new to the bulk of developers and there's a lot of wheel-reinvention going on, the HPC world has been doing parallel programming for DECADES. It's not new, the techniques are well known.

    And just look at a typical linux (or windows) box - it's running quite a few concurrent processes already. Individual applications might not be parallelised, but having multicore sure helps when I want to leave amarok playing in the background while I play sauerbraten while I have bittorrent going.

  33. multithreading not even in C or C++ by Lord+Lode · · Score: 0

    There's not even a way in the C or C++ core language to start a new thread. And with many different third party libraries, there'll never be a reliable standard way to do it. Multithreading is great and everything, but if even such a popular programming language doesn't allow it, how am I supposed to produce programs for 8-core CPU's?

    1. Re:multithreading not even in C or C++ by julesh · · Score: 1

      I believe Boost supports multithreading, and is considered a semi-standard for C++ development these days; in fact, I understand that the next version of the C++ standard will incorporate a number of libraries from Boost. Not sure if the threading library is one of them, though.

    2. Re:multithreading not even in C or C++ by Anonymous Coward · · Score: 0

      Threads? They're nice and all but I think it's understandable that they're not included in currently included C/C++, because of how sparse OS support was, and them being such an OS-level concept.

      What can C do? Spawn new processes. exec commands are defined in C and they work pretty well nowidays.

      C++0x defines std::thread. Boost also supports threads.

    3. Re:multithreading not even in C or C++ by Anonymous Coward · · Score: 0

      Not yet, but soon:

      http://en.wikipedia.org/wiki/C%2B%2B0x#Multitasking_memory_model

    4. Re:multithreading not even in C or C++ by Anonymous Coward · · Score: 0

      You could do it in Java.

    5. Re:multithreading not even in C or C++ by hazydave · · Score: 2, Interesting

      Multithreading is a system-level thing, not a language level thing.

      Sure, there have been languages that make threading ubiquitous, but they've never caught on, and it's hardly necessary.

      You'll notice that internet, graphics, and many other programming necessities are not built into C/C++ either. They are higher level functions, and thousands of programmers have no problem understanding C's role here. People have been writing multithreading code in C/C++ for decades... I've personally done in from the 80s until now, under a dozen or so OSs.

      Don't use your chosen language as a crutch for sicking to the level of programming practiced when that langauge debuted. The whole point of C was not to define much of anything in C itself.. in truth, the language proper doesn't even do I/O... that's handled via a library. So is threading, so is graphics, etc.

      --
      -Dave Haynie
    6. Re:multithreading not even in C or C++ by KiloByte · · Score: 1

      The last time I checked, pthreads weren't exactly non-standard. Every reasonable system has it built-in for over a decade, and there's only one system where you need to get that as an add-on. Guess which...

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    7. Re:multithreading not even in C or C++ by johannesg · · Score: 2, Informative

      There's not even a way in the C or C++ core language to start a new thread. And with many different third party libraries, there'll never be a reliable standard way to do it.

      Never? A standard, reliable way to do it will be part of C++0x - so that's hardly "never"...

    8. Re:multithreading not even in C or C++ by Anonymous Coward · · Score: 0

      errr hook the system threading api perhaps, call me old fashioned but thats how I always do it... ?

      The reason there is no 'native' support in the language (C,c++,VB etc) is because it's not needed... You don't need to use a third party API to do it either, that just adds more layers of gloop and uses up clock cycles...

    9. Re:multithreading not even in C or C++ by Alioth · · Score: 1

      Java never caught on? That's news to me....

    10. Re:multithreading not even in C or C++ by SpinyNorman · · Score: 1

      So why doesn't the difference between the native graphics API's on Windows and Mac or Linux similarly prevent you or anyone else from writing graphical applications? You either decide what platform you're targetting and use the API's for that platform, or if you want cross-platform you choose cross-platform (incl. threading) tools. Doh!

      From a conceptual point of view how difficult is it anyway to switch from one platform's implementation of the standard "threads+mutexs+condition variables" model to the exact same model on another platform?!

    11. Re:multithreading not even in C or C++ by krischik · · Score: 1

      I have done multithreading in C / C++ and a programming language which supports it natively (Ada). And there was a huge difference in the amount of bugs (aka deadlocks) I produced.

      It your line of argument which hinders multithreading. Not that you argument is wrong. You are right. But that makes it even more dangerous.

      The point is: There is nothing wrong with making your live easy. And if a programming language which makes multithreading easy and less error prone would have caught on we would be a lot further now.

    12. Re:multithreading not even in C or C++ by Lord+Lode · · Score: 1

      C has built in operators to add, subtract, multiply and divide numbers. You can use the CPU and the RAM memory. You don't need a library for that. I think the ability to do computations on the different cores and using multiple threads is just as basic as the computation and memory and should be part of the core language.

  34. It's already there by wurp · · Score: 3, Insightful

    Seriously, no one has brought up functional programming, LISP, Scala or Erlang? When you use functional programming, no data changes and so each call can happen on another thread, with the main thread blocking when (& not before) it needs the return value. In particular, Erlang and Scala are specifically designed to make the most of multiple cores/processors/machines.

    See also map-reduce and multiprocessor database techniques like BSD and CouchDB (http://books.couchdb.org/relax/eventual-consistency).

    1. Re:It's already there by Anonymous Coward · · Score: 1, Insightful

      Lisp... has a few problems "Let's take function argument evaluation, as a simple example. Because a function call in Lisp must evaluate all arguments, in order, function calls cannot be parallelized. Even if the arguments could have been computed in parallel, there's no way to know for sure that the evaluation of one argument doesn't cause a side-effect which might interfere with another argument's evaluation. It forces Lisp's hand into doing everything in the exact sequence laid down by the programmer. This isn't to say that things couldn't happen on multiple threads, just that Lisp itself can't decide when it's appropriate to do so. Parallelizing code in Lisp requires that the programmer explicitly demarcate boundaries between threads, and that he use global locks to avoid out-of-order side-effects. " - John Wiegley But yeah, functional languages can sidestep these things. Erlang, Haskell, Scalia, etc.

    2. Re:It's already there by shutdown+-p+now · · Score: 1

      FP is certainly a better way to tackle multi-core, but it's not the silver bullet. It doesn't magically turn inherently non-parallel algorithms into parallel ones, though it does aid in parallelizing the rest.

      By the way, I don't know what Lisp is doing on your list. It's not hardcore FP for one, and most certainly not in a "no data changes" sense - most Lisp programs out there use imperative constructs, including mutable state (CL more so, Scheme less so, but both do).

    3. Re:It's already there by Anonymous Coward · · Score: 0

      It's funny seeing people talking about LISP as a functional language. This says to me, each time I see it, that the programmer studied LISP in college, but did not use it professionally. Professional LISP programmers, few that there are remaining, use LISP imperatively, as an object oriented language, and revere it for the sexpression, literate macros, and other such features. They do not, as you imagine it, spend all their time attempting to determine how to turn all of their code into recursive functions and the like.

      C//

    4. Re:It's already there by wurp · · Score: 1

      In fact, I got my BS in Physics & Mathematics, so I didn't study LISP at all in college. I've done a bit since then just to learn the basics. I've certainly never used it professionally.

      Thanks for the info.

      I have spoken to at least one professional LISP programmer who indicated that he did in fact try to build almost solely functional software. I don't know if he was really telling the truth.

    5. Re:It's already there by SpyPlane · · Score: 1

      Thank you for mentioning erlang!

      Erlang is growing, albeit it is still a small community, but we are growing! Check out github.com for a ton of erlang projects going on. There are some really stellar ones that are making news like Powerset or Nitrogen. Just reading the developer blogs, you get the impression that most developers coming from languages like C++ and Java are almost shocked how easy it is to write scalable and reliable software in Erlang.

      I've since picked up erlang and am trying to convince my boss to let me put some erlang apps in our product. For the time being, it is open source erlang projects for me.

      --
      "We need a fourth law of Robotics: Stop Fingering My Wife"
  35. Research starting? by iliketrash · · Score: 1

    "... and research is only starting."

    Hmmm... I remember people doing research on this subject at the University of Illinois when I was a graduate student there in the 1980s.

    1. Re:Research starting? by flyingfsck · · Score: 1

      Shhhh... you are giving your age away. Nobody on Sloshdat is supposed to know that 1024 processor cubes were already being built 30 years ago.

      --
      Excuse me, but please get off my Pennisetum Clandestinum, eh!
  36. Article = -1 Flamebait by tyler_larson · · Score: 3, Insightful

    If you spend more time assigning blame than you do describing the problem, then clearly you don't have anything insightful to say.

    --
    "With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea...."
    RFC 1925
  37. Virtualization by fermion · · Score: 1
    I have tried to do some stuff for cray, so i know how hard it is to write these types of applications. But someone correct me if i am wrong, but i thought the solution was virtualization. We know most applications can run on a modest chip. The issue is that if one is running several processes, the chip has to save state, run the new process, save state, restore state, and the run the old process. This happens many times a second, and everything is so fast, we most of the time don't notice. I am watching a movie, writing this, and running a GUI all on the same machine. I don't notice any delays.

    However, there is an issue of overhead with switching, and it seems like running specific processes on specific cores would do enough to help here. I don't see why the average application needs run on more thane one core. It seems like the OS can assign a core a process, and there would no issue beyond the current multithreading.

    Now, like the stuff written for the cray, there is some applications could take advantage of the parallel processing, but I don't see a general need for this. It would be like the original Mac where certain processes weree shifted to the graphics processor by the OS. Not that programs are not going to written differently, but this will happen over time. DOS applications did not become full fledged window applications over night.

    --
    "She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
    1. Re:Virtualization by jonbryce · · Score: 1

      Virtualisation is mostly a buzzword.

      I am sitting here writing a slashdot comment, which is being spell-checked in the background. I am listening to an mp3. I have some torrents downloading in the background, and my email client periodically checks for new mail, checks it for spam & viruses and notifies me about it as appropriate. I also have an instant messaging client open and connected to a few IM networks.

      Multiple CPU cores help, but there is a limit to what extent they can be used.

      But where does virtualisation come in to this? Other than creating more work for the processor.

  38. There may be a reason for that too by G3ckoG33k · · Score: 1

    'In fact, TFA doesn't even use the words "Linux" or "Windows."'

    Yup. There may be a reason for that too.

    The initial SMP support was added to Linux 1.3.42 on 15 Nov 1995. Linux is clearly well adapted to multicore CPUs. That is one of the reasons why Linux dominates over Windows on www.top500.org. The other argument is cost.

    1. Re: There may be a reason for that too by drsmithy · · Score: 1

      The initial SMP support was added to Linux 1.3.42 on 15 Nov 1995.

      Windows NT supported SMP from its first release in 1993.

    2. Re: There may be a reason for that too by Bert64 · · Score: 1

      The third argument is adaptability...
      A lot of the top10 systems are Power/PPC based, windows doesn't run on that in any form that would be useful for HPC...

      Also Windows won't boot without a videocard present, it has no serial console mode of operation (does HPC 2008 change this?) meaning you need to have video hardware present in all your nodes even tho it will never be used... It still increases costs and consumes power.

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
  39. That's a big leap by SuperKendall · · Score: 3, Insightful

    If you don't believe me, pull out a profiler and run it on one of your programs, it will show you where things can be easily sped up.

    Now, given that the performance of most programs is not processor bound

    That's a pretty big leap, I think.

    Yes a lot of todays apps are more user bound than anything. But there are plenty of real-world apps that people use that are still pretty processor bound - Photoshop, and image processing in general is a big one. So can be video, which starts out disk bound but is heavily processor bound as you apply effects.

    Even Javascript apps are processor bound, hence Chrome...

    So there's still a big need for understanding how to take advantage of more cores - because chips aren't really getting faster these days so much as more cores are being added.

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
    1. Re:That's a big leap by Anonymous Coward · · Score: 0

      Most of those problems (3d, image, video, audio) are already easy to make paraller with normal things like threads and processes. In fact my LightWave 5.6 from year 1997 is able to use up to 8 threads.

      The hard thing to understand is that we do not need more processor power. There are few times when you need that full 100% oomph, but most of the time the processors hum idly in 20% or less.

      Now, imagine if you could fetch that 100% oomph when needed from a server farm instead of your own computer...

    2. Re:That's a big leap by phantomfive · · Score: 2, Informative

      So there's still a big need for understanding how to take advantage of more cores - because chips aren't really getting faster these days so much as more cores are being added.

      OK, so we can go into more detail. For most programs, parallelization will do essentially nothing. There are a few programs that can benefit from it, as you've mentioned. But those programs are already taking advantage of them, not only do video encoding programs use multiple cores, some can even farm the process out over multiple systems. So it isn't a matter of programmers being lazy, or tools not being available, it's a matter of in most cases, multiple cores won't make a difference. If you run windows, open the task manager and check how often the CPU is completely occupied. Rarely.

      Javascript is an interesting example, because in the last few months we've had something of a competition between browser makers to see who could get the fastest javascript. Now, I'm not going to go read through the changelogs, but I'm willing to bet that the biggest speed ups haven't been from making it multi-threaded, but rather from standard optimization techniques. Basically they went through with a profiler, found what the bottlenecks were, and tried to remove them. This is the normal way to optimize your program. If it happens to turn out the the bottleneck is a bunch of things waiting to use the processor while there is another one available, then you start thinking about making it multi-threaded. If not, then making it multi-threaded will gain you nothing as far as performance.

      --
      Qxe4
    3. Re:That's a big leap by wlt · · Score: 1

      Now, imagine if you could fetch that 100% oomph when needed from a server farm instead of your own computer...

      I think that just shifts the problem from CPU power to bandwidth (which is already a problem area).

      this is why PCs evolved in the first place - mainframes were essentially a server with ... very poor, very high latency bandwidth (driving to the office to get punched cards... that's a 1 day ping time).

      the raw performance figures keep going up, but the relationship shifts back and forth. putting all the smarts in the cloud just makes bandwidth that much more critical. if you can't have the bandwidth, maybe it makes sense to have the smarts at the end nodes.

    4. Re:That's a big leap by davecb · · Score: 4, Informative

      And if you look at a level lower that the profiler, you find your programs are memory-bound, and getting worse. That's a big part of the push toward multithreaded processors.

      To paraphrase another commentator, they make process switches infinitely fast, so one can keep on using the ALU while your old thread is twiddling its thumbs waiting for a cache-line fill.

      --dave

      --
      davecb@spamcop.net
    5. Re:That's a big leap by Anonymous Coward · · Score: 0

      You nicely mention that the CPU is rarely completely occupied. You can sometimes combat this (as you stated, a large portion of time is often spent waiting on some resource to become available) by threading at a level of graularity such that you're not spending most of your time waiting on a tied-up resource, but rather chugging away on Thread A while Thread B waits on a shared resource and then vice versa.

      To me this actually cries for BETTER (not necessarily read "more") multi-threading instead of the same level. This begs the question who do we look to for better multi-threading? The developers (those guys mentioned in the summary) who either aren't utilizing available tools or aren't being provided them. Couple this with the fact that most scheduling is immature compared to the current advances in hardware and here we are at today.

    6. Re:That's a big leap by Haeleth · · Score: 1

      Now, imagine if you could fetch that 100% oomph when needed from a server farm instead of your own computer...

      Do you seriously think that in the near future we're all going to start uploading raw video to the Internet in order to run the processor-intensive editing and compression processes on a server farm?

      If so, perhaps I can interest you in my real estate portfolio? It includes several highly affordable bridges and a unique property with unmatched views of Paris.

      Back in the real world, we're going to want powerful PCs for a little while yet.

    7. Re:That's a big leap by Anonymous Coward · · Score: 0

      Complete failure to comprehend. At which point did he say anything about the internet? In no part. You have mocked him for something he didn't even say, and as a result ended up looking like a fool. Way to go.

    8. Re:That's a big leap by BenoitRen · · Score: 1

      Chips are getting faster, just not in Mhz or Ghz anymore. When chip makers found out that they couldn't increase the magic CPU cycles per second number, they started adding cores.

      The reality is that it's all about architecture. The improvements in chip design make them much more efficient, and thus faster.

    9. Re:That's a big leap by Anonymous Coward · · Score: 0

      There is no evidence that logged-in users read responses either.

    10. Re:That's a big leap by phantomfive · · Score: 1

      They get emails when a new response is added to their comment, or they can see them in their preferences page.

      --
      Qxe4
  40. Not many apps are made for multicore! by Vexorian · · Score: 1
    However, it is not like they need to. I think the most benefit multi core can give is the ability to run much more applications in paralel, and not that much the optimization for a single application. Both windows 7 and Linux (for a longer time) have supported multicore quite well, and there ARE tools/languages that make multi threading more friendly. However, we don't use them, perhaps because besides of the logical uses for threads, it is rarer to focus on multi-cores optimize stuff unless you are doing some really speed-critical task which is not something that home apps do that often... However at home multicore can still be alot of help even without apps using it.

    Being able to burn a DVD while encoding a video and playing some game all at the same time, is something that benefits from extra cores and does not require the apps themselves to know about the cores. Of course, this is not the most common situation - Perhaps the IT world is starting to realize home/office don't really need as much power? Even since 5 years ago MS was the only force driving us to require more power upgrades, but now with even them focusing for performance in windows 7, perhaps it is going to be the "year of Moore's law no longer relevant in the desktop"

    --

    Copyright infringement is "piracy" in the same way DRM is "consumer rape"
    1. Re:Not many apps are made for multicore! by tepples · · Score: 1

      Being able to burn a DVD while encoding a video and playing some game all at the same time, is something that benefits from extra cores

      Burning a DVD is I/O bound. Playing a game is user input bound. Encoding a video is the one task of the three that looks CPU bound, but that benefits from multiple cores only because each core can work on 1/4 of the video.

  41. Here we go again by Seth+Kriticos · · Score: 1

    So here we are at the multiprocessing dilemma again. The summary gets it all wrong. It is referring to operating systems, which are fine with this kind of stuff. UNIX and derivate (Linux) systems were fine with multiprocessing for decades. Most of the big irons in the top 500 are running multi-core just fine. Even Windows got the hang out of it lately (I guess).

    The problem is, that most application developers did not learn to wrap their minds around the multiprocessing paradigm. No tool can magically design your single threaded application to work multi-threaded. The developer needs to analyze the program flow, export computationally expensive operations to separate threads and manage to get a good junction control (locks, balanced threading). It's a design paradigm that has to be learned.

    Problem is, that you can't get developers that are not used to the idea of multiprocessing paradigms to switch. Another problem is, that exactly this group of people is also teaching the new generation, so it is not going to change that fast either.

    It's a bit like a chicken and egg problem: until there is no large distribution of multi-core systems, no one will have the urge to switch. So that's why it is a good thing that this new CPU's get out. Once they are there, developers will derive the need to utilize them to stay competitive. Kind of like natural selection and adoption of new environments.

    Isn't exactly rocket science (well, except if you are writing a rocket guidance system).

  42. what kind of fanboy wrote that article? by DragonTHC · · Score: 1

    He says Windows and Linux but spuriously leaves out Mac?

    mac suffers from the same damn problem. The OS and most apps weren't written for multi-core processors.

    that's why any true multi-core app is distributed. hence rendering farms, not rendering server.

    --
    They're using their grammar skills there.
    1. Re:what kind of fanboy wrote that article? by hazydave · · Score: 2, Interesting

      That's incorrect, at least in part. Modern MacOS is based on CMU's Mach, which has had lightweight threading support since long before Apple got into the picture. The OS was completely designed for multiple CPUs, down to the very core.

      If modern MacOS apps are not heavily multithreaded (I have no idea, I don't run priorietary hardware anymore, regardless of the OS), that's the fault of programmers not advancing past the days of MacOS 9... it has nothing whatsoever to do with the OS.

      --
      -Dave Haynie
    2. Re:what kind of fanboy wrote that article? by RedK · · Score: 1

      Just to add to what Hazydave said, Apple are also including the Grand Central framework in their next release, which is available to developers right now, in order to give them the tools needed to build multi core aware apps. So I guess Apple is ahead of the game here, probably why the article doesn't mention it.

      --
      "Not to mention all the idiots who use words like boxen."
      Anonymous Coward on Monday August 04, @06:49PM
  43. Functional programming style is not enough by Anonymous Coward · · Score: 1, Interesting

    You need to establish/prove purity to the compiler so it can actually make use of it.

    Lisp, Scala and Erlang don't have that property.

    Haskell does.

    Haskell and other pure languages are where the future of parallelism might lie.

    1. Re:Functional programming style is not enough by SpuriousLogic · · Score: 2, Interesting

      I'm not sure I totally agree that Haskell if the future, although I do think that functional programming right now looks to be the the most promising way to deal with muli-cores. Scala has some very strong points that can see it's adoption beat the other, specifically being able to run in the Java JVM and make use of existing Java libraries. You can use the function aspects of Scala when you need to, but still use Java where you do not need parallelism.

    2. Re:Functional programming style is not enough by SanityInAnarchy · · Score: 1

      Erlang is not pure, true -- it requires programmers to explicitly break tasks up into processes.

      It then makes those processes so efficient that there's hardly a performance hit to doing so.

      The other advantage is that it's trivial to scale this beyond a single machine, let alone a single core. How easy does Haskell make RPC?

      Now, it's true that it helps if your compiler can figure it out for you. The question is whether it's easier to grasp purely functional programming, or the concept of actors. I would argue actors are easier, especially with things like Reia.

      --
      Don't thank God, thank a doctor!
    3. Re:Functional programming style is not enough by Peaker · · Score: 1

      You can use actors with purely functional programming.

      Its just that the types of your actors will accurately depict the effects that they may have.

    4. Re:Functional programming style is not enough by SanityInAnarchy · · Score: 1

      You can use actors with purely functional programming.

      Certainly.

      But my understanding was that Haskell parallizes by having the compiler analyze your program. I assume actors are not part of the language itself.

      --
      Don't thank God, thank a doctor!
  44. Should I feel shame? by blake182 · · Score: 1

    I've been programming for 30 years or so, and I've been feeling ashamed. I've been feeling like I've done something wrong and that I haven't structured my programs right. That if only I was smart enough I would be able to take advantage of these multicore systems.

    But I think I'm feeling better about myself. If I write rational multithreaded programs and use scalable patterns like producer / consumer, then I'll be pretty much ready to go.

    And it seems like a lot of this isn't really relevant for desktop applications. I mean, there's some amount of keeping the main event thread moving so that your application is responsive, and you do time consuming operations on separate threads. But the only time I've really used a whole lot of threading is in server apps where you have a whole bunch of incoming connections that you're processing concurrently.

    I understand that there is a branch of computer science that surrounds parallel computing, and there are some applications that might benefit from this (image processing being the canonical example). But I think it's another tool in the toolbox. Another way to approach a problem like map / reduce or whatever is in vogue. Some problems will benefit from being solved this way. Some won't. Use the right tool.

    And I don't understand why we need to beat the drum for more efficient use of multicore. It's cool, we'll figure out what to do with all these cores. And then we'll put that in our toolbox and use it when appropriate.

  45. Rub belly and pat head. Now do what else? by tepples · · Score: 1

    So you're watching a movie and writing a Slashdot comment. How many other things can you do at once that would require a core? Even if you have 30 other processes open, most of them would just be waiting for input. There's a limit on the tasks that a single human being can care about at once, and Microsoft doesn't appear ready to bring terminal servers to the home market.

  46. Not tools, developers by Todd+Knarr · · Score: 3, Insightful

    Part of the problem is that tools do very little to help break programs down into parallelizable tasks. That has to be done by the programmer, they have to take a completely different view of the problem and the methods to be used to solve it. Tools can't help them select algorithms and data structures. One good book related to this was one called something like "Zen of Assembly-Language Optimization". One exercise in it went through a long, detailed process of optimizing a program, going all the way down to hand-coding highly-bummed inner loops in assembly. And it then proceeded to show how a simple program written in interpreted BASIC(!) could completely blow away that hand-optimized assembly-language just by using a more efficient algorithm. Something similar applies to multi-threaded programming: all the tools in the world can't help you much if you've selected an essentially single-threaded approach to the problem. They can help you squeeze out fractional improvements, but to really gain anything you need to put the tools down, step back and select a different approach, one that's inherently parallelizable. And by doing that, without using any tools at all, you'll make more gains than any tool could have given you. Then you can start applying the tools to squeeze even more out, but you have to do the hard skull-sweat first.

    And the basic problem is that schools don't teach how to parallelize problems. It's hard, and not everybody can wrap their brain around the concept, so teachers leave it as a 1-week "Oh, and you can theoretically do this, now let's move on to the next subject." thing.

    1. Re:Not tools, developers by EnglishTim · · Score: 2, Insightful

      And the basic problem is that schools don't teach how to parallelize problems. It's hard, and not everybody can wrap their brain around the concept...

      And there's more to it than that; If a problem is hard, it's going to take longer to write and much longer to debug. Often it's just not worth investing the extra time, money and risk into doing something that's only going to make the program a bit faster. If we proceed to a future where desktop computers all have 256 cores, the speed advantage may be worth it but currently it's a lot of effort without a great deal of gain. There's probably better ways that you can spend your time.

  47. No, you shouldn't feel ashamed? by G3ckoG33k · · Score: 1

    "And I don't understand why we need to beat the drum for more efficient use of multicore."

    Huh? It is really simple. Because the industry wish to perpetuate a need for new products, whether we need them for the moment or not.

    In the meantime, maybe some dude may discover the next killer-application which could actually harness the power at hand.

    Very few pc programs can make the latest quad cores crawl. They typically handle anything you throw at them. Even most 3D games are swallowed.

    So, accept it. The progress is there. If not for the need, so at least because of marketing and market shares.

    Who in their right mind would by an inferior product, e.g. a CPU, if the competitor was cheaper and faster and consumed less power?

    .

  48. Fortran anticiapted out-of-order back in 1990 by goombah99 · · Score: 1

    In the 80's Fortran, which stayed alive and healthy by working in the vector processor communit, got all sorts of instructions that are naturally out-of-order block processes. For example, for-loop and where-loop declarations that say the loop counter or loop array can be interated in any order. It has matrix parallel operation declarations.

    Sun's fortran variant Fortress (sort of Java meets fortran) is designed from the start for thread safety so operations don't explicitly have to lock and unlock before expression.

    And the new PGI fortran compiler has all sorts of compiler directives for automatic parallelization.

    --
    Some drink at the fountain of knowledge. Others just gargle.
    1. Re:Fortran anticiapted out-of-order back in 1990 by AHuxley · · Score: 1

      Thats because the people who use Fortran are smart and have the math skills to work with many cores or seek another way to solve the problem.
      Life is good when your well funded, smart and have hardware options to work your problems out on.
      Cores, speed or really fast cores, or a super computer.
      What happens at MS or Apple?
      Some software design 'idea' is passed down to the programmers.
      They work out (in real offices) a time line and budget and see the big picture.
      The plan is transmitted to a communist country, a country with a caste system, a country with a nuclear export problem or ....
      People in cubicles look over the code for any military industrial secrets, see what is needed to be done and promptly out source to the cheapest coders they can find.
      Your software is then worked on, passed back up the chain after getting culturally appropriate comments inserted.
      The software is them presented to the original "software design " ie marketing team at MS or Apple.
      Alive and healthy today is on time and within a budget on one core.

      --
      Domestic spying is now "Benign Information Gathering"
    2. Re:Fortran anticiapted out-of-order back in 1990 by ciderVisor · · Score: 1

      Heavy number-crunching is inherently suited to parallelization. Once you have a desktop scenario with human-computer interaction, accurately predicting the next required 'work package' becomes several orders of magnitude harder.

      --
      Squirrel!
    3. Re:Fortran anticiapted out-of-order back in 1990 by Anonymous Coward · · Score: 0

      > And the new PGI fortran compiler has all sorts of compiler directives for automatic parallelization.

      And like everything else you mentioned and also e.g. OpenMP and to some degree CUDA probably doesn't work for anything actually difficult.
      Sure, it's great that we don't have to write a lot of messy code to do something trivial, but parallelizing the really difficult stuff has not got any bit easier.
      Often you end up just using an horribly inefficient algorithm (as in 2x times or more slower on a single CPU) just to get it parallel somehow.
      That approach is hardly an option of you develop for desktop CPUs and you mostly have customers with 1 or 2 cores...

    4. Re:Fortran anticiapted out-of-order back in 1990 by Anonymous Coward · · Score: 0

      Sure, it's great that we don't have to write a lot of messy code to do something trivial, but parallelizing the really difficult stuff has not got any bit easier.

      If the problem doesn't contain natural data or control parallelism it is questionable to pursue speedup by these means. Perhaps a DSP or custom instructions in a FPGA, or even an ASIC would provide a preferable solution for speedup in this case.

  49. News Flash: Software takes time to migrate by carlzum · · Score: 1

    Intel releases a new processor this year and the author is surprised that existing software applications aren't immediately taking advantage of it? This isn't a matter of changing a compiler setting or modifying a few methods, parallel computing requires major refactoring and fundamental redesign. And how are Windows 7 and Linux not well prepared? The development tools and applications aren't prepared, not the operating systems.

  50. Nutty by eneville · · Score: 2, Insightful

    I disbelieve this entirely. UNIX/Linux is well designed for multiple core CPUs. Just take the whole single program, single small job approach of a pipeline command and you have your multicore solution ready. Programs that can make use of tasks that are IO bound are frequently written with threading in mind. qmail/apache are both well written for mutliple core CPUs. I don't see what the article is trying to say. Its clearly wrong.

  51. OS or userland apps? by gmuslera · · Score: 1

    I think that linux has been used successfully in massive multiprocessor computers (unless most of top 500 computers are mostly single processor ones).

    In a desktop pc, the OS will take care of the multiple cpus to run the different apps, unless you are talking about heavily cpu intensive apps, and yes, you can put blame on those specific apps (at least for linux most apps arent OS specific)... but not in the OS.

  52. Excellent point by wurp · · Score: 1

    I am, unfortunately, not an expert in functional languages. I do remember that LISP isn't pure functional.

    The main point still stands - functional languages do already address this issue. You're absolutely right that LISP doesn't do all it needs to out of the box to address the issue properly.

    I honestly have no idea if Erlang, Scala or Haskell do allow the compiler to identify pure functional calls, although I tend to believe the other AC response that Haskell, at least, does.

    1. Re:Excellent point by Anonymous Coward · · Score: 0

      I honestly have no idea if Erlang, Scala or Haskell do allow the compiler to identify pure functional calls, although I tend to believe the other AC response that Haskell, at least, does.

      Haskell not only "allows" the compiler to "identify" pure functional calls -- it requires that anything that isn't pure functional be explicitly marked as such, and impure functions cannot be called from pure ones. The compiler knows which functions are pure simply by looking at their types.

      (Unless you deliberately subvert the type system using a function that includes the word "unsafe" in its name.)

      ((And technically all the functions are pure, even the ones that do IO, but only via handwaving.))

    2. Re:Excellent point by wurp · · Score: 1

      IMO it would be better for the compiler to identify and tag the calls internally - obviously it can do so, or it couldn't give compilation errors in the way you describe. Then it could allow you to mark a function as pure, for your own information, and give a compilation error if it wasn't.

      That would give you the advantages that the compiler can give for pure functions, without the hassle of updating whole trees of functions when one impure call is made, but it would still let you ensure that critical calls are pure.

      I'm not sure about the quotes - is that intended to be sarcasm? Your sentence reads fine without them, and including them seems to be intended to imply that I'm stupid, which I'm not ;-)

    3. Re:Excellent point by dumael · · Score: 1

      With Erlang, it is "fairly" easy to identify pure function calls, as most non-pure all live in specific libraries. Erlang also has lightweight threads, so rather than trying to concentrate on small scale parallelism, programmers are mostly free to spawn the necessary number of threads to solve a problem. They obviously aren't totally free, but a lot cheaper than requesting a new thread from the underlying OS. Haskell is essentially totally pure. All side effects are specifically spelled out, so the compiler has to actually identify them to compile your code in the first place. Most of the problems in parallelizing functional languages is not identifying what can be made to run in parallel, but in figuring out if it's worth it.

  53. Clueless writer! by Skal+Tura · · Score: 1

    The tools has been here for some 10 years now, multithreading has existed a reaaally long time now, documentation was still lacking in late 90s, but running multiple threads is child's play now.

    Like someone else stated, mostly programs aren't CPU bound, they spend most of their time waiting for data from HDD etc.

    Applications benefitting from multiple cores have been multithreaded, or a lot of them. It's not a software paradigm limiting scalability.

    Furthermore Windows 7 is MORE than capable of handling 8 cores, infact, Windows 7 probably starts to shine at 16 cores with it's SMP capabilities. Microsoft spent A LOT of time making sure there's that kind of scalability on Windows kernel.

    I can't express enough how misinformed TFA writer is, and how clueless and ignorant he is. I'm SHOCKED that this kind of garbage is on Slashdot! Come on, even half-witted self-respecting geeks know about this stuff already better.

  54. Multicore = failure. by zymano · · Score: 1

    I championed it here but there is no software that utilizes it and programming for it is difficult as mentioned here in many articles.

    New languages aren't being used to help out multicore or parallel processing with graphics chips.

    A graphics chip computer built for gaming and general use would be amazing. It would cost as much as an entry level general chip using pc but could do 3d GAMES!

    But would need parallel processing language.

    http://www.gpgpu.org/
    http://en.wikipedia.org/wiki/GPGPU

    1. Re:Multicore = failure. by SeekerDarksteel · · Score: 1

      The problem is that GPUs are terrible for single threaded performance, resulting in massive slowdowns for everything but those tasks which can take advantage of the parallelism.

      --
      The laws of probability forbid it!
  55. Not if you use your computer well by Anonymous Coward · · Score: 1, Interesting

    A computer that is used in an efficient way will at any time either do nothing (and hopefully switch to standby/hibernate after a couple minutes) or do several things parallelly. While I read Slashdot, my computer is mostly downloading mail, uploading files to a web server, defragging the disk, encoding a video, doing a background backup, etc. Or if it isn't, it can fold proteins. Modern browsers will also soon be multithreaded, some already are, so every tab, plugin etc. can run on its own core.

    Apps that lack multithreading can also be a blessing - less overhead, and they are restricted to one core, so no matter how bad an app behaves, there will always be a core that isn't affected by the CPU hog so the machine stays responsive. Responsiveness is much more important than raw computing speed.

  56. Macintosh by Bordgious · · Score: 0

    Mac FTW!

  57. Oh Shut Up! by tjstork · · Score: 1

    I like to differentiate myself with threading as a developer but this article is over the line.

    It's absolutely absurd to say that multicore chips won't benefit a system when any modern Windows or Linux installation will not benefit users. I think I have like 20 windows open, and quite a few processes. Some of them are active, and some are not. The fact is, these systems, both Windows and Linux, and if anything, Linux, are designed to serve up multiple threads with multiple users on multiple processors. They -are- mainframe operating systems in a consumer role..

    --
    This is my sig.
  58. apples has no 2 core + systems under $2500 so it w by Joe+The+Dragon · · Score: 0, Troll

    apples has no 2 core + systems under $2500 so we need to hope that so 10.6 will work on any pc / work with to days hacks.

  59. This is kinda like XML... by FlyingGuy · · Score: 2, Interesting

    it is the answer to the question that no one asked...

    In a real world application, as others have mentioned pretty much all of a programs time is spent in an idle loop waiting something to happen and in almost all circumstances it is input from the user in whatever form, mouse, keyboard, etc.

    So lets say it is something life Final Cut. Now to be sure when someone kicks of a render this is an operation that can be spun off on its own thread or its own process, freeing up the main process loop to respond to other things that the user might be doing, but that is where the rubber really hits the road is user input. The user could do something that affects the process that was just spun off, either as a separate thread or process on the same core or any other number of cores so you have to keep track of what the user is doing in the context of things that have been farmed out into other cores/processes/threads.

    Enter the OS.. Take your pick since it really does not matter which OS we are talking about, they all do the same basic things, perhaps differently, but they do. How does an OS designer make sure any of say 16 cores ( dual 8 core processors) are actually well and fairly utilized? Would it be designed to use a core to handle each of the main functions of the OS, lets say Drive Access, Com Stack pick your protocol here, Video Processing etc., or should it just run a scheduler like those that they now run which farms out thread processing based on priority? Is there really any priority scheme for multiple cores that could run say hundreds of threads / processes each? And what about memory? A single core machine that is say truly 64 bit can handle a very large amount of memory and that single core controls and has access to all that ram at its whim ( DMA not withstanding ), but what do you do now that you have 16 cores all wanting to use that memory, do we create a scheduler to schedule access from 16 different demanding stand alone processors or do we simply give each core a finite memory space and then have to control the movement of data from each memory space to another, since a single process thread ( handling the main UI thread for a program ) has to be aware of when something is finished on one core and then get access to that memory to present results either as data written to say a file or written into video memory for display?

    I submit that the current paradigm of SMP is inadequate for these tasks and must be rethought to take advantage of this new hardware. I think a more efficient approach is that each core detected would be fired up with its own monitor stack as a place to start so that the scheduling is based upon the feedback from each core. The monitor program would be able to ensure that the core it is responsible for is optimized for the kind of work that is presented. This concept while complicated could be implemented and serve as a basis for further development in this very complex space.

    In the terms of "super computers" this has been dealt with but in a very different methodology that I do not think lends itself to general computing. Deep Blue, Cray's and things like that aren't really relevant in this case since those are mostly very custom designs to handle a single purpose and are optimized for things like Chess or Weather Modeling, Nuclear Weapons study where the problem are already discretely chunked out with a known set of algorithms and processes. General purpose computing on the other hand is like trying to heard cats from the OS point of view since you never really know what is going to be demanded and how.

    OS designers and user space software designers need to really break this down and think it all the way through before we get much further or all this silicon is not going to used well or efficiently.

    --
    Hey KID! Yeah you, get the fuck off my lawn!
    1. Re:This is kinda like XML... by bored · · Score: 1

      In a real world application, as others have mentioned pretty much all of a programs time is spent in an idle loop waiting something to happen and in almost all circumstances it is input from the user in whatever form, mouse, keyboard, etc.

      If by that you mean blocked on the message queue waiting for an event.. Or maybe you were talking about the kernel's idle task putting the CPU to sleep.

  60. Multiprocessor issues by JRHelgeson · · Score: 1

    Not all linear problems can be solved with parallel processing.
    It takes 1 woman 9 months to produce a baby - but 9 women cannot produce a baby in a single month...

    Software operates primarily on a linear function: Process A needs to be done before B, and B before C and so forth. The real issue is that dividing a linear process across parallel processors is notoriously difficult: Task "D" is sent to processor 2, however the data it needs to process is already sitting in the cache of processor 0... this slows things down and E finishes before D and the app crashes.

    This is where the design of Microsoft's Hyper-V platform shows real promise. By placing a virtualization layer (Hypervisor) between the OS and the processor, the added abstraction layer can distribute dissimilar or unrelated processes to different cores. It can also assist with non-linear computing tasks that work well with parallel processing and even provide the framework by which

    Look at it this way: there is no way that Microsoft is going to leave spare resources sitting idle. They'll figure out some way to consume every single one. It's the Microsoft way!

    --
    Good security is based upon reality and common sense. Common sense is a function of having common knowledge.
  61. It's better this, realy by V!NCENT · · Score: 1

    The entire idea of multi-core is not that your performance increases, but that performance doesn't decrease.

    I want every thread to run simultaniously instead of timesharing. Imagine all your apps are devided in multiple threads, then you'r all timesharing again and boy, don't you just hate it when your entire computer slows down to a crawl?

    I mean look at the succes of 3D window management; you'll lose a little performance overall but when a single process jumps to 100% CPU reservation then at least there's no 2D WM lockup.

    --
    Here be signatures
  62. That method might add latency by tepples · · Score: 1

    Would it not be much simpler though from a prespective of not needing to deal with complicating motion estimation algorithms and such just to split video work along groups of b-frames? Seems like as long as the video was more then a few frame groups in length you would get just as much gains without even needing to rempliment much if any of your existing codec algorithms.

    In these discussions about parallelism, I used to recommend splitting a 2-hour video into four 30-minute parts and feeding each to a single-threaded encoder. But that would need more cache and memory bandwidth, something that a lot of PCs with multicore CPUs lack, and it wouldn't work at all for live streaming. Splitting at group-of-picture boundaries might work better, but it would still add more latency to a live stream.

  63. This is incorrect by hazydave · · Score: 3, Funny

    The idea of an OS and/or suppoet tools handling the SMP problem is nothing more than a crutch for bad programming.

    In fact, anyone who grew up with a real multitheaded, multitasking OS is already writing code that will scale just dandy to 8 cores and beyond. When you accept that a thread is nothing more or less than a typical programming construct, you simply write better code. This is no more or less an amazing thing than when regular programmers embraced subroutines or structures.

    This was S.O.P. back in the late 80s under the AmigaOS, and enhanced in the early/mid 90s under BeOS. This in not new, and not even remotely tied to the advent of multicore CPUs.

    The problem here is simple: UNIX and Windows. Windows had fake multitasking for so long, Windows programmers barely knew what you could do when you had "thread" in the same toolkit as "subroutine", rather than it being something exotic. UNIX, as a whole, didn't even have lightweight preemptive threads until fairly recently, and UNIX programmers are only slowly catching up.

    However, neither of these is even slightly an OS problem... it's an application-level problem. If programmers continue to code as if they had a 70s-vintage OS, they're going to think in single threads and suck on 8-core CPUs. If programmers update themselves to state-of-the-1980s thinking, they'll scale to 8-cores and well beyond.

    --
    -Dave Haynie
    1. Re:This is incorrect by Gunstick · · Score: 1

      unix didn't have threads for long time because it simply uses complete processes. Process creation is not requiring a big overhead.
      Of course threads are faster and data interchange is better. But in every unix programming class you got to see the inter process communication stuff and how to make several programs cooperate. It's a basic functionality of unix.

      --
      Atari rules... ermm... ruled.
    2. Re:This is incorrect by Todd+Knarr · · Score: 3, Informative

      Unix didn't for a long time have lightweight preemptive threads because it had, from the very beginning, lightweight preemptive processes. I spent a lot of time wondering why Windows programmers were harping on the need for threads to do what I'd been doing for a decade with a simple fork() call. And in fact if you look at the Linux implementation, there are no threads. A thread is simply a process that happens to share memory, file descriptors and such with it's parent, and that has some games played with the process ID so it appears to have the same PID as it's parent. Nothing new there, I was doing that on BSD Unix back in '85 or so (minus the PID games).

      That was, in fact, one of the things that distinguished Unix from VAX/VMS (which was in a real sense the predecessor to Windows NT, the principal architect of VMS had a big hand in the architecture and internals of NT): On VMS process creation was a massive, time-consuming thing you didn't want to do often, while on Unix process creation was fast and fairly trivial. Unix people scratched their heads at the amount of work VMS people put into keeping everything in a single process, while VMS people boggled at the idea of a program forking off 20 processes to handle things in parallel.

    3. Re:This is incorrect by shutdown+-p+now · · Score: 1

      I spent a lot of time wondering why Windows programmers were harping on the need for threads ... thread is simply a process that happens to share memory, file descriptors and such with it's parent

      There is your explanation right there. Threads make things "easier" because all sharing is implicit. For better or worse, that's another matter...

    4. Re:This is incorrect by bar-agent · · Score: 1

      There is your explanation right there. Threads make things "easier" because all sharing is implicit. For better or worse, that's another matter...

      I say "for worse." What's the point of implicitly sharing everything when you've got to limit access to it regardless? It's better to have specifically declared shared memory with inherently limited access. At the very least, analysis could catch unlocked accesses to known-shared memory.

      --
      i'd hit it so hard, if you pulled me out you'd be the king of britain [bash.org]
    5. Re:This is incorrect by dkf · · Score: 2, Interesting

      It's better to have specifically declared shared memory with inherently limited access. At the very least, analysis could catch unlocked accesses to known-shared memory.

      You're better off going to a message-passing model; they're theoretically much more tractable (there are several schemes that have had decades of work done and even spawned programming languages) and they scale up to multi-machine computing (e.g. cluster-scale) much more easily.

      Shared memory parallelism is just plain nasty. Occasionally useful, but always nasty. Use with care and good taste.

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
    6. Re:This is incorrect by bored · · Score: 1

      Unix didn't for a long time have lightweight preemptive threads because it had, from the very beginning, lightweight preemptive processes. I spent a lot of time wondering why Windows programmers were harping on the need for threads to do what I'd been doing for a decade with a simple fork()

      LOL, Uh, you are aware that fork() isn't exactly lighweight? In fact its probably the most heavyweight of any process spawning method. Hence the dozen's of modifications to fork (see vfork() for example)on assorted unix's to make it fast. CreateProcess() in comparison is significantly lighter because it doesn't try to carry process context around. Real threading is generally significantly faster on most hardware arches and OS's.

    7. Re:This is incorrect by Todd+Knarr · · Score: 1

      Uh, you are aware of how fork() worked back in the day, right? When I started using it it didn't even use copy-on-write, the child process shared the same page tables as it's parent. COW came along shortly after that, so I didn't ever have to get much involved with the details of keeping memory seperate. vfork() plays other games to avoid the overhead (in particular it doesn't exactly a new process until exec() gets called).

      As for being heavyweight, perhaps compared to something like Sun green threads, yes. Compared to VMS or Windows process creation it was always much more light-weight because Unix just didn't have the process overhead. Clone the page tables, flag the pages as COW and you're pretty much there. In fact, as I noted, if you look in for instance Linux fork() and pthread_create() are in fact both simply thin wrappers that call the underlying clone() function. And green threads had their own issues, scheduling in particular (since they weren't KSEs).

    8. Re:This is incorrect by bored · · Score: 1

      Uh, you are aware of how fork() worked back in the day, right? When I started using it it didn't even use copy-on-write, the child process shared the same page tables as it's parent.

      Well we must have been using different flavors of unix... The old unix systems I used, that predated COW, actually made physical copies of the process. That was _REALLY_ slow. COW was a huge improvement over that. I remember writing lots of code which manually managed shared memory segments between processes in order to implement process memory sharing, which wasn't the default behavior. Now I just spawn a thread and life is much better.

      Clone the page tables, flag the pages as COW and you're pretty much there. In fact, as I noted, if you look in for instance Linux fork() and pthread_create() are in fact both simply thin wrappers that call the underlying clone() function.

      Uh, on most machines dinking with the page tables is very expensive due to TLB flushing. Later if you actually access any of those COW pages the page faults and physical memory copies are also quite expensive. That is why thread creation is so much faster, it doesn't have to mess with the process address space. Also, if you notice http://linux.die.net/man/2/clone the clone system call has a bunch of flags which control what is "cloned". Depending on what flags you set varies how heavy it is. Linux hasn't exactly done this in the most efficient manner in the past, and hence the performance problems in the past. I haven't benchmarked it recently, but a couple of years ago, linux thread creation was significantly faster than fork() I assume that is still true, I expect that the performance difference has probably gotten even larger.

  64. multicore chips and interrupts .. by viralMeme · · Score: 1

    I understand how on a single CPU, the interrupt line is set low and the device puts a unique number on the data bus. How are interrupts handled on these multicore chips.

  65. Natural Process by Aphoxema · · Score: 1

    No one's going to learn how to really work around using multiple cores until they're really out there in the wild where developers can work with them.

    It really is the next logical step, and no one ought to be bitching no one knows how to dig holes when we just found out how to make shovels.

    --
    "Most people, I think, don't even know what a rootkit is, so why should they care about it?"
  66. Parallelism by ledow · · Score: 1

    The simple fact is that most programming tasks are inherently linear. Sure, you can design programs that are better, and you can offload work to other CPU's in clever ways, but at the end of the day, you can't do that much better than a couple of major threads per program, with all of them running on an empty CPU.

    In Office apps, you can't "offload" anything at all, really. Possibly a spellcheck or grammar check on the side, but you're not going to make *any* gains over the simplistic setups. Why? Because 99% of the program is spent waiting for the user to do something and, when they do, 99% of the time you can complete that task in a matter of microseconds.

    In games, you can offload AI, physics, pathfinding, graphics drawing, etc. but at the end of the day you still have to limit interaction to what the user does (i.e. shoots, moves, etc.) and/or the FPS limit. You can get slightly more done by parallelising in that time, purely because the AI is not reliant on the graphics drawing etc., but every 1/60th of a second you have to bring everything to a halt and pass it off to be drawn in order.

    In database apps, you can pass off I/O and tricky queries to other threads and so make gains, but you're just introducing a lot of locks, callbacks and everything else to be able to do that. You can scale with that, but you can't scale that far. And at some point, you've got to read the same data off the same disk as on a single-CPU system and pass that, with *all* it's results, to the user.

    In operating systems, you can offload a lot of tasks, but again, most of the time you are looking at waiting for user input to actually do something.

    It's an inherent limitation of the machinery and the uses, not the design of a particular operating system. Sure, you can make gains over what we have now, but the simple fact is that at some point you have to manage and collate all those seperate tasks into a result and you can't do it until everything's finished. To use games as an example (because they are a mass-market, hardware-pushing, performance-critical application that will routinely make use of multiple CPU's/GPU's to the full extent), you can't necessarily do the AI until the physics has been done (otherwise bots would walk into moving objects that weren't there a second ago). You can't do the graphics until the AI and physics are both done. And over all that, you have to do SOMETHING every 1/60th of a second whether the other threads have finished or not. And there's only so many ways you can split up tasks. You can do graphics rendering in blocks of pixels, as proposed, but at what point does the locking of memory and random bus access killing the memory cache actually make it *less* efficient that just running from 0 to 1024 and then from 0 to 768 (or whatever).

    A lot of applications *don't* thread things that they should. On the desktop, asynchronous DNS is a major culprit in my opinion - I should not be able to hang file manager windows, firewalls, browsers, FTP clients, etc. just because my DNS server has gone down or is momentarily inaccessible. And when I click the god-damn Cancel button, then you should CANCEL the other thread as quickly as possible BUT also let me just get on with whatever else I want to do with this app. However, this has nothing to do with multi-core or operating systems, it's to do with single-threaded apps still being made on systems that have reliably handled multi-thread apps (even on single-core machines) for decades. Ideally, EVERY tab in my browser window should be a different thread. It would mean that a tab with a particularly heavy Javascript or particularly slow flash movie will not slow the operation of the browser itself down. It's quite a simple job but a lot of browsers don't do it - there's a reason for that and it's not because GCC or the operating system doesn't include a "pass this off to another thread" function.

    The problem is not new, it's not exciting, it's not revolutionary, it's not going to lead to a whole new way to pr

  67. Development tools? by phorm · · Score: 1

    Question: Is this looking at a single app or multiple? It seems fairly straightforward to me that most individual apps aren't going to see a huge boost from a hefty amount of cores, but multiple apps or threads/instances would probably see plenty.

    Servers especially should be able to take advantage of this, where individual cores - just like multiple CPU's before - can handle multiple instances of a server daemon.

      From what I've seen thus far, Unixy OS's handle SMP fairly well. I haven't touched windows webservers in awhile, but I'd imagine they might do well enough in that scenario too.

    Translation: Not a big increase for your game/spreadsheet, but still some extra bang for multitasking. IO is still going to be a big bottleneck though.

  68. 8 core machines + Linux is fantastic by snowtigger · · Score: 1

    I am writing this from my 8 core Intel box running Linux with 8GB of memory. This is the FASTEST computer I've ever had and the first time I've noticed a big leap forward. I normally don't care about cpu speeds, graphics cards, etc. Hardware tends to be fast enough for the current generation of software (I run Linux) and that's usually all you need. But this 8 core thing is different.

    I develop and run very heavy graphics applications, where cpu tends to be the bottleneck. In my world, you used to rely on extra cpu from render farms or clusters to get the job done.

    This world is changing. Shorter kind of jobs that require a quick turnaround, can now be done locally instead of sending jobs to the render farm. This is massive. As people start doing more jobs locally, it also frees up space for the longer running batch jobs, so they get done faster too.

    When I first got the machine, it had Windows installed and it felt just as slow as a regular (single or double cpu/core) box. That should be of no surprise to anyone around here. But Linux sure knows how to use the multi core magic.

    1. Re:8 core machines + Linux is fantastic by Anonymous Coward · · Score: 0

      I am writing this from my 8 core Intel box running Linux with 8GB of memory.

      What a waste. :-/

  69. Why blame the programmers? by miffo.swe · · Score: 1

    This is what i hear, "waa waaa please use more cores so people see a need to get a new 8 way CPU!"

    I don't think this is a problem that programmers should solve. Sure its nice if they utilize multiple cores as much as possible but i don't want cores to be used just for the sake of it. If adding another core to an application gives a 30% gain im pretty sure i could use that power for better things in most cases.

    The problem in my mind is that core speed has hit a brick wall and tossing more cores at the problem is just a desperate attempt to keep the upgrade treadmill going. Beyond four cores i personally wouldnt see any performance gains other than in rare occasions where i browse, watch movies, encode movies and unzip some large file. I would in those cases also hit the HD and i/o much more than the CPU.

    --
    HTTP/1.1 400
  70. Mod parent up! by davecb · · Score: 1

    --dave

    --
    davecb@spamcop.net
    1. Re:MOD PARENT UP! by drizek · · Score: 2, Insightful

      Do you need to dedicate an entire 3ghz CPU core to run your bittorrent, and another to refresh slashdot?

    2. Re:MOD PARENT UP! by Anonymous Coward · · Score: 0

      And what's your total CPU usage?

      Mine's under 2% for 91 processes.

    3. Re:MOD PARENT UP! by Anonymous Coward · · Score: 0

      Well, slashdot refresh takes a whole core in certain browsers and certain antivirus programs used to generate ungodly amount of exceptions out of bittorrent trafic..

  71. Say what ? by Space+cowboy · · Score: 3, Informative

    Apple have no 2 core intel systems. Period.

    Even the lowly Mac mini is a dual-core system. Every laptop is a dual-core system. The Mac Pro is either 4-core (with hyperthreading for a virtual 8-core) or 8-core (with hyperthreading for a virtual 16-core) system.

    "Better to keep silent and look the fool, rather than speak and remove all doubt"

    Simon.

    --
    Physicists get Hadrons!
    1. Re:Say what ? by Space+cowboy · · Score: 2, Informative

      Gaah - the < was swallowed in the statement "Apple have no <2 core intel systems. Period."

      Probably obvious, but to save people nit-picking

      --
      Physicists get Hadrons!
    2. Re:Say what ? by toddestan · · Score: 1

      Well, actually if I was going to nit-pick, I would probably point out that the Apple TV runs a single core Intel processor :)

    3. Re:Say what ? by Anonymous Coward · · Score: 0

      You are correct "Better to keep silent and look the fool, rather than speak and remove all doubt"

        Two 2.26GHz Quad-Core Intel Xeon
        Two 2.66GHz Quad-Core Intel Xeon [Add $1,400.00]
        Two 2.93GHz Quad-Core Intel Xeon [Add $2,600.00

      You should likely revisit Apple Mac Pro configuration before making such remarks. I suggest you go to the store and configure a mac pro.

    4. Re:Say what ? by Space+cowboy · · Score: 1

      Looks like I wasn't obvious enough - see my earlier comment on the swallowing of < in the post. Earlier than both of these cowardly responses, in fact....

      Simon

      --
      Physicists get Hadrons!
    5. Re:Say what ? by Space+cowboy · · Score: 1

      Fairy nuff :)

      --
      Physicists get Hadrons!
    6. Re:Say what ? by Anonymous Coward · · Score: 0

      "Better to keep silent and look the fool, rather than speak and remove all doubt"

      Simon.

      That guy on American Idol came up with that saying???!!!

    7. Re:Say what ? by Anonymous Coward · · Score: 0

      Say what what?

      http://store.apple.com/us/browse/home/shop_mac/family/macbook_pro?mco=MTE3MDE

      Looks to me like Apple has plenty of 2 core Intel systems. Period.

  72. Article is not right... by Anonymous Coward · · Score: 0

    This article is incorrect on two counts:

              1) Linux scales just fine for SMP. Windows doesn't handle the caches on multicore systems quite optimally but also is fine on multicore systems (other than M$ having to update licensing I suppose, since they usually license for 4 cores max now on "normal" windows versions.) If your applications don't keep cores busy that is NOT the OSes problem.

              2) Amdahl's Law. From wikipedia, "The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program." It is NOT sensible to just say "Oh, most people have 8-cores, let's just turn OpenOffice (for example) into a bunch of threads." It just doesn't make sense. It may be *possible* to do, but at the expense of bugs (due to race conditions) and high synchronization overhead (the thread overhead isn't significant if each thread does a ton of work as is the case with most current multithreaded apps, but if you're splitting a app into threads just because, overhead could be quite high.) AMD and Intel might hate it, but most people just won't have a use for more than 1 or 2 cores. Even me, and I run TV capture and playback.

              Presently I have entirely single-core systems (Athlon XP 2200+, a 2100+, a Sempron 2500, a P4-3.0, and a Celeron M-1.5). On my busiest system, if I upgraded to a quad or 8-core, 1 core would go for mythbackend (it's got a BT878 so CPU usage is high), Xorg maybe another core, and mythfrontend maybe a 3rd. But the 2nd and 3rd core would not be so busy, I would probably set the kernel core handling to powersave* and only actually use 1-2 cores.

    *"echo 1 > /sys/devices/system/cpu/sched_mc_power_savings" should turn on the multicore power saving scheduler option, which gets a core near 100% busy before powering up the next core instead of dividing up threads evenly between all cores, saving power on systems that can power down cores individually.

    1. Re:Article is not right... by petermgreen · · Score: 1

      (other than M$ having to update licensing I suppose, since they usually license for 4 cores max now on "normal" windows versions.)
      MS limit windows processor support based on the number of "processor units" not the number of cores.

      1 processor unit for home, 2 for proffessional (buisness/enterprise/ultimate if you are using vista) 4 for server standard, 8 for server enterprise and even more (I don't remember the exact number) for server datacenter (note: the names of the server editions vary slightly from release to release but it's usually pretty obvious which maps to which).

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
  73. God! The guy doesnt even know Linux != GNU/Linux by miknix · · Score: 2, Funny

    Windows and Linux aren't designed for PCs beyond quad-core chips, and programmers are to blame for that. Developers still write programs for single-core chips and need the tools necessary to break up tasks over multiple cores.

    How many times do we have to tell that Linux *IS* the fscking kernel??

    Given that, including Linux and Windows in the same bag doesn't make sense. Which makes the entire post m00t.

    Solutions:
    1) s/Windows/Windows NT kernel/
    2) s/Linux/GNU\/Linux/

    Nice try to get a battle though.

  74. do it the unix way, use pipes! by Gunstick · · Score: 2, Funny

    Unix has for ages run on multi CPU systems. And it does this well. And with easy tools you can harvest the power of all CPUs: the pipe
    Every part of the pipe can run on another CPU.

    I recently came across fslint, which is a example of heavily piped shell.

    In short (leaving out the parameters and options) it runs
    find | sort | tr | sort | bash | merge_hardlinks| uniq | sort | cut | tr | bash | xargs | sort | uniq | cut | sort | tr | xargs | sort | uniq | cut | sort |tr | xargs | sort | uniq | cut | bash | sort

    That's a lot of CPUs :-)
    OK it's not a great example for CPU hungry programs. But the progress of the modern programming languages which tend to be monolythic beasts to do everything (perl, php, java) lead to programs not using pipes or other types of inter process communication because it's just cumbersome.
    The pipe concept enables multi CPU programming without even thinking about how to put tasks on different processors.
    Unfortunatly I have not found a language which sets such a simple concept as the fundamental programming principle.

    See the unix shell, without the pipe you can't really do much.

    --
    Atari rules... ermm... ruled.
    1. Re:do it the unix way, use pipes! by philipgar · · Score: 1

      unfortunately, the overhead of pipe operations render many chains of operations slow enough that the parallelism gained by using multiple cpu's can be eaten up. Especially when some of the pipe operations perform extremely simple operations on data. These don't tend to scale too well, and the overhead will often eat up all the benefit of multithreading. Additionally pipes have their place, but can fail when sub-channel data is needed, or when multiple pipes are needed (multiple input or output streams, pipes work well for problems with a single input source and a single output destination). Also there are some operations in a long pipe chain that are likely to be the most expensive by far of any of the operations. Often these operations could be parallelized, but the program performing the operations doesn't realize it. This leads to inefficiencies, because if 95% of your time is spent in one operation, Amdahl's law says that you could throw an infinite number of processors at the problem, and only speed up your application by 5%.

      What you should look into if the ideas behind pipes seem logical for multiprogramming is the use of stream-based languages. I know StreamIt was one of the early ones developed. These languages tend to take the pipe paradigm to a whole new level, allowing splits in a pipe, joins, etc. Also, the idea of futures (I think they were introduced in Scheme, but I could be wrong) is something that really needs to be enabled in more programming languages. A future is a subroutine or function that is "called" at one point, but the system knows it doesn't have to be evaluated until any of its outputs are requested by the application. With futures you could trivially launch 10 "parallel" jobs at once, and the system decides how to schedule them. Of course, a big problem with these is that they tend to fail pretty bad in a language like C where a compiler doesn't really know where outputs of functions can be later used.

      Phil

    2. Re:do it the unix way, use pipes! by Gunstick · · Score: 1

      true.

      What is needed is a language which is effective and also uses streams as it's main object/function/communication call. Rendering it easier to pass data via this scheme. That way the classic malloc, file or whatever data structures exist, which do not easily enable multi processing, are less used. Programmers are lazy...

      --
      Atari rules... ermm... ruled.
  75. Silicon Graphics did 1024 cores around 2002 by Anonymous Coward · · Score: 0

    Silicon Graphics did systems with 1024 processors around 2002.
    Running their unix and running Linux.

    compilers to use them at 99% of peak exist,
    programmers exist too,

    The problem is the casual programmers have no clue.

    Now Linux kernel can schedule 8 big programs and each uses a cpu (basic scheduling here). What it brings is you can be encoding a video for your ipod, playing a game, having a video conference, etc all at the same time.

    Now a good programmer would do video ending in parallel, split a film in 8 segments and encode each in a separate thread. I want a program like this on an 8 core machine.

    Plus I want fast memory and fast disk.

  76. A compromise... by spiffmastercow · · Score: 1

    How about rewriting the standard libraries for many procedural languages (this includes OO languages, since OO is really just a style of procedural programming) to use multi-threading whenever appropriate? For instance, any array sorter should use a multi-threaded heapsort instead of a quicksort if the array is above a certain size. The program flow would still be procedural, and the average programmer would not have to deal with parallel programming very often, and the parallel specialists can handle the libraries where its needed. Of course this won't work for every circumstance, but it would be a great way to get the most out of the code we already have.

    1. Re:A compromise... by Tiger4 · · Score: 1

      If the processing required can be characterized successfully, numeric, memory operation, shift registers, etc. then a compiler might be able to tag them such that the OS can assign as much "power" for that kind of operation as it has available at any given moment.

      No more idle processing units if there is any work anywhere that can use them.

      --
      Behold, this dreamer cometh. Come now, and let us slay him... and we shall see what will become of his dreams.
  77. n! Re:Some tasks are embarrassingly parallel by Anonymous Coward · · Score: 0

    Most programs barely use any computational power, in fact there are very few programs that require all that computing power to operation and those are certainly well designed.

    Home users do use some apps that could benefit from multiple cores. Video encoding is one of them, but that one is embarrassingly parallel because the encoder could just split the video into quadrants and have each of four cores work on one quadrant.

    Or n-sections even where n = number of available cores. Why break into quadrants if you have 8 or 16 cores available?

    1. Re:n! Re:Some tasks are embarrassingly parallel by dkf · · Score: 1

      Why break into quadrants if you have 8 or 16 cores available?

      Because it's often impolite to assume that nothing else is going on on the machine at the same time? (This actually means that the video encoder needs to know how many cores are assigned to it so that it knows how many ways to partition the data. Let the OS worry about keeping some resources back for other activity...)

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
  78. DB by Anonymous Coward · · Score: 0

    Ummmm.....Grand Central anyone?

    http://www.apple.com/macosx/snowleopard/

  79. Bad, bad developers... by the_raptor · · Score: 1

    ... how dare you not focus your efforts on the 0.0002% of your users who run your desktop app on their server big iron!

    --

    ========
    CINC, 4th Penguin Legion
    1. Re:Bad, bad developers... by petermgreen · · Score: 1

      Thing is it's not just server big iron that is going up in core count. My brother recently got a moderately priced (about £450 inc vat and delivery so not rock bottom end but not hugely expensive either) dell vostro 420 desktop and it came with a quad core processor as standard.

      Right now low end desktops have 1 or 2 cores, midrange desktops have 2 or 4 cores, and high end workstations have 4 or 8 cores. I expect in the next couple of years all those figures will double.

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
  80. FF does threads, it's just the GUI by Nicolas+MONNET · · Score: 1

    Doing a multithreaded GUI is hard, and very prone to bugs, and on top of that there are quite a number of assumptions made by JS code that force monothreading.
    But anyway, FF does multithreading, in fact in the 3.1+ you can spawn new worker thread from unpriviledged javascript; however that code won't be able to touch the DOM, only pass and receive JSON-encoded messages.

  81. One of the odd ones that wants single-core apps? by Anonymous Coward · · Score: 0

    Personally, I run workloads which currently consist of many, many, many minimally threaded applications running on systems with lots of cores. However, if those applications suddenly became heavily threaded, I might have an entirely different expectation placed on me. Instead of giving each virtual machine a single thread, I'd have to give them a minimum of two, and instead of running dual quad-core servers, I'd be looking at quad quad-core servers.

    I guess you could say that I can't have my cake and eat it too -- expect the manufacturers to keep spitting out higher core densities, and still expect my users to require no more computing power than they have currently. Yet, it would be nice that if I had twice the number of cores, that I could run twice as many applications, rather than only being able to run the same number of applications that have been "enhanced" to abuse more of my cores.

    If Intel comes out with 8-core processors with hyperthreading, supports quad-socket motherboards, and actually has a decent memory bus behind that, I'll be happy for a while ;-)

  82. Windows 7 given significant tweaks by NameIsDavid · · Score: 1

    Windows 7 will support 256 cores in the 64-bit version. Microsoft has made significant tweaks to the thread dispatcher code to make this possible. A good discussion can be found here: See http://channel9.msdn.com/shows/Going+Deep/Mark-Russinovich-Inside-Windows-7/

    1. Re:Windows 7 given significant tweaks by AHuxley · · Score: 0

      You mean like learn to use some obscure sub system that will be obsolete by the time the code is ready for release in x months.
      Whats why on Mac, Linux or Windows you stick with code that will just work on one core. No problems then.
      If you start looking deep into Linux Mac or Windows and understanding how to write great code, your customers will just expect so much more each upgrade cycle.
      Thats more book time, reading, learning, study and bugs for you- for features that may be 'tuned' at any version bump.
      Then Intel or OS X or Windows or Linux changes something and your so back to alpha testing on a shipping product.
      Keep your code dumb until *all* the OS catch up, everybody else is.
      Cute PR write ups from Windows, Apple or some basement dwelling code person do nothing for real world long term shipping software.

      --
      Domestic spying is now "Benign Information Gathering"
    2. Re:Windows 7 given significant tweaks by fractoid · · Score: 2, Insightful

      [T]hats why on Mac, Linux or Windows you stick with code that will just work on one core. No problems then.

      That, and the much greater reason that (a) 99% of software these days would run just fine on a single core P4 3GHz, and (b) most programmers are really, really bad and it's much harder to screw up a single-threaded app badly enough that I can't fix it, than it is to screw up a multi-threaded app.

      --
      Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
  83. Smooth continuous functions by omb · · Score: 1

    I sometimes despair at what I read here, sin(x), cos(x) are both very smooth, ie infinitely differentiable periodic functions, so why would it surprise you that interpolation off a table, spacing determined to give desired accuracy would not be quicker than the Taylor series, and since the function is periodic the table size is bounded.

    Try that with any _ROUGH_ non periodic function and see where you get

    eg 1/(1-e^x)

  84. OCCAM by omb · · Score: 1

    It wasn't C, if it was BCPL when David Barron developed the Transputer, at Southampton I would be surprised, but the world turns and now we have the AMD Hyperchannel and Infini-x

  85. Twenty cores should be the functional limit by chrisG23 · · Score: 1

    Three Cores for the MAC kings under the sky
    Seven for the Windows-lords in their halls of stone
    Nine for Linux users doomed to die virgins
    One for the Dark Lord on his dark throne
    In the Land of MS where the Shadows lie.
    One Core to rule them all, One Core to find them,
    One Core to bring them all and in the darkness bind them (with restrictive licensing)
    In the Land of MS where the Shadows lie.

  86. Units of Work by Hecatonchires · · Score: 1

    Sorry if this is straightforward to the hardcore programmers. I'm just a business programming sort of guy. Lots of lists and mailmerges.

    One of most common tasks in web programming is

    • Connect to Database
    • Execute SQL
    • Get recordset
    • loop through recordset to build dropdown list

    Couldn't a whole bunch of these be farmed off onto different processors?

    --

    Yay me!

  87. Sensational title about problems found 20 years ag by Anonymous Coward · · Score: 1, Informative

    So what the author is blathering and foaming about are problems found and solved 20+ years ago. Instead of programmers studying anything, the author should study some. NUMA has been in Linux for close to 10 years. It solves the memory bus problem. Multi-threaded applications solves the problem of using more than 1 core. I do it all the time. Did it yesterday, will likely do it tomorrow. Not every program takes advantage of multiple cores. Quite a few do. Those that scream the need for parallel computing use all of the cores (on my nehalem system it shows up as 8 cores). I do with authors would do the tiniest squeak of research before describing how the world is going to end. Oh well.

  88. elephant years by epine · · Score: 2, Insightful

    Knuth's maxim is sufficiently pithy to have become, over time, self referential, as evidenced by your misunderstanding.

    The root of all evil used to be deep and singular, now it is broad and shallow. I guarantee you that Knuth did not include choosing the best fundamental algorithm under the label "premature" unless it involves squabbling over log log N terms or stray digits in the exponent term.

    http://www.siam.org/pdf/news/174.pdf

    An unpacked (deoptimized) version of Knuth's maxim is that the transition from program structure and notation which maximizes readability, comprehension, and conviction (concerning its correctness and merit) to one which favours performance should be delayed as long as possible. Ideally until performance becomes the sole remaining success factor.

    (Taking into account the human mind's special capacity to imprint upon evil, Knuth's formulation remains the better one.)

    Originally Knuth meant manually hoisting loop constant expressions (often in ways that later turn out to not be fully general) or manually evaluating constant expressions or manually fusing nested function calls and the kind of rot that a good compiler these days will do on your behalf. Anyone used the "register" keyword lately? Once upon a time it seemed like a good idea.

    While the principle remains the same, the temptations have changed. Such as parallelizing a bad implementation of a poor algorithm in the misguided belief that the underlying task is not sequentially bound.

    That said, projects which do *no* evil typically fail to impress anyone. The ideal is to wrap large amount of cleanly structured and accessible source code around a nugget of pure, smoldering evil, coked to the last clock cycle.

    Perversely, the worst example of this is TeX itself. The smoldering nugget of pure evil is the single pass parsing regime and data packing eight bit character values.

    I suspect the literature on parallel programming would roughly equal the literature on electro-chemical storage cells. Sheesh, if only those guys were paying attention, we'd have watch batteries powering small cities by now.

    On second thought, how much literature could there really be if you can summon the majority of it onto your screen in 4/10'ths of a second for any combination of keywords?

    Parallel programming is a lot like fuel cells. You get some pretty impressive results on selected applications involving pristine apparatus in a controlled setting, dating back to the Apollo program (in both cases).

    Reality on the ground is rarely so forgiving.

    If we hadn't already achieved a pixel processing speed-up between 1980 and 2008 best approximated by a sideways 8, Javascript wouldn't even have entered the conversation.

    It boils down to this: ignoring everything you guys have already accomplished, you've pretty much done nothing. I worked for that kind of company once. The guy in charge put on a Cirque du Soleil of intestinal recursion. That's how I feel about the claim that software developers haven't been paying attention to parallelism for elephant years.

  89. just run multiple separate tasks(like erlang does) by Jessta · · Score: 1

    It's not really a problem. If you can't split a single task over multiple CPUS, you can just run multiple separate tasks(like erlang does).

    - burning DVDS
    - playing videos
    - generating sweet fractals
    - web browser(recently being more CPU intensive)
    - Bit torrent
    - BOINC
    - Indexing files for searching later
    - Rendering frames
    - Compiling updates
    - Compressing backups
    - While still having enough spare resources to remain responsive.

    --
    ...and that is all I have to say about that.
    http://jessta.id.au
  90. The problem is in the wrong spot. by shaitand · · Score: 1

    Optimizing applications for multiple threads is like unrolling loops. Programmers are writing logic not implementation, compilers should be taking care of implementing logic in a way that is optimal for the hardware.

    Now, get back to me when -fuse_threads is a compiler option and implicit when choosing o3

    1. Re:The problem is in the wrong spot. by tenco · · Score: 1

      Aren't computer scientists writing logic and programmers the implementation?

    2. Re:The problem is in the wrong spot. by shaitand · · Score: 1

      Nope. A computer scientist is just a trumped up programmer who thinks his chit doesn't stink because he has a degree and spends as much time working on theory as practical code. A programmer who calls himself a 'computer scientist' is probably nothing more than a dreamer while a programmer who calls it like is will be the doer.

    3. Re:The problem is in the wrong spot. by shaitand · · Score: 1

      To answer your question more directly. All programming is writing logic. Those electrical states are meaningless unless a programmer gives them meaning via logic. Computer logic is expressed in code. That's why you can express code in pseudo code, it's merely a language to express logic.

      The compiler reads your logic in whatever language you've expressed it and turns it into actual commands for hardware thus, implementation. The compiler should be taking care of utilizing the hardware (in this case multiple cores) in order to execute the logic the programmer has expressed.

  91. lack of industry knowledge? by pjr.cc · · Score: 2, Interesting

    It's really quite frustrating to see posts like this. Posts that dont take into account what is needed and focus on what we are incapable of doing - even when they dont need to.

    So lets look at reality for second. First, most modern OS's scale very very far past 4 cpu's (not sure what windows scales to, but linux certainly has no limitation based on current cpu reality). So the kernels are just dandy for multi-core cpu's, bring it on! 128 cores, we're ready for ya!.

    The same is not true at the application level, and that is a fair comment. But dont confuse linux and windows with their apps for crying out loud! From an application point of view we are capable of parallel coding, but its non-trivial. Its also not something we need alot of the time.

    For instance, we now buy servers (our cheapest models) with dual cpu's and quad cores and we're tending to virtualise it up into several machines with 1 or 2 cpu's each. Now whether you do this because you assume the OS will utilise one cpu and the apps will utilise another (as one person told me is irrelavent). Surfice it to say, having 2 cpu's is usually quite nice.

    But what requires more then that in reality? well, your desktop might - after all theres alot of things going on at once right? In some point cases, thats true (there are quite a number of very heavy applications out there, and supprise supprise, they can multitask *GASP*).

    Same at the server, not many things require that many CPU's and even at the application level, we've gotten good at spreading heavily loaded applications across multiple servers (we call it load balancing, was that too sarcastic?). Take mail (weather its exchange or postfix or sendmail or whatever), or web servers, etc. Those server applications that do require heavy grunt tend to already be coded with "parallel" in mind, even across multiple servers (think oracle RAC).

    As for cache contention - well it sounds like the hardware makers are finally fess'ing up to the fact they have a problem, Houston!

  92. make by eviljav · · Score: 1

    make -j bitches
    optimized for as many cores as you want

  93. Reinvent VM/CMS by tepples · · Score: 1

    But how many humans do you know that only run one or two windows on their PCs at the same time?

    Your average XP user has 4 or 5 windows open at the same time.

    At the moment, I have three windows open on this Windows XP machine: Firefox, Command Prompt, and Windows Explorer. All three are waiting for user input, including the Firefox window that I'm typing this post into (between keystrokes). If all the apps you have running in the background don't bring your load average above 1, you aren't likely to benefit from multicore optimization.

    And also, through the API achieve reliability. For example, if one of the running copies of Windows crashes, the Apps with multiple worker threads keep some threads running, and they can detect and recover from that failure

    Congratulations: you've reinvented VM/CMS. It'll work at least until the CP crashes.

    This allows applications to work around issues like blue screens in certain OSes, or kernel panics in others

    Blue screens and kernel panics are often caused by defective device drivers. Would your design run these in a VM?

    so long as [applications] perform sufficient checkpointing

    Third parties paying attention to checkpointing? NBL.

  94. Sometimes I prefer single threaded applications by InfiniteLoopCounter · · Score: 1

    First and foremost, if they are going to go with more CPU's they might as well sort out the problem with the extra heat output.

    On a hot day (say, 35C+ or 95F+), I wince when I try to run a multi-threaded application on my dual core machine (Intel Pentium-D 2.67 GHz); or some background process runs at the same time as my foreground process.

    Why is this, you maybe asking? Well, it's because it sounds like old-style CD recording gone wrong. It starts off with a low sounding hum and gradually gets louder and louder at increasing pitch with seemingly no end.

    It's fine and dandy on a normal or cool day, but unbearable on hot days. I just wonder how many other people have to use CPU limiters to play certain games for a few weeks a year. Alternatively, I realise I could have damaged the thermal paste over the processor when installing it (I have 3 standard fans inside a roomy case).

  95. Apple plans for the future by Quila · · Score: 1

    This OS will be the included OS before Apple starts selling 4- and 8-core consumer machines, and it will have been out long enough for developers to use Grand Central to leverage those cores.

  96. Adapt security. by Anonymous Coward · · Score: 0

    "That would be a big change, but the efficiency and total throughput gained would be huge."

    It would be except in secure environments were one has to guarantee that information can't cross boundaries even at the CPU level.

    1. Re:Adapt security. by Tiger4 · · Score: 1

      Exceptions and lockouts for security, timeliness, reliability, etc. can always be made. The general solution is for general purpose computing AND special case applications.

      --
      Behold, this dreamer cometh. Come now, and let us slay him... and we shall see what will become of his dreams.
  97. No, Sorry not that easy, see the Patterson paper. by omb · · Score: 1

    The real problems with exploiting parallelism are (a) a solution is needed since Moor's law has run into a brick wall, excepting major process improvements that the semi-manufacturers dont see, and (b) all current algorithm design going right back to 1948, John von Neumann, has been essentially serial. Threads, and multi-cores are an essentially serial solution to a parallel problem.

    Parallel computation, is very hard, see how many kernel (OS) developers we have v app. developers. This is because of problems related to timing and computational order. This produces problems with data sharing and correctness.

    Then there is the problem space, in some problems, easily artificially constructed, then the next step depends on the completion of all earlier steps, and the solution cannot be parallelized eg Fibonnachi where you can show a trivial parallel decomposition that wins nothing. In other, and more interesting cases eg the Partial Differential Equations of Mathematical Physics, Routing, and Finite Element systems some to large amounts of parallelism may be possible, particularly will well thought out analysis that capitalizes on special features of the problem or the known solution, eg multimode solution and boundrey condition matching in Navier-Stokes or Elasticity.

    The point is that this is at the algorithmic level, it is not about code optimization or other programming paradigms so the kool aid of we need a better tool chain or parallelizing compilers is hope, hand-waving and optimism and of course product support.

    Dont get me wrong, I see this as a very good thing, seas of CPU (core = 1 CPU), will fully solve lots of problems, and improve robustness, but it will not help with many problems, 4-16 cores will be generally handy, but 1024 .. 1048576 will need new algorithms, for the first time in 60 years of computing and 500 of Mathematics.

    A last thought, when we get to ~1073741824 cores we may start to make progress with AI and need to worry about the Singularity.

  98. MOD PARENT UP! by PRMan · · Score: 1

    Seriously, I'll never understand article after article about mulitple CPUs being wasted when I have 37 processes in Task Manager.

    --
    Peter predicted that you would "deliberately forget" creation 2000 years ago...
  99. Real Dumb by omb · · Score: 2, Insightful

    As has already been explained, Non-Sequential thinking is hard, you postulate double speed, BUT the producer thread, the app finished and handed of the buffer to the OS to send to the GPU, and you say it threads this. Well fine, so the threaded part can run on another core, but then hardware DMAs the data and waits for a GPU interrupt/done-queue ack so how does this speed things up on multicore. Not at all, someone has to set up the DMA and wait, not run, while it completes, so unless all cores are at 100% you have saved nothing, and created additional overhead spawning a new thread

    Duh, Marketing Departments

    1. Re:Real Dumb by 99BottlesOfBeerInMyF · · Score: 1

      As has already been explained, Non-Sequential thinking is hard, you postulate double speed...

      No, I state that the design offers the theoretical potential of double speed, under perfect conditions. Realistically it will always be less and in practice it is often more like 10%-30% for applications it helps at all. But we don't need to postulate anything. I'm describing the feature added over a year ago. It works just fine.

      ...how does this speed things up on multicore.

      For some CPU bound applications, where the CPU's ability to process data to send to the GPU takes a significant amount of the processing, splitting that out into a separate process results in significant benefit. Since this was one common bottleneck profile, it worked well.

      ...so unless all cores are at 100% you have saved nothing

      No, just one core (the one running the application's main process) has to be 100% and causing the bottleneck. Then, any portion of that thread which is dedicated to feeding the GPU can be split out. Usually the application hits another bottleneck anyway, but it does help. Don't you think it is a bit absurd to be arguing it will cause more overhead when it's been working and shown to provide improvement is benchmarks for quite a while now?

  100. Re:Mythical Machine Month by Tiger4 · · Score: 2, Interesting

    2. I recode for a cluster. Why stop at a multi-core computer? If I can get a 2:1 to 10:1 speed up by writing better code, then why stop at a dual or quad core? The application might require a 100:1 speed up, and that means more computers. If I have a really nasty problem, chances are that 100 cores are required, not just 2 or 8. Multi-core processors are nice, because they reduce cluster size and cost, but a cluster will likely be required.

    I think I agree with you, BUT... don't fall into the old trap: If ten machines can do the job in 1 month, 1 machine can do the job in 10 months. But it doesn't necessarily follow that if one machine can do the job in 10 months, 10 machines can do the job in 1 month.

    Also, the problem with runtime interpreters is not that they don't generate assembly code. The problem is that it is harder to get at the underlying code that is really executing. That code could be optimized if you could see it. But seeing it is just more difficult.

    --
    Behold, this dreamer cometh. Come now, and let us slay him... and we shall see what will become of his dreams.
  101. Quoted article by DavidApi · · Score: 1

    The quoted paragraph in the SlashDot article. Does it appear in the InfoWorld article? I can't see it. The link goes to the article no problems, but where is this quote? Words like "blame" don't even appear!

    Am I missing something? Is the link to InfoWorld incorrect?

    The reason I wanted to read the original article was because the SlashDot teaser (quote) mentions Windows and Linux performance, but not Mac OS X, and I wanted to see if the original article mentioned that or not.

    Help?

  102. Development tools are available.... by jamesswift · · Score: 1

    Problem? The development tools aren't available and research is only starting.

    Nonsense. Here are a few couple of portable tools and libraries that will solve many developers problems.
    http://www.threadingbuildingblocks.org/ (c++)
    http://developers.sun.com/sunstudio/downloads/ssx/tha/tha_getting_started.html
    Research is mature and ongoing.

    Education, however, is only starting to reach the mainstream.

    --
    i wish i could stop
  103. I/O is part of the language proper by Trepidity · · Score: 1

    Insofar as the language proper is defined by the language standards, the I/O libraries are part of C, because they're specified in the ANSI C and C99 reports. Any conforming C implementation must have the standard I/O functions, and they must behave in the way the standard specifies. That differs quite a bit from the situation with networking libraries, which are third-party and not covered by the C standard.

  104. Most Linux Programs by TheSimkin · · Score: 1

    Most linux programs are small and do a small part. larger apps usually call down to smaller apps. Doesn't this in itself let the os balance the work across multiple processors? It seems that anything that is very intensive (like compiling or video conversion) the apps to make them run across many processors is already done. I have used gcc to compile across 6 diff machines at one point. Transcode uses all available processors. So does folding at home! The only thing I really think is missing is a software based GLX engine that can use all available cores. perhaps a higher end video card isn't needed for basic 3d anymore.

  105. If you are right, we aren't very smart by coryking · · Score: 4, Interesting

    But most computing in the world is done using single-threaded processes which start somewhere and go ahead step by step, without much gain from multiple cores.

    The fact that all we do is sequential tasks on our computer means we are still pretty stupid when it comes to "computing". If you look outside your CPU, you'll see the rest of the computers on this planet are massively parallel and do tons and tons of very complex operations far quicker than the computer running on either one of our desks.

    Most of the computers on the planet are organic ones inside of critters of all shapes and sizes. I dont see those guys running around with some context-switching, mega-fast CPU, do you?**. All the critters I see are using parallel computers with each "core" being a rather slow set of neurons.

    Basically, evolution of life on earth seems to suggest that the key to success is going parallel. Perhaps we should take the hint from nature.

    ** unless you count whatever the hell consciousness itself is... "thinking" seems to be single-threaded, but uses a bunch of interrupt hooks triggered by lord knows what running under the hood.

    1. Re:If you are right, we aren't very smart by tftp · · Score: 2, Interesting

      If you look outside your CPU, you'll see the rest of the computers on this planet are massively parallel

      You don't even need to look outside of your computer - it has many microcontrollers, each having a CPU, to do disk I/O, video, audio - even a keyboard has its own microcontroller. This is not far from a mouse being able to think about escape and run at the same time - most mechanical functions in critters are highly automated (a headless chicken is an example.) I don't call it multithreading because these functions are independently operated, just as I don't call a 386 computer dual-core because it has an independent ATA controller or an independent network card. The HDD gets written to and the network data is sent without using the main CPU, but these are independent functions performed by independent hardware. IBM/360 had that already.

      Some people (very few) have an ability to do two dissimilar tasks at the same time. That would be a perfect analogy. But the rest of us, all critters included, are single-threaded, just as you mentioned yourself. Logically thinking, any single thought can't be easily parallelized, but why couldn't we think two thoughts at the same time? I wonder why is that? This question is, IMO, very important because a brain should, technically, be capable of that feat - and nevertheless it doesn't do that! I guess this could be because the brain has (or must have?) only one VM to run our consciousness (our persona) on. Since most thoughts [queries] are executed in volatile context of owner's persona [database] it could be that allowing two thoughts at the same time, on two copies of the persona, would result in independent modification of both personas, and how to you merge them back then? And if the brain doesn't copy a full database that defines a person for each trivial thought, then running of two or more queries in parallel may result in unpredictable results (does a brain have semaphores, mutexes and spinlocks? I doubt that; if you are asked to "hold that thought" it takes a considerable effort to separate and memorize the context, and often we fail.

    2. Re:If you are right, we aren't very smart by coryking · · Score: 2, Interesting

      Logically thinking, any single thought can't be easily parallelized, but why couldn't we think two thoughts at the same time?

      Yes, but there is increasing evidence (dont ask me to cite :-) that many of our thoughts are something that some background process has been "thinking about" long (i.e. seconds or minutes) before our actual conscious self does. There are many examples of this in Malcolm Gladwell's "Blink", though I dont feel much like citing them. Part of that book, I think, basically says that we should really trust the underlying parallel part of our brain and "go with our gut" more often then western society often feels comfortable doing.

      Basically, yeah, our train of though it single-threaded, but that doesn't mean our train of though isn't just a byproduct of lower-level processes that have figured stuff out long before "we" become aware of it.

    3. Re:If you are right, we aren't very smart by coryking · · Score: 2, Funny

      our train of though it single-threaded, but that doesn't mean our train of though isn't just a byproduct

      And sometimes, even, our background grammar checker misses things that our background finger-controller mis-types while on auto pilot. thought/though, thing/think are stroke-patterns that my hand-controller mixes up a lot and since this isn't something super-formal, the top-part of my brain never catches.

    4. Re:If you are right, we aren't very smart by vertinox · · Score: 1

      But the rest of us, all critters included, are single-threaded, just as you mentioned yourself. Logically thinking, any single thought can't be easily parallelized, but why couldn't we think two thoughts at the same time? I wonder why is that

      Um.... I don't know about you, but I see, hear, feel, smell, and taste all at the same time and can form perfectly logical thoughts from the experience.

      Also, human thought it relational and contextual. You maybe thinking about one particular topic at a time (personally I find myself doing 2 to 8 but anymore than think makes me go home early) but those thoughts often latch onto other things such as your periphial vision or other thoughts that pop in or out of your head.

      That is how you remember things by seeing other things and how you can drive and talk on the cell phone at the same time (well you aren't supposed to but you can).

      --
      "I am the king of the Romans, and am superior to rules of grammar!"
      -Sigismund, Holy Roman Emperor (1368-1437)
    5. Re:If you are right, we aren't very smart by Grishnakh · · Score: 1

      It's true, there's a lot of parallel computation in most animals: our mammalian eyes, for instance, do a lot of signal processing in the neurons of the retina, before that data is transmitted to the brain's visual cortex. Our spinal cords also do a lot of processing, which is why most humans will drop a hot potato before the pain signals from their hand reach their brains.

      However, computers and electronics are a little different, because of economics. It would be a lot more efficient, for instance, to play music on our PCs using dedicated MP3 or Ogg decoding chips, because they could decode those streams with far less power than a dual-core Intel CPU requires to do the same thing in software. However, CPUs are cheap, power is cheap, and software is much easier to write than hardware is to design. Additionally, when there's a bug in software, it's easy to distribute patches and updates. If there's a bug in a chip, it's no small task to replace it for all the past customers.

  106. 64-way parallelism for Blu-ray authoring by benwaggoner · · Score: 1

    For HD DVD and Blu-ray authoring, the CineVision PSE system we designed for VC-1 used a hybrid spatial/temporal model.

    First, the codec itself was 4-way threaded, encoding each 1920x1080 frame as four slices. Then the file was distributed across multiple blades, each processing a section of the video. Since this was for disc-authoring, we knew where chapters were going to be in advance, and so split by chapter; ideally you'd have at least 2x as many chapters as workers.

    The key to avoiding the "chunk transitions" was aligning along chapters, since they almost always start at a scene change or a black frame, so it'd be easy to see the problem. Also, there is extensive 3rd pass support to manually tweak a transition that could go wrong. There was a fair amount of workflow that had to get baked in to get full advantage of the paralleization, like prepopulating each worker with the source during the 1st pass and keeping it cached for the 2nd and potentially 3rd passes.

    Anyway, it works nicely; that product was used for 90% or so of HD DVD titles and about a third of Blu-ray titles so far. Last I heard, the record for a 2 hour movie encode was about 6 hours for 2 passes. I'm sure it'd be faster yet with more recent processors. That scaled up to 64-128 cores pretty well, given source chapters. With overlapping scene detection in the first pass, it could be scalable well beyond that for long-form content. Of course, with short content you're not so worried about end-to-end encoding time, but full throughput.

    As suggested earlier, live streaming is that hard stuff, since you can't do significant temporal slicing without adding a whole lot of latency.

    We have a similar kind of issue with Smooth Streaming for Silverlight, where we encode the same source in multiple bitrates, and need to make sure GOPs are aligned across all the data rates for seamless switching. For an example of that:

    http://on10.net/blogs/benwagg/Behind-the-Scenes-at-SmoothHDcom-Encoding-Big-Buck-Bunny/

  107. My Summary of the Article by IronClad · · Score: 1

    Since I see little evidence that timothy or Mr. Chapman read the article, I'll do them a favor so they don't have to click:

    < article paycheck="undeserved" >

    Hi I'm Agam Shah and I'm writing an article about multicore processors, but these concepts are so new to me that I'm putting quotees around "race conditions" like it's frickin' sharks with lasers.

    So then I did a Google search on "parallel programming tools" and it help me get another paragraph out of the way.

    Oh, and I quote some lamer analyst who has never heard of NUMA or libhoard, so I'll try to fabricate some crisis that the problems they address might never be solved.

    Parallel programming is hard, WAH! WAH!

    Oh, except when it's not, as in that trivial application named Photoshop. I'll write one of those next weekend.

  108. That's funny I've been using 8 cores on Linux.... by Anonymous Coward · · Score: 0

    for the last 2.5 years. I frequently running multithreaded applications across all 8.

  109. Meh by scientus · · Score: 1

    Some languages has existed as a bunch of thread for years, like erlang. And event-based designs almost completely solve this problem. Some things like xlib and glib still run as a big ugly loop but there are alternatives like xcb, that at least one desktop manager uses (awesome wm).

    The two things that currently peg for me against a single core is firefox's unified javascript loop (this changes a bit in 3.1), and ffmpeg for high def video (multi-threaded is in the works). The fact people use alot of differnt programs at once and as most programs are not very demanding also makes this not that big of a problem. Few applications need single-thread programming (all i can think of is compressors 7zip, video, etc in their top-quality modes, and certain resource allocators), most things would never hit that single-processor head if they were written decently. I think its just a legacy application problem.

  110. Don't blame the programmers... by Douglas+Goodall · · Score: 1
    I have been watching this for years now. Intel came out with hyper-threading. HT required chipset and BIOS support to work, and although I paid top dollar for a high end Toshiba notebook, the processor in which should have supported hyper-threading, Toshiba dropped the ball and it didn't work on my machine. The same thing happened to me with an expensive HP desktop machine. A while later, Intel and Microsoft announced HT wasn't the silver bullet and multi-core would be. Manufacturers rushed to build machines around dual-core processors and Intel/AMD competed to provide dual core processors for desktop and server use. The machines hit the street, and for the most part, Microsoft Office and a small handful of other applications actually were optimized to make use of the cores.

    Programmers like myself were waiting for a clear direction in terms of language and compiler support for multi-core development, and of course multi-core debugging is a challenge.

    Now we have quad processors from multiple vendors and there are plenty of choices for hardware, but there is still not a clear winner when it comes to development tools and methodology. Intel has a threaded toolbox, and beyond that we can roll our own. The only support I have seen that made me smile was the multi-core support in Python, which only exists in the more recent versions, and those versions are not ubiquitous yet.

    It is really easy for Intel to unilaterally make a decision to stop processor development at 3GHz and put it on the programmers to reorganize their code in a parallel manner. It is something else again for each software engineer to choose how to do this and commit their clients to those decisions, and the fall out that will last the lifetime of this code. Companies that paid to migrate their applications to hyper threading only got to benefit for a year or two before the environment went away. I am frightened to make a decision today about multi-core that depends on Intel (and AMD) to keep multi-core stable far enough into the future to make development worthwhile.

    It is fairly obvious at this point that multi-core is here to stay, but it will be nothing more than a way to sell more expensive hardware until the powers that be provide a cohesive set of tools and methodologies that make multi-core useable to address our current problems. A friend of mine told me of his experiments configuring a multicore Windows box for gaming using process affinities. He indicated that the Windows operating system used about 1.5 cores itself, which in the case of a dual core machine left about a half core for the game. My experience has shown that we have little control over the way tasks are assigned to specific cores, and multi-core seems to do more for the operating system and environment than the threads of a specific application After years of effort addressing this problem, it is still not clear to me which tools and methods will be the most stable over time. It looks to me that there has been very little progress on the software side in the last two years.

  111. Quantum computing .... by Jerry · · Score: 1

    Assume we develop affordable 32 bit quantum computers. How does that change this parallelism problem?

    --

    Running with Linux for over 20 years!

  112. So Narrow Sighted! by ryanw · · Score: 1

    Ok, so it points out a flaw with Windows 7 and Linux but completely fails to give the praise to the efforts that Apple is doing with Mac OSX and Snow Leopard!!! OSX is incorporating incredible efforts to leverage GPU and Multi-core solutions for developers. Ignoring these pieces is incredibly ignorant of the "personal computer" and "distributed computing" markets.

    http://www.apple.com/macosx/snowleopard/

    1. Re:So Narrow Sighted! by Ash-Fox · · Score: 1

      Ok, so it points out a flaw with Windows 7 and Linux but completely fails to give the praise to the efforts that Apple is doing with Mac OSX and Snow Leopard!!! OSX is incorporating incredible efforts to leverage GPU and Multi-core solutions for developers.

      With things like a broken OpenGL implementation (code that works fine on Linux, XP, Vista, Seven, Solaris, BSDs - it doesn't work on OS X and in quite a few cases, will lock up the system), buggy graphical drivers (contributes to many OpenGL issues), old generation graphics hardware... And you want to claim OS X is ahead in the GPU? ...What?

      And then you have the tenacity to even mention the multi-core solutions in OS X, when they can't even get simple things like signalling right, never mind the occasional screwed up behaviour of POSIX threads.

      I honestly don't believe for a second you have done any complex development on applications in OS X.

      --
      Change is certain; progress is not obligatory.
  113. "Migrate their applications to hyperthreading?" by XanC · · Score: 1

    From an application standpoint, how is hyperthreading any different from multi-core?

    1. Re:"Migrate their applications to hyperthreading?" by Douglas+Goodall · · Score: 1

      hyper-threading is subject to certain security problems that are not present in multiprocessing. Security researchers have noted that it is possible to peek at encryption keys in memory within the HT environment. See http://www.daemonology.net/hyperthreading-considered-harmful/

  114. For what sort of work load are they talking about? by rnturn · · Score: 1

    I would agree that yout typical email client or word processor is not going to benefit much from multiple cores. Most business applications running non Windows aren't likely to need even more than two cores to get their work done. (I supposed one would be using the other two to run the anti-virus software to keep that OS reasonably healthy, eh?) But OSes like Linux tend to have users that are doing more of a variety of tasks simultaneously. They'll have an email client frequently checking for new mail, an audio player running, a windows where they're downloading patches or new source code onto their system, an editor window or two open, windows to other systems on the local network, a browser with multiple tabs being updated frequently, and who know what else. Can you run all the same applications simultaneuously on Windows? Maybe, though without multiple desktops it's unlikely. Alt-Tabbing though a list of multiple programs makes switching from one program to another incredibly clumsy so most people I know avoid running more than 2-3 applications at once making more than dual-core chips mostly overkill. If the extra cores are going to be useful at all in a business environment, I suspect it'll be to run a slew of additional tools used to enslave^Wmanage the desktops centrally. Servers may be a different story but I believe the extra cores would be used more advantagiously by Linux since the servers running it are, more often than not, tasked with running more than a single application at a time; something which Windows servers are still not asked to do in most situations.

    --
    CUR ALLOC 20195.....5804M
  115. Works for me. by MikeFM · · Score: 1

    I have a dual quad-core Xeon server and it keeps all cores busy and is definitely faster than a single or dual-core system. Nothing fancy going on. I run several virtual machines and each of them runs normal software such as web servers.

    --
    At what price learning? At what cost wisdom? The price is a man's peace of mind, and the cost is his life.
  116. Not an OS problem by Casandro · · Score: 1

    It's more a language problem. C(++) was never meant to run on systems with several processors. The programms are meant to be execute in a single thread of execution. If you actually want to use multiple processors it's quite hard to do.

    Object oriented programming might solve some of the problems.

  117. Message Queue by Anonymous Coward · · Score: 0

    At least with Windows it is probably due to the fact that each window is connected to a message queue and message queues are thread owned resources.

    That means that window messages are always exceeded the thread that created the window. If a synchronous message is send from one thread to another both threads rendezvous until the message executed.

    If anything a multi-process design forces the programmer to use asynchronous messages.

    Martin

  118. Java is not a multithreading language. by krischik · · Score: 1

    Java does not support multithreading - java.lang.Theads a library function does. Have a look here:

    http://en.wikibooks.org/wiki/Ada_Programming/Tasking

    See, not a library but language keywords. Note that I a fluent and Java and Ada and have designed and implemented larger projects in both.

  119. Language vs Library by krischik · · Score: 1

    But there is a difference in the "ease of use" between a language feature and a library feature. That is unless you use a language like smalltalk where everything is library.

    Think of how error prone printf is. If parameters and the little % stuff does not match all goes havoc.

    And more so in multithreading. Here the bugs are often sporadic and extremely difficult to find. And a programming language which support it natively is a great help. See:

    http://en.wikibooks.org/wiki/Ada_Programming/Tasking

    So you know what I understand in "natively".

  120. C++0x by krischik · · Score: 1

    Great - there is still only one compiler to support "export" - almost 10 years after the standard was defined - and you speak about next standard. So when will the compiler we see the first compiler to support you new library?

    Martin

    PS: I know a language where all generics are "export" and all compiler support it - since 1983. So it is possible to implement.

    1. Re:C++0x by johannesg · · Score: 1

      Great - there is still only one compiler to support "export" - almost 10 years after the standard was defined - and you speak about next standard. So when will the compiler we see the first compiler to support you new library?

      Martin

      PS: I know a language where all generics are "export" and all compiler support it - since 1983. So it is possible to implement.

      I know it is not in the nerd mindset, but yes, you can move forward with a new version before completely finishing the previous version of something. Out in the real world it is in fact quite common.

      This is doubly true in this case, since the reality is that export was a badly conceived feature to begin with, that might very well be removed from the next standard because it is so difficult to implement. So the next standard does in fact fix that little problem as well.

      Your implicit assertion that C++ compilers are immature because of this missing feature is, of course, complete laughable. C++ has had very solid compiler support for a very long time now.

    2. Re:C++0x by krischik · · Score: 1

      because it is so difficult to implement.

      ... in C++. As I said Ada has the equivalent of exported generics since 1983. All compilers as the standard demand separate compilation.

      MS C++ is also missing "Exception Specifications" and "two-phase name lookup" and a couple of other things.

      So maybe C++ has become so complex that it can't be extended properly any more.

      Martin

    3. Re:C++0x by johannesg · · Score: 1

      MS C++ is also missing "Exception Specifications" and "two-phase name lookup" and a couple of other things.

      Ah, *Microsoft* has done an incomplete implementation, so that means the standard is obviously flawed!

      Gee. That's a lot of flawed standards out there...

      So maybe C++ has become so complex that it can't be extended properly any more.

      That's rich, coming from an ADA guy...

    4. Re:C++0x by krischik · · Score: 1

      So maybe C++ has become so complex that it can't be extended properly any more.

      That's rich, coming from an ADA guy...

      Where shall I start to answer that sentence? A single answer wont help here:

      Answer 1: Actually I am multi language guy - I programmed 10 years in C++, 5 years C and a lot of other stuff.

      Answer 2: The 2003 ISO standard for C++ is 757 pages and the 2007 ISO standard for Ada is 786 pages - that would be 3.8% larger. Only the Ada standard does not only contain multi threading but real time multi threading. There is a trend that programming languages might start light weight but seldom end that way. The first K&R C might have been half the size of Ada 83. But with 537 pages C 99 was just as fat as Ada 95 (582 pages). Only: Ada 95 is object orientated and multi threaded and C isn't. If you put the size of the standard into relation to the features you get then Ada is the lightest of the three.

      Answer 3 Last not least: Ada - like predecessor Pascal - is named after a historic person. Yep, not everything with 3 letters is an acronym.

  121. It SEEMS but it is NOT by Anonymous Coward · · Score: 0

    The study data to date suggests "thinking" is an illusory, emergent property of all those parallel brain/body processes. There is no sequential state machine in there, but just the fiction of one in the introspective processes of the mind. In fact, our perceptual processes have been shown to reorder events and stimuli quite frequently; it is almost a necessity in that our complex parallel cognitive processes mask their own latencies so we can comprehend and act within a somewhat real-time world.

    It is a burden of this illusion that we conceive of computation being serial and needing parallelization, when the physical reality is parallel and requires serialization only to better fit our limited comprehension.

  122. Multicore ? I don't need a stinking multicore by dvhh · · Score: 1

    Sure dev are able to cram more crap into spare cpu cycle. But looking at the trends now, single core is mostly the way to go with light OSes and optimized software ( netbook ) for the end user, as for more niche market that find a use in high performance computing (slashdot reader, scentist) I'm pretty sure that multiprocessor is a better way to go rather than multicore ( wich is a way to do cheap multiprocessor design anyway ). Anyway OSes HAL have been ready for a long time for SMP, but multicore broke the abstraction of SMP because of the shared resource on the die between the core. Anyway again the processor vendors create an offer, by putting faster processor, and software vendor put the needs by creating bloated software.

  123. This will all be water under the bridge... by PhysicsGeek42 · · Score: 1

    ...in a few years when this replaces electronics as the standard method of switching for computation. There are working 30GHz photonic processors, and it won't be uncommon to see 10 times that in a CPU.

  124. OK, I'll chime in by OneSmartFellow · · Score: 1

    Most experienced developers (that use lower level languages like C/C++), do indeed know how to write multi-threaded applications, DESPITE the poor support for doing so by the compiler. This is usually done through threading libraries, rather than native language features supported by the language preprocessor and compiler and linker.

    This is not actually the problem. The problem is that most applications simply don't need to be multi-threaded, and in fact adding threading frequently introduces more problems than it solves. Most multi-threaded applications would actually perform better as multi-instance applications. where each instance runs on a seperate core, in virtual isolation.

    It seems to me that mutli-core systems facilitate this paradigm with almost no effort required by the developer. As it should be.

  125. Compilers are not magic by Anonymous Coward · · Score: 0
    The compiler could do this ahead of time. It could spend the time analyzing the program, and write the results out somehow. Assign each chunk a metric of some kind.

    Lots of engineers have said "it's a compiler problem" and failed. Cydra, Tera, and most recently Itanium have all bet on compiler breakthroughs and lost. The problem is that most programs are not automatically parallelizable; expecting a compiler to rewrite a program so that it is more parallel is tantamount to expecting the compiler to be as smart as a human programmer. It may happen some day, but don't hold your breath waiting for it.

    1. Re:Compilers are not magic by X0563511 · · Score: 1

      When I said "by the compiler" perhaps a better explanation of what I'm thinking of is "by the compiler tools".

      This would be less compilation and more analysis. The compiler has already gone and done it's job, although tweaks to the compiler (some nonintuitive... not using inline functions) would help. This would be adding metadata to the binary.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
  126. XMOS technology supports parallel processing by Anonymous Coward · · Score: 1, Informative

    XMOS have been experimenting in this area already. Their language which is an extension to C supports code for parallel processing on multi core chips. See http://www.xmos.com/

  127. Hyperthreading by Anonymous Coward · · Score: 0
    What I'd like to see is, rather, that the CPU can implement a kind of "micro-thread" function, that would allow two larger codepaths simultaneously

    That sounds a lot like hyperthreading: a CPU with multiple sets of registers. The idea was that when a thread of execution hit a latency stall, the CPU could still do useful work by switching to another thread. Since the CPU had several sets of registers, the switching was fast. Some models of the Pentium 4 had this capability.

  128. Parent does not understand process profiles by fnj · · Score: 1

    Core utilization has nothing to do with how many threads and processes you have. It has to do with how many threads/processes you have which are active and compute bound from moment to moment. I have 137 processes (ps aux|wc) running in linux right now, and in toto they are consuming 0.8% of two cores (top).

    100 tabs in Firefox should take as much cpu altogether as the one tab you are viewing. That this is not completely so, some of the background ones are animating CPU-sapping Adobe Flash that no one can see, is a design problem. Even so, I often have more than 100 tabs open with little effect on overall system performance other than Firefox's (and other browsers) absurdly gigantic memory usage.

    How many programs do you run at once which are actually doing serious computing other than the one you are interacting with? Sure, there are times you are doing database jobs and such, but it isn't much for the typical desktop user.

  129. DON'T DO IT! by fnj · · Score: 1

    Not until you've read the replies that have a clue.

  130. Are we talking now about technology or marketing? by Anonymous Coward · · Score: 2, Informative

    If we are talking about technology... The Linux operating system (monolith kernel is the operating system) works great on CPU's what have more than 4 cores. If the article writer did not know, the Linux OS powers almost all supercomputers etc. The problem is that applications ain't developed to use so many threads etc. The OS just works fine but if the applications can not use multiple threads, you do not gain anything. If you do not run multiple instanses of them.

    If we are talking about marketing lies and misinformation, the "operating system" (actually a _software system_) does not work at all, because usually this "operating system" can not use the multicore CPU's well. Who should we blame?

    Serioysly, Linux just works on multicore CPU's but that is just an operating system. The software systems like Ubuntu, Fedora and Mandriva just ain't working so well.

  131. Re:Are we talking now about technology or marketin by Ash-Fox · · Score: 1

    the Linux OS powers almost all supercomputers etc.

    I only know of one cray model that had Linux actually... Can I get some kind of citation from a trustworthy source on this, as I can't find it on Google?

    --
    Change is certain; progress is not obligatory.
  132. C or C++ and standart compiance by krischik · · Score: 1

    Will that be implemented by the same vendors which implemented "export" in last 10 years?

    I believe it when MSC++ and G++ have a fully working implementation.

    Martin

  133. Do I need 8-core support at the moment? by Shamenaught · · Score: 1

    Do I have an 8-core machine? No. Will I have one at some point in the future? Probably. Will I be happy if support has improved by then? Yes.

    Seriously, until I have an 8-core machine I'd probably prefer other improvements (stability, for example) arrived before more efficient 8-core support. Also, given the problems with trying to program for too many cores, is it possibly fair to say that Intel are pushing the tech before the software is ready, or possibly even the wrong tech?

    I saw talk earlier in the comments of instruction sets with inherent support of multiple cores. Wouldn't it be better to get something like that out, presumably some form of SIMD-like additions, before pushing the processors with >4 cores?

    --
    mysql> SELECT * FROM `places` WHERE `place` LIKE 'home`; Empty set (0.00 sec)
  134. Most applications aren't house-complex by Anonymous Coward · · Score: 0

    They're more plumbing-complex.

    So you have to wait for the main pipes to be laid before you can start putting the utilities on the end (bath/bog/basin...) and to some extent they can be parallelised. But you're still going to have to wait until the basic pipework is ready. And that will run at the slow 1CPU scale. The bath/bog/basin in one room cannot all be installed if there's not the room for all the workers and the space they need to work. And you cannot really do them in another room, so you can't scale.

    SMP? Drop it. AMP, Asymetric MultiProcessor is more worthy. Big CPU for most tasks, a few smaller CPU's to take on threads (or used instead of the big CPU if you're not running much), and a few dedicated processors (swapping versatility for power and simplicity).

  135. Parallel programming in Erlang is still _hard_ by Pinky's+Brain · · Score: 1

    Pure functional programming with only implicit parallelism (no message passing) might be relatively straightforward and it's true that parallelism is easier to extract automatically than with procedural languages ... but this only allows for a subset of parallel algorithms.

    Transactional memory already allows for a little more.

    With message passing (ie. Erlang) you finally have the full deal. Removing aliasing from the equation removes a lot of very nasty problems, but some remain. Deadlock, starvation, livelock (with priorities making all those problems more likely to occur too). In fact Erlang is really too lax a language to be automatically checked for those problems (mostly because of the use of asynchronous message passing).

    Modern Occam is better in that regard, although I wouldn't say that makes parallel programming easy either ... it's just as good as it gets.

  136. State of the art by dna_(c)(tm)(r) · · Score: 1

    That's an application of Parkinson's law

  137. The problem is human nature & choice by gooneybird · · Score: 3, Insightful

    "The problem my dear programmer, as you so elequently put, is one of choice.."

    Seriously. I have been involved with software development from 8-bit pics to Cluster's spanning wans and everything in between for the past 20 years or so.

    Multiprocessing involves coordination between the processes. It doesn't matter (too much) whether it's separate cores or separate silicon. On any given modern OS there are plenty of examples of multiprocessor execution: Hard drives each have a processor, video cards each have a processor, USB controllers have a processor. All of these work because there is a well-defined API between them and the OS - a.k.a device drivers. People that write good device drivers (and kernel code) understand how an OS works. This is not generally true of the broader developer population.

    Developer's keep blaming the CPU manufactures' that it's their fault. It's not. What prevents parallel processing from becoming mainstream is the lack of a standard inter-process communications mechanism (at the language level) that abstracts a lot of the dirty little details that are needed. Once the mechanism is in place, then people will start using it. I am not referring to semaphores and mutexes. These are synchronization mechanisms, NOT (directly) communication mechanisms... I am not talking about queues either - too much leeway on their use. Sockets would be closer, but most people think of sockets for "network" applications. They should be thinking of them as "distributed applications". As in distrbuted across cores. As an example, Microsoft just recently started to demonstrate that they "get it" because with the next release of VS. It will have a messaging library.

    choice:

    At this time there are too many different ways to implement multi-threaded/multi-processor aware software. Each implementation has possible bugs - race conditions, lockups, priority inversion, etc. The choices need to be narrowed

    Having a standard (language & OS) API is the key to providing a framework for developer's to use, yet still allowing them the freedom to customize for specific needs. So the OS needs an interface for setting CPU/core preferences and the language needs to provide the API. Once there is an API, developer's can "wrap their minds" around the concept and then things will "take off". As I stated previously, I prefer the "message box" mechansims simply because they port easily, are easy to understand and provide for a very loosely coupled interaction. All good tenants of a multi-threaded/multi-processor implementation.

    Danger Will Robinson:

    One thing that I fear is that once the concept catches on, it will be overused or abused. People will start writing threads and processes that don't do enough work to justify the overhead. Everyone who starts writing programs will "advertise" that it's "multi-threaded", as if this somehow automatically indicates quality and/or "better" software...Not.

  138. Anonymous Coward by Anonymous Coward · · Score: 0

    I noticed in the latter versions of Java 6 - it takes care of multiple cores automatically - when executing a loop or something else intensive both cores are loaded almost equally. Well I have only 2 cores so I don't know if Java works as well on more than 2 cores.
    I mean Java on Ubuntu, haven't tested on windows.

  139. Use Haskell (or OCaml)... by Hurricane78 · · Score: 1

    ... because in such languages, multicore-usage is already included from the very beginning.
    In Haskell, you have to explicitly state, that you do not want something to be spread to more than one core.

    With the included total type safety and lazy evaluating, I call that a winner. :)

    At least, if you do not want to program hardware directly.

    --
    Any sufficiently advanced intelligence is indistinguishable from stupidity.
    1. Re:Use Haskell (or OCaml)... by JustNiz · · Score: 1

      That is a perfect example of what not to put in a language.
      It should be a compiler and/or operating system decision.

  140. Flash processing by AlpineR · · Score: 1

    I don't know about Firefox in particular, but many browsers slow or stop Flash in hidden tabs. So you'd have to split those tabs into windows and tile them across the screen to get your CPU working harder.

  141. Processing Power?? by Anonymous Coward · · Score: 1, Informative

    Is this really a concern? How many people are tapping out their CPU? Honestly, 95% of people will never actively use more than 75% of their dual core 2.0 GHz CPU's RAM has and will be the limiting factor on most PC's.

    Also, on the redhat servers I admin, we don't seem to have much trouble with 4x4 CPU's. Are people really saying there is a difference between 16 procs and 4 quad cores? As the OS sees them...

    Odd to even be concerned...

  142. Wrong: by Anonymous Coward · · Score: 0

    Two 2.26GHz Quad-Core Intel Xeon "Nehalem" processors
    6GB (six 1GB) memory
    640GB hard drive 1
    18x double-layer SuperDrive
    NVIDIA GeForce GT 120 with 512MB
    Ships: Within 24hrs
    Free Shipping
    $3,299.00

    Taken directly from the Apple Store Mac Pro section

    There are two Quad core CPU's inside the 8 core systems. Check before you post and criticize. Also, the two dual core system mentioned was just an example.

    "Better to keep silent and look the fool, rather than speak and remove all doubt"

  143. This article is a troll. by Anonymous Coward · · Score: 0

    The article doesn't mention Win7 or Linux whoevever wrote the slashdot headline invented that part.

    Take a look at top500.org and tell me Linux can't handle more than 4 cores. I don't know much about Win7 but I doubt it will have a problem either.

    True, that many _applications_ don't thread well, but that has nothing to do with the OS.

    I expect a perfomance increase when I get my 8-way cpu. Especially in situations where my 2 cores are maxed out now.

    This stupid generalization does not take into account that not all people use a computer in the same way.

    "Windows and Linux aren't designed for PCs beyond quad-core chips" - Flat wrong, not in the article, and misinformation.

    I call shenangians on this article.

  144. Thread.Join..... by donjefe · · Score: 1

    Clearly, more applications need to be (correctly) multi-threaded. I'm not talking about World of Warcraft or CMU calculation projects here, but more common applications like IE, Office, etc. As polished as Microsoft software is (shielding head against thrown fruit), often the user is still forced to wait while UI rendering is waiting on some other task (Outlook you fat slow pig). Every time the Visual Studio IDE turns white while loading my project, every time Outlook is half rendered and has locked all my input devices, every time some office app "appears" to be idle, yet is locking my mouse (AAARRRGH!), I am reminded of how little (or poorly written) multi-threading there is for mainline software. I assure my boss that my cubicle produces more than just profanity and desk banging.... I have noticed that Mac software appears to be quite a bit better in this regard (shielding head from raging mac haters from earlier posts), as I am not often pounding my fist on the table while using my Mac. I'm not sure if a better architecture, or more "thread aware" programming is the cause.

  145. Windows 7 IS ready for multicore by Anonymous Coward · · Score: 0

    According to Mark Russinovich "Technical Fellow and Windows Kernel guru", the dispatch scheduler in Windows 7 was reworked to support up to 256 cores (with logical processor groups). Skip to 8:45 for the details.

    http://channel9.msdn.com/shows/Going+Deep/Mark-Russinovich-Inside-Windows-7/

  146. Developers are not the problem. by rdavidson3 · · Score: 1

    Windows and Linux aren't designed for PCs beyond quad-core chips [CC], and programmers are to blame for that.

    Developers are not the problem. The problem lies further upstream with whomever is creating the functional and technical requirements. Developers develop against those requirements, and if there wasn't a specification for 8 cores, then don't expect it.

  147. Re: bus by andrewd18 · · Score: 1

    What the article is suggesting is that we implement some sort of car-sharing initiative, we stop taking so many cars to the same destination. Or a bus!

    But everything's already being transferred on a bus!

  148. So do C/C++ &/or Delphi (multiple thread capab by Anonymous Coward · · Score: 0

    Those aren't the only language tools that can do that (older ones can as well):

    C & C++ + Borland Delphi compilers have had access to the CreateThread API -> http://msdn.microsoft.com/en-us/library/ms682453(VS.85).aspx call since they & the Win32 API came out!

    (&/or, even

    SetProcessorAffinity -> http://msdn.microsoft.com/en-us/library/microsoft.xna.net_cf.system.threading.thread.setprocessoraffinity.aspx

    +

    SetThreadAffinity -> http://www.delphipraxis.net/topic134206.html

    Win32 API calls & that's ALL a body needs (in addition to actual tasks to "spread around" available physical CPUs &/or multiple cores present, once they are detected for, IF you want to go about this manually that is)).

    You CAN do this yourself, OR, let the OS process scheduler kernel subsystem component do that for you, your choice...

    HOWEVER: Just by coding with multiple threads, you CAN just let the OS process scheduler kernel subsystem take care of it, for you, just by allowing the OS to wait until one of the processors or cores present become fully saturated, & then, it will send other child threads of a parent process to the least saturated CPU cores present.

    E.G.-> The OS' process scheduler subsystem in Microsoft's Windows NT-based OS family (Windows NT 3.5x- 4.x, 2000/XP/Server 2003/VISTA/Server 2008/Windows 7) is aware of how many threads (smallest atomic unit of execution on Microsoft OS') an application has &, that is all it needs!

    I.E.-> Even taskmgr.exe can show anyone that much, as to how many independent threads of execution an app has...

    (The OS & its process scheduler core/kernel component subsystem has to know how many there are in order to send threads of execution that an application has across the least saturated CPU (physical, or core) present, assuming the other CPU's present are @ or nearing 100% cpu cycles saturation).

    APK

    P.S.=> This is done by the OS, & for ANY multithreaded application, & no "SetProcessAffinity" type API calls (explicit multithreaded apps that do all the checks for CPU's present, & schedule their own thread executions across them as needed) required... multiple threads of execution designed apps (that use what I call "implicit multithreaded design" that use multiple threads) are really all that is required here (though you can do the processor detections yourself, routines abound galore online for this if needed & then send the threads you have to diff. CPUs/cores yourself, manually, as noted above IF need be)...

    Well, that's "all you need", & GOOD logic (PLUS, applications that actually require more than a single thread to do a particular job, & NOT ALL DO, & that is "part of the problem", because not all do & many others note it here)

    Imo? Well - the article is misleading (it's more about the apps riding on the OS, & not the OS) - however, I have 17 processes running here, and not a single one is single threaded (Dual Core CPU @ present as I look @ this here)...

    I use Windows Server OS', which can use 1-8 cores outta the box ->

    CPU and memory scalability for Exchange Server 2003 and for Exchange 2000 Server

    http://support.microsoft.com/kb/827281

    (Read carefully to the bottom & it lists what OS' it applies to & the article deals in this w/ Exchange Server)

    Windows Server 2003 &/or Windows Server 2008 can use up to 8 CPUs/cores, outta the box, & install as "WorkStation/Pro" models by default (meaning you can install server-class/back-office class apps like IIS later on, IF ever needed onto them)... apk

  149. Linux will do nicely with at least 10 GIGGABYTE by Anonymous Coward · · Score: 0

    Linux runs well

  150. more cores? gimme! by chaircrusher · · Score: 1

    Every time I read one of these 'boo hoo more cores don't make things fasterer' stories I find it strange, since the problem domains with which I'm familiar -- Image Processing and Audio Software -- can and do already take advantage of multiprocessing.

    In the audio world, you're pushing samples through a directed graph from inputs to outputs, and it's unambiguous to split the processing into threads that can keep the CPU fairly busy.

    In Image Processing, and particularly in the Insight Toolkit that I work with daily, image filters are written to run separate threads on regions of the images. It isn't even particularly hard for most tasks, that iterate through a pixel at a time, requiring only read-only access to an input image.

    And for software development, where you run builds and rebuilds all day, make -j 8 makes a hell of a difference in how long you wait to do something.

    Computer games could really use more cores as well, because the view on screen has the same property as most image processing -- each pixel on screen is an independent computation. If you do parallel ray tracing, doubling the cores can nearly double the frame rate. That's why hardcore gamers pay the big bucks for multi-card solutions -- the graphics cards are rendering in parallel.

    Now if you're talking about a spreadsheet or a web browser, it's hard to see the benefit. That's why so many people buy pokey little Atom netbooks -- nothing they do would have taxed a 1GHZ PIII ten years ago particularly.

  151. Re:Mythical Machine Month by Cassini2 · · Score: 1

    I think I agree with you, BUT... don't fall into the old trap: If ten machines can do the job in 1 month, 1 machine can do the job in 10 months. But it doesn't necessarily follow that if one machine can do the job in 10 months, 10 machines can do the job in 1 month.

    Unfortunately, this effect just makes the programmers job worse. It means that if he can only get the complexity estimate to within a factor of 100 for CPU usage, by the time Amdahl's law is done, his estimate will only good within a factor of 1000. To me, this screams, if you really need multi-core capability, you probably need a cluster too.

    How likely is it that if a programmer shows a user some code, and the feedback is the code is too slow, that the user will be satisfied with a 2:1 or a 4:1 speedup?

  152. Not octo-core ready? by Anonymous Coward · · Score: 0

    Bull. They scale nicely up to about 16 cores. The problem you are going to have on Windows is Licensing, which for XP/SP3 and Vista allows up to 4 cores.
    On Linux, you can rebuild the kernel with a few mods. Unless you are building a cluster and need a fiber backbone, it is not an issue.
    As far as "most programs not ready...", that may be true, but as mentioned elsewhere here, it isn't needed for a lot of applications. Learn pthreads (if you're using C/C++) and you'll be fine, or java threads if you are doing Java.

  153. Re:Are we talking now about technology or marketin by Todd+Knarr · · Score: 1

    Most supercomputers these days aren't single machines, they're clusters. Google "beowulf" for examples. See http://www.cbronline.com/news/linux_x86_clusters_take_over_top_500_supercomputer_ranking, they noticed the trend back in 2004.

  154. Re:Mythical Machine Month by Tiger4 · · Score: 1

    How likely is it that if a programmer shows a user some code, and the feedback is the code is too slow, that the user will be satisfied with a 2:1 or a 4:1 speedup?

    2:1 is probably only just noticeable, assuming it isn't an actual timed test. Anything that highly depends on user responsiveness (i.e. gaming and simulations) needs pretty dramatic pickups before the user will categorically agree it is better.

    Even the more reasonable ones will want a "meaningful" increase. So the time saved has to be enough that they could do something useful with it. i.e. shoot another bad guy, beat the market to a good deal, go out for a smoke break, get home an hour earlier, etc.

    --
    Behold, this dreamer cometh. Come now, and let us slay him... and we shall see what will become of his dreams.
  155. Gonna have to disagree with you there by Anonymous Coward · · Score: 0

    Well, who said that every application under the sun must be heavily multi-threaded or spawning multiple processes? Where's the need for a email client to spawn 8 or 16 threads? Will my address book be any better if it spans a bunch of processes?

    Have you ever used Outlook or Thunderbird when accessing multiple IMAP accounts. No amount of cores will make that tolerable.

  156. UNIX pipes and named pipes by mchnz · · Score: 1

    Twenty years ago processors were slow, but some UNIX boxes had more than one. Where this was the case, pipes and named pipes could be used to keep more than one CPU busy. Such techniques were often used to for linking troff, eqn, etc. The skill required was not much more than the ability to break a task down into large sized units that could work independently. Of course not all tasks are amenable to such an approach, but many are.

  157. Firefox + Flash apparently CAN'T use 2 cores... by Anonymous Coward · · Score: 0

    I run a 2.4GHz quad-core, on 4GB ECC ( Phenom II ), in openSUSE 11.0.

    Firefox ENDLESSLY freezes.

    Flash video *almost* ALWAYS fails to play back smoothly.

    Typing into ANY web-form *always* involves having the typing appear intermittently-later than what I'm typing.

    I tried running KSysGuard, to discover what the hell's going on, and found out that

    a) you have to tell KSysGuard that your scale, for
    CPU0-sys
    CPU1-sys
    CPU2-sys
    CPU3-sys
    CPU0-user
    CPU1-user
    CPU2-user
    CPU3-user
    CPU0-nice
    CPU1-nice
    CPU2-nice
    CPU3-nice
    is *400%*, because that's how it thinks.
    ( 100% of each Core, * 4 cores ).

    Stupidly, I'd assumed that 100% would mean 100% of CPU being used...

    WHEN I figured that out, though ( by re-compiling the kernel, with "make -j5" ), then

    b) I SAW that Firefox *never* uses a second core, except for during startup?

    Why the hell can't it send plugins to a separate core?

    Why the hell can't it send tab-open to a separate core?

    Obviously, I'm not a coder, but to have a 2.4GHz quad-core proc *stuttering* on a BROWSER that has a few windows/tabs open, EVERY TIME I open a new tab, every time I view a video, etc, seems incompetent.

    It's like enforcing 1-wheel drive, in a crew-cab pickup-truck!