Slashdot Mirror


SW Weenies: Ready for CMT?

tbray writes "The hardware guys are getting ready to toss this big hairy package over the wall: CMT (Chip Multi Threading) and TLP (Thread Level Parallelism). Think about a chip that isn't that fast but runs 32 threads in hardware. This year, more threads next year. How do you make your code run fast? Anyhow, I was just at a high-level Sun meeting about this stuff, and we don't know the answers, but I pulled together some of the questions."

378 comments

  1. Ready for CMT? Hell no! by iostream_dot_h · · Score: 3, Funny

    Now my hardware will force me to support CMT on my computer? This is as bad as DRM.

  2. Schism Growing by SirCyn · · Score: 2, Insightful

    I see a deep schism growing in the processor industry. There are two main camps, the parallel processors, and the screemin single processors.

    The parallel are used for intense processing. Research, servers, clusters, databases; anything that can be divided into many little jobs and run in parallel.

    The other camp is the average user who just wants fast respons time and to play Doom 3 at 100+ fps.

    1. Re:Schism Growing by GoatMonkey2112 · · Score: 2, Insightful

      This will go away once there are games that take advantage of multiple processors. Eventually the game user will start to see the advantage of multiple processors. It's already starting to become clear when you look at the architectures of the next generation consoles.

    2. Re:Schism Growing by udderly · · Score: 1

      I see a deep schism growing in the processor industry. There are two main camps, the parallel processors, and the screemin single processors.

      I would like to have a parallel processor for my servers and a single processor to do video rendering. Is there a downside that I'm missing?

    3. Re:Schism Growing by Anonymous Coward · · Score: 0
      The other camp is the average user who just wants fast respons time and to play Doom 3 at 100+ fps.

      I think some of the people in this "camp" that put together gaming PC's to squeeze every frame they can get would not appreciate being referred to as "average".

    4. Re:Schism Growing by philipgar · · Score: 5, Interesting

      Actually from what I've heard, the entire industry is moving in this direction. The whole idea of out of order processors (OOP) has become outdated. OOP was great. Enabled massive single threaded performance, however the costs (in terms of area and heat dissipation) is enormous.

      I just came back from the DaMoN workshop where the keynote was delivered by one of the lead P4 developers. He explained the future of microprocessors and said that the 10-15% extra performance that OOP enables just isn't worth it. The Pentium 4 has 3 issue units, but the way things are rarely issues more than 1 instruction per cycle.

      We can squeeze more performance out of them, but not much. The easiest method is to go dual core. However if an application must be multithreaded to enable the best performance, what would you rather have . . . 2 highly advanced cores, or 8-10 simple cores that can issue half as many instructions per cycle as the dual core design. Than consider the fact that each core enables 4 threads to run (switch on cache miss/access). It doesn't take a rocket scientist to see that overall performance is improved with this.

      The other option is the hybrid core. A single really fast x86 core combined with multiple simpler x86 cores. That way single threaded apps can run fast (until they're converted) and you can get overall throughput from the system without blowing away your power budget on OOP optimizations.

      Granted most of this is in the future (within the next 5 years), but IBM's going that way (ala Cell), its within Intels roadmap, Sun is pushing that route etc. I assume AMD has plans to create a supercomputer on a chip . . . unless they wish to be obsoleted.

      Phil

    5. Re:Schism Growing by selderrr · · Score: 1

      l33ts who want Doom3 at 100+fps can also benefit from massive paralellism : the graphics are offloaded to the GPU anyway, so what's left for the CPU is projectile & object positioning, and AI.

      imagine a future PC with 32656 CPUs, all running at a measly 40MHz, but each one dedicated to a single object in the game. All they have to do is calc the position of that single object. Might give some interesting results

    6. Re:Schism Growing by timford · · Score: 4, Interesting

      You're right that the latest generation console CPU architectures reflect the trend of concurrent thread execution. That said, however, there seems to be a parallel trend developing that involves separating the general purpose CPU into independent single-purpose processors.

      The most obvious example of this is the GPU, which has been around for a long time. The latest moves toward this trend rumored to be in development are PPUs, Physics Processing Units. How long until game AI evolves enough that we have the need for AIPUs also?

      This approach obviously doesn't make too much sense in a general purpose computer because the space of possible applications and types of code to be run are just too large. It makes perfect sense in computers that are built especially to run games though, because we have a very good idea of the different kinds of code most games will have to run. This approach allows each type of code to be run on a processor that is most efficient at that type of code, e.g. graphics code being run on processors that provide a ton of parallel pipelines.

    7. Re:Schism Growing by timford · · Score: 1
      Just as a slight correction, the majority of CPU work done in games like Doom3 is also graphics-related, despite the existence of the GPU. The CPU has to take care of setting up all the data to be fed to the GPU... for example calculating shadow volumes, applying bone transformations to skin vertices, etc.

      imagine a future PC with 32656 CPUs, all running at a measly 40MHz, but each one dedicated to a single object in the game. All they have to do is calc the position of that single object. Might give some interesting results

      This would be horrible IMHO. The vast amount of information that would need to be passed among all the processors would dwarf the actual game code.

    8. Re:Schism Growing by rpresser · · Score: 1

      imagine a future PC with 32656 CPUs,

      What are the other 111 processors doing (32656 + 111 = 2^15-1)? Enforcing DRM?

      --
      Why the heck doesn't slashcode let me use <sup> and <sub>?

    9. Re:Schism Growing by trentblase · · Score: 1

      Yeah, the downside you're missing is the cost of having both.

    10. Re:Schism Growing by LWATCDR · · Score: 1

      "The other camp is the average user who just wants fast response time and to play Doom 3 at 100+ fps."
      I am afraid that is NOT the average user. Maybe the average high end gamer but not user.
      Parallel will be of use for the average user. Your typical PC runs about 43 processes. Yes even games will benefit once the game programmers start writing multi threaded code. For your average user I can see where you might even have a bunch of integer processors "sharing" a few blindingly fast CPUs. Sort of like a reverse CELL. Most of your typical code is still integer except for games, simulations, and multi-media. How many people play Doom3, rip a DVD to DiVX, and simulate a star going nova at the same time?

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    11. Re:Schism Growing by janpf · · Score: 0

      The balance might not be so easy. Exploring parallelism is a hard issue for many problems. For instance, most of my time I'm compiling C++ code. Usually I just need to compile one file (the one I changed and want to test), and this is not a parallel process. In this scenario I'd be better off with 2 highly advanced cores. As you mentioned the hybrid option might be the best. But remember that there are apps that cannot be "converted".

    12. Re:Schism Growing by jedidiah · · Score: 1

      Am I the only person that opens up the Windows task mangler and see hordes of proceses running?

      If you're not using an Atari ST, you can probably already exploit a multi-core cpu to your immediate benefit.

      --
      A Pirate and a Puritan look the same on a balance sheet.
    13. Re:Schism Growing by Jeremy+Erwin · · Score: 1

      imagine a future PC with 32656 CPUs, all running at a measly 40MHz,

      Ah, a PC where latency is king. Rather difficult to optimize, I should imagine.

    14. Re:Schism Growing by Anonymous Coward · · Score: 0
      The balance might not be so easy. Exploring parallelism is a hard issue for many problems. For instance, most of my time I'm compiling C++ code. Usually I just need to compile one file (the one I changed and want to test), and this is not a parallel process. In this scenario I'd be better off with 2 highly advanced cores. As you mentioned the hybrid option might be the best. But remember that there are apps that cannot be "converted".

      WTF? I hope your code isn't as obfuscated as your comment.

    15. Re:Schism Growing by MynockGuano · · Score: 2, Funny

      Managing the cooling system and blue case LEDs.

    16. Re:Schism Growing by MynockGuano · · Score: 1

      Heh; I had to read it forwards and backwards, too, but what he's saying is that compiling code, being a process that's not really parallelizable(?), is going to really bog down without powerful single-thread processing available.

    17. Re:Schism Growing by philipgar · · Score: 1

      in this case there isn't really a problem. Considering that if you're compiling more than 1 file the compilation is parallelizable. If you just change one file than only that one file has to be compiled. If the code in that file is so complex that a single simple core running at 4GHz takes too long to compile the file than you need to reexamine how you're writing your code. If it takes long enough that its annoying something is wrong.

      However that being said, I'm not really sure how gcc divides up the building phase. If thats not parallelizable than there is a good chance you may have to wait, I have no idea.

      Phil

    18. Re:Schism Growing by swillden · · Score: 4, Interesting

      Exploring parallelism is a hard issue for many problems. For instance, most of my time I'm compiling C++ code. Usually I just need to compile one file (the one I changed and want to test), and this is not a parallel process.

      You'll still benefit from parallelism in two ways. First, a modern computer is rarely doing just one thing. The OS has some threads managing I/O and performing housekeeping operations, and you're probably also listening to some music, and you probably have some other apps running that occasionally need a little computation. So none of that stuff will impede your compile.

      Second, even a compiler can benefit from multiple threads, though current compilers don't do it. There are multiple stages in compilation, like pre-processing, lexical analysis, syntax analysis, semantic analysis, intermediate code generation, optimization and code generation. The stages don't need to wait until the previous stage has completed its work on the entire file, so the stages can be parallelized to a large extent. It might even make sense to have multiple threads working on different chunks of the code for more computation-intensive stages, like optimization (which becomes even more important without out-of-order execution).

      It seems to me that linking could also be done in parallel with computation, to some degree. To a very large degree if you can guarantee that you don't have any symbols that override library symbols (else a use of a symbol could be linked against a library definition of that symbol before the compiler got around to noticing that you'd defined another definition).

      Perhaps the biggest problem with parallelizing compilation and linking to that degree will be I/O. On second thought, probably not, because modern machines have huge amounts of RAM for caching disk files.

      In an 8+ core machine, it may make sense to dedicate a core to memory management, also. Even with manual memory management (malloc/free), allocating and releasing memory consumes significant CPU cycles, so I could see value in offloading that to another thread. A "free" operation, from a compute thread's point of view, would be nothing more than notifying the memory manager thread that this block is now available for re-use. The memory manager thread would then take care of all of the bookkeeping needed. The manager could also arrange to have a list of blocks of commonly-needed sizes ready for instant allocation, and could even spend some CPU cycles on analyzing the allocation patterns of the compute threads to try to ensure that blocks are always available when needed. Obviously, pushing that idea further leads naturally to full-blown garbage collection, with fewer concerns about GC pauses.

      Although it's true that not all computations can be sped up by multi-threading, lots of them can, including lots that we're used to thinking of as inherently serial processes.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    19. Re:Schism Growing by InvalidError · · Score: 1

      Application-specific architectures are in completely different leagues from general-purpose.

      A genera-purpose CPU can pump out up to two multiplications per cycle assuming the code consists entirely of MUL instructions (unrealistic) and with a 4GHz CPU, this amounts to only 8 bilion 32bits multiplications per second.

      GPUs are optimized to pump out 4x4 matrix multiplications and the current models have an effective throughput well over 100 bilion multiply-add operations per second.

      On a CPU, the coder/compiler has to specify how data is shuffled and transformed. On GPUs, the hardware already knows how the data is going to be loaded and saved so the programmer only needs to define the transformations based on architectural (register) data.

      As for the PhysX accelerator, most of the physics calculations could probably be offloaded to the graphics card's vertex shader so I am much more sceptical about that one.

    20. Re:Schism Growing by bdeclerc · · Score: 1

      No, but you might just be the only person who hasn't noticed most of those processes aren't doing anything, which is something they can do quite fast even on a single-threaded machine...

      SMT and CMT won't gain you anything if there's a single thread requiring 99% of CPU resources, and hundreds each requiring 0.001% of CPU Resources.

    21. Re:Schism Growing by fitten · · Score: 1

      If the code in that file is so complex that a single simple core running at 4GHz takes too long to compile the file than you need to reexamine how you're writing your code. If it takes long enough that its annoying something is wrong.


      Ever compile C++ code that uses a lot of templates?

    22. Re:Schism Growing by sleepingsquirrel · · Score: 1
      Also, don't forget that people have been thinking about this problem for a long time. So be sure to check the literature...
      Parallel computers with tens of thousands of processors are typically programmed in a data parallel style, as opposed to the control parallel style used in multiprocessing. The success of data parallel algorithms - even on problems that at first glance seem inherently serial - suggests that this style of programming has much wider applicability than was previously thought.
    23. Re:Schism Growing by Doctor+Faustus · · Score: 1

      Yes, it would. However, companies (Thinking Machines?) already have made cubes of processors, where each only talks to its immediate neighbors. You could break down the space you're modeling into 32786 smaller cubes (32*32*32), and assign one processor to each. As modeled objects move through the modeled region, their data would be passed from one processor to the next.

    24. Re:Schism Growing by Anonymous Coward · · Score: 0

      10-15%? Try 100% in some cases. I've tested simple Mandelbrot set plotting code on an Alpha EV5 and EV6 running at the same clock speed (~500MHz) and the EV6 ran my program twice as fast. The EV6 is an OOO machine whereas the EV5 is a simple in-order processor.

      The inner loop of my program was a very tight and not especially parallelizable but it benefited immensely from dynamic register renaming.

    25. Re:Schism Growing by MemoryDragon · · Score: 1

      The funny thing is, fast response time goes hand in hand with parallelism. You cannot win speedwise in a user interface without pushing as much into the background as possible.

    26. Re:Schism Growing by ajlitt · · Score: 1

      Now, maybe I'm sharing my opinions too freely here... but don't you think that listening to an architect of the P4 decry OOP is like a designer on the Ford Pinto speak out against putting gas tanks in the back of cars? True, a big honkin OOP mechanism is probably not the best use of die space, since compiler scheduling is getting pretty good these days, but let's not forget the Itanium "let the compiler do ALL the work" mess...

    27. Re:Schism Growing by aminorex · · Score: 1

      The point breaks down on its assumption. Compiling code is enormously parallelizable.

      --
      -I like my women like I like my tea: green-
    28. Re:Schism Growing by aminorex · · Score: 1

      > A genera-purpose CPU...

      You mean, a Symbolics Lisp Machine?

      Sorry, couldn't resist.

      --
      -I like my women like I like my tea: green-
    29. Re:Schism Growing by MaGogue · · Score: 1

      The other camp is the average user who just wants fast respons time and to play Doom 3 at 100+ fps.
      Don't forget the even more average user : IE and 16 worms.
      Maybe I'll get a little less calls like 'the pages are loading slower and slower'.

    30. Re:Schism Growing by philipgar · · Score: 4, Interesting

      This is true. On a 500MHz machine OOP makes a huge difference. However when we move to a 4GHz machine that requires 400 cycles to access main memory, 25 cycles to access L2 cache and 4 cycles to access L1 cache, the difference between OOP and in-order starts to fall away. Even the best code on the best processors of today aren't getting a huge speedup from OOP. Also just because the processor is in order doesn't mean a memory/fp/int instruction can't all be run in parallel depending on how its designed (however they must be retired in order). The primary factor however is the memory hierarchy. If most applications are waiting on main memory or cache half of the time, even the most efficient processing can only speedup the processor by 50% (Amdahl's law). Phil

    31. Re:Schism Growing by Anonymous Coward · · Score: 0

      More like your point breaks down on your strawman. He specifically said compiling one file. How do you parallelize that?

    32. Re:Schism Growing by laird · · Score: 1

      Perhaps I'm biased (I used to work at Thinking Machines) but pany more problems are parallel than you'd think at first, if you have a sufficiently fine-grained parallelism.

      "even a compiler can benefit from multiple threads, though current compilers don't do it"

      Actually, Parallel Make (i.e. gmake -j, http://developers.sun.com/solaris/articles/paralle l_make.html, or pmake, http://www.llnl.gov/icc/lc/DEG/pmake/pmake.html) can make project builds significantly faster.

      Beyond that, any time that you're rendering graphics, or sorting data, or in fact using any large volume of data, or doing more than one thing at a time, multiple processors could help. This is why most graphics cards are highly parallel, and why all high-end databases run well on many processors. Heck, even booting a computer benefits from parallel execution. Of course, it may be hard work expressing the dependencies between processes properly (e.g. parallel makes can break makefiles that have implicit dependencies in instruction order that work serially, but which break when parallelized), but if the problem is worth thinking harder about, it can probably be parallelized.

    33. Re:Schism Growing by swillden · · Score: 1

      Actually, Parallel Make (i.e. gmake -j, http://developers.sun.com/solaris/articles/paralle l_make.html, or pmake, http://www.llnl.gov/icc/lc/DEG/pmake/pmake.html) can make project builds significantly faster.

      Yes, but the poster I was responding to was talking about building a single source file (the one just modified, usually) and linking it. "make -j" doesn't do anything useful if there aren't multiple source files out of date.

      I agree with the rest of your comments.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    34. Re:Schism Growing by amliebsch · · Score: 1

      But then Doom4 will require a hypercube of processors. May require some new fab techniques.

      --
      If you don't know where you are going, you will wind up somewhere else.
    35. Re:Schism Growing by ciroknight · · Score: 1

      To afford a practical light on what you're saying, Processing Unit specification isn't really happening as drastically as you may think.

      Many companies have realized that Multithreading is just the next step into building faster machines. Single thread apps are virtually going the way of the dodo as almost every application we use could make use of having multiple threads. And those that don't will co-exist well with other applications that do.

      More specialized chips like the Cell processor being pimped by IBM are really the light of the future in this area. Taking que from the graphics card industry, it simply incorporates multiple, highly parallel float/vector units. Your "Physics Processing Unit" could run dearly inside of this, along with nearly any other vector code.

      AIPU's are a bit different; AI code usually deals with lots of branches, which almost would be better handled by its own processor. If a branch miss happens, the entire pipeline being flushed is a disaster. While I believe that this can be dealt with on-chip with the CPU, better branch predictors and trace caches, it's very hard to integrate these features into anything to save power, and thus, it would almost be better off chip.

      Current generation console chips simply reflect that the GHz wars are over. At this point, no matter how much faster we scale up the processors, we can't get the same amount of work done as having 2, 3, or 7+1 cores. We should look forward to newer desktop machines that reflect this processing wall.

      --
      "Victory means exit strategy, and it's important for the President to explain to us what the exit strategy is." G.W.Bush
    36. Re:Schism Growing by jericho4.0 · · Score: 1

      You're correct, but who waits long to compile one file? You're either compiling one smallish file, that will compile and link in seconds, or you trigger a larger compile that can be parallelized.

      --
      "A language that doesn't affect the way you think about programming, is not worth knowing" - Alan Perlis
    37. Re:Schism Growing by zuzulo · · Score: 1

      I hate seeing folks running windows without easy ways to optimize which processes are actively running. When forced to use windows, one application i find extremely useful is a freeware program called EndItAll which allows you easily to set which processes are not allowed to run, which are optional, and which should never be killed. Pretty useful stuff - the original version is freeware, tho i think 2.0 may not be - just google it.

      Once installed, play around and figure out a minimal process set to use as a baseline ...

      --
      "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety."
    38. Re:Schism Growing by Anonymous Coward · · Score: 0

      If you change one header file that 5000 code modules depend on, then it helps.

    39. Re:Schism Growing by CTho9305 · · Score: 2, Informative

      However when we move to a 4GHz machine that requires 400 cycles to access main memory, 25 cycles to access L2 cache and 4 cycles to access L1 cache, the difference between OOP and in-order starts to fall away.
      Actually, a major point of OO execution is to hide small delays - with the window sizes of modern processors, you can easily hide L1 latencies and possibly hide the latencies of L2, and only pay severely for accesses to main memory. An in-order core is the one that really loses performance as L1 and L2 take a few extra cycles.

      Also just because the processor is in order doesn't mean a memory/fp/int instruction can't all be run in parallel depending on how its designed (however they must be retired in order).
      They're retired in order in out-of-order processors... they just execute out of order. In order to run n instructions in parallel in an in-order processor, you need n consecutive instructions in the program that all have their dependencies met at the start of the cycle. I'd doubt that happens often. Plus, you said yourself L1 is 4 cycles away... that means at every memory access you can't issue any new instructions for 4 cycles. Read some assembly code - memory instructions make up a HUGE portion of the instructions. In order execution is a big sacrifice, and you need to be able to find a lot more parallelism to make up for its loss.

    40. Re:Schism Growing by CTho9305 · · Score: 1

      The OS has some threads managing I/O and performing housekeeping operations, and you're probably also listening to some music, and you probably have some other apps running that occasionally need a little computation. So none of that stuff will impede your compile.
      This is true, but if you look at how much CPU time that stuff actually uses, it's negligible - over 1 week, with winamp playing ~12 hours/day, it's used 20 minutes of CPU time. "System" (where drivers live) has used 1 hour, as has my virus scanner. If we assume all background tasks combined used 4 hours of CPU time (seems unlikely, looking at task manager... probably closer to 3), on average about 2% of the time is going to these tasks. If you add a whole extra processor core just for these 2%, you're going to be wasting a lot of computational power.

    41. Re:Schism Growing by Fembot · · Score: 1

      this sounds rather like old news to me: transputer

    42. Re:Schism Growing by Anonymous Coward · · Score: 0

      How long until the AIPU gets smarter than the CPU and takes over the computer?

    43. Re:Schism Growing by laird · · Score: 1

      "the poster I was responding to was talking about building a single source file (the one just modified, usually) and linking it. "make -j" doesn't do anything useful if there aren't multiple source files out of date."

      You're right, of course. In theory you could parallelize the compilation process, but that would be very hard to do since any piece of code could be dependent on any other (at least until you parse it enough to understand what it does). At least, with multiple files, the makefile makes the dependencies explicit.

      "I agree with the rest of your comments."

      Rock on!

  3. Niagara Myths by turgid · · Score: 4, Insightful
    I am totally not privy to clock-rate numbers, but I see that Paul Murphy is claiming over on ZDNet that it runs at 1.4GHz.
    Whatever the clock rate, multiply it by eight and it's pretty obvious that this puppy is going to be able to pump through a whole lot of instructions in aggregate.

    Ho hum.

    On a good day, with a following wind, Niagara might be able to do 8 integer instructions per second, provided it has 8 independent threads not blocking on I/O to execute.

    It only has one floating-point execution unit attached to one of those 8 cores, so if you have a thread that needs to do some FP, it has to make its way over to that core and then has to be scheduled to be executed, and then it can only do one floating-point instruction.

    Superb.

    The thing is, all of the other CPU vendors with have super-scalar, out-of-order 2- and 4- core 64- bit processors running at over twice to three times the clock frequency.

    You do the mathematics.

    1. Re:Niagara Myths by Shalda · · Score: 3, Insightful

      Well, as you might expect, Sun has only a server mentality. The typical server runs few floating point instructions. In a lot of ways, Niagara would be very good at crunching through a database or serving up web pages. On the other hand, such a processor would be worthless on a desktop or a research cluster. I'd like to see actual real-world performance on these processors. I'd also like to see what Oracle charges them for a license. :)

    2. Re:Niagara Myths by rwyoder · · Score: 3, Funny
      On a good day, with a following wind, Niagara might be able to do 8 integer instructions per second...
      Uh, I believe they said it was 1.4GHz, not 1Hz.
    3. Re:Niagara Myths by turgid · · Score: 4, Funny
      Uh, I believe they said it was 1.4GHz, not 1Hz.

      Yes, and I corrected myself straight away in another post. In true slashdot style, the post where I corrected myself got modded down to Offtopic.

    4. Re:Niagara Myths by Anonymous Coward · · Score: 1, Insightful

      Sun has stated that the Niagra CMT chips are aimed at web servers and such that do not need a lot of FP. Follow on chips, late next year I believe, will have the FP stuff.

    5. Re:Niagara Myths by turgid · · Score: 1, Flamebait
      Yes, Sun states a lot of things.

      Judging Sun on its processor track record of the last decade, the follow-on chips will be a year to two years late, under clockspeed and have over-all performance barely comparable with that of the more conventional competition.

      Not that I'm cynical or anything.

      I'm sure intel will be able to cobble together something with 4 pentium-m cores in it to compete, and AMD will have 4-core Opterons. And, as I said in my previous post, they'll be better suited to running general workloads, and will cope equally well with the multi-threaded ones.

      Sun just doesn't seem to get it. Not everyone wants to run Solaris and Java.

      Ask Sun if they will provide the IP necessary to get Linux ported to Niagara and Niagara 2.

    6. Re:Niagara Myths by Stankatz · · Score: 1

      On a good day, with a following wind, Niagara might be able to do 8 integer instructions per second....

      Holy crap! I can do math that fast on a calculator.

    7. Re:Niagara Myths by Anonymous Coward · · Score: 0

      Problem solved.

    8. Re:Niagara Myths by ortcutt · · Score: 1

      It's a server-processor. Sun sells servers. Servers tend to run software that either spawn a lot of processes (like apache 1.3, postgresql) or run a lot of threads (MySQL, etc...). These programs do no or almost no FP. Niagara was designed to be a server processor. It would probably suck for other things, but that's beside the point.

    9. Re:Niagara Myths by rsynnott · · Score: 1

      It is being aimed at a low to mid-end server environment (decidedly low-end, for Sun). How much floating point do YOUR servers do? ;) It's a 64bit processor, by the way. (Sparc v9 compatible)

      --
      Me (Blog)
    10. Re:Niagara Myths by turgid · · Score: 1
      I know exactly what Sun sells.

      My point is, that even a small piece of FP code will completely ruin the performance of a Niagara processor.

      I don't doubt that for server processors, highly multi-threaded CPUs with low-latency context switches (e.g. 4 thread contexts with zero cycle switch latency per core) are the future.

      My point is that Niagara is a bit too simplistic and despite it's radical new architecture, it will merely have comparable performance to the more traditional competition.

      There will be no compelling reason to buy a Niagara 1 system over, say, a 4-core Opteron.

      Niagara 2 allegedly will be much better according to Sun's marketing hype, but can they deliver it in time? Can they afford to prop up their dreadful CPU division financially until then?

    11. Re:Niagara Myths by turgid · · Score: 1
      It is being aimed at a low to mid-end server environment (decidedly low-end, for Sun).

      So is pentium and opteron. Can Sun really make a Niagara system cheap and fast enough to compete with them?

      How much floating point do YOUR servers do? ;)

      Not much, but a few instructions deep in a function somewhere for calculating statistics could cause a major disruption on a Niagara processor as the thread containing the FP code gets migrated over to the one core out of 8 that contains the FP unit.

      I suspect that Niagara machines will come with software floating-point emulation in Solaris to cope with such a scenario (so the thread migration doesn't need to happen).

      Niagara 2 will have proper floating-point, so it goes.

    12. Re:Niagara Myths by rsynnott · · Score: 1

      It should apparently be fast enough that you're into at least quad xeon/opteron territory. It will also have fancy onboard ethernet and encryption, which should speed it up. Memory controllers on-chip, so motherboard should be cheap enough; the big cost will be the chip. Remember that they won't be trying to make a profit on the hardware; they'll do that by selling support contracts. ERm, as far as I know they have floating point; just not very much of it ;) (one shared unit, not one core having it and the rest not). And really, most server code shouldn't have floating point at all.

      --
      Me (Blog)
  4. Steam Engine - Diesel by kpp_kpp · · Score: 5, Insightful

    Some people have predicted this move for quite some time. I remember hearing about it back in the late 80's early 90's and I'm sure it goes way back before then. The analogy was to Steam Engines and why they lost out over Diesels. You can only make a Steam engine so big but you cannot connect them together to get more power. With diesels you can hook many of them together for more power. Chips are finally getting to the same point -- It is more cost efficient to chain them together than to create a monsterous one. I'm surprised it has take this long to get to this point.

    1. Re:Steam Engine - Diesel by turgid · · Score: 4, Insightful

      The problem has been the cost of software development. It's almost always cheaper to throw more hardware at a problem than invest in cleverer code. Highly parallel designs require very clever code. The Pentoum 4 debacle has finally shown that we're now at the stage where we're going to have to bite the bullet at develop that cleverer code. With ubiquitous high-level laguages running on virtual machines (e.g. Java) this is becoming more feasable since a lot of the gory details and dangers can be hidden from the average programmer.

    2. Re:Steam Engine - Diesel by spotvt01 · · Score: 2, Insightful

      It's all about the scalability in processor architecture. And unfortunately, your analogy about diesel engines only goes so far. You can only chain so many pistons together before you have to worry about how effecient you can transfer the energy to the drive train. There is an upperbound of effectiveness. Concentrating on the number of pistons and ignoring each pistons' capabilites will leave you with a lot of hourse power but little torque. The same problem exists in multiple core designs, namely: only so many things can be done in parallel. This is because most programs are sequential in nature and benefit very little from executing their code in parallel. And eventually, you'l get down to something sequential like the bus or acess to memory or paging to the hard diskwhich is where the real bottle neck is anyway). About the only thing this will help with is if you're doing some sort of mathmatical computing (using MPI or somethigg like that as was previously mentioned) or you're playing Doom3 while you're your rendering the spcial effects for Star Wars III. In which case you need to get out more ;)

    3. Re:Steam Engine - Diesel by MajorDick · · Score: 0, Flamebait

      "You can only make a Steam engine so big but you cannot connect them together to get more power"

      That has to be bar none one of the DUMBEST things I have ever heard on slashdot.

      Mind you I understand it was not you that said it, no problem there, but whoever said that originally had best stay away from ANYTHING mechanical.

    4. Re:Steam Engine - Diesel by flaming-opus · · Score: 1

      This has been going on for years. IBM gave up on bigger single CPUs about 1980, so did cray, cyber, and unisys. Everyone has been doing multiprocessors for decades now. The only new thing is that they are sticking lots of them on a single piece of silicon, instead of one per chip. (or multiple chips per cpu, as the case may be).

    5. Re:Steam Engine - Diesel by arkanes · · Score: 3, Informative
      You cannot hide the gory details and also thread for (pure) performance, at least not to any signifigant degree, and not with our current ability to analyze programs. Some current compilers/languages can squeeze out some parallelism via analysis, but to prevent bugs they must be conservative, so you rarely get signifigant performance boosts. The key to parallelizing performance is minimizing information sharing, and thats a design/archiectural issue that can't really be addressed automatically. It's not simply a matter of higher level languages or cleverer code - the inherent complexities and dangers of multi-threaded programming are quite large, to the point where it's almost impossible to prove the correctness of any signifigantly multithreaded application while still gaining a performance boost.

      Note that I am talking about pure performance gain here, not percieved performance, such as keeping a GUI responsive during long actions - that kind of MT is generally slower than the single threaded alternative, and is fairly easy to keep correct.

      Gaining performance via multithreading requires you to seperate out multiple calculations, with minimal dependencies between them. The number of applications that can benefit from this is much smaller than you might think. I doubt very much that we'll see very many applications get a boost from dual/many core processers, and it's not just a matter of "re-writing legacy apps". What we will see is over all system speed increases on multi-threaded OSes.

    6. Re:Steam Engine - Diesel by turgid · · Score: 1

      You're right. I'm full of shit.

    7. Re:Steam Engine - Diesel by borud · · Score: 1
      Then I must be awfully clever, because most of the software I've written the past decade can easily be partitioned to run on multiple CPUs concurrently and benefit from it. In fact, most of it already does.

      The problem isn't really that this is hard to do, but the fact that mainstream availability and awareness of hardware that does it is a pretty recent phenomenon so it isn't something most software firms prioritize.

      (Note that I am not saying the technology hasn't been available. I've used multi-CPU computers for a long time, and I've run Linux on SMP machines for a decade)

      As has been pointed out again and again: almost all modern desktop machines run tens of processes or even hundreds -- often with two or more processes in runnable state during actual use of the PC. You can benefit *immediately* from more CPUs even without having multithreaded software.

      You are right about Java though. In fact, the exciting stuff, in my humble opinion, of the 1.5/5.0 release isn't so much the generics, autoboxing and enums, as it is the JSR 166 implementation -- ie. the java.util.concurrent classes, which make it even easier to write more easily maintainable, concurrent code.

    8. Re:Steam Engine - Diesel by Mignon · · Score: 1
      You can only make a Steam engine so big but you cannot connect them together to get more power

      And here's a picture of SMP (Steam Multi Processing) in action.

      I wouldn't have been so hard on the guy - everybody seems to love analogies here even well after they have fallen apart. This one happened to fall apart sooner than most.

      Reversing (and breaking) the analogy, here's what might have been considered a "dual core" steam engine. Ok, forget the analogies, it's an impressive machine.

    9. Re:Steam Engine - Diesel by turgid · · Score: 1
      Then I must be awfully clever, because most of the software I've written the past decade can easily be partitioned to run on multiple CPUs concurrently and benefit from it. In fact, most of it already does.

      Compared to most people who call themselves programmers, you probably are.

    10. Re:Steam Engine - Diesel by gbjbaanb · · Score: 1

      Good point that hardware is cheaper than developing clever code.. but that will not change even now we have multi-core CPUs. Instead, I think we'll see the minimum amount of work done to support the new 'thread-optimised' CPUs. (ie 2 cores.. some apps will have, gosh, 2 threads :-).

      Developing multi-threaded code is hard to do properly, god knows how many bugs I've seen from poor, or not enough synchronisation. The trouble is that this is not easy, and even languages like Java will make it easier only by making it less efficient - in the old days at least (I havn't done any java for a while), you slapped the synchronise keyword on a class and all access to it was .. synchronised. ie. back to single-threaded. so, poor programmers will not make code that works well on multi-core CPUs.

      Add to this mix that every time you lock and switch a thread, you suffer a (small) performance hit. This can mean that the system spends more time thread-switching and sloshing data between the cores than it does performing useful work (I've seen this happen!)

      So whilst I applaud the new multi-core CPUs, I really don't think that we'll get any performance boost at all if they swap raw speed for more cores. (except in specialised applications, specially written).

    11. Re:Steam Engine - Diesel by furry_wookie · · Score: 1, Informative

      >You can only make a Steam engine so big but you cannot connect them together to get more power.

      Who told you that? That's just plain wrong, and was obviously made up by someone who knows little railroad history.

      It's was not uncommon at all to connect 2 or more steam engines together. In fact Union Pacific used to do it for nearly all of their freight trains, especailly the express produce delivery trains.

      There are also rail lines in Africa, China and Russia where coal is plentyful and old is not, and which **still use steam engines** and often use multiple steam engines(heads).

      Search google for "double headed steam" or "triple headed steam" for examples.

      --
      -- Given enough time and money, Microsoft will eventualy invent UNIX.
    12. Re:Steam Engine - Diesel by Neil+Watson · · Score: 1
      turgid:
      It's almost always cheaper to throw more hardware at a problem than invest in cleverer code.

      Cheap for the vendor but not for the customer who is constantly caught in the costly and wasteful upgrade cycle.

    13. Re:Steam Engine - Diesel by kpp_kpp · · Score: 1

      My memory of the exact quote is quite poor. I believe it was from a tech magazing in the early 90's and I have been unable to find it on the internet (although I haven't looked that hard).

      It may have been more along the lines of the locomotives themselves... you cannot hook two steam powered locomotives together and get additional pulling power. I don't know... sorry for the vagueness.

    14. Re:Steam Engine - Diesel by aminorex · · Score: 1
      No, cheaper for everyone. Upgrading software is even more expensive than upgrading hardware.

      Anyone with a load average over 1.0 is going to see the benefits of more parallelism immediately. If you've ever used an SMP desktop, you know it's good.

      --
      -I like my women like I like my tea: green-
    15. Re:Steam Engine - Diesel by iabervon · · Score: 1

      I think the future is really parallelism from analysis, but languages still have to change such that programmers can actually specify what can be done in parallel without making the decision about whether it should be split or not. We need to work out the questions that compilers need programmer input on, and how to phrase these questions such that programmers can answer them reliably from their knowledge of the program.

      I don't think that programmers will ever be able to identify 64 things that could be done at the same time. On the other hand, it is probably possible to specify interference throughout a program, and have the compiler deal with the threads per se. Programmers can't track lots of threads of execution, but they have a good idea of dependencies; compilers don't know the dependencies, but can track arbitrary quantities of threads. The solution is to get the information the programmer has to the compiler to assist it in its analysis.

    16. Re:Steam Engine - Diesel by TopSpin · · Score: 2, Insightful

      I doubt very much that we'll see very many applications get a boost from dual/many core processers, and it's not just a matter of "re-writing legacy apps".

      I think this is a foolish thing to doubt. As supercomputing evolved into parallelism the same thing was said; it's too hard, some things can't be done in parallel. Yet solutions have been found for most cases and there is no lack of desire for more parallel capacity today.

      Put enough cores in front of a twenty something Carmack wannabe and he'll figure out how to parallelize so many spinning triangles we'll all be breathlessly waiting to pay for even more cores. Put eight cores in the hands of a video encoding programmer and he'll refractor, tune and rethink the whole process until those cores stay 99% full and he invents an entirely new paradigm for the practice.

      Perhaps there is something deeper here; isn't the universe fundamentally parallel? So it isn't possible to parallelize the calculation of the next digit of pi; the universe has a way of ultimately requiring you to perform the damn calculation with thousands of pi simultaneously. Determinism gets lost somewhere and parallel computation becomes viable.

      --
      Lurking at the bottom of the gravity well, getting old
    17. Re:Steam Engine - Diesel by davecb · · Score: 1
      I'm inclined to disagree, as it's merely annoying to turn a single-process single-threaded program into a multi-instance program.

      I agree that it rapidly gets hard. Typically it's where you don't keep it brutally simple that causes problems: locks, race conditions, deadly embrace and all their friends come out to play (;-)).

      Better languages make it at least possible to write "clever" code, as the monitors in Java ("protected" classes) provide a discipline that can be used for simple locking schemes. Far better than C or (horrors!) PL/1 processes. But if you get tricky in Java, it's like getting tricky in any language. Caveat Emptor!

      --dave

      --
      davecb@spamcop.net
    18. Re:Steam Engine - Diesel by Anonymous Coward · · Score: 0

      When I open a couple dozen tabs in Firefox, it freezes up for a little while while all of the pages get rendered. This doesn't change when I do it on a multiprocessor system. I still have to go get a cup of coffee while I wait for it. It would change if the Firefox devs would invest in cleverer code to allow me to navigate in one tab while the other are getting rendered concurrently.

    19. Re:Steam Engine - Diesel by arkanes · · Score: 1

      The difficulty is not in the Carmacks and other geniuses managing to parellelize things, it's for the typical programmer to parellelize, especially in large scale applications. Concurency bugs are amazingly difficult to detect and debug, and highly concurrent programs require a large amount of skill and analysis if you want them to be both fast and correct. The vast majority of multithreaded applications today are *slower* (in absolute terms - again, not responsiveness, but performance) because of multithreading, not faster, and they still have deadlocks and race conditions and memory corruption. Theres almost certainly room for improvement in our tools for automatic parellelization, and room for improvement in the tools programmers use to write parellel programs, but I don't believe that typical applications will see signifigant speedups from multicore machines.

  5. EPIC? by Anonymous Coward · · Score: 1, Insightful

    So does this mean that Intel's gamble with the Itanium was a good one? Or does this mean that we are going to try to teach students a totally new development style for more threads and parallelism?

    1. Re:EPIC? by HidingMyName · · Score: 2, Interesting

      That is hard to say. EPIC is a very long instruction word architecture (VLIW) which supports up to 3 concurrent non-interfering instructions which requires static (compile time) scheduling, since the instructions must be in contiguous memory. Getting efficient scheduling is hard, since the complexity is pushed back on the compiler, which may need to do some serious code reordering. Additionally, EPIC was designed to support speculative execution, which has efficiency issues if the wrong prediction is made. Additionally, EPIC had a new instruction set/core so Intel may not have gotten as much reuse of existing designs that multithreaded (using register bank switching) or multi core designs might have been able to exploit. Modern fabrication and design is so complex, that widely used designs get development resources and new interesting directions often don't get fabricated.

    2. Re:EPIC? by m50d · · Score: 1

      I think tending towards an Itanium-like design is inevitable, but Itanium itself was too far ahead of it's time. You can't make the leap straight to the compiler having to do everything while having the 3-instructions parallelism when people have been used to x86's peephole optimisation for so long. I think Intel would do well to mothball it, keep a small team working on the compiler and keeping the design more or less up to date, then reintroduce it when people are more used to these sort of things. There are lots of very good things about Itanium, and it's better than kludging these sorts of things into x86 - but people will only appreciate it after they've had some experience with them.

      --
      I am trolling
    3. Re:EPIC? by TheRaven64 · · Score: 1
      EPIC is a very long instruction word architecture (VLIW) which supports up to 3 concurrent non-interfering instructions which requires static (compile time) scheduling, since the instructions must be in contiguous memory

      While the instructions are fetched in blocks of 3, there may in fact be an arbitrary number in a bundle, allowing future EPIC chips to execute more of them in parallel.

      --
      I am TheRaven on Soylent News
  6. WTF? by Timesprout · · Score: 4, Funny

    and we don't know the answers, but I pulled together some of the questions."

    What is this now, Questions for Nerds. Stuff we dont know?

    --
    Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
    What truth?
    There is no dupe
  7. well at least he seems to understand the problems by Anonymous Coward · · Score: 5, Interesting

    from TFA:
    "Problem: Legacy Apps You'd be surprised how many cycles the world's Sun boxes spend running decades-old FORTRAN, COBOL, C, and C++ code in monster legacy apps that work just fine and aren't getting thrown away any time soon. There aren't enough people and time in the world to re-write these suckers, plus it took person-centuries in the first place to make them correct.

    Obviously it's not just Sun, I bet every kind of computer you can think of carries its share of this kind of good old code. I guarantee that whoever wrote that code wasn't thinking about threads or concurrency or lock-free algorithms or any of that stuff. So if we're going to get some real CMT juice out of these things, it's going to have to be done automatically down in the infrastructure. I'd think the legacy-language compiler teams have lots of opportunities for innovation in an area where you might not have expected it."

  8. How is this different from having multiple cores? by MichaelSmith · · Score: 1, Offtopic

    ...and isn't this the challenge being addressed by DragonFly BSD?

    Software people use threads already, as long as the VM and OS are up to the task. I don't see why it should matter if some of the threads are implemented in hardware.

  9. Re:Windows Articles, Slashdot and Pragmatism by Rwilson500 · · Score: 0, Offtopic

    I agree with you on using the best tool for the job, but what does this have to do with the actual article?

  10. Vader vs. Brooks? by hraefn · · Score: 3, Funny

    I almost thought this was going to be about Star Wars nerds being forced to watch something on Country Music Television.

    1. Re:Vader vs. Brooks? by kyouteki · · Score: 1

      That's exactly what I thought when I saw the title. Imagine my dismay when I RTFA.

      --
      A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
    2. Re:Vader vs. Brooks? by Aspasia13 · · Score: 2, Insightful

      I almost thought this was going to be about Star Wars nerds being forced to watch something on Country Music Television.

      Look out! It's Garth Vader!

    3. Re:Vader vs. Brooks? by Durinthal · · Score: 1

      Vader vs. Brooks?

      Well, both of them did have another name they were called at one point..

    4. Re:Vader vs. Brooks? by Lovesquid · · Score: 1

      I find ya'lls lack of Faith Hill disturbin'.

  11. Argh! by turgid · · Score: 2, Informative
    Today I have diarhea in the guts as well as the mind. I should have previewed that before I posted it.

    On a good day, with a following wind, Niagara might be able to do 8 integer instructions per second, I meant per clock cycle, of course, not per second.

    The thing is, all of the other CPU vendors with have

    I meant "will have" not "with have".

    /me LARTS himself with a big stick.

    1. Re:Argh! by deetsay · · Score: 1
      On a good day, with a following wind, Niagara might be able to do 8 integer instructions per second
      Combine that beast of a processor with Longhorn, and pushing out one KILOBYTE will be just matter of seconds...
      --
      "The looser the waistband, the deeper the quicksand", or so I have read.
  12. One Weenie's Perspective by Anonymous Coward · · Score: 5, Funny

    Well I am a Star Wars weenie, and I am definitely NOT ready for Country Music Television.

    1. Re:One Weenie's Perspective by Anonymous Coward · · Score: 0

      Don't be hating country music.

    2. Re:One Weenie's Perspective by slapout · · Score: 1

      Don't worry. They don't actually play Country Music. They're own by the same people as MTV. (ie. they don't play music at all)

      --
      Coder's Stone: The programming language quick ref for iPad
    3. Re:One Weenie's Perspective by -kertrats- · · Score: 1

      Actually, CMT does play music a helluva lot more than MTV, VH1, or any of those channels. I'd guess it's at least 18 hours a day. Have you ever actually watched the channel?

      (now, the quality of the music played is another matter entirely, but that's besides the point).

      --
      The Braying and Neighing of Barnyard Animals Follows.
    4. Re:One Weenie's Perspective by slapout · · Score: 1

      Yes I have. Let me correct myself, they don't play music during prime time. This is their schedule for tomorrow night (all times central):

      6:00 pm Dukes of Hazzard
      7:00 Gretchen Wilson Documentary
      7:30 Class of 1975 (Documentary)
      8:00 American Revolutions (Documentary)
      9:30 Gretchen Wilson Documentary
      10:00 Dukes of Hazzard
      11:00 Small Town Secrets ("Hunting for Bigfoot")
      12:00 CMT Music

      --
      Coder's Stone: The programming language quick ref for iPad
    5. Re:One Weenie's Perspective by -kertrats- · · Score: 1

      There we have it. 18 hours.

      (I get the point).

      --
      The Braying and Neighing of Barnyard Animals Follows.
  13. Not really an issue by MemoryDragon · · Score: 2, Insightful

    given the fact, that I havent programmed a single threaded program in years.

    1. Re:Not really an issue by DigiShaman · · Score: 1

      First off, I'm very ignorant when it comes to programming. That said, couldn't one just let the compiler optimize source code into multi-threaded binary regardless of how the original program was coded? If so, would it work better with higher level programming languages such as Java?

      --
      Life is not for the lazy.
    2. Re:Not really an issue by fizban · · Score: 1

      No, multi-threading vs single-threading has a lot to do with the *type* of problem being solved. A lot of things that programmers write are very serial in nature, in that you have to do A before B and B before C. Only if I can do A, B, and C at the exact same time, with no interdependencies, is multi-threading truly a good solution. A compiler may be able to find some pieces of code that it can put in a separate thread, but usually not. It's up to the programmer to decide the dependencies and structure the program accordingly.

      Now, the being said, some computer languages allow the programmer to more easily structure the program to be threading capable, by easily separating objects and tasks from one another, but it's still up the programmer to make the decisions.

      True software development is not just about writing code. It's about organizing structured processes for solving problems and understanding the relationships between them.

      --

      +1 Insightful, -1 Troll. What can I say, I'm an Insightful Troll.

    3. Re:Not really an issue by MemoryDragon · · Score: 1

      Many problems are very serial in nature, sure, there might be possibilites where the compiler can multithread automatically, but this wont bring any significant performance boost, because probably to few of those problems can be multithreaded automatically. But having multithreading on processor level can speedup things here and now. See, most problems in modern programs have to be multithreaded at least away from the user interface, which means, for keeping a responsiv UI you have to push things into the background. It is even more so, once you are on the middleware and server level, where multithreading is extensively used for serving more than one user. Basically every program nowadays in existence, which is more complex than a hello world, does multithreading one way or the other. You simply do not see it, because many operating systems hide the threads to the user in the process list (windows for instance) But expect the average UI program have at least two threads, one for messaging one for program, and probably around 5-10 depending on the tasks it has to perform. Expect the average server open and pool threads upon increased load. So having a real threading instead of a simulated time sliced one, would speed things up significantly, no matther where you run into a program.

    4. Re:Not really an issue by veg_all · · Score: 1

      given the fact, that I havent programmed a single threaded program in years.

      So you can write complex, multi-threaded software but you don't know how to use a comma?

      --
      grammar-lesson free since 1999. (rescinded - 2005)
  14. Question: What needs multiple threads? by dostert · · Score: 2, Interesting

    As a scientific programmer, all I know is that this will eventually be a huge benefit to all my MPI and OpenMP codes.

    I really only know the "scientific" programming languages, but most all math specific routines are already written for parallel machines. I'm a bit curious, what else really needs multiple threads? Isn't the benefit of dual-core procs the ability to not have a slow-down when you run two or three apps at a time? Don't games like DOOM III and Half-Life II depend mostly on the GPU (which I'm guessing they can handle multiple core GPU's since the programming should be fairly similar to SLI?)? What is the benefit in games? Just faster level loading times?

    I don't want to sound like I'm whining or anything here... I'm not saying that multiple cores suck. On the contrary they're fantastic for what I do, but I just was hoping you guys could help me understand how common apps and non-mathematical operations can use them.

    1. Re:Question: What needs multiple threads? by Frit+Mock · · Score: 5, Insightful


      In games the AI of non-player-characters (-objects) can profit a lot from threading.

      But for common apps ... I don't expect a big gain from multiple threads. I guess typical apps like browsers, word-processor and so one have a hard time utilizing more than 3-4 threads for the most common operations a user does.

    2. Re:Question: What needs multiple threads? by TheKidWho · · Score: 2, Insightful

      umm, better physics and AI for games is what I can think of off the thop of my head =)

    3. Re:Question: What needs multiple threads? by James+McP · · Score: 2, Interesting

      The simplest example is OS runs on one, the game another. But it's really not that simple. Let's take a typical Windows box since it's the bulk of the market.

      Thread 1: OS kernel
      Thread 2: firewall
      Thread 3: GUI
      Thread 4: print server
      Thread 5-7: various services (update, power, etc)
      Thread 8: antivirus
      Thread 9: antivirus manager/keep-alive
      Thread 10-16: spyware (I said a typical Windows box)
      Thread 17+: applications

      Yeah, CMT will be handy out of box as long as the OS is aware. I expect it will be wasteful the first couple of iterations but I can't count the number of times I've had to disable antivirus and yank the ethernet while running computationally intense applications.

      --
      I've been on slashdot so long I'm starting to get out of touch with the cool stuff if it ain't on slashdot.
    4. Re:Question: What needs multiple threads? by rayzat · · Score: 1

      This really isn't a should we do single core or mutli-core design issue at this point. Because of new issues arrising form shrinking dimensions and diminishing returns on new hardware technology the only real option for future designs is multi-core. If you think about it any single core system is going to have a max speed once that max speed is reached the only way to make it faster is to add more cores. The only way single core system will be able to remain supreme is with new architectures and the only way to really utilize them is to re-write your code. So if you have to re-write you code anyway you might as well re-write it for the system with the most possibility. As for how common apps can use them some apps probably won't be able to exploit them and will run slower then then they would if they had 100% resource alloaction on a single core system. However multiple applications code could be run concurrently on the different cores so even though the applications will be running slower you will be getting more done.

    5. Re:Question: What needs multiple threads? by Anonymous Coward · · Score: 0

      but most all math specific routines

      It looks like you're not also a scientific paper writer. It should be "but almost all math-specific routines". Having said that, the "most all" error is far too common. Think about it. Most. All. Most all. It almost sounds contradictory. Really, though, it's just bad style. See Strunk's commonly misused words and phrases.

    6. Re:Question: What needs multiple threads? by timford · · Score: 2, Informative
      You high-and-mighty scientific code snobs looking down on us game programmers! =)

      Actually there is a whole lot to games like DoomIII and HL2 than what can be run on the GPU. First of all, a lot of the graphics-related code is never run on the GPU, it's run on the CPU (for example shadow-processing code), which then passes on the info to the GPU to do the actual rendering.

      Secondly multiple core GPUs doesn't make that much sense to me. The nature of graphics processing is completely SIMD (like much of your scientific code probably). Graphics needs parallelism, but it doesn't need different code being run in parallel. It needs so much parallelism because there are millions of vertices and pixel fragments, each of which needs to be handled very much the same way (that is, with the same shader code). The main reason SLI exists is that there is a limit to how much parallelism we can put on one chip because of transistor limits and all that mumbo jumbo. When there comes a point that we could put multiple cores on one GPU... then we might as well have one core with twice the number of pipelines.

      Finally, games like D3 and HL2 do a lot more than just graphics and level-loading. Physics are getting more and more realistic and therefore computationally intensive (HL2 has particularly good physics). Also I think we're on the brink of game AI becoming much more advanced than the simple state machines present in current games. Then there are more eccentric tasks like UnrealEngine3's "Seamless World Support" which constantly shuffles in and out resources so you can create huge worlds without loading times.

    7. Re:Question: What needs multiple threads? by CrayzyJ · · Score: 1

      While on the surface, your idea really does not work. The cache will thrash like mad, the IO bus will be clogged, and paging (may) be a bottleneck. What if all 17 threads make a system call at the exact same time. The locking will bring the system to a screaching halt.

      What you propose (not a horrible idea, btw) requires much more than just some threads in the CPU.

      --
      Holy s-, it's Jesus!
    8. Re:Question: What needs multiple threads? by paulpach · · Score: 1

      Most server software can use multiple threads or multiple processes.

      For example apache:

      When two people make a request to apache, you could serve one at a time. In that case, the second person will wait a relativelly long time. Especially if the first request happens to be a slow one.

      To solve that problem, apache spawns multiple threads or processes (depending on configuration) and serves both requests at the same time.

      Normally the OS alternates CPU between the two tasks. At any given time only one request is being processed by the CPU, but over time both requests appear to be executing at the same time. There is significant overhead jumping between the requests, and if there are cache misses, the CPU just stalls for a little bit.

      Better scenario: Hardware can do multithreading. In this case, there is only one CPU, but the hardware alternates processing between the two requests (as opposed to the OS). This way, the OS does not incurr in the switching tasks overhead, and if there is a cache miss, the harware automatically switches to the other task withought wasting time which hopefully wont have a cache miss. This is what Sun is doing here, and what Intel does with hyperthreading.

      Best case scenario: you have multiple cores, both requests can be processed by different cores trully at the same time. This is what AMD and Intel are doing with dual and quad cores.

      Note multiple core and hardware threading don't have to be mutually exclusive. You can have multiple cores and each core support multiple threads. In fact, this is what you get when you have a dual P4 computer.

      So to answer your question: Almost any server software such as apache, samba, postgresql, mysql, bind, and many others will greatly benefit from hardware threads or multiple cores. So long the server executes requests in multithreaded or multiprocess fashion.

    9. Re:Question: What needs multiple threads? by bradkittenbrink · · Score: 1

      The argument is that GPU's are good for turning polygons into pretty pixels, but not much else. Physics and AI are nice and all but they probably only use only a couple threads each before you can't parallelize it any more. The truly scalable benefits of multi-core design will come from "procedural generation" if your game is running on a 4 core cpu, you can send say 400,000 polygons to the gpu, if your game is running on a 32 core cpu you can send 3,200,000 polys to the gpu, if your game is running on a 1024 core cpu, well you get the idea. It's not quite that simple, but that's the general idea. This makes sense because gpus are inherently parallel and will keep getting more so, to keep the input to the gpu from becomming a bottleneck we need a way for the cpu to scale in the same way.

    10. Re:Question: What needs multiple threads? by Anonymous Coward · · Score: 1, Interesting

      I have never worked on an embedded product that was not implemented as a collection of threads. Setting priorities properly and dealing with issues of priority inversion and deadlock have been part and parcel for embedded systems for decades. A multi-thread core that allowed you to lock critical threads to a slice of the core would be a hoot if the silicon was affordable.

    11. Re:Question: What needs multiple threads? by James+McP · · Score: 1

      I think you misunderstood. I merely provided an example where the 32+ thread CPUs discussed in the articles prove useful to the common person. You have just provided an arguement why current CPU designs shouldn't have large numbers of threads.

      IANA chip designer but I can see solutions to several of the problems you pointed out. Fast, large caches. Scalable I/O busses with QoS (e.g. HyperTransport). Large address spaces. Multiple memory controllers.

      Of course, it's easy for me to say "just do it" but I'm assuming that if Sun is bothering to build a CPU intended to run large numbers CMT that they've addressed these issues and have some form of solution in place.

      --
      I've been on slashdot so long I'm starting to get out of touch with the cool stuff if it ain't on slashdot.
    12. Re:Question: What needs multiple threads? by willy_me · · Score: 1
      CPUs are already plenty powerful. What we really need is a way for the external bus connecting the GPU to the CPU to scale.

      I read a great article on Ars Technica here that shows how Apple is moving the rendering from the CPU to the GPU. Included are some nice graphs that show the relative available bandwidth between the components. To get to the point, it's not that the CPUs aren't fast enough for rendering, it's that the bandwidth to fill the GPU isn't there. Hence, they're moving the rendering to the GPU.

    13. Re:Question: What needs multiple threads? by bored · · Score: 1

      This is silly, because while I may have 50 processes with 2 to 5 threads each, they are all idle waiting around for some event. The applications are all waiting for some network or user action, the "services" are all waiting around for network or applications to access them, the OS has a few house keeping threads eating up .01% of my CPU and the rest are running in serialized application contexts. In my case all the CPU time is being eaten up running virtual machines. Maybe vmware can burn up an extra CPU doing some kind of pre processing but I fail to see how a virtual machine will benifit from more CPU threads that are slower than one fast one...

    14. Re:Question: What needs multiple threads? by flithm · · Score: 2, Informative

      Actually that's not necessarily true. It's definitely true right now though. Most developers haven't really been tought to think in terms of parallelism when designing software, but that's starting to change.

      It's all about the algorithms. Once multi-core chips have been mainstream for a while, all the algorithms out there will start to get converted to take advantage of parallel processing. And there are already algorithms out there that do this... this page has a small repository of parallel implementations of common algorithms including QuickSort, hashing techniques (for super fast searching), string operations (which every application in existence uses), and more.

      Now I know this isn't always possible, but in many cases it is. Almost every program out there uses search and sort algorithms. Your address book does it, your web browser does it. These algorithms can be implemented to take advantage of having multiple processors.

      A lot of operations can actually be modified to take advantage of this stuff. See the pbzip2 project that achieves a near linear speed up per processor!

      Almost every algorithm out there can be modified to take advantage of muliple cores. Things like video/audio decoding are prime candidates (a lot of research is currently happening in this area).

      It may take a generation of programmers and then another generation or two of applications to start really taking advantage of parallelism, but mark my works: once this stuff is mainstream, you'll start to really see some performance like never before.

    15. Re:Question: What needs multiple threads? by Anonymous Coward · · Score: 0

      Browsers and word processors have a hard enough time using one processor; they're definitely not the target of the ever-more-mega-hurtz race. There'll be opportunities for doing more stuff in browsers and word processors, but if it's at all compute-heavy, it's likely to be thread-friendly (example: concurrent speech recognition, multimedia plugins, etc.).

      Probably the one common app (that will appeal to a wider audience than just games, which a lot of people just buy consoles for) will be content creation/editing. Video editing/encoding/rendering/filters/special effects/insert feature here are things that are easily parallelizable and suck down as much processing power as you can throw at them for the foreseeable future, and I guarantee you you'll see more "average" users wanting to do these things with their computer or computer-like appliances in the future. HD TV/DVD will probably be a big driver for this sort of thing, although that's pure speculation on my part. For a glimpse into the future, you can look at the Mac, which already has dual CPUs commonly available and is heavily used for multimedia applications.

    16. Re:Question: What needs multiple threads? by James+McP · · Score: 1

      The fact that you are running VMs is indicative you are not a common Windows user. :)

      I'd imagine if a VM could expect to see multiple threads you'd see a decent speed boost. Each of those virtual environments has to be active on a pretty regular basis just to generate the clock timing, so there's 1 thread per VM. If you are actively *using* the virtual machines you could dedicate one thread per virtual CPU.

      I'm not a big VMWare user but last time I played with it, you could emulate machines with clockspeeds of CPU Mhz/(n+1) where n is the number of virtual machines. So a 1Ghz machine could have 2VMs of around 333Mhz and a 3Ghz PC could have two
      1Ghz VMs. Now pretend that a 5Ghz 1-thread CPU and a 2Ghz 8-thread CPU both cost $1,000. The 5Ghz would let you have 4 1Ghz VMs while the 2Ghz would let you have 8. That makes the 8-threader twice as cost effective.

      If I'm wrong about the ratio of CPU to VM clockspeeds it only changes the break point. The fact that everyone has switched to a multicore /multithread standard indicates that we have already hit that breakpoint.

      --
      I've been on slashdot so long I'm starting to get out of touch with the cool stuff if it ain't on slashdot.
    17. Re:Question: What needs multiple threads? by Anonymous Coward · · Score: 0

      But for common apps ... I don't expect a big gain from multiple threads. I guess typical apps like browsers, word-processor and so one have a hard time utilizing more than 3-4 threads for the most common operations a user does.

      Right, granted, but.. how much faster do you need your word-processor or browser to be? These are apps that are now mostly user-bound, rather than processor-bound. Having Word execute 50% faster is not going to allow you to write your documents 50% faster, is it? These are "legacy" application types now, and I think the consensus in the hardware industry is that development effort is better spent on those application classes / usage models where the performance will be more noticable.

    18. Re:Question: What needs multiple threads? by CrayzyJ · · Score: 1

      sure, with enough money thrown at it, any problem is solvable. SMP machines having been fighting the cost v. cache coherency battle for years. Let's face it, a 1GB cache would be awesome, but...

      --
      Holy s-, it's Jesus!
    19. Re:Question: What needs multiple threads? by bored · · Score: 1
      I'm not a big VMWare user but last time I played with it, you could emulate machines with clockspeeds of CPU Mhz/(n+1) where n is the number of virtual machines. So a 1Ghz machine could have 2VMs of around 333Mhz and a 3Ghz PC could have two

      Its not really that simple, I tend to use VMware as a test and development bed. I've been purchasing MP machines for years, and usually my desktop machines have two processors. Anyway, with most things I use and write there is some level of MP scalablilty, vmware isn't any diffrent, when the virtual machine is idle it doesn't take up any physical CPU. With screen savers turned off (my virtual screen's don't need saving) and such the virtual machine will consume less than 1% of the physical CPU.



      What really needs to be said is common knowledge with most people who work with MP machines, asside from some of the transaction based server machines, most programs have limits to how many concurrent operations they can run at the same time. Just like its hard to extract single thread instruction parralism past a couple of operations, its hard to extract multiple thread parralism past a couple of threads. My current application is scalable to somewhere between 5 to 10 threads, past that extra processors don't do anything for me. Basically the laws of software haven't changed in the last few years but the massive scalability of web based applications have helped people forget that throwing more processors at a problem even if its scalable won't always cause the application to scale.



      Usually, somewhere someone has to consolidate all those transactions going to a DB, or whatever your bottleneck is. The attempt to sell MP machines to the masses is just a cheap and lazy way to speed peoples computers up. Problem sets that are processor bound and scalable are already running on big SMP's. Personally i like SMP's, but given the choice of a CPU 2x as fast or 2 cpu's I will take the single proc any day. Having two CPU's running 2x as fast is of course the best of both worlds and I'm usually happy to spend a little more to get that.



      I predict that AMD, Intel etc will make a few multiproc CPU's get out to maybe 4 cores per die and then the race to make faster cores will kick back in. Only now people will expect screaming fast 4 way chips instead of screaming fast 1 way chips.


    20. Re:Question: What needs multiple threads? by TheRaven64 · · Score: 1

      I suspect that CMT will be accompanied by a a move back towards micro-kernel style systems. If each kernel subsystem is in its own process, and an efficient zero-copy message passing mechanism is implemented, then the system will scale very well. A similar approach is taken by the new Solaris networking stack and DragonflyBSD.

      --
      I am TheRaven on Soylent News
  15. Re:Ready for CMT? Hell no! by ksheff · · Score: 2, Funny

    CMT is manufactured pop-country music at its worst. Yuck!

    --
    the good ground has been paved over by suicidal maniacs
  16. Good article! by Anonymous Coward · · Score: 0

    I suspect we're all going to have to look to languages that really do support very high levels of parallelism from the get go. We're going to need a high perfomance language and a scripting language. From my early days as a computer scientist, I'd say anything functional will serve us really well, especially languages like CAML and Scheme.

  17. Re:How is this different from having multiple core by root-kun · · Score: 1

    To the desktop user, this really means nothing special. But when we're talking about producing a 1024-node system or, even some highend 1U racks for SMB markets, the more parallelism on chip, the better.

  18. Big Hairy Package by Tweak232 · · Score: 5, Funny

    "The hardware guys are getting ready to toss this big hairy package over the wall:"

    Vivid imagary...

    1. Re:Big Hairy Package by Anonymous Coward · · Score: 0

      And after talking about weenies, too.

      I think the OP is implying that hardware guys are tops and software guys are bottoms.

      Geekdom and gay S&M are a weird mix. :)

  19. Screw CMT; Time to use wasted CPU by WindBourne · · Score: 1

    Look, if you have 32 threads operating at 1/32 of GHz, or you have 1 thread operating at 2GHz, then it is a basic wash (not really, but close enough).

    I would be far more interested in taking advantage of all the CPU cycles that run all over at Businesses. THink of how much wasted cycles there are running Screen Saver, or a Word document. By distributing the load amongst the systems, then a large number of things can be done.

    --
    I prefer the "u" in honour as it seems to be missing these days.
    1. Re:Screw CMT; Time to use wasted CPU by Tweak232 · · Score: 1

      THink of how much wasted cycles there are running Screen Saver, or a Word document.

      I that cycles spent running a word doccument are not wasted, as they are used for productivity, whereas a screen saver is not. It is not fair to compare the two.

      And of course this wasted space has been realized before, and what you are talking about is distributed computing. For example GIMPS and SETI@home both use unused cpu cycles, so you get 100% of your cpu time going to something important. It would be nice if buisnesses found a way to distribute their processes on big jobs, but the fact is that most users do not need all the power they have for mundane things such as word editing and e-mail.

      THink of how much wasted cycles there are running Screen Saver, or a Word document.

      Do you mean you want people to always use emacs, or what I said above?

    2. Re:Screw CMT; Time to use wasted CPU by David+McBride · · Score: 2, Interesting

      I would be far more interested in taking advantage of all the CPU cycles that run all over at Businesses.

      Condor.

    3. Re:Screw CMT; Time to use wasted CPU by Anonymous Coward · · Score: 0

      Look, if you have 32 threads operating at 1/32 of GHz, or you have 1 thread operating at 2GHz, then it is a basic wash (not really, but close enough).

      Except that the 2GHz takes major power to run (and then to cool), and if the one thread stalls your CPU is doing nothing.

      With the slower CPU it's lower TCO, as well even if a couple of threads stall on I/O or memory you're still using the CPU for useful work.

    4. Re:Screw CMT; Time to use wasted CPU by WindBourne · · Score: 1

      Most ppl running word are only running word. During that time, the CPU is for all purpose at 0% utilization.

      --
      I prefer the "u" in honour as it seems to be missing these days.
    5. Re:Screw CMT; Time to use wasted CPU by WindBourne · · Score: 1
      Except that the 2GHz takes major power to run (and then to cool),

      That is dependant on the CPU. Since nothing is published yet, I am not convinced that the CPU will draw anything less than an equivilent single threaded CPU.

      and if the one thread stalls your CPU is doing nothing.

      The OS runs threads. IOW, the threads are in software rather than in the hardware. All major OSes are threaded nicely with little to no stalling.

      With the slower CPU it's lower TCO, as well even if a couple of threads stall on I/O or memory you're still using the CPU for useful work.

      The lower TCO remains to be seen. The current batches of CPUs (Intel, AMD, etc) have been optimized for dealing with I/O, ram, etc.

      What is funny about this is that I described this approach to a CS prof of mine back in 1992. At the time, it was obvious that single threaded CPU would spend far too much effort being efficient while a multi-mini-cpu would have major benefits (cross communication will be super fast). His belief was that it would never come about and that CPUs always be single threaded.

      --
      I prefer the "u" in honour as it seems to be missing these days.
    6. Re:Screw CMT; Time to use wasted CPU by Draknor · · Score: 1

      What I would much rather see is a more widespread use of power/clock-throttling during such usage. It's been common in notebooks for awhile, I'd like to see that migrate more to the desktop (it's happening now, but I don't get the sense that its all here just yet).

      Think of how much money businesses would save if instead of running 500 or 1000 Dell machines w/ P4's at 200+ W each, they could switch to Pentium M workstations running 100+ W each. Saves money on electricity and saves money on A/C usage & maintenance.

    7. Re:Screw CMT; Time to use wasted CPU by Anonymous Coward · · Score: 0

      Look, if you have 32 threads operating at 1/32 of GHz, or you have 1 thread operating at 2GHz, then it is a basic wash (not really, but close enough).

      That's what the entire article is about -- its a wash if your code is procedural. If you use threads and async calls then you should notice an improvement in speed.

    8. Re:Screw CMT; Time to use wasted CPU by m50d · · Score: 1

      In addition to things others have suggested, take a look at openmosix. Just run the openmosix kernel on all the computers in a business and they can migrate jobs from ones which are busy to those that aren't.

      --
      I am trolling
    9. Re:Screw CMT; Time to use wasted CPU by WindBourne · · Score: 1

      OpenMosix is exactly what I am in favor of. But that has nothing to do with the previous postings. They are all speaking in favor of a multi-threaded CPU whereas I am in favor of using underutilized CPUs which is the model that OM favors.

      --
      I prefer the "u" in honour as it seems to be missing these days.
    10. Re:Screw CMT; Time to use wasted CPU by Tweak232 · · Score: 1

      "But that would be too expensive p-Ms cost more than p4s"

      what you would likely hear if you brought it up. Consumers (almost) always go for the lowest common denomenator. :(

  20. Programming isn't up to it by Toby+The+Economist · · Score: 5, Interesting

    32 threads in hardware on one chip is the same as 32 slow CPUs.

    Current programming languages are insufficiently descriptive to permit compilers to generate usefully multi-threaded code.

    Accordingly, multi-threading is currently handled by the programmer; which by and large doesn't happen, because programmers are not used to it.

    A lot of applications these days are weakly multi-threaded - Windows apps for example often have one thread for the GUI, another for their main processing work.

    This is *weak* multi-threading; because the main work done occurs within a single thread. Strong multi-threading is when the main work is somehow partioned so that it is processed by several threads. This is difficult, because a lot of tasks are inherently essentially serial; stage A must complete before stage B which must complete before stage C.

    The main technique I'm aware of for making good use of multi-threading support is that of worker-thread farms. A main thread receives requests for work and farms them out to worker threads. This approach is useful only for a certain subset of problem types, however, and within the processing of *each* worker thread, the work done itself remains essentially serial.

    In other words, clock speeds have hit the wall, transistor counts are still rising, the only way to improve performance is to have more CPUs/threads, but programming models don't yet know how to actually *use* multiple CPU/threads.

    El problemo!

    --
    Toby

    1. Re:Programming isn't up to it by Ann+Elk · · Score: 1
      32 threads in hardware on one chip is the same as 32 slow CPUs.

      So, Sun managed to put an NCR Voyager on a single chip? Uhh... cool?

    2. Re:Programming isn't up to it by dchallender · · Score: 1

      Far too many years ago I remember using Helios "parallel C" on a transputer network (in this case actually a network of PCs, each PC modeling one transputer). 1. The language "enhancements" available encouraged more parallelism than normal with only tiny changes to coding approach (a lot of work done by the compiler obviously) 2. A lot of work on parallelising, was done by the "controller" that parceled off work to the "transputers" (apols for bad terminoloy , this was around 15 years ago and my memory of those days is hazy). I'm sure these days similar minor "addons" to common languages to encourage high level parallelism, coupled with some beefy analysis at compiler level to enable extra parallelism and coupled to a dynamic run time analysis tool, which could spot parallelism opportunities (as all programmers will know, some optimizations are not obvious at code analysis stage, only become apparent when code executes) when application running. I'm currently working on projects that would massively benefit from multi threaded CPU's - currently work is farmed out from central server to multiple processing clients, being able to have multi threaded CPUS would help this enormously.
      --
      Dave
      Generated by SlashdotRndSig via GreaseMonkey

    3. Re:Programming isn't up to it by flaming-opus · · Score: 4, Interesting

      You are absolutely incorrect.
      multi-threaded programming is the predominant programming model on servers. Some tasks, such as web serving, mail serving, and to some degree data-base machines scale almost linearly with the number of processors. All of the first tier, and some of the second tier server manufacturers have been selling 32+-way SMP boxes for years. They work pretty damn well.

      Sun is not trying to create a chip to supplant pentiums in desktops. They are not going for the best Doom3 performance. They want to handle SQL transactions, and IMAP requests, and most likely are targetting this at JSP in a big way.

      As a user of a slightly aged sun SMP box, I'd rather have those many slow CPUs and the accompanying I/O capability, than a pair of cores that can spin like crazy waiting for memory.

    4. Re:Programming isn't up to it by rabtech · · Score: 1

      It has very little to do with programmers "not being used to it".

      Many problems require the result of operation X to complete operation Y; in other words the algorithms are naturally serial in nature and are not easily amenable to parallelism.

      There are a few clever tricks but in some cases making a serial operation parallel gives vastly decreasing performance gains (i.e. two threads = 110% of one thread, four threads = 105% of two threads, etc).

      --
      Natural != (nontoxic || beneficial)
    5. Re:Programming isn't up to it by Dark+Fire · · Score: 3, Interesting

      "Current programming languages are insufficiently descriptive to permit compilers to generate usefully multi-threaded code."

      I agree.

      However, I believe that Functional programming languages would seem to have the best chance of successfully taking advantage of multiple threads of execution. Google has 100,000+ computers doing this now using functional programming ideas.

      As pointed out in other posts, not every problem will benefit from parallelism. With research and time, this might change. Many problems can be represented in both procedural constructs and recursive constructs. The procedural has been considered the most comprehendable and implementable for the past three decades. This may have to change in light of the direction the hardware technology is going.

    6. Re:Programming isn't up to it by Toby+The+Economist · · Score: 1

      > Some tasks, such as web serving, mail serving, and
      > to some degree data-base machines scale almost
      > linearly with the number of processors. All of the
      > first tier, and some of the second tier server
      > manufacturers have been selling 32+-way SMP boxes
      > for years. They work pretty damn well.

      I explicitly described this method of multi-threading in my reply.

      I also noted that when the work done by each thread is examined, it is performing serial tasks; e.g. it is internally single-threaded, so we haven't *really* got away from the single-threaded paradym.

      --
      Toby

    7. Re:Programming isn't up to it by Anonymous Coward · · Score: 0

      You're stupid. Since each thread is executing serially, it's single threaded? That doesn't make any sense at all.

    8. Re:Programming isn't up to it by Anonymous Coward · · Score: 0

      In other words, clock speeds have hit the wall, transistor counts are still rising, the only way to improve performance is to have more CPUs/threads, but programming models don't yet know how to actually *use* multiple CPU/threads.

      Your post misses the fact that these mutliple CPU/thread systems are of great benefit in a server environment. Applications in a sever environment already take advantage of multipe CPU/threads. Look at any enterprise-class server software (e.g., web server, J2EE application server, security software) and it WILL take advantage of multiple CPU/threads.

      Even on the personal workstation front, there is value to multiple CPU/threads. The applications don't even have to be written to take advantage of the multiple CPU/threads platform, because each single-threaded application running concurrently can have its own CPU.

      You greatly overstate the problem.

    9. Re:Programming isn't up to it by sleepingsquirrel · · Score: 1

      CMT, meet CTM

    10. Re:Programming isn't up to it by Anonymous Coward · · Score: 0

      Totally agree. Many tasks are very dependant on serial execution. Checksums, message digests and digital signatures for instance. You cannot parallelize these kind of algorithms. The only thing that multi-threads could help with is i/o, trying to crack a message digest, or running multiple different message digests at once.

      Because of this, the only real benefit to serial code is in the event of running multiple tasks. But then it really isn't multiple threads: it is multiple processes with different adress spaces. I don't know the specs on any of these chips, but threads are a very different beast than processes because of the different adress spaces.

      If these HW threads cannot handle multiple address spaces, then it really is not that useful and seems more of a marketing ploy and is a misnomer.

    11. Re:Programming isn't up to it by MikeBabcock · · Score: 1

      No intentional reference here to PS3/XBox360 designs here, but I can see how games would benefit from SMT or SMP design more than many problems. Large AI systems in video games could be created if you had enough processors to handle the calculations while letting others handle graphics and audio calls.

      A good physics model with semi-intelligent opponent AI would be fun to implement on a 32-way desktop system.

      --
      - Michael T. Babcock (Yes, I blog)
    12. Re:Programming isn't up to it by mikael · · Score: 1

      32 threads in hardware on one chip is the same as 32 slow CPUs.


      A human brain consists of around 100 billion neurons each running at a maximum speed of 500 Hz (a single neuron can depolarize and recharge within 0.002 seconds).

      If that were the case, wouldn't we be better off having a few huge brain cells, rather than billions of small ones?

      Source: How neurons work

      --
      Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
    13. Re:Programming isn't up to it by Jeremi · · Score: 1
      A human brain consists of around 100 billion neurons each running at a maximum speed of 500 Hz


      Yes, but have you ever tried to program one of those things? I can't even patch mine to keep me away from the ice cream....

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    14. Re:Programming isn't up to it by biraneto · · Score: 1

      That almost hurts me. Even web interface development feels the urge of using threads (ie. javascript programming). Any program that is not as simple as a hello world is in need of threads. A programing language that doesn't support thread is not a good one. Stop using delphi... and try reading something about linux for a change :)

    15. Re:Programming isn't up to it by johnhennessy · · Score: 2, Informative

      Could be wrong here, but I thought that the main reason for implementing a CMT chip with "hardware threads" was to make the context switch less painful.

      On single processor systems, when it wants to switch between two threads, it usually executes a context switch - it needs to dump one set of registers to memory, load the other set from memory and change the instruction pointer.

      That usually adds up to two seperate memory accesses to different parts of memory. What's more, is that it is not always possible to accurately predict (by the processor) which two sets of memory addresses will be involved - it all depends on what new thread has been choosen by the scheduler.

      This wouldn't have been a huge problem on Intels architecture - given the relatively small number of registers it gives to applications.

      In Sparc architectures - you have a lot more registers (and wider 64 bit registers) to worry about.

      What Sun would like to do is remove this overhead by implementing a set of registers for each of processing unit. This makes them independant in their own right.

      At the end of the day, the bottle neck in most systems is going to be the RAM-CPU bus, if this can reduce the number of hits that bus takes, then overall system performance will improve - by what margin is usually up to the system architects (i.e. why pick 8 cores instead of 12, why pick 3Mbit of cache instead of 2MBits, etc, etc)

      --
      [ Monday is a terrible way to spend one seventh of your life. ]
    16. Re:Programming isn't up to it by Lovesquid · · Score: 1

      but I can see how games would benefit from SMT or SMP design more than many problems

      I disagree. I think that games would benefit more from many problems.

    17. Re:Programming isn't up to it by laird · · Score: 1

      "when the work done by each thread is examined, it is performing serial tasks; e.g. it is internally single-threaded, so we haven't *really* got away from the single-threaded paradym."

      That's true, but it doesn't matter. The point isn't whether programmers write multi-threaded code, but whether software runs faster.

      Keep in mind that Sun's market is servers, where you have big expensive boxes processing hundreds or thousands of transactions a second. Each transaction can (and usually is) written as serial code, because that's easier to write and debug. However, it doesn't matter that each transaction is single-threaded, because there are many different transactions, and becauase the web servers, application servers, and databases _do_ understand threading. So if a CPU can handle 32 different transactions at once, and your OS/web server/app server can distribute and manage those threads properly, your server performance is just as good as if you had one processor running 32x as fast. Of course, you can't make a CPU 32x as fast (ask Cray), so this is a very clever way to make computers faster.

      Of course, if you're running desktop applications it's a whole different ball of wax, because users aren't doing hundreds or thousands of different things at once -- they're probably doing 1-2 things at a time. So if you're playing a game and that game is single-threaded, then 31 of those virtual CPU's will go naerly unused. But Sun doesn't sell desktop computers.

      Hmm. Since MacOS X runs on SPARC (OK, it used to until recently, and we can assume that Apple has probably kept at it), I wonder if Apple would be interested in selling Xserves running on these nifty new SPARC's -- they're probly better CPU's for servers than Intel, and would make the point that the OS is more important than the CPU.

    18. Re:Programming isn't up to it by be-fan · · Score: 2, Informative

      Multithreading is dominant because it's the only way to wring parallelism out of legacy languages like C. And nobody claims multithreading is easy, natural, or anything but error-prone. The future is really in languages that have formal abstractions for concurrency, so programmers can specify at a high level what tasks can be concurrent and let the compiler do the low-level locking. Basically, you want languages based on a concurrent calculus of computation (eg: Pi-calculus), instead of languages based on lambda calculus, which lacks an formal notion of concurrency.

      --
      A deep unwavering belief is a sure sign you're missing something...
    19. Re:Programming isn't up to it by Anonymous Coward · · Score: 0

      While most tasks may be sequential in nature, most power hungry tasks aren't. Rendering? Web serving? Audio/Video? Cracking? FFTs? Kernel compilation? It all sounds very parallel to me. Or maybe I just don't know what people are doing when they run out of oomph.

      Well, actually I can think of some exceptions, like certain cryptographic and compression algorithms, but that's mostly because they haven't been designed with parallelism in mind. e.g. there are other means to be secure besides chaining cipherblocks, several blocks can be unsorted at the same time etc..

    20. Re:Programming isn't up to it by Electroly · · Score: 1

      I was with you until "Since MacOS X runs on SPARC". NeXTSTEP and OPENSTEP ran on SPARC, but no version of Mac OS X ever has.

    21. Re:Programming isn't up to it by lgw · · Score: 1

      So that when somehting goes wrong you'd have no change in Hell of debugging it? Sounds like fun! You can't beat a simple lockword for transparancy.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    22. Re:Programming isn't up to it by lgw · · Score: 1

      I believe that Functional programming languages would seem to have the best chance of successfully taking advantage of multiple threads of execution

      This is the second post I've seen claiming this. Given that everything that can be done in a fucntional language can also be done in an iterative language and vice versa, I don't see where this is coming from.

      A function pointer and some associated state information is the same in any language, whether described with "lamba" or "class".

      --
      Socialism: a lie told by totalitarians and believed by fools.
    23. Re:Programming isn't up to it by be-fan · · Score: 2, Informative

      Eh? Locks suck for debugging, nobody in their right mind likes debugging multithreaded code. If anything, this will make it easier to debug parallel code. Once you have a formal model for expressing concurrency, it makes it much easier to reason about the code and figure out where something went wrong. Further, since the compiler can understand a formal concurrency model in a way it cannot understand an ad-hoc concurrency model, the compiler can offer tools to aid in debugging concurrent applications.

      Now, if you're talking about debugging things when the compiler breaks, then yes, it will make debugging harder. On the other hand, the compiler already does some pretty heroic transformations during optimization, and things seem to work pretty smoothly. Certainly, the possibility of hard-to-debug errors caused by a broken optimizer doesn't seem to have prevented people from putting -O2 in their build scripts.

      --
      A deep unwavering belief is a sure sign you're missing something...
    24. Re:Programming isn't up to it by lgw · · Score: 0

      The slowest possible way to find a problem is to "reason about the code"! It's the last resort: do you have any idea how hard is it to find a missing semicolon by "reasoning about the code", or to determine that the API documentation is full of lies?

      In my experience, problems with multi-threaded code almost never come from a lack of understanding of how to write multi-threaded code. Instead, problems come from the usual typos, library calls not performing as documented, or other cases of the code not doing what it looks like it's doing. Every layer of abstraction only makes this worse to sort out. With a lockword, especially one where a pointer to the owning thread is used to indicate a lock, it's not going to be that hard to figure out where things went wrong.

      Certainly "the compiler offering tools to help in debugging concurrent applications" is only a good thing, but you don't need a new language for that, just appropriate attention to the debugger. That's going on now in modern debuggers, as they attempt to rise to the challenge.

      If the optimizer is broken you're pretty screwed, but I've actually seen that before, and it's pretty easy to detect that by accident ("It works fine on my machine." "What's different?" "Well, I'm using a debug build." "You don't suppose ...").

      --
      Socialism: a lie told by totalitarians and believed by fools.
    25. Re:Programming isn't up to it by be-fan · · Score: 2, Informative

      The slowest possible way to find a problem is to "reason about the code"!

      It's the only Right Way (TM) to find a problems. Now, I don't recommend reasoning about the code to find typos, but then again, you shouldn't be making typos anyway.

      In my experience, problems with multi-threaded code almost never come from a lack of understanding of how to write multi-threaded code.

      Most people would disagree. Almost invariably, problems with multithreaded code are the fault of the programmer. Race conditions, deadlocks, etc, are all the result of the programmer not locking something he should, locking something he shouldn't, or locking things in the wrong order. You can throw scalability problems on top of there too, the programmer only locked one thing when he should have locked several. Just look through the changelogs of a highly-multithreaded system like the Linux/BSD kernels. Notice how often races and deadlocks get fixed relative to typos.

      is only a good thing, but you don't need a new language for that, just appropriate attention to the debugger.

      You do need a new language for that. Current languages have no formal model for concurrency, so anything the compiler offers you is an ad-hoc solution. Clever debuggers can offer hacks, but nothing complete and reliable. The only way for the compiler to understand concurrency systematically, like the programmer does, is to specify it systematically. To specify it systematically, you need a formal model of concurrency, and a language that allows you to specify concurrency in your application according to that model.

      Let me use an analogy. Current languages treat concurrency the way assembler treats functions. You can do functions in assembler (even very complex ones with closures and everything), but it's all ad-hoc. The assembler doesn't really know anything about functions. It can't tell you if you put your arguments in an incorrect register or if you don't adjust the stack-pointer correctly. Now, an asm debugger can use hacks to try to divine the functions in an asm program, it can read the stack pointer and the base pointer and try to figure out what the functions are, but it'll never be as good as a C debugger that systematically understands what a function is and how it is used. Concurrent calculi do for concurrency what lambda calculus does for functions. They offers a formal way of understanding and specifying concurrency in a program.

      --
      A deep unwavering belief is a sure sign you're missing something...
    26. Re:Programming isn't up to it by Dark+Fire · · Score: 3, Insightful

      From the parent post:

      "Current programming languages are insufficiently descriptive to permit compilers to generate usefully multi-threaded code."

      The portion of importance is:

      "insufficiently descriptive"

      In C, C++, and Java, you must program with concurrency in mind to obtain any benefit from multiple threads of execution. In a functional programming language, the restrictions placed on the behavior of functions often imply concurrency without the programmer necessarily intending that as the result. If you write a C program without concurrency in mind and want to adapt your solution later to take advantage of multiple threads, you may need to code a completely different solution and also locate a compiler that knows how to take advantage of concurrency. In a functional language, you may only need to get an updated version of your compiler/interpreter. This is why C, C++, and Java are in the "insufficiently descriptive" category and functional programming languages are not.

    27. Re:Programming isn't up to it by afabbro · · Score: 1
      But Sun doesn't sell desktop computers.

      Actually, they do. It's just that no one buys them.

      --
      Advice: on VPS providers
    28. Re:Programming isn't up to it by rsynnott · · Score: 1

      For the moment, this is being aimed at SERVERS. Servers have lots of separate threads or processes. It'll be more of a challenge on a desktop (tho Microsoft and Sony at least seem to be counting on it working out for their consoles).

      --
      Me (Blog)
    29. Re:Programming isn't up to it by jelle · · Score: 1

      "32 threads in hardware on one chip is the same as 32 slow CPUs."

      Nope, it's the same as that single-threaded fast chip, but instead of wasting cycles during a pipeline stall, it will switch over to one of the other threads. Add to that the reduced thread&task-switch overhead and you've got a free CPU upgrade.

      Granted, it's not gold spun out of wool, but it removes wasted cycles and costs only a few extra transistors (esp. compared to the cache size, etc).

      --
      --- Hindsight is 20/20, but walking backwards is not the answer.
    30. Re:Programming isn't up to it by lgw · · Score: 1

      It's the only Right Way (TM) to find a problems.

      Wow, judging by your UID you're not some unseasoned guy straight out of school, so I'm mystified. I "reason about code" during the design process, but it's pretty darn useless for finding bugs. Of course I picked the correct algorithm, but that's not necessarily what I typed in.

      Now, I don't recommend reasoning about the code to find typos, but then again, you shouldn't be making typos anyway. ... Most people would disagree. Almost invariably, problems with multithreaded code are the fault of the programmer.

      You shouldn't make typos, but then you shouldn't make any errors, hardly a useful statement. All software problems everywhere are the fault of some programmer, but I don't see how multi-threading is a special case. At least, not in my shop, or anywhere that hires people who know what the Hell they are doing at the job they do everyday. Most people install spyware thinking it's helpful, but I was talking about professionals.

      You do need a new language for that.

      I once helped 2 other seasoned engineers (we probably had 50 years combined experience) track down a threading bug for six weeks. Many weeks of very expensive lost programmer effort. The root cause? Microsoft's documentation was in error (including what's available to their premiere support folks), but very subtly, and of course the source code was unavailable. This is not a problem that a different language would have fixed.

      You can do functions in assembler (even very complex ones with closures and everything), but it's all ad-hoc. The assembler doesn't really know anything about functions. It can't tell you if you put your arguments in an incorrect register or if you don't adjust the stack-pointer correctly.

      I programmed in assembly for 5 years on large projects. I never had a problem with arguments in an incorrect register (beyond typos, of course, always a menace), or for that matter with lockwords or race conditions. Certainly, a language with innate support for concurrency would eliminate a category of errors caused by carelessness, but when something goes wrong for some deeper reason you're *still* debugging through it in assembler. Ever try doing that with well optimized code? What a mess! Debugging assembly is pretty easy when the original program was in assembly, and bugs have nowhere to hide.

      A competant professional low-level programmer isn't losing much time to errors of carelessness in the first place (though, of course, they do happen). C++ debuggers are getting the hang of easing multi-threaded debugging today. A language with concurrancy support would have to have a debugger that was one Hell of a lot better than that if the small amount of time gained in finding each simple error is going to outweigh the large amount of complexity added to a simple lockword when debugging the really hard corner cases.

      I can't think of a single time in history when someone has said "most people aren't good at programming, so we'll make it simple by designing an easy language!" and was right. Hard problems are hard problems. You can reduce carelessness with good language design, but that's about it.

      The lambda calculus did little for formal program specification. It was a new tool to help prove programs correct, but that's remarkably useless in the field. After all, you can prove an algorithm correct if that gives you a warm fuzzy, but you can never prove that the code you typed in implements that algorithm.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    31. Re:Programming isn't up to it by lgw · · Score: 1

      I keep seeing this assertion, but never any evidence.

      A functional programming is no more or less descriptive than C++. It's all an abstract syntax tree once it's parsed, and the same algorithm looks about the same coming from any language. It's all about the compiler. There are FORTRAN compilers that deliver code very nicely optimized for highly parallel environents, for example, because it's been worth the high cost needed to make that true.

      I can see that a language that has innate support for concurrency is going to do better than one that doesn't, performance wise, but only if you're thinking about concurrency and take advantage of those features to provide better hints to the compiler. There's nothing about functional languages that are any closer to that than iterative languages. You're no more likely to "only need to get an updated version of your compiler" in a functional language than in FORTRAN, if you didn't write your functional language program with concurrency in mind.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    32. Re:Programming isn't up to it by lars_stefan_axelsson · · Score: 1
      I keep seeing this assertion, but never any evidence. A functional programming is no more or less descriptive than C++.

      Check out e.g. "Four-fold Increase in Productivity and Quality" (pdf link). Erlang is freely available. Regarding C++ vs. functional programming being 'less descriptive' there's Haskell vs. Ada vs. C++ vs. Awk vs. ..., An Experiment in Software Prototyping Productivity(PS link). Now that doesn't directly address your question about what the compiler can do, there's about a metric ton of stuff about the higher level optimisations you can do with a declarative language compared to a messy one such as C++; riddled with aliasing problems etc. I haven't got any links handy, but some googling should turn them up (you could start by checking out Urban Boquist's, now quite old, PhD thesis). Plese note though that the Erlang references demonstrate that even while they may be slower on micro benchmarks, they always win in the end. Much like C beat out assembler in the eighties.

      Your argument basically boils down to "the languages under discussion are all turing complete". While that's true, that's not really what we're saying. We're saying that given a declarative language the potential (and nowadays practice) for optimisation is much improved compared to e.g. Fortran or C.

      --
      Stefan Axelsson
    33. Re:Programming isn't up to it by be-fan · · Score: 1

      Of course I picked the correct algorithm, but that's not necessarily what I typed in.

      Fine, I'm not going to argue with this. In my experience, bugs are the result of my making logical errors, not handling corner cases, making fencepost errors, etc. YMMV.

      This is not a problem that a different language would have fixed.

      It might have, it might not have. If the API erred in marking something thread safe when it wasn't, well, a concurrent language would fix that, because there would be no need for that documentation in the first place! The concurrency of the function would be expressed in its interface, just like the type of a function is expressed in its interface. The compiler would have checked for that error before the test-cases could have even been run.

      but when something goes wrong for some deeper reason you're *still* debugging through it in assembler.

      You're only debugging through it in assembler because C debuggers suck. In any case, what are you arguing? That using functions from ASM code is just as easy as using functions from C code? How about using classes from C code versus using classes from Java code? Feel free to believe that, but most people would disagree. Strongly-typed programming languages aren't the panacea they were marketed as, but if they did reduce typing-related bugs. Concurrent languages won't be a panacea either, but they will reduce deadlocks and race conditions.

      I can't think of a single time in history when someone has said "most people aren't good at programming, so we'll make it simple by designing an easy language!" and was right.

      That's a very stupid statement. Programming is a silly undertaking to begin with. The only reason we do it is because current technology doesn't allow anything better. What you want to be doing is writing specifications, and letting the compiler sort out the rest. Human beings are good at design, computers are good at managing details. The more expressive your specification language becomes, the more work the compiler can do for you, and the less error-prone the whole process is. As you said: almost every error is the fault of the programmer. By extension, the less the programmer does, the less likely there will be errors.

      In any case, this theoretical "competent programmer" is a myth. Most software in existance today blows. If programmers were really so damn competent, it wouldn't. Software wouldn't be full of races and deadlocks, software would be aggressively multithreaded. Hell, just consider the pain people went through trying to get the Mozilla codebase to play nice with BeOS's multithreading. Consider the fact that GTK+ still isn't properly thread-safe, and that Qt is in version 4 before it has reached a decent-level of thread-safety. Consider that Darwin, NetBSD, and OpenBSD still have course-grained multithreading in their kernel, while FreeBSD and Linux went through tremendous pains to fine-grain multithread their SMP. Consider the huge amount of research going in to making locking algorithms (RCU, etc) that don't kill scalability. Saying that the current state of affairs is acceptable is just plain asnine.

      Hard problems are hard problems.

      Hard problems are made easier to solve when you have proper tools to find the solution. Calculus problems are hard. Yet, you can solve many calculus problems knowing nothing more than algebra. However, it's difficult, tedious, and error prone. Once you have the language of calculus, a formalism that you can apply to your problem, well, then it becomes much easier.

      The lambda calculus did little for formal program specification.

      Most existing programming languages are reducible to the lambda calculus. As a result, the lambda calculus is immensely powerful for designing algorithms. Using the techniques of the calculus, you can design algorithms that are provably correct. On hard problems, it's the design that's the hard part. It might take weeks to design the algorithm, but the code for it might only fill a few pages of text. It's easy to use ad-hoc means to make sure your text matches the specification. It's not easy to use ad-hoc means to ensure that your specification is complete and correct.

      --
      A deep unwavering belief is a sure sign you're missing something...
    34. Re:Programming isn't up to it by lgw · · Score: 1
      Fine, I'm not going to argue with this. In my experience, bugs are the result of my making logical errors, not handling corner cases, making fencepost errors, etc. YMMV.

      We may be saying the same thing here. Do you really find it easier to spot a fencepost error by looking at the code instead of stepping through the debugger? If I've made an error in judgement (and not merely a typo) for something like that, I'm unlikely to spot it by re-reading the code. It's pretty darn obvious in the debugger, however.

      If the API erred in marking something thread safe when it wasn't, well, a concurrent language would fix that,

      I can buy that. I do see some wins from the idea. There's an advantage of making concurrency a language feature instead of a standard library in that compiler bugs are pretty rare. There are also, of course, potential performance gains from optimization (which we haven't really been discussing).

      Strongly-typed programming languages aren't the panacea they were marketed as, but if they did reduce typing-related bugs. Concurrent languages won't be a panacea either, but they will reduce deadlocks and race conditions.


      It took a long time before strong typing did more good than harm, and it's still a pain in the ass when parsing (or other situations when runtime polymophism is needed). I guess it's possible to make a language which would be a net improvement, but I'm quite skeptical that would actually happen. It seems far more likely that it would merely add obfuscation to the existing problems. Ultimately, the *hard* problems require watching what's going on at the lowest level, where nothing can hide.

      If a language could abstract concurrency *without* complicating what happens at this level, *and* therefore let you understand how to handle low-level concurrency yourself (if you really had to) without breaking what the language is doing, how could I object? That's a lofty goal however - how many strongly typed languages were there before people reaized you really did need void* after all? I'm, just quite skeptical someone could actually deliver such a language.

      Programming is a silly undertaking to begin with. The only reason we do it is because current technology doesn't allow anything better. What you want to be doing is writing specifications, and letting the compiler sort out the rest.

      This leads to a really fundamental understanding in programming language design! The same mistake was mad ewith the design of COBOL and many other languages which tried to be closer to natural language. You need a formal lanuge in which to write specifications, natural lanuguage is too ambiguous. A "very stupid statement" is

      ADD 1 TO INDEX GIVING INDEX

      (actual COBOL code)! This is just a pile of obfuscation on top of

      index++

      . It's not the need to spell out the details that makes programming hard, it's the fact that the details matter. This has been proven over and over in the history of programming language design.

      Strong typing is better precisely because it forces you to spell out more details, for example. It's quicker to write code in a dynamically typed language, but a pain to maintain.

      In any case, this theoretical "competent programmer" is a myth. Most software in existance today blows. If programmers were really so damn competent, it wouldn't

      Sturgeon's Law: 90% of everything is crap. This is just a true in English as it is in programming. That doesn't mean it's a language problem. I'm not saying that you *couldn't* come out ahead by putting concurrency in the language to reduce carelessness, I'm saying you *also* pay a price in hiding what's really going on, and that price is high. It's hard to achieve net advantage. Yes, this is an area novice programmers are prone to screw up, but your language design and debgugger would have to be implemented *really* well to com

      --
      Socialism: a lie told by totalitarians and believed by fools.
    35. Re:Programming isn't up to it by lgw · · Score: 1

      Special purpose declaritive languages beat out general purpose iterative languages, no doubt, but special purpose C++ compilers found mostly in university research projects are pretty damn good.

      Pattern-matching languages are neat and all, but can be a world full of hurt to debug. There's nothing I see in Erlang that requires functional programming to implement, however. My gut reaction to any high level language is "cut out the bullshit, what's really going on!". Personal preferance, I guess, but if I can't tell what the bytes are going to be, it's useless to me.

      Weak dynamic typing is also a huge pain in the ass to support in large systems.

      Let's just say I remain unconvinced.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    36. Re:Programming isn't up to it by lars_stefan_axelsson · · Score: 1
      Let's just say I remain unconvinced.

      As is your perogative. We'll continue to make billions beating our competition thinking the same thing. You have a point though, Erlang is a weak functional language with great support for concurrency/parallelism (depending on Erlang being functional in the first place). Oh and BTW, Erlang/Haskell etc. are emphatically not "special purpose" less so than C++ I'd say.

      Anyway, why stop at assembly. Why don't you build your own processor from gates/FPGA. I mean a general purpose CPU is just some average kludge to make some imagine problem solvable. If you have domin knowledge then you can always do better. Why bother with someone else's "high level" circuit? Build your own I say!

      --
      Stefan Axelsson
    37. Re:Programming isn't up to it by be-fan · · Score: 1

      We may be saying the same thing here. Do you really find it easier to spot a fencepost error by looking at the code instead of stepping through the debugger?

      Easier? No. But I sleep better at night when I analyze the algorithm to figure out what's wrong instead of using the debugger. It gives me confidence in the rest of the code. It's not just me, though. I've heard of lots of places that frown on using the debugger.

      It took a long time before strong typing did more good than harm, and it's still a pain in the ass when parsing (or other situations when runtime polymophism is needed).

      You're conflating strong typing with static typing. Strong typing just prevents type errors --- in the face of runtime polymorphism, it will catch type errors at runtime. Static typing prevents you from using runtime polymorphism.

      Ultimately, the *hard* problems require watching what's going on at the lowest level, where nothing can hide.

      In a properly high-level language, problems that require low-level access should be exceedingly rare. Basically, if you've got memory protection (ie: can't overwrite random memory with garbage), garbage collection, and some concurrency model, there should be very few problems that require getting down to the metal. In those rare occasions, it shouldn't be too difficult to do that. Again to use the analogy of optimizing compilers (especially aggressive ones like you'll find in Scheme or ML), you kind of get to know what the compiler will do. A seasoned ML programmer can tell you exactly what the memory layout of his app is as easily as a C programmer can. Things shouldn't be that much different for concurrent compilers.

      how many strongly typed languages were there before people reaized you really did need void* after all?

      Void* is a bad example because in C its used for two things: dynamically typed access and untyped access. You often need the former, but you almost never need the latter. In languages that have support for the former, you don't need void* at all.

      The same mistake was mad ewith the design of COBOL and many other languages which tried to be closer to natural language.

      I'm talking about mathematically precise specifications, not natural-language specifications. Natural languages suck for programs because they are ambiguous. There is no systematic way of expressing precisely what you want. However, with regards to concurrency, languages like C are the same way. There is no systematic way of specifying concurrency in those systems either.

      A good calculus for describing concurrenct would certainly be handy for design review, but that doesn't automatically mean it would be better in a programming language.

      All programming languages have semantics that can be mapped to a calculus of computation. If a language was based on a concurrent calculus, not only could the concurrency model for the application be specified during the design, but that model could be imported wholesale into the program. The more you minimize the path between specification and code, the smaller the chance for error.

      --
      A deep unwavering belief is a sure sign you're missing something...
    38. Re:Programming isn't up to it by laird · · Score: 1

      "I was with you until "Since MacOS X runs on SPARC". NeXTSTEP and OPENSTEP ran on SPARC, but no version of Mac OS X ever has."

      OK, I shortened a longer and more complex discussion. Most of the parts of MacOS X run on SPARC. NeXTSTEP and OPENSTEP ran on SPARC, as does WebObjects. Most people don't realize it, but WebObjects includes the Cocoa runtime, which means that Mac Cocoa app's can run on SPARC (over the Cocoa runtime) though Apple doesn't allow developers to license the Cocoa runtime for that purpose. And of course both Mach and BSD run on SPARC, and there are mentions of SPARC in Darwin (for example, http://www.opensource.apple.com/darwinsource/10.3. 7/gas-495.8/include/mach/sparc/). Finally, I've also been told by several Apple engineers over the years that Apple maintains NeXTSTEP's portability across more than PPC and x86, specifically mentioning SPARC. So given all of that, it's highly likely that that Apple could easily ship MacOS X for SPARC if they wanted to, but I can't prove it. :-)

      I'd still love to see MacOS X Server ship on the high-end SPARC's. Not that there's anything wrong with Solaris 10, of course, but it'd be great for Apple and for its customers for Apple to prove the point that the OS matters more than the CPU, and that they have the best, most portable OS.

      OK, I have no idea how well MacOS X would run on an 8-core SPARC with 4x "hyperthreading" to have a virtual 32 processors. But it'd be a blast to try.

    39. Re:Programming isn't up to it by davecb · · Score: 1
      The reason Sun did the CMT chips was the horrible speed mismatch between CPU and memory.

      It dispatches another thread when the current thread has to wait for a cache load. The number of decoders/registers was set by observing the cache stall behavior of real programs.

      --dave

      --
      davecb@spamcop.net
    40. Re:Programming isn't up to it by lgw · · Score: 1

      I've heard of lots of places that frown on using the debugger.

      That's just ... broken. How else do you determine that an API is not working as advertised? I can see a new programming shop thinking they can get away with this, but in a mature programming environment -- that is, an environment where you're mostly working on code that was written by your company, but the original authors are gone or don't rememeber anything about it -- you're *always* working with APIs that don't quite work as advertised!

      You're conflating strong typing with static typing. Strong typing just prevents type errors

      That's a good point! Strong typing is handy that way, because it abstracts "is this argument the right type" into the language, with very little downside. Static typing is wonderfully useful, however (except when the language fails to proivde adequately for polymorphism) in that it *forces* type documentation into the API. That was an awesome payoff, once the details had been made correct through years of experimentation.

      Concurrancy support in a language that's merely analogous to strong typing doesn't seem worth the downside of having a new language: you should be able to achieve the same results with library code. Concurrency support that's analogous to static typing could be a big win -- again by *forcing* the concurrency description out into the open -- but only after those same years of experimentation to prove that the language facility is adequate.

      Void* is a bad example because in C

      The most important thing about void* was the ability to provide untyped access in those places overlooked by the language designer, where the language's type mechanism wasn't up to the task for your corner case. This is of the utmost importance, because otherwise you just can't use the language to solve that corner case problem, which would mean the language wouldn't be adopted in the first place by many shops.

      There is no systematic way of specifying concurrency in those systems either.

      The problem with C, of course, is that there are too many ways to do any given thing, so there are few "systematic ways" of doing *anything* that the compiler can depend on. This may seem like a bad thing, but it's for this very reason that C was so widely adopted while most new languages languish in research environments.

      The more you minimize the path between specification and code, the smaller the chance for error.

      Sure, no argument. But you can get that same win by simply having a shop standard on how concurrency will be done with C (or, better, with C++, where you can easily encapsulate your way of formally specifying concurrency). You don't get the benefit of the compiler checking for errors, but you get the full benefit of the design-to-code transition.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    41. Re:Programming isn't up to it by lgw · · Score: 1

      Didn't you see the recent Slashdot article on the new generation of supercomputers that build their own processors from gates/FPGA for the purpose at hand? ;)

      As far as the high-level/low-level language thing goes, the advantage of high level languages is only up-front. Any facility you like in a high level language you can write as library code for a low-level language. The problem, of course, is that you have to spend the effort to do that. It's just like the argument for open source: the upside is, you have the souce and can maintain it to your needs and schedule, the downside is you have to maintain it!

      The problem with high-level languages, of course, is that when trying to use them for anything "real" (like running the equipment described in that paper) you stil have to deal with the bytes-and-bits underpinnings in a low-level way.

      If you have a high level language taking care of the generalities, and C/C++ taking care of the specifics, you could argue all day about which one is the "special purpose" language, but that's just semantics. Are you able to write device-level controls purely in Erlang, or does it wrap low-level libraries that do the actual bit-bashing?

      There's nothing at all wrong with the latter approach, it has many advocates. However, please recognize that the revers apporach is equally valid! There's also nothing wrong with delivering the important abstraction that a given high-level language provides as a library for a low-level language.

      Functional languages are cool and all, but I've yet to see any broad functionality that doesn't make just as much sense (and is just as descriptive) when built on C++, for example.

      --
      Socialism: a lie told by totalitarians and believed by fools.
  21. how much for the best of both worlds? by nounderscores · · Score: 1

    If price was no object, someone could design a chip with more than two cores in it, and each core still ran as fast as any single core chip out there.

    Just the existance of one such device would heal the rift immediately. Everyone would say... aha! It is only a matter of time before blazing speeds and hardware threading comes to the desktop.

    1. Re:how much for the best of both worlds? by InvalidError · · Score: 4, Informative

      Hardware threading has been mainstream for more than two years in the form of HyperThreading.

      Simultaneous Multi-Threading is a CPU's ability to concurrently execute mixed instructions from multiple threads. Intel's HT simply 2-ways SMT.

      Chip Multi-Threading is a CPU's ability to hold execution states for multiple threads, executing instructions from only one of them at a time unless the chip is also SMT.

      In Sun's case, the mid-term plan is to eventually offer 8-ways SMT with 32-ways CMT: the CPU can hold states for up to 32 threads and have in-flight instructions from as many as eight of them.

  22. Don't worry by StupidKatz · · Score: 2, Informative

    You can have your parallel processors and still play DOOM III at insane fps. At worst, it will just take a bit for folks to start writing programs to take advantage of the additional processors/cores.

    BTW, your "average" user hasn't even played DOOM I, let alone DOOM III. Surfing the web and using e-mail doesn't usually put a lot of strain on a PC.

    1. Re:Don't worry by 'nother+poster · · Score: 1

      Obviously you've never seen me web surf. ;) Two or three instances of firefox open with multiple tabs each. Memory use maxed, harddrive clacking away...

    2. Re:Don't worry by Anonymous Coward · · Score: 0

      I always use tabs in single-mode. I suppose I can imagine why someone would want multiple windows open, but that's definitely not "average". ;)
      -
      SK

      (I loathe the anti-bot test.)

  23. Missing the point by Anonymous Coward · · Score: 1, Insightful

    All of these recent articles about multi-cores, multiple pipelines of execution seem to miss the real value of theis technology; the provisioning of multiple Virtual Machines real-time on the same system. While most software will never use the multi-thread, multi-CPU capabilities of even the quad core AMD products like VMWare are now allowing you to dynamically provision systems on demand to deal with load. Another great use is for server consolidation; instead of 10 1U racks to handle web farming, try a 16 way box that can provide a single point of reliability, management and execution for those services. This is about horizontal scaling in a vertical fashion.

  24. OLTP systems by bunyip · · Score: 2, Informative

    Now of course, the room was full of Sun infrastructure weenies, so if there's something terribly obvious in records management or airline reservations or payroll processing that doesn't parallelize, we might not know about it.

    Well, since I work in airline reservations systems, I'll add my $0.02 worth...

    Most OLTP systems will benefit from CMT and multi-core processors. We had a test server from AMD about a month before the dual-core Opteron was announced, we did some initial testing and then put it in the production cluster and fired it up. No code changes, no recompile, no drama.

    IMHO, the single-user applications, such as games and word processors, will be harder to parallelize.

    Alan.

  25. What a totally vague and useless post, yipee! by tomstdenis · · Score: 2, Insightful

    First off, performance + java != good idea. Not trying to camp fanbois here but if you really need "down to the metal" performance you're writing in C with assembler hotspots.

    So the observations that there is too much locking in Java's standard api is informative but not on-topic. the fact that the standard solution is to use a completely new class [e.g. StringBuilder] is why I laughed at my college profs when they were trying to sell their Java courses by saying "and Java is well supported with over 9000 classes!".

    In the C and C++ world things get extended but also fixed at the same time. We can still use the strncat function which has been around for a while EVEN IN threaded environments...

    Also, he totally fails to point out that extra threads [e.g. register sets] only pay off when the pipeline is empty. So it's a catch-22. You either have a very efficient pipeline that you can cram full of a single thread's instructions or you have a shoddy one where you're only hope is to mix in other threads.

    Think about it. If you only have one ALU and 32 threads that means each individual thread works at 1/32 the normal speed. Even if they're a lower/higher priority!

    That then gets into two camps. Are you threading because the performance of the pipeline sucks [e.g. dependencies in the P4] or because you want to interleave instructions [e.g. twice the clock rate but half the performance]. If it's the latter than even if you turn off 31 of 32 threads you still end up with one weak ALU.

    Consider the AMD64 for instance. It usually gets an IPC that is pretty high [usually in the 1.5-2.5 range] which means that it's retiring instructions from a single thread at pretty much the entire capacity of the chip. Adding extra threads doesn't help.

    Consider then the P4. It usually gets an IPC of 0.5 to 1 [for ALU code, which is observable by the fact it's about as fast as a half-clockrate Pentium-M]. This means it's two ALUs are not always busy and an additional thread could bump the IPC up to 1-1.5 range.

    I know [for instance] that with HT turned on my 3.2Ghz Prescott compiles LibTomCrypt in close to the same time as my 2.2Ghz AMD64 [the P4 takes 5 seconds longer, without HT it takes about 15 seconds longer].

    So the only saving grace is an efficient ALU so that you can run single tasks at least somewhat efficiently. Then tacking on the extra threads doesn't help as an efficient ALU won't have many bubbles where other threads could live.

    So you end up with essentially a hardware register file but still 1/2 the performance. Remember that the goal of multi-processing is closer to 'n' times faster with n processors.

    The BEST a single core multi-thread design can hope for is the performance of a single core single thread design...

    Whoopy...

    Multi-threading is NOT the future. Multi-cell is. Where you have dedicated special purpose [re: space optimized] side-cores that do things like "I can do MULACC/load/store REALLY REALLY QUICK!!!".

    In other words, "yet another press release on /.".

    Tom

    --
    Someday, I'll have a real sig.
    1. Re:What a totally vague and useless post, yipee! by Anonymous Coward · · Score: 0

      The BEST a single core multi-thread design can hope for is the performance of a single core single thread design...

      And what's the best a DUAL core multi-thread design can hope for? Which after all is what the article is talking about (well actually OCTA core)

    2. Re:What a totally vague and useless post, yipee! by Anonymous Coward · · Score: 0
      So the observations that there is too much locking in Java's standard api is informative but not on-topic. the fact that the standard solution is to use a completely new class [e.g. StringBuilder] is why I laughed at my college profs when they were trying to sell their Java courses by saying "and Java is well supported with over 9000 classes!".

      I thought Java was great at uni because I didn't know about Perl. Surely the point of APIs and abstraction is that it shouldn't matter what the guts are, the user just uses the StringBuffer without caring whether the JRE puts locks in or not.

    3. Re:What a totally vague and useless post, yipee! by tomstdenis · · Score: 1

      The idea is adding register sets [re: threads] somehow makes the process more efficient. My comment is that if your ALU pipeline is well stuffed another thread won't have the execution resources it needs [and likely just get in the way anyways].

      So if you make a shoddy ALU that stalls a lot another register set can get you better performance overall [but not for individual threads] and if you make a good ALU your extra register set in hardware buys you VERY LITTLE.

      A dual core cpu is something else. that's two execution engines which can run in parallel. That's not the same thing as multi-threading [in the sense of HT].

      But you still have a space-time trade off.

      Putting 32 really shoddy ALUs in a core doesn't help if you have a single intenstive task [e.g. compiling a large file or bzip'ing a tarball].

      And putting 32 really good ALUs isn't feasible [now] as it takes too much power/space to be reliablely implemented.

      Tom

      --
      Someday, I'll have a real sig.
    4. Re:What a totally vague and useless post, yipee! by pedantic+bore · · Score: 1
      Well, before you say a post is vague and useless,
      you should at least read it.



      The Niagara chip has 8 cores, each of which runs as
      many as 4 threads. There's not one ALU, there are 8, so there's an improvement. Can they keep 4 threads busy per ALU? Maybe, depending on how often
      each thread must go to memory. Every time one thread stalls on a load or store, the other threads have an opportunity to execute a bunch of ops.



      If your workload consists of one thread, this does you no good. If your

      --
      Am I part of the core demographic for Swedish Fish?
    5. Re:What a totally vague and useless post, yipee! by tomstdenis · · Score: 0, Flamebait

      If you were at uni for "computer science" your prof's did you a disservice.

      Tom

      --
      Someday, I'll have a real sig.
    6. Re:What a totally vague and useless post, yipee! by tomstdenis · · Score: 1

      Still a space/time problem. Will your four ALUs be better [in terms of efficiency] than my one high performance ALU?

      If it takes you 2x the area to get 2x the performance you've entered into a "no duh" region.

      If your multi-threaded 1x the are core gets >1x the performance then you have something to talk about.

      In the AMDX2 case the dual-core is faster but nobody is saying "gee whiz that must be some new ideas there!" it's really a bigger chip with more transistors...

      The future of computing lay not in "what's the biggest we can build" but what's the most efficient.

      Tom

      --
      Someday, I'll have a real sig.
    7. Re:What a totally vague and useless post, yipee! by Anonymous Coward · · Score: 0
      'First off, performance + java != good idea. Not trying to camp fanbois here but if you really need "down to the metal" performance you're writing in C with assembler hotspots.'

      You're clueless (and after previewing I find you're now at +5 Insightful, nice job mods! LOL). There's a ton of heavy-lifting Java server code out there. You should also look at (or better yet, do it) C vs. Java benchmarks with current VMs.

      At any rate, Sun's CMT is aimed at accelerating server applications. If you do a little research, you'll find that "C with assembler hotspots" is just about non-existent in that space.

      The vast majority of cluster-based scientific code contains no assembler, for that matter...though there might be a bit in the math libraries.

    8. Re:What a totally vague and useless post, yipee! by cahiha · · Score: 1

      Not trying to camp fanbois here but if you really need "down to the metal" performance you're writing in C with assembler hotspots.

      And you are going to hand-tune your assembly language hotspots and your C code to work on every single variant of the x86 architecture? I think not. Java isn't the answer to high performance computing, but neither are C or assembly. Some kind of JIT is likely the future, together with a high-level language that sucks less than Java.

    9. Re:What a totally vague and useless post, yipee! by tomstdenis · · Score: 1

      Chances are if I have a very expensive and dedicated task for which I need a $4000 processors ... I'll know which ISA to optimize for.

      Tom

      --
      Someday, I'll have a real sig.
    10. Re:What a totally vague and useless post, yipee! by tomstdenis · · Score: 1

      Because developers are looking for the shortest TTM doesn't mean that hand-crafted assembler is moot.

      The fact of the matter is the VM occupies both cpu time and memory. Unless you implement the VM in hardware in which case what's the difference? You could just write an x86-vm...

      Tom

      --
      Someday, I'll have a real sig.
    11. Re:What a totally vague and useless post, yipee! by swillden · · Score: 1

      Putting 32 really shoddy ALUs in a core doesn't help if you have a single intenstive task [e.g. compiling a large file or bzip'ing a tarball].

      But a modern computer is rarely doing just one thing, and many tasks that we currently view as inherently serial (like compilation) can be parallelized, to some degree. Plus, as others have said, this is Sun, and they're thinking about servers. I think desktops can benefit as well, but there's no question that servers benefit from parallelism.

      And putting 32 really good ALUs isn't feasible [now] as it takes too much power/space to be reliablely implemented.

      The idea is that you can put more ALUs on the chip if they're simpler and don't have all of the complexity required to support out-of-order execution.

      I don't know that this notion is going to be the future, but it's pretty clear that progress on single-core processors is slowing, so I won't be surprised if we have to start moving to greater parallelism to continue improving system performance.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    12. Re:What a totally vague and useless post, yipee! by aminorex · · Score: 1

      Does it make you feel important and big to degrade people? Perhaps you should become a federal law enforcement agent or a miltary interrogator.

      --
      -I like my women like I like my tea: green-
    13. Re:What a totally vague and useless post, yipee! by tomstdenis · · Score: 1

      No everyone should be pansy little asses, never question anything and always put up with substandard quality.

      I'm sorry, but computer science is not about the latest thing java can hide from you like how to manipulate strings. I'm sorry ... THAT'S WHAT THE PROGRAM IS ABOUT!!!

      It's like saying "oh don't learn calculus, just put the question into Magma and use the result."

      Things like Java and C and what not are good to know, but are not a computer science degree.

      If you walk out of university with no understanding of how a cpu works [at least from the ISA standpoint] or how to implement a sorting algorithm or a searching algorithm or etc... then you're effectively useless...

      People sit and bitch about low quality software all the time [specially on /.] but then never stop to question the people writing it or their half-ass lazy professors from college/university.

      Not to say all school is bad. In college we did learn about searching/sorting/compilers/assembler/etc... so the course load was well rounded. I just hate the new programs which tend to focus solely on say doing everything in Java or using existing libraries for all tasks.

      Tom

      --
      Someday, I'll have a real sig.
    14. Re:What a totally vague and useless post, yipee! by m50d · · Score: 1

      JIT is a) sucky and b) already done, since x86 is usually just RISC pretending.

      --
      I am trolling
    15. Re:What a totally vague and useless post, yipee! by tomstdenis · · Score: 1

      "The idea is that you can put more ALUs on the chip if they're simpler and don't have all of the complexity required to support out-of-order execution."

      Trans...may....tah....

      They got eaten alive on that issue.

      Turns out for general purpose software you do need an OOO engine. The only time you can really get away with it is if you are really hardware specific in your code [e.g. a cell processor] where you know the delays of memory/execution resource and can schedule the code effectively on your own.

      Their thread design may be suited for servers but if the servers are using off the shelf code [e.g. php on apache with some cgi in C/php/etc] then you really need a kickass compiler and tightly bounded hardware or an ALU that can work well on the fly.

      And again, threading is not about duplicating execution resources [e.g. an ALU] but about providing multiple register sets to flow through the core. If your ALU is well built and getting a high IPC [e.g. power efficient] then threading won't help.

      What does help servers is multiple execution cores, tightly coupled memory and high bandwidth low latency disk.

      Tom

      --
      Someday, I'll have a real sig.
    16. Re:What a totally vague and useless post, yipee! by philipgar · · Score: 1

      >

      The problem is that no matter how effiecient you design your ALU you can not account for the fact that a significant portion of many applications runtimes are waiting on the memory hierarchy. Main memory costs are expensive (in terms of cycles). Current trends are pushing clock speeds up, and so the procesorry/memory gap will continue to widen.

      This is true because P ~= A*C*V^{2}*F where A=die area, C=capacitance, V=voltage and F=frequency. Traditionally area was the one constant in this equation (making larger dies doesn't make economic sense). However now P is the real constant. F just has to be scaled back in its growth rate, and within 5-10 years (when V is likely to stop decreasing) it will be scaled back further.

      However we will see increasing processor frequencies. With that will be a larger processor/memory gap. And while some applications are cache aware, most programmers are not going to program with cache in mind, and even when they do L1 through Ln cache levels have a latency associated with them. Allowing 4 threads to quickly shuffle around the core on an L1 cache miss just makes sense. Regardless of how awesome your ALU is this will improve performance. While 4 threads per processor may not yield more performance than 3 or 2 (depending on what the threads are doing), the performance of the system as a whole is no worse than it would otherwise be.

      I for one welcome these new multi-threaded monsters.

      Phil

    17. Re:What a totally vague and useless post, yipee! by Anonymous Coward · · Score: 0
      "Because developers are looking for the shortest TTM doesn't mean that hand-crafted assembler is moot."

      It doesn't mean it's worthwhile either. Many games are coded with no assembler these days, never mind enterprise apps.

      "The fact of the matter is the VM occupies both cpu time and memory. Unless you implement the VM in hardware in which case what's the difference? You could just write an x86-vm..."

      Memory isn't much of a concern these days (for the vast majority of apps) and is becoming less so as things move to 64-bit. If you were up on the current state of the art, you'd find that some of the hotspot optimizations in current Java VMs aren't possible with static compilation, and they do make a significant difference. Run-time inlining for instance...

      As I said before, do some benchmarking.

    18. Re:What a totally vague and useless post, yipee! by Anonymous Coward · · Score: 0

      "What does help servers is multiple execution cores, tightly coupled memory and high bandwidth low latency disk."

      The first two items are what Niagara is all about. The third, well, depends on the site, right?

      I still don't think you understand that CMT is multi-core, each core running multiple threads, with an OS that's optimized for that work, and hardware that's optimized for it. You keep talking as if this is one core with 32 threads; it's not, it's 8 "execution engines", each capable of four threads, with a simple pipeline. You don't have to code specially for it, and as long as the OS has enough threads to throw at it, it will chew them up quite efficiently.

    19. Re:What a totally vague and useless post, yipee! by Anonymous Coward · · Score: 0

      Yeah, a processor that runs eight threads at 2 GIPS will never match a single threaded 16GIP processor that has 1/16 of memory latency of the former.
      The problem is, the latter doesn't exist.

      Without the latency gain the eight 2GHz cores will beat the 16GHz core hands down in parallel applications. There's no way the single core processor can keep as many simultaneous memory accesses on their way unless it has a dedicated set of scatter/gather instructions.

      Of course the multi-cell thingy might still be a better idea..

    20. Re:What a totally vague and useless post, yipee! by lgw · · Score: 1

      Not to disagree with you, but there should be different programs for different careers, no? Some people go off to write OSs, programming languages, or other low-level code, and *need* to understand how everything really works. Most will spend a career nailing together existing Java libraries to solve the same problem for yet another customer. It doesn't hurt to be able to learn that in school as well.

      Ideally the latter would be taught by the business school, to avoid confusion, but you can't have everything.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    21. Re:What a totally vague and useless post, yipee! by rsynnott · · Score: 1

      Have a look at a P4 of some sort, sometime, one of the ones with hyperthreading. For many tasks, that will disprove "The BEST a single core multi-thread design can hope for is the performance of a single core single thread design..."

      --
      Me (Blog)
    22. Re:What a totally vague and useless post, yipee! by argent · · Score: 1

      Turns out for general purpose software you do need an OOO engine.

      An OOO engine lets you extract concurrency from non-concurrent code.

      Multiple register sets and multithreading let you take advantage of concurrency in concurrent code.

      Concurrent code is much harder than linear code, so you get a win on more code from OOO than from CMT.

      One reason that Sun is so big on multiple register sets is that they got really good at it early on because the Sparc design pretty much needs it, because they have so many registers to flush on a context switch otherwise, thanks to the register stack.

    23. Re:What a totally vague and useless post, yipee! by pedantic+bore · · Score: 1
      Gadzooks, what did I do to get that formatting?

      Anyway, the last paragraph should be:

      If your workload consists of one thread, this does you no good. If your workload has eight threads, you're golden. If the processor speed is 1.4GHz then this is on the order of 11.2GHz. If your workload has more than eight threads, you'll get more than eight CPU's worth of speed if there's something to do in between memory accesses.

      --
      Am I part of the core demographic for Swedish Fish?
  26. Re:Windows Articles, Slashdot and Pragmatism by Anonymous Coward · · Score: 0

    Grow up, it's you who constantly take cheap stabs at Linux.

  27. We all are by 3770 · · Score: 1

    We all are.

    If one of your favorite applications happen to be multithreaded then that's gravy.

    But you'll benefit anyway. If you bring up your process list you'll see that you have probably at least 10 processes. These will now be able to run independently.

    Also, the windows kernel itself can benefit from hardware threads.

    --
    The Internet is full. Go Away!!!
    1. Re:We all are by tomstdenis · · Score: 1

      This is total f'ing hype. If you have an efficient ALU multi-threading won't help crap [in the hardware front, it does in software where you may have blocked threads, etc...].

      Think about it this way. You have one car that can carry you and your buddies to work at 50mph and two cars that can take you and your buddies to work at 30mph.

      Sure the two cars let you do independent things but when you're working on one task [getting to work] you're not ahead.

      In a video game context for instance, you do have multiple threads but the big ones are where 99% of the time is spent [e.g. AI, TL, models]. Giving EQUAL processing resources to something as trivial as audio or network code isn't very smart.

      Hyperthreading only pays off for the Intel P4 because the ALU is so notoriously weak that it has the bubbles in the pipeline that another thread can fill.

      This isn't true about all processors. Sure HT could work with the AMD64 but you'd see such a marginal [if any] improvement that the size increase would make it cost ineffective.

      Tom

      --
      Someday, I'll have a real sig.
    2. Re:We all are by 3770 · · Score: 1

      My point was, you'll benefit from multiple hardware threads (dual cores or more) even if your applications aren't multithreaded.

      Do you disagree with that?

      --
      The Internet is full. Go Away!!!
  28. What doesn't scale (and what does) by davecb · · Score: 1
    Last year I as at a big commercial shop, looking at performance of a bunch of billing-like programs, and noticed:
    • Some older C, C++ and embedded-SQL programs are written without consideration of parallelization: they're single-process single-thread.
    • If the customer is large, the majority of the single-process single-thread programs have been rewritten to allow one to run multiple instances, so they can use more than one CPU.

    The latter can scale on multi-processors, and mostly do. Much of our performance work centered on finding out how many processes to run, and whether to group them all on one processor board to get short memory access times. Plus fixing obvious things, like O(n^2) algorithms.

    in my personal opinion, the consideration for older programs are as follows:

    1. Can you change the start-up of single-process single-thread programs to split up the input data and run multiple instances.
    2. Are there any bad algorithms in use, such as singly-linked lists for large data stores. This has nothing whatsoever to do with CMT on first glance, but turns out to be a limit on the performance you're using multiple instances to achieve!
    3. Is there data shared between the instances, because if so, you will have to add locking, which is slowish on large multiprocessors, and arguably faster on CMT processors with very good memory locality.

    So: adding CMT makes it a good idea to parallelize older programs, O(n^2) algorithms in CMT or multi-CPU programs are every bit as bad as in uniprocessor programs, and introducing locking is bad, but locking on CMT needs to be measured against regular multiprocessors to see if it's going to be better (my speculation) or worse.

    --dave

    --
    davecb@spamcop.net
  29. Re:well at least he seems to understand the proble by jstott · · Score: 1
    "Problem: Legacy Apps You'd be surprised how many cycles the world's Sun boxes spend running decades-old FORTRAN, COBOL, C, and C++ code in monster legacy apps that work just fine and aren't getting thrown away any time soon. There aren't enough people and time in the world to re-write these suckers, plus it took person-centuries in the first place to make them correct.

    Well, the Fortran programs have an easy solution---just recompile with a modern compiler designed for these CPU's. Any loop that can be automatically unrolled can be parallelized instead. Loop parallelization has been a standard Fortran optimization on parallel architectures for decades. Yes, this can be done with other languages as well, but historically it hasn't been (I expecte either due to a lack of demand, or because it's harder to accomodate language features [things like strict aliasing], or both).

    -JS

    --
    Vanity of vanities, all is vanity...
  30. Re:Windows Articles, Slashdot and Pragmatism by gabebear · · Score: 1

    Wow, I've been noticing some out of place posts on Slashdot for a couple days now but this one just proves Slashdot has a serious problem.

    I'm sure you didn't mean to but your post ended up showing up as the first post in an Article about CMT. What's really wierd is that it showed up after a bunch of other posts...

  31. Re:well at least he seems to understand the proble by strider44 · · Score: 1

    But those decade old apps can easily be done by one core in its spare time. I'm not sure why this is an issue.

  32. Shame by gr8_phk · · Score: 3, Interesting

    That's really a shame about the FP performance. My hobby project is ray tracing, and my code is just waiting to be run on parallel hardware. The prefered system would have multiple cores sharing cache, but seperate cache would be fine too. memory is not the bottleneck, so higher GHz and more cores/threads will be very welcome so long as they each have good performance. The code scales well with multiple CPUs as pixels can be rendered in parallel with zero effort - the code was designed for that. As it sits, I'm hoping my Shuttle (SN95G5v2) will support a AMD64x2 shortly. We're still not up for RT Quake, but interactive (read very jerky 1-2 fps) high-poly scenes are possible today.

    1. Re:Shame by Knetzar · · Score: 3, Insightful

      It sounds like you want a cell.

  33. Am I the only one... by roach2002 · · Score: 1

    Am I the only one who thought a bunch of SoftWare Weenies were going to be ready for Country Music Television?

    (Man I'm having a bad case of the Mondays)

  34. Need a breakthrough in hiding concurrency by argent · · Score: 2, Insightful

    Every time someone exposes concurrency at some layer as a way of improving performance, rather than because you're implementing a process that's inherently concurrent, it's a huge clusterfuck. Doesn't matter whether it's asynchronous I/O, out-of-order execution, multithreaded code, or whatever. Even when you're dealing with a concurrent environment like a graphical user interface the most successful approaches involve breaking the problem down into chunks small enough you can ignore concurrency.

    One of UNIX's most important features is the pipe-and-filter model, and one of the really great things about it is that it lets you build scripts that can automatically take advantage of coarse-grained concurrency. Even on a single-CPU system, a pipeline lets you stream computation and I/O where otherwise you'd be running in lockstep alternating I/O and code.

    That's where the big breakthroughs are needed: mechanisms to let you hide concurrency in a lower layer. Pipelines are great for coarse-grained parallelism, for example, but the kind of fine grain you need for Niagara demands a better design, or the parallelism needs to be shoved down to a deeper level. Intel's IA64 is kind of a lower level approach to the same thing where the compiler and CPU are supposed to find parallelism that the programmer doesn't explicitly specify, but it suffers from the typical Intel kitchen-sink approach to instruction set design.

    1. Re:Need a breakthrough in hiding concurrency by Jeffrey+Baker · · Score: 1

      The pipeline is genius, but you'd still like to have concurrency in a single pipe stage. The best example is GNU sort, which does I/O ... sort ... I/O ... sort ... I/O ... sort, alternating quite inefficiently until a final merge. If GNU sort could take advantage of multiple CPUs it would run quite a bit faster, and sorting by divide-and-conquer is one of the most easily-parallelized processes.

      I'm not too excited about the multi-threading aspects of this new Sun processor, but I'm definitely happy about 8 cores, 3MB cache, and 4-channel memory interface. I think it will be a parallel-sorting monster.

    2. Re:Need a breakthrough in hiding concurrency by argent · · Score: 1

      The best example is GNU sort, which does I/O ... sort ... I/O ... sort ... I/O ... sort, alternating quite inefficiently until a final merge. If GNU sort could take advantage of multiple CPUs it would run quite a bit faster, and sorting by divide-and-conquer is one of the most easily-parallelized processes.

      Sorting is one of those areas where heroic measures are worthwhile, because improvements are such a huge win, and the operation is conceptually simple and widely usable.

      Sorting is also a big problem for the UNIX pipeline because the input has to complete before any output can start. It's kind of an exception to my general rule: concurrency for performance reasons is a nightmare, but some nightmares you just have to deal with.

    3. Re:Need a breakthrough in hiding concurrency by MenTaLguY · · Score: 1

      Basically that breakthrough is already here -- functional programming. Unix pipes, for example, are roughly equivalent to monads in functional languages. Since concurrency (and even order of execution) are largely unspecified in functional programs, the compiler has a tremendous amount of latitude to pursue parallelization.

      Of course, we've not really seen widespread adoption of functional programming (as done in Haskell, Erlang, etc) because most programmers haven't been trained that way, and it's only in recent years that practical tools like monads (and more recently arrows) have been well-understood or available.

      Also, not all algorithms are really very parallelizable.

      --

      DNA just wants to be free...
    4. Re:Need a breakthrough in hiding concurrency by argent · · Score: 1

      Basically that breakthrough is already here -- functional programming.

      Well, sorta. I know intellectually that FP is good, but none of the FP languages I've played with have really grabbed me the way UNIX pipes did. Syntax matters, otherwise people would be just as happy with SQL as with UNIX pipes and filters. After all what's the difference between

      SELECT files FROM filesystem WHERE directory IN (SELECT home FROM passwd WHERE user IN (SELECT user FROM diskhogs));

      and

      fgrep -f diskhogs /etc/passwd | awk -F: '{print $6}' | xargs ls -l

  35. Hdw multi-thread vs multi-CPU by Intron · · Score: 1

    Isn't the big issue cache? On a multi-CPU system running one thread per CPU, each thread has its own cache. On HMT, the cache is shared. Threads running in different sections of code on different data will tend to reduce cache hits, offsetting the performance gain of the multiple threads. The limit on increasing the number of threads is that most of the threads will be waiting on cache misses.

    --
    Intron: the portion of DNA which expresses nothing useful.
    1. Re:Hdw multi-thread vs multi-CPU by Anonymous Coward · · Score: 0

      Threads are used to operate on the same data. Processes are used to operate on different data. At least, that's the idea...
      As long as the kernal schedule threads from the same process on the same CPU and threads from different processes on different CPUs cache misses are not going to be that big a problem. At least, as long as the programmer's don't use threads when they should use processes and vice versa.

    2. Re:Hdw multi-thread vs multi-CPU by Anonymous Coward · · Score: 0

      Multi threads touching the same data normally have to be protected by mutexes, which prevents all but one from running. As pointed out in the article, this limits performance. The only time multi threads is useful is operating on different data.

    3. Re:Hdw multi-thread vs multi-CPU by Anonymous Coward · · Score: 0

      Cache is not a problem. Why? Because when you're one of 32 threads, and you get a cache miss, hopefully the data will be back in cache next time the CPU runs you.

      Cray has been doing this for a long time. As long as your number of threads is greater than the number of cycles for memory access, you don't even need cache!

      dom

    4. Re:Hdw multi-thread vs multi-CPU by jelle · · Score: 1

      That holds only if your memory 'cycles' are pipeline delay cycles that don't lock the memory for more than one cycle. Which, for dram, is not the case on a page boundary or switch (cas precharge).

      If your external memory doesn't have the bandwidth of your CPU core, you will need cache either way.

      --
      --- Hindsight is 20/20, but walking backwards is not the answer.
  36. Re:well at least he seems to understand the proble by archeopterix · · Score: 1
    from TFA:
    I guarantee that whoever wrote that code wasn't thinking about threads or concurrency or lock-free algorithms or any of that stuff.
    Well, perhaps it's a job for the compiler to make that code thread-aware, at least to some degree. Two consecutive function calls that you (the compiler) know to be independent? Execute them in parallel. A loop running over 10000 independent objects? Split it into k loops, 10000/k objects each.

    Of course the compiler has severe limits as to what it can really guess (the "independent" part can be very hard in this aspect), but at least once you write it, you can run it on all your apps for free.

  37. The bottlenecks by davecb · · Score: 3, Interesting
    CMT is a good approach for dealing with the speed mismatch between CPUs and memory, our current Big Problem

    I'll misquote Fred Weigel and suggest that the next problem is branching: Samba code seems to generate 5 instructions between branches, so suspending the process and running something else intil the branch target is in I-cache seems like A Good Thing (;-)).

    Methinks Samba would really enjoy a CMT processor.

    --dave

    --
    davecb@spamcop.net
    1. Re:The bottlenecks by Anonymous Coward · · Score: 0

      I though Samba had a problem with reader/writer contention on its internal tables. I thought there was going to be some work putting in a lock-free solution but I don't think anything came of it. With 32 cores the contention will get much much worse.

    2. Re:The bottlenecks by Anonymous Coward · · Score: 0

      Samba 4 has solved the problem in a very elegant way (though the solution is not yet implemented very well).

      Its not lock-free, but contention is MUCH lower.

    3. Re:The bottlenecks by ddebrito · · Score: 1

      Motorola was using CMT technology on their timer channel processor inside 68332 micro controller (this was designed back 1986). Each timer channel basically had its own instruction register (as well as other registers). A round-robin process time-sliced throught the timer channel processors. This worked great because instructions were pre-fetched in time for the next execution window. It worked great for parallel timing processes because every channel had the same priority and same time resolution. How it would work for database applications that might interact with eachother (eg need to block eachother) would need to be carefully evaluated.

    4. Re:The bottlenecks by CTho9305 · · Score: 1

      I'll misquote Fred Weigel and suggest that the next problem is branching: Samba code seems to generate 5 instructions between branches, so suspending the process and running something else intil the branch target is in I-cache seems like A Good Thing (;-)).

      That's close to average... normal integer applications tend to be about 20% branches. That's why branch predictors are important, and why so much research goes into improving them. You can get around 95% accuracy without too much difficultly (higher with the really fancy predictors). If the branch target isn't in the I-cache, it's not a branch problem, it's a cache problem. You either need a good L1 prefetcher, a bigger L1, or code that isn't so bloated ;). (Of course, applications like OLTP tend to not fit in ANY size cache so you're screwed no matter what, but fortunately they parallelize well so you can just throw more slow CPUs at it).

    5. Re:The bottlenecks by davecb · · Score: 1
      In a test using samba, branch prediction only saved me a few cycles deciding that the branch around the if-statement or debug macro was going to be taken, but the delay was indeed from filling the i-cache line for the target, which took many many MANY cycles.

      A conventional data prefetcher didn't buy me anything (:-))

      A typical debug macro or multi-line if will tale me to the next i-cache line with more than 80% probability.

      --dave

      --
      davecb@spamcop.net
  38. dead end by cahiha · · Score: 2, Insightful

    Threads are actually one of the simplest form of parallelism to deal with and we have had decades of experience with them. That's why Sun loves them: it fits in well with their big-iron philosophy and hardware and makes it easy for their customers to migrate to the next generation.

    But the future of high-end computing, both in business and in science, will not look like that. Networks of cheap computing nodes scale better and more cost-effectively. Many manufacturers have already gone over to that for their high-end designs. That's where the real software challenges are, but they are being addressed.

    Processors with lots of thread parallelism will probably be useful in some niche applications, but they will not become a staple of high-end computing.

    1. Re:dead end by Spy+der+Mann · · Score: 1

      Processors with lots of thread parallelism will probably be useful in some niche applications, but they will not become a staple of high-end computing.

      Image rendering would benefit A LOT from parallel processing. People who would benefit: Image designers, game players.

      Yeah, the niché is almost negligible. Who needs high speed rendering, anyway?

    2. Re:dead end by convolvatron · · Score: 1

      in the limit, the packaging, power, network, and thermal costs will drive the individual nodes in any cluster to be 'fatter'. its the same thing as today, but the crossover point will change. there may be some short term bumps due to volume issues, but i think you're wrong. they aren't disjoint architectures but a continuum

    3. Re:dead end by cahiha · · Score: 1

      in the limit, the packaging, power, network, and thermal costs will drive the individual nodes in any cluster to be 'fatter'.

      Quite to the contrary, once you are willing to use clustering, it makes sense to make the individual nodes a bit leaner: you reduce packaging, power, network, and thermal costs that way.

  39. Re:Ready for CMT? Hell no! by NoData · · Score: 3, Funny

    Seriously! And why foist this garbage on the Star Wars (SW) weenies? Has John Williams gone country?

  40. How to make code run fast? by Apreche · · Score: 3, Interesting

    Easy. In present days there are some assembly instructions that can be executed simultaneously. With a chip like this however, all bets would be off. Instead of just a meager few instructions that could be executed simultaneously you would be able to execute any number of instructions simultaneously.

    So if you have a function that say does 10 additions and 10 moves you would first figure out if any of them needed to be done before or after each other. Then see which ones don't matter. Then write the function to do as many at once as possible.

    It really doesn't matter for anyone other than the compiler writers. Those guys will write the compiler to do this kind of assembly level optimization for you. The trick is writing a high level language, or modifying an existing one, so the compiler can tell which things must be executed in order and which can be executed side by side.

    --
    The GeekNights podcast is going strong. Listen!
    1. Re:How to make code run fast? by Anonymous Coward · · Score: 0

      All modern CPUs do something very much like that automatically. Please read up on Out-of-order processing and be enlightened.

    2. Re:How to make code run fast? by m50d · · Score: 1

      It relies entirely on the compiler, yes, but writing a good optimising compiler is *bloody difficult*. As you say, the language can help. Making use of this in Haskell should be fairly easy, because the consequences of each function are clearly defined. Making use of it in C (without requiring the programmer to give hints) may very well be impossible.

      --
      I am trolling
    3. Re:How to make code run fast? by fred+fleenblat · · Score: 1

      The thing is, the multi-core chips that we're talking about are designed to run multiple independent streams of instructions. If the cores are all working on the same stream, the overhead of co-ordinating access to cache and registers utterly destroys any parallelism gains.

      The last couple of generations of chips and compilers are actually pretty smart about instruction re-ordering and making full use of the available registers and ALU/FPU's.

  41. let's go verb hunting! by illtron · · Score: 0, Offtopic

    This year, more threads next year.

    Hmm, I can't seem to find one. For arguably one of the tech-savviest sites on all of the Internet, Slashdot contributors have surprisingly awful grammar.

    We hear a lot about the lack of technical education and preparation for engineering and science careers these days, but sometimes it looks like English instruction is just as bad.

    I'm not looking for perfection, and sometimes, like in comments, speed matters more than grammatical accuracy, but when you're submitting a story, it really can't hurt to read it over to make sure it fits elementary school standards.

    P.S., there is no such word as "virii." There, now this is officially off-topic.

    --
    Slashdot: 24 hours behind every other site or your money back!
    1. Re:let's go verb hunting! by rylin · · Score: 0, Offtopic

      FIRST POST!

    2. Re:let's go verb hunting! by deeej · · Score: 0

      there is in my dialect... btw, incase you didn't get the memo, prescriptivism is dead. embrace the new english that are.

  42. 500W power supply? by sgt+scrub · · Score: 1

    "So, given that CMT chips use less watts per unit of computing, why aren't..."

    I think the "requires a 500W power supply part should answer this question".

    What will this Cell Based system look like? - Our Speculation
    - MotherBoard supports up to 4 Cell Chips
    - Each Cell Chip will have its own Rambus main memory. The memory will be on plug in strips much like DDR etc
    - The Cell Chips on the motherboard will cooperate by means of FlexIO which is a multilane/serial technology.
    - There will be two slots meant for video cards. Similar to AGP but designed for Rambus, not AGP compatable, 10x faster than AGP.
    - All other I/O will be done by means of FlexIO similar to what is now possible with USB - except the system will boot from flexIO
    - There will be no legacy hardware support - NO PCI, AGP, usb, serial, parallel, ps2 , ethernet - nothing
    - The power supply will need to be about 500 watts.
    - power management will allow cell chips and parts of cell chips to be powered down when not in use.
    - There will be 16 FlexIO ports coming out the back. 2 in and 2 out for each Cell Chip.
    - Cluster can be created by stacking Cell Boxes and connecting them with the FlexIO cables.

    http://cellsupercomputer.com/power_pc.php

    --
    Having to work for a living is the root of all evil.
  43. My CPU left me, and the Flatscreen died.... by the_weasel · · Score: 1

    Am I the only person who was wondering why slashdot was talking about Country Music Television for a moment there?

    * crickets *

    Time to hand in my nerd badge I guess, and slink off into the sunset.

    Seriously, though - thanks for clarifying the meaning of CMT in the blurb. A big step forward from the usual Slashdot blurb.

    --
    - sarcasm is just one more service we offer -
    1. Re:My CPU left me, and the Flatscreen died.... by jcuervo · · Score: 1
      Am I the only person who was wondering why slashdot was talking about Country Music Television for a moment there?
      I'll do you one better: I thought they were going to start showing Star Wars on Country Music Television for a sec.
      --
      Assume I was drunk when I posted this.
    2. Re:My CPU left me, and the Flatscreen died.... by the_weasel · · Score: 1

      Yep, and from reading the rest of the comments, its seems you were not alone.

      Scary when the headline generates more interest than the actual article contents :)

      --
      - sarcasm is just one more service we offer -
  44. Does anyone read these? by Anonymous Coward · · Score: 0

    Maybe the need for smaller transistors and wires on chips has been fueling the growing nanotech industry, so maybe we should continue working on smaller and faster chips, though they might not be practical.

  45. Already been done... by Anonymous Coward · · Score: 0
    Chuck Moore (inventor of Forth) has been banging out these kind of systems for years. Take, for example, his 25x Microcomputer chips, which are essentially 25 computers on a 7 square mm die.

    It was designed using his forth CAD software, probably running on one of his earlier cpus.

  46. Re:well at least he seems to understand the proble by daVinci1980 · · Score: 1
    Any loop that can be automatically unrolled can be parallelized instead.
    Please unroll the following loop automatically (not FORTRAN, but simple enough to translate):
    void AccumulateLoopCount(int N) {
    int accumulator = 0;
    for (int i = 1; i < N; ++i) {
    accumulator += i;
    }
    return accumulator;
    }
    Now make the code parallel.

    (I realize that this solution could actually be computed at compile-time for any known value of N, and I realize that there is a formula to compute this answer in constant time). My point is that just because a loop can be unrolled automatically (this loop can) does not mean that it can be executed in parallel. Executing this code in parallel would result in a *massive* performance hit or a tremendous memory size explosion.
    --
    I currently have no clever signature witicism to add here.
  47. Here is a tip for you hardware guys... by dwalsh · · Score: 0, Troll

    If you want us to accommodate your inability to improve single threaded performance and rearchitect 20 years of software for parallel computing, then how about this:

    DON'T CALL US WEENIES! Ya bunch of Verilog writin', pocket protector wearing misfits, who take six months to implement what we can do in five lines of code, and cannot maintain app. integrity even in a single core non-hyperthreaded CPU! (See here: http://www.comp.nus.edu.sg/~abhik/pdf/pact04.pdf).

    Yours Sincerely,
    A Software Engineer.

    --
    ${YEAR+1} is going to be the year of Linux on the desktop!
    1. Re:Here is a tip for you hardware guys... by Anonymous Coward · · Score: 0

      Dear Mr A.S. Engineer

      We will continue to call you weenies as long as you use ridiculous words like "rearchitect".

  48. Performance by ZuggZugg · · Score: 1

    It's funny Sun claimed 15x performance increase with Niagara about 16 months ago, but they never bothered to put that claim into any context. 15x the then 900 MHz SPARC III, I doubt it seriously. I doubt even 15x their low-end SPARC IIe in their now discontinued blades.

    It appears that Sun engineers have hit a MHz wall sooner than the likes of Intel/AMD/IBM and are going extreme parallelism.

    Based on what I've read the Niagara CPU will only be deployed in a single slot server...the only thing it might be useful for is front-end web servers and light-duty app servers. It doesn't sound like FP performance will be too exciting so I doubt it will find it's way into renderfarms.

    I would like to see a showdown between the IBM/Toshiba Cell and Niagara.

    It's my opinion that the Sun engineering team are in serious trouble.

    1. Re:Performance by Anonymous Coward · · Score: 0

      I agree, I don't think many people want to go back 5-7 years in single-thread performance.

  49. Can use, not needs! by try_anything · · Score: 2, Interesting

    If single-threaded performance improvements slow down, and the available computing power is spread out among multiple cores, anyone persisting in writing single-threaded code will fall behind in performance.

    Remember the old days when people used fancy tricks to implement naturally concurrent solutions as single-threaded programs? The future is going to be just the opposite. Any day now we'll see a rush toward langages with special support for quick, clear, safe parallelism, just like we've seen scripting languages catch on for web programming.

  50. Sun Fortress, Haskell and Erlang by Anonymous Coward · · Score: 1, Insightful

    and other such languages will become more popular as this new multithreaded world takes hold because they embed the multithreaded concepts into the language without explicit programmer interaction. C, C++, Java style threading and mutex constructs are error-prone and awkward to use.

  51. What is the problem here? by borud · · Score: 1
    Every time someone mentions systems with more processors or more cores, there is a lot of whining from people who think that making software take advantage of more processors is such a monumental task.

    It isn't. And it isn't just scientific data chugging which would benefit from increased availability of actual concurrent processing in typical desktop computers; there are currently many of these PCs that already to things that can be paralellized.

    For instance image processing. For many kinds of image processing it isn't even hard to partition the problem so that you can make use of more than one processor. I use my PC for processing pictures taken with a digital SLR. A lot of people I know do video editing on a PC and even people who have small home studios for music production centered around their PCs or Macs.

    Even if you are not running multithreaded applications that are heavily CPU-bound, multiple CPUs or CPU cores is useful. Currently my desktop computer runs 108 processes. between 3 and 6 of these processes were on cursory inspection marked as "runnable", yet I have only one CPU. I'd probably benefit from another CPU or three because right now I'm not really doing anything that requires a lot of CPU grunt.

    There is no problem. It isn't as hard as people say to make use of more processors, more cores or more low-level support for multithreading. If anyone is trying to make you believe there's a big problem, you can safely ignore them.

  52. Re:well at least he seems to understand the proble by Sique · · Score: 1

    Because sometimes the sheer amount of data those applications have to calculate has increased. Or because a calculation that once was done once a week during the weekend on several machines with separate data groups in parallel is now done as an instant report at the fingertip of a clueless manager, who just want to be the 'numbers to be up-to-date' (of course THIS calculation can be parallelized, but not in an algorithmic way, but by separating independent data).

    --
    .sig: Sique *sigh*
  53. Compilers and "Events Model" by xtracto · · Score: 1

    The main problem with paralelism for the general application is the current model. The "Event Model" that is used nowadays as the basic processing model for applications specifies that the program will stay idle until the user press a key or moves the mouse (or push buttons).

    With this model it is kind of hard to use the multithreading processors. Of course after the user has triggered an action then the program could make use of the threading capability to improve its performance.

    Next comes the problem of looking at "how many threads" should one allow in his program... if one allow to many threads and the processors have just 2 it will be bad, also the other way around.

    I think the compilers must be done "thread aware", so they can get the program code and efectively use the processing power. Of course if the program is compiled natively (C,C++, Pascal, etc) we would have the same problem of threads numbers, but if there is a virtual machine (Java and .NET technology) or even a interpreter, the middle-layer needs to be thread aware so it can distribute the processes in different threads.

    Of course, the first applications that can take advantage of multithreading are games, as their model is active but, then again the compiler MUST be aware of the multithreading capacities and it should be able to fit the different developer wantd threads in the processor.

    For the general application I think multithreading must can be used by changing (or extending) the events model window paradigm, so, in one thread the program could wait for the events and other thread could be used to pro-act; this could be achieved by some kind of artificial intelligent development.

    Just today I was daydreaming about how to replace the totally old and awkward menu bar standard interface, specifically for OpenOffice, which has 10 menus with 30 or more submenues... this is a thing that could be improved by some kind of proactive behaviour from the computer (imagine something like an agent that could predict the options you where looking for while using the program... [no i am not thinking about the !£%!"£@ "feature" of hiding the menu options from MS Office , windows et al] ).

    Another way to use multithreading could be from the Operating System, so the programs [that do not require] multithreading wont have to deal with it BUT the operating system would use the multithreading capacities to schedule the processes execution... in this way we may get [AT LAST] a [REAL] multiprocess OS (and not the illusion we have now by quit process switching).

    --
    Ubuntu is an African word meaning 'I can't configure Debian'
    1. Re:Compilers and "Events Model" by putaro · · Score: 1

      Strangely, in the market that Sun is targeted at, the server market, applications are written to be multi-threaded and do not run off an event model because they do not have GUIs!

      Another way to use multithreading could be from the Operating System, so the programs [that do not require] multithreading wont have to deal with it BUT the operating system would use the multithreading capacities to schedule the processes execution... in this way we may get [AT LAST] a [REAL] multiprocess OS (and not the illusion we have now by quit process switching).

      This is called a multi-threaded kernel and Windows, Linux and Solaris are all set up to do this.

    2. Re:Compilers and "Events Model" by argent · · Score: 1

      The main problem with paralelism for the general application is the current model. The "Event Model" that is used nowadays as the basic processing model for applications specifies that the program will stay idle until the user press a key or moves the mouse (or push buttons).

      But that's good. If the program has no useful work to do... if it's waiting for the user to do something... then it should be idle so it doesn't use CPU time that other programs may need.

      The world is full of programs that busy-wait already, whether they do it with a thread or by setting timers to put extra actions in the event loop. Most of them are not actually doing anything useful with that little bit of business, they're just badly written.

      If a program DOES have work it needs to do, that can actually make a difference to the user, then it should kick off a background process or thread to do it. But right now there's way too many programs that sit there using a few percent of my CPU obsessively checking on the state of something they should be handling in the main event loop.

      Another way to use multithreading could be from the Operating System, so the programs [that do not require] multithreading wont have to deal with it BUT the operating system would use the multithreading capacities to schedule the processes execution.

      Um, that's what every modern operating system DOES. The last desktop OS that didn't do native concurrent multitasking was Mac OS 9, and Steve Jobs has not only staked it in the hear, he's cut its head off, stuffed its mouth with garlic, and poured holy water and weedkiller on the corpse. And if that's the only good thing to come from the Mac X86 transition, well, it might even be worth it.

  54. Sun needs more raw performance by PureCreditor · · Score: 1

    An UltraSparc that runs 32 threads of CMT, but combined of merely a few hundred MIPS, is worse than an IBM Power or AMD Opteron that requires software context switches, but crunches out thousands of MIPS. Sun needs a clearer server/CPU strategy than throwing a whole new paradigm on the table PER UPGRADE CYCLE.

  55. Oh, it's livejournal, that explains it all.... by Anonymous Coward · · Score: 0
    Holy shit! I've seen goatse, I've seen tubgirl, I thought I'd been around teh intarweb a time or two, but dayaaaaam, that is some messed up sheeeit!

    Kudos, sir troll!

  56. Old news for IBM, this is just Sun catching up by The+Mad+Duke · · Score: 2, Interesting

    IBM started SHIPPING Power5 with SMT capablility August 31 of last year - IBM has SMT running on 1.9 GHz processors today. Sun is getting farther and farther behind.

    --
    -The Mad Duke
    1. Re:Old news for IBM, this is just Sun catching up by Anonymous Coward · · Score: 0

      IBM started SHIPPING Power5 with SMT capablility August 31 of last year - IBM has SMT running on 1.9 GHz processors today. Sun is getting farther and farther behind.

      SMT is like hyperthreading. Nothing to get excited about. I think what you really mean is dual core chips. Both the POWER5 and the UltraSPARC IV are dual core.

      Niagara will have 8 cores and be in production by the end of the year from what I understand. It will be IBM doing the catching up.

  57. I disagree by NigelJohnstone · · Score: 1

    "Sure the two cars let you do independent things but when you're working on one task [getting to work] you're not ahead."

    But you're not, you never are working on only 1 task.

    Look at the threads running on a PC and its hundreds, you have file cache threads, communications threads, all kinds of stuff running.

    A whole convoy of cars all sitting in one lane waiting for the car in front.
    You keep the speed limit the same, make the highway 8 lane and 8 times the cars can pass through.

    Also you would save the thread state store/recall overhead, as the processor needs prepped for each thread switch. You have only 1/8th of those happening if the chip can run 8 threads at a time.

    1. Re:I disagree by tomstdenis · · Score: 1

      learn...to...profile...

      Yes there are 96 processes running on my computer.

      At any given load maybe 1 of them is active. If I'm bziping a 900MB tarball it's a single process [* though you can actually split bzip2 into many processes they're still chained...].

      Like how many times are you doing DSP operations while writing gigabytes to disk and communicating with other threads at the same time? Mostly you're either taxing the I/O or your taxing the ALU.

      Again it's a space/time tradeoff. I don't know how many times I can say this...

      Sure the Niagra has 4 ALUs but are they as power efficient as the single ALU in the AMD64 or say a PPC even?

      Back to the car analogy, you have 4 people going to work. They can take one car that will burn 1 gallon of fuel and get there in 10 minutes.

      In this case they either burn 4 gallons of fuel and get there in four cars [one each] in 10 minutes or they burn 1 gallon and get there in 40 minutes.

      Where this shines is if they all want to go to four different places. Instead of taking a round trip of 40 minutes [10 each say] they take 10+e time [e == delay because of traffic] so say e=10 so 20 minutes.

      How is this an "improvement?". They're proving that using more resources can get better performance. Yipee.

      Can multiple cores improve throughput? You bet! Does multiple threads really help? No way!

      Keep in mind the scale here. A 1000-cycle task switch is NOTHING compared to the 2.2 million cycles a process has [2.2Ghz clock] in the typical 1ms style timeslice. 1000 cycles == 0.05% of the total execution time. In a proper OS though the timeslices would be larger for tasks that are above priority which means the actual time taken to swap tasks is minimal.

      And even still, hardware assist task swapping and "multi-threading" are not the same thing. I'd rather see an efficient ALU with hardware assist [e.g. a local cache or something for that task] then a multi-threaded ALU with lax performance.

      Tom

      --
      Someday, I'll have a real sig.
  58. Link Warning ! by Anonymous Coward · · Score: 0

    That was worse than goatse - my eyes and my ears have been assaulted!

  59. New job - new tools by el_womble · · Score: 1

    Traditional languages that have had threads bolted on like C/C++ make threading more challenging than it needs to be. Java, as long as you understand the principles of concurrency, makes it a breeze. I would be interested to see weather a well coded JVM / JIT could outperform traditional languages on these new CPUs - especially if you could dedicate a couple of the hardware threads to JIT, and GC threads.

    --
    Scared of flying, pointy things snce 1979!
    1. Re:New job - new tools by mmusson · · Score: 1
      Traditional languages that have had threads bolted on like C/C++ make threading more challenging than it needs to be. Java, as long as you understand the principles of concurrency, makes it a breeze.

      This is simply not true. The simplified constructs in Java lead to very poor locking patterns in non-trivial code. Complex Java uses the very same advanced algorithms calling the low level primitives that C/C++ would use. Java is more dangerous in one important way. C++ contains destructors and a common locking pattern is to lock in the constructor and unlock in the destructor. This makes it very hard to have a missing unlock making the code easier to maintain.

      In Java you are left with obsessive use of catch/finally to handle all possible cases manually. The syntactic sugar in Java gives the illusion that the code is easier but it is in fact no easier than any other language.

      --
      SYS 49152
  60. Re:well at least he seems to understand the proble by babble123 · · Score: 2, Informative
    Can I use OpenMP? I

    void AccumulateLoopCount(int N) {
    int accumulator = 0;
    #pragma openmp parallel for reduction(+:accumulator)
    for (int i = 1; i < N; ++i) {
    accumulator += i;
    }
    return accumulator;
    }
    (I'm not actually an OpenMP programmer, so this syntax might be wrong...)
  61. The author doesn't understand Java class locking by putaro · · Score: 2, Informative

    From the article:

    The standard APIs that came with the first few versions of Java were thread safe; some might say fanatically, obsessively, thread-safe. Stories abound of I/O calls that plunge down through six layers of stack, with each layer posting a mutex on the way; and venerable standbys like StringBuffer and Vector are mutexed-to-the-max. That means if your app is running on next year's hot chip with a couple of dozen threads, if you've got a routine that's doing a lot of string-appending or vector-loading, only one thread is gonna be in there at a time.

    Classes such as StringBuffer and Vector are locked (synchronized) on a per-object basis. As long as you aren't trying to access the same object from different threads you won't block. And if you are trying to access the same object from different threads you will be happy that they were thread-safe!

    The performance problems of having these classes being obsessive about thread safety do not result from the locking forcing singlethreadedness. The performance problem stem from the cost of locking objects.

  62. Re:well at least he seems to understand the proble by doppe1 · · Score: 1

    void AccumulateLoopCount(int N) { int accumulator = 0; #pragma omp parallel do reduction(+:accumulator) for (int i = 1; i N; ++i) { accumulator += i; } return accumulator; } Very easy to parallelise this, each thread has its own private accumulator, initialised to zero, and the result from each thread is summed at the end. I don't see where this massive performance or memory hit would come from.

  63. Back to the future by jmichaelg · · Score: 1
    Yes a lot of tasks today are essentially serial. Way back in the 32K days, we had to partition our work so it would fit in a 24k. The OS took 8k the memory hogging pos.... Any rate, our tasks were broken down into little pieces of code that loaded serially one after the other. It was hard to imagine what one would do if the had a machine with 256K of memory. Nobody could ever use it! Reading your post reminded me of those days - some of us were hampered by a lack of imagination.

    Some tasks are serial, others can be parallelized. You don't need fancy languages to do it either. To effectively partition tasks into threads using something as archaic as C, you can either fork or you can load different processes. Either way works. The trick is to shift one's thinking from "tasks are serial..." to "how could I speed my code up if I had multiple cpus available?"

    Encoding music to put on some sort of music player that doesn't have replaceable batteries and is headed for landfill 18 months after it's purchased comes to mind. Partitioning the music can be done either by tunes or within a tune if the encoding scheme can be chunked.

    AI is scalable via threading which mean a well laid out game architecture could scale with more hardware threads. A user with a single processor would only get a few smart enemies, a user with a cpu array could see lots of smart behavior such as some of the enemy deciding to flee while other warriors come charging into battle.

    Folks who play with Photoshop or Gimp can easily soak their cpus. A blur operation is the kind of task that can be partioned to good effect. It's not programming languages that's keeping this from happening as much as not many people have the requisite hardware.

    Personally, when I'm working, I'll be printing, scanning and reading the scanned input simultaneously. For some unknown reason, the printer driver soaks up all the cpu cycles which slows down my reader and scanner software. Being able to allocate the printer driver its own hardware would make the rest of my workflow smoother.

    Some of us will be able to use the horsepower and some of us won't. Not much has changed in the past 40 years.

  64. Don't listen to Sun by photon317 · · Score: 0, Flamebait


    I heard the same talk under NDA about a bit over 6 months ago. They're just hyping their warez, it's nothing special. They're talking about multi-core CPUs like what just came out from AMD, and "hyperthreading" like what Intel has had for a while. They're basically playing catchup, and poorly. If they were smart they'd have dumped future plans for the UltraSparcs a few years ago and started transitioning to Solaris on x86s and especially Opterons, and possibly built some fat custom hardware in a similar vein to the SunFire series servers around the Opteron architecture.

    --
    11*43+456^2
    1. Re:Don't listen to Sun by photon317 · · Score: 1

      I should add - yes of course they are selling x86 (And specifically Opteron) boxes with Solaris and Linux on them. But they still consider it a fringe market for edge devices and small webservers for cheapass customers. They don't "Get it", and they still think their big UltraSparc hardware is king and can stay that way for years to come.

      --
      11*43+456^2
    2. Re:Don't listen to Sun by photon317 · · Score: 1


      Flamebait? Give me a break...

      --
      11*43+456^2
    3. Re:Don't listen to Sun by Anonymous Coward · · Score: 1, Insightful

      Sparc playing catch-up? It's x86 that's playing catch-up to the proprietary RISC vendors. UltraSparc IV processors have multiple cores like the new AMD and Pentiums for the past year or two. POWER4 from IBM started shipping with four cores when it came out several years ago. HP's PA-RISC has been dual-cored for a while. I think POWER4 has SMT, and I know POWER5 does. Even before HP and Compaq merged, the next Alpha chip, the EV8 was going to have some impressive SMT, also.

      The only way that x86 is ahead is clockspeed, due to aggressive production technology.

      How can a true Slashdot geek not be looking forward to this? It's something new and different. I'll never own one and possibly never work with one, but I'm curious to see exactly how such a design performs, because it's a lot different from a single 3.6 ghz Pentium 4. Don't you want to at least see how it does before dismissing it? Unless you have stock in Sun or a bizaare emotional investment in processors, what's the harm in Sun spending their money on this product?

    4. Re:Don't listen to Sun by photon317 · · Score: 1


      Bullshit.

      Sun has not been selling multicore UltraSparcs at all, unless they started within the past 4 months or so. They may have been claiming they exist, but they aren't for sale (I haven't looked in 4 months or so, could be now - but not 1-2 years like you claim).

      I wasn't attacking POWER or PA-RISC, only Sun, so there was no need for the rest of your crap, and I won't respond to it other than to say that PA-RISC is virtually dead in the water, and so is HPUX. IBM's POWER architecture may do well with Linux on it, but AIX will eventually go away, IBM's been planning that transition for a while now.

      Linux + Opteron >>>> Solaris + UltraSparc

      I've been there and done it with fileservers, webservers, and oracle database servers. Sun is in denial.

      --
      11*43+456^2
    5. Re:Don't listen to Sun by Anonymous Coward · · Score: 0

      I just did some googling and dual-core UltraSparcs were announced (at least, earliest news story within the top 3 results) in 2002. The general availability date for an UltraSparc IV system was listed on Sun's web site as May 2004. 1-2 years might have been exagerrating, but not much.

      Since when were we talking about operating systems??? I just thought we were talking about processors. You said something about Sun being late to the multi-core and multi-threading party, and I just pointed out that one could argue x86 is the one that's late with the multi-core features.

      I don't care if Niagara succeeds or fails. That's not the exciting thing to me. What's interesting will be to see how/why such a design as Niagara fails, or how/why it succeeds technically.

      Would RF processor interconnects be of no interest to you, or a working asynchronous processor? Maybe such research makes no business sense, and that's something I didn't address from your first post. My argument is, "Why not take an interest in something new, regardless of whatever negative feelings the 'Sun' moniker evokes?"

    6. Re:Don't listen to Sun by Anonymous Coward · · Score: 0

      Yes, SPARC is playing catch-up. Multiple cores is what you do when you have a die-shrink without the engineering bandwidth to make the core faster (like Intel and AMD do). That's why you have the latest PA-RISC chips with multiple cores and 64MB(!) of cache. They couldn't make the cores faster, so they just added more cache and more cores to use the transistor budget.

      Of course, POWER CPUs have never been a single chip, so putting multiple cores in isn't a problem -- especially when you can charge $10,000 for a CPU and use as much power as you want.

      Sun is playing catch-up because they thought they could engineer a faster core but found out too late that they couldn't. So the only thing to do is take working cores and put lots of them on a chip.

      Intel and AMD have actually managed to squeeze more performance out of a single core, so they haven't needed to go multi-core yet.

      dom

  65. You might want to go back to school... by putaro · · Score: 4, Insightful

    and take some advance architecture courses.

    The BEST a single core multi-thread design can hope for is the performance of a single core single thread design...

    I'm sorry but that turns out not to be the case.

    When you have a system that is running lots of different threads simultaneously the amount of time that it takes to do a context switch from one thread to another becomes an issue. In the real world, threads often do things like I/O which cause them to block or they wait on a lock. If you can do a fast context switch you get back the time that you would have wasted saving registers off to RAM and pulling back another set. Faster thread switching means that your multi-thread single core now runs its total load (all of the threads) faster than a single core single thread design. Also, things like microkernels become a lot more feasible (microkernels are notorious for being slow because context switches are slow).

    When you have looked beyond your desktop machine maybe you'll have earned the right to sneer at your professors. I don't think you're there yet.

    1. Re:You might want to go back to school... by geekdad · · Score: 1

      What you say is true ... however ... The assumption is that through-put/performance is the only advantage to a multi-threaded approach. A well architected multi-threaded system provides another means of partitioning a problem to improve maintainability and scalability.

  66. I didn't see garbage collection in his list by alispguru · · Score: 2, Interesting

    Those of you who are up on the current state of the art here, please help me out. I was under the impression that multiple threads and automatic storage management were still not on good terms with each other, and that this was a big unsolved problem.

    --

    To a Lisp hacker, XML is S-expressions in drag.
    1. Re:I didn't see garbage collection in his list by Anonymous Coward · · Score: 1, Informative

      Sure they are. Just don't use Boehm style GC which usually requires a "stop the world" to perform GC. See my project atomic-ptr-plus for various forms of SMP/CMT friendly GC. I'm currently sporadically working on a RCU-SMR hybrid that obsoletes everything there. It would be less sporadic but I don't have as good funding as Sun, Intel, IBM et all have.

  67. Hyping Sun Warez by Anonymous Coward · · Score: 0

    They're just hyping their warez , it's nothing special.

    So what you are saying is that MS Office and Adobe Photoshop are available on Solaris for free, but I have to go to some dodgy Russian web site to get it? :-)

  68. Toss Big Hairy Package over the wall? (gulp) by Anonymous Coward · · Score: 0

    I,for one, welcome our new eunich overlords!

  69. rethink you apps by Anonymous Coward · · Score: 0
    We'll have to rethink many apps. I have been a fan of SMP and of threads for years. Threads are great for things like web servers, audio (multi-channel), PBXs (Asterisk), etc. However, most software today does not take advantage of threads. Good software design that takes advantage of threads (SMP, CMT, hyperthreading, etc.) is difficult and not what most people do. Debugging is a PITA.


    What do threads and CMT buy you when browsing? How about when playing a game? How about using OpenOffice or Office? All of these can run faster if multithreaded, but are they ready?

  70. Re:well at least he seems to understand the proble by maxwell+demon · · Score: 1
    Now make the code parallel.
    Simple: If there are 4 threads,
    • thread 1 adds 1, 5, 9, ... into accumulator_1,
    • thread 2 adds 2, 6, 10, ... into accumulator_2,
    • thread 3 adds 3, 7, 11, ... into accumulator_3,
    • thread 4 adds 4, 8, 12, ... into accumulator_4.
    After all threads have finished, just do accumulator = accumulator_1 + accumulator_2 + accumulator_3 + accumulator_4.

    Of course, for this specific piece of code, the perfect optimization would be:
    void AccumulateLoopCount(int N)
    {
    register int tmp = N>>1;
    if (N%2)
    return tmp*(N+1);
    else
    return N*(tmp+1);
    }
    This code of course is hardly parallelizable ;-)

    If you want explicitly unparallelizable code, you'll have to use a non-associative operation, e.g.
    double nest(double (*f)(double), int n, double x)
    {
    for (int i=0; i<n; ++i)
    x = f(x);
    return x;
    }
    This can still be (partially) unrolled e.g. like this:
    double nest(double (*f)(double), int n, double x)
    {
    for (int i=0; i < (n>>2); ++i)
    {
    x = f(x);
    x = f(x);
    x = f(x);
    x = f(x);
    }
    switch (n&3)
    {
    case 3:
    x = f(x);
    case 2:
    x = f(x);
    case 1:
    x = f(x);
    case 0:
    break;
    }
    return x;
    }
    However there's no way to parallelize it.
    --
    The Tao of math: The numbers you can count are not the real numbers.
  71. Uncommon apps will be more common by try_anything · · Score: 1
    There are only a few "common apps" because it takes a huge investment of manpower to write apps like OpenOffice or Firefox. I think the potential winners in this are be people who want to create or simply use non-common, GUI-style apps.

    Here's why: Few people want to put in years of effort learning to make production-quality apps, and many aren't able to because they have a hard enough time already keeping up their output on their "real" job. As a result, most cool things with limited audiences are stillborn. Filesharing, web browsing, and word processing programs make it because they have enormous user bases. Make GUI programming simpler, and more cool things will be created for medium-sized audiences.

    I expect GUI library designers to discover ways to make programming easier by using simpler, less efficient models and using extra threads to make up the loss in efficiency. Ingenious people will find a way. Simplifying the task of GUI programming will mean that more creative tinkerers will build real apps instead of quirky, crash-prone prototypes.

  72. Re:Ready for CMT? Hell no! by slashdot_commentator · · Score: 1

    The blame for this outrage should be put on Berman & Enterprise.

    --
    There is no America. There is no democracy. There is only IBM and AT&T and DuPont, Dow, General Electric, and Exxon
  73. Re:PHP and multi-threading by cosinezero · · Score: 1

    Single-threaded webpages are -terrible-.

    I mean, unless you like your data access code holding up the page rendering.

  74. Legacy multi-threaded code by Anonymous Coward · · Score: 0

    I guarantee that whoever wrote that code wasn't thinking about threads or concurrency or lock-free algorithms or any of that stuff.

    There is legacy multi-threaded code out there. I worked on a couple of projects in the 90's in C and C++ that were multi-threaded. pthreads is a C library, not Java, after all.

    That's not to say that most of the legacy code out there is multithreaded. And multithreaded coding requires some serious discipline. The problem boils down to a simple trade off. You have to lock access to anything that is shared between threads. And locking is expensive, so you want to limit when you are doing it.

    1. Re:Legacy multi-threaded code by lgw · · Score: 1

      Legacy indeed. We didn't call it "threading", but I've supported 30+ year old code that dealt with the same issues. It's not exactly a new idea (and I've never found a language simpler than assembly in which to handle locking issues, everything else leaves me guessing what's really happening).

      --
      Socialism: a lie told by totalitarians and believed by fools.
  75. pointless for scientific codes by cahiha · · Score: 1

    As a scientific programmer, all I know is that this will eventually be a huge benefit to all my MPI and OpenMP codes.

    Unfortunately, these kinds of processors are pointless for most scientific applications (there are some exceptions, but not many). Scientific apps are limited by arithmetic units and memory bandwidth, and these processors do nothing to improve them. The Cell processor at least has multiple FPUs, this one doesn't even have that.

    For your MPI codes, you are much better off using a workstation cluster, because unlike with these kinds of processors, you get a separate memory subsystem and a separate FPU with each thread that way.

  76. Anybody else remember Tera? by Doctor+Memory · · Score: 1

    This sounds like the stuff that Tera was working on with their MTA back in the 90s (see this or for more techincal details here). Basically, a processor that could handle up to 128 threads at a time, with almost zero-latency switching among threads. These processors could be easily interconnected to scale up to whatever the customer (e.g., Sandia, Los Alamos, LLL) wanted. From perusing Cray's website, though, I don't see any current machines that appear to be using that architecture, so I assume it didn't play out somehow.

    --
    Just junk food for thought...
    1. Re:Anybody else remember Tera? by convolvatron · · Score: 2, Informative

      the technology and architecture were beautiful. the execution and business planning were poor. because it was such a huge and underfunded effort to get the whole thing (os, compiler, processor, network) brought up from scratch, they lagged current technology at both of the two introductions. stability was a problem.

      still, a terrible shame. a testament to the failings of the short-term investment model.

      the compiler did automatic parallelization, but only really well for HPC-style loop nests. if you weren't running parallel code, you really suffered, because the individual thread execution rates were so poor, and they ran uncached (one of the nice things about the model is that they used concurrency to hide memory latency, but if you didn't have it to exploit...)

  77. Why shouldn't compilers be able to profit from it? by maxwell+demon · · Score: 1

    Yes, compilers are not parallelizable. But multithreading doesn't necessarily mean parallelization.

    I could for example imagine the parser to run on one thread, and the single-function optimizer on another thread. Every time the aprser has finished parsing a function, it tells the optimizer thread, which then immediatly starts optimizing it, while the parser starts parsing the next function. The later inter-function optimization passes then work with the pre-optimized functions which were mostly optimized during parse (given that parsing includes getting the source from disk, which will most likely block, I really wonder if a multithreaded compiler would even be an advantage on a single-core system).

    --
    The Tao of math: The numbers you can count are not the real numbers.
  78. Partial Evaluation by sleepingsquirrel · · Score: 1
    Why unroll the loop? A sufficiently smart compiler should be able to turn that into...
    int AccumulateLoopCount(int N)
    {
    return n*(n+1)/2;
    }
  79. Re:well at least he seems to understand the proble by Lumpy · · Score: 1

    Yikes you are so right.

    I just was "promoted" to the "programming guru/ lead IT" position here. the last guy left an utter mess in old VB code as well as ASP that is a broken mess that is working but only barely. management freaked when I suggested we rewrite everything correctly and take advantage of the .NET upgrades as well as migrate some of the ASP to PHP or just simply properly written ASP. The 6-18 months to complete the rewrites is "unacceptable" to them. They did not realize how much of a mess they had on their hands, and I refuse to blow smoke up their butts and lie about how long it will take to fix.

    MANY companies are running that old FORTRAN and COBOL code because management refuses to allocate resources to fix the slightly broken and old code/software.

    This "good old code" is not that good. and usually is overlooked to be fixed because "it's working"

    --
    Do not look at laser with remaining good eye.
  80. AH! AMBIGUITY! by Anonymous Coward · · Score: 0

    Have you not programmed anything threaded in years? Or have you not programmed anything single-threaded in years?

    This is what keeps me up at night.

    1. Re:AH! AMBIGUITY! by MemoryDragon · · Score: 1

      Not anything single threaded, the main reason is, that you usually are forced to multithread anyway, once the program becomes bigger than a single hello world, to reduce latency times.

  81. Re:well at least he seems to understand the proble by daVinci1980 · · Score: 1
    The original loop I posted can be optimized into constant time like this:
    void AccumulateLoopCount(int N) {
    return N * (N + 1) / 2;
    }
    Which was why I said that the example is not really a great one. (It does represent a class of problems that are hard to parallelize efficiently, though).

    The speed and memory tradeoffs come from the two parallel implementations of the code I posted.

    If you do what you suggest, the performance loss is in the step that you sorta gloss over: "wait for all the threads to finish." Semaphores and mutexes are notoriously expensive to lock and unlock. The original code I posted completed on the order of dozens of clocks per loop iteration. Locking a single mutex or incrementing (or decrementing) a semaphore costs thousands of cycles.

    The memory explosion that I refer to comes from the temporary variables that you've added per thread. Of course, there is the phenomenal overhead of the thread itself (which is on the order of several K), but ignoring that there are now extra temporaries that are necessary; one per thread to avoid having to lock a mutex at each step of the loop.

    Of course, in your implementation (with 4 threads) this is only a 4x increase in the memory requirements of my implementation. However, in an N-thread implementation, this is an Nx increase in memory requirements.

    This is not an insignificant number of cycles or memory increase.
    --
    I currently have no clever signature witicism to add here.
  82. Re:well at least he seems to understand the proble by Anonymous Coward · · Score: 0

    Ha. My compiler parallelized that loop automatically. (Can yours?)

    You say the performance hit of running this code in parallel would be "*massive*". Why is that? On my computer, the performance just gets better and better when I add processors. You say that parallelizing this loop would cause a "tremendous memory size explosion." Why would that happen? Are you saying the memory requirements would be worse than O(logN)?

  83. multicell isn't the answer either by cahiha · · Score: 1

    Multi-threading is NOT the future. Multi-cell is.

    I agree that multithreading is not the future, for all the reasons you give. But I don't believe multicell is either: you still have memory and I/O bottlenecks.

    In fact, the future of high performance computing is already here: large amounts of commodity hardware. Every box you add automatically adds not only another CPU, but also a separate memory system and I/O.

    Having said that, multicell doesn't have a future as a general purpose parallel computing paradigm, but it does hold the promise of being able to replace GPUs and other special-purpose hardware that litters our machines right now.

  84. more info on the basis and a link by kpp_kpp · · Score: 1

    still haven't found the tech article but this is similar to what it was talking about...

    "There was no salvation in using more than one steam engine on a single train, except in situations where extra power was needed for only a short distance, e.g., climbing a mountain grade. In normal operations, two steam engines would waste energy fighting against each other."

    http://yardlimit.railfan.net/guide/locopaper.html ...it makes the analogy even weaker but oh well...

  85. Hasn't been true for decades by Paul+Crowley · · Score: 1

    Garbage collection techology has been dealing with this well for a long time. Read The Memory Management Reference - multiple threads are assumed, and the single-threaded special case merits barely a mention. The "mutator" threads can keep running while garbage collection is going on, too - memory barriers are used to protect against race conditions.

  86. This is awesome.... by slapout · · Score: 1

    ...now my code can look like this:

    Thread 1....waiting for user input

    Thread 2....waiting for user input

    Thread 3....waiting for user input

    All running at the same time!!!

    --
    Coder's Stone: The programming language quick ref for iPad
  87. Re:Why shouldn't compilers be able to profit from by convolvatron · · Score: 1

    i'm pretty sure you can evaluate conservative fixed point analysis in parallel if you have a fine-grained machine (like an smt one)

  88. Am I the only one? by MrCopilot · · Score: 1

    Who saw Chewbacca ina cowboy hat?

    --
    OSGGFG - Open Source Gamers Guide to Free Games
  89. Can someone elaborate? by SharpNose · · Score: 1

    That is, how would CAML and Scheme play?

    1. Re:Can someone elaborate? by goertzenator · · Score: 1

      The theory is that functional langauges are far more parallelizable than imparative langauges like C and java. In pure functional programming you say what you need done, but the runtime system can do what needs to be done in any order it feels like, including chopping it into little pieces and feeding it to many CPUs. I don't know how many of today's functional languages can be parallelized. Scheme and Ocaml have a lot of imperative features that would gum up the works. Languages like Haskell show more promise. Personally, I think imperative languages will fall out of the mainstream in the coming years (decades?). C programs like Apache might work fine on 64 cores, but how well will it work on a million core cpu? Math is forever, Von-Neuman style computers aren't.

    2. Re:Can someone elaborate? by lgw · · Score: 1

      There's no important difference between "functional" and "imperative" languages, when it comes down to this sort of optimization. Or, really, any important difference except the convienence of the syntax (is there anything more painful than C++'s "pointer to member function" syntax?).

      Scheme is cool for illustrating concepts in optimization, because the syntax is so simple you don't lose the point you're trying to make about optimization in a mess of parsing logic. It's all an AST inside the compiler, however, and any "real" language optimizes as well as any other. Don't confuse ease of illustration with functionality.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    3. Re:Can someone elaborate? by Anonymous Coward · · Score: 0

      Compiling a functional language is easier to parallelize than compiling an imperative language. So compilers for functional languages can see a performance boost from this.

    4. Re:Can someone elaborate? by lgw · · Score: 1

      No one cares about the time it takes to compile one file.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    5. Re:Can someone elaborate? by Anonymous Coward · · Score: 0

      I think you fail to understand the fundamental difference between imperative and functional languages. Scheme is not purely functional, which it seems you do not recognize. The distinction between imperative and function languages is hardly related to syntax either. The reason pure functional languages lend themselves to parallelization optimizations is because code can be moved and rearranged and decoupled from the time element that imperative languages are bound by. There are no side-effects what-so-ever, so the syntactic order of code does not matter. Look up "referentially transparent" on google.

      Although, with a good inference algorithm one could probably optimize Scheme sufficiently by detecting memory references outside the local scope I think you would still be very hard pressed to implement similar optimizations in C, C++, or Java. So Scheme minus "set!" and "begin" would be ideal and R5RS would still remain easier than fully-imperative languages to optimize for parallelization.

  90. But Sun make *servers* by NigelJohnstone · · Score: 1

    "At any given load maybe 1 of them is active. "

    The file cache thread *could* be active, file caching is only done during idle time to let the main thread run better, a simple OS tweak could overlap that better.

    Same with the network, it *could* be overlapped better its only not done now to avoid impacting the front thread.

    I assume the same is true through all the drivers and OS subsystems.

    So yes you may have 1 thread mainly running now, but it doesn't mean you can't gain from this.

    But all this misses the point, Sun make servers, the exact sort of boxes that run hundreds of active threads running the same code serving to multiple users. The perfect thing to benefit from this.

    "A 1000-cycle task switch is NOTHING compared to the 2.2 million cycles a process has "
    Fair comment.

    1. Re:But Sun make *servers* by tomstdenis · · Score: 1

      My point though is there is more to running a server than "conns/sec". Throwing more resources at the problem means taking more electricity to run and to cool [via air cond].

      So it can be more advantageous to efficiently run tasks and round-robin then to have N slow alus that run in parallel because you're not doing cache cohererancy, clock distribution, etc...

      I don't know the specs of this new design. My point was just to raise several comments for food for thought. Threading just doesn't pay off for efficient ALU designs. Multi-ALU does pay off though, but effectively if you double the ALU and double the registers ... you have a dual core...

      Personally I find what AMD is doing is more interesting. They're reducing power while maintaining IPC and their newer designs [e.g. the X2] compare entirely favourably to the latest Intel offerings [e.g. the intel 84x series].

      Tom

      --
      Someday, I'll have a real sig.
  91. Re:PHP and multi-threading by Anonymous Coward · · Score: 0

    Fascinating. Pray tell, how do you read the user's mind regarding exactly what data they want from a search query so you can do your query concurrently with the rendering the search results? I'm very curious about this exciting new technology.

  92. Why the future of SMT is bleak by spockvariant · · Score: 5, Informative

    I'm a researcher working on high performance computing and have used various configurations of Simultaneous Multithreading (aka Hyperthreading aka CMT) (Intel Xeon, IBM POWER5). The result is always the same - at the end, memory latencies and OS overheads kill most of the gains of instruction level parallelism coming from SMT. Look at it this way - the typical latencies of operations on most modern processors are of the order of 1 nanosecond, whereas DRAM latencies are of the order of 200ns. As long as you can't do anything about this latency, there's no point in cutting down on processing times. There's a very nice paper in this year's ACM SIGMETRICS that gives real experimental data to illustrate this fact - http://www.cs.princeton.edu/~yruan/XeonSMT/smt.pdf The paper shows that the speedups obtained using SMT in practice are meagre. The reason that the simulation results coming from the original UWashington research on the subject - http://www.cs.washington.edu/research/smt/ - looked far better was their use of unreasonably large caches in their simulations, and that they completely ignored the OS overhead of enabling SMT - which is non-negligeable - and is a thing that has been pointed out often on the Linux Kernel mailing list as well.

    1. Re:Why the future of SMT is bleak by jelle · · Score: 1

      There is much more in the world than web server performance.

      In applications that actually need to mainly use the CPU, instead of mainly do I/O (web servers, file servers, database servers), the OS overhead is negligible, and the gain to be had from avoiding pipeline stalls can be significant.

      btw: I you have a >1gbit internet pipe to serve your static web pages over, then you can also afford a second system to get more speed.

      --
      --- Hindsight is 20/20, but walking backwards is not the answer.
    2. Re:Why the future of SMT is bleak by CTho9305 · · Score: 2, Insightful

      The reason that the simulation results coming from the original UWashington research on the subject - http://www.cs.washington.edu/research/smt/ - looked far better was their use of unreasonably large caches in their simulations, and that they completely ignored the OS overhead of enabling SMT - which is non-negligeable - and is a thing that has been pointed out often on the Linux Kernel mailing list as well.

      I didn't read most of the princeton paper... but you're arguing that caches need to be big to get any gains, and that Intel's HT chips show SMT doesn't offer anything. The Intel chips have ridiculously small L1 caches - only 8KB. A quick sampling of washington papers shows they simulate machines with 64-128KB L1 caches, which are entirely reasonable - all AMD processors since the Athlon have had 64KB L1 caches. Both companies are increasing L2 sizes, and 1-2MB is not unreasonable either.

      I don't know anything about OS overhead, but section 2.3 of the princeton paper argues SMP kernels (which SMT requires) are slower, and thus you pay for extra overhead when using SMT vs a non-multithreaded single processor. However, they themselves don't make the same claim for multiprocessors (because you have to pay the OS overhead anyway), and with the introduction of dual core processors at the consumer level, everybody will soon be using the SMP kernels anyway. This point is [rapidly becoming, if not already] moot.

      Their analysis in section 3.3 implied that the memory subsystem becomes the bottleneck in multiprocessor systems with SMT enabled, but before you take that and agrue SMT offers nothing, I again point out problems with the Intel implementation: their memory bus is shared among all CPUs, so the per-CPU bandwidth drops with an increase in CPUs, and per-thread bandwidth is half again. AMD's Opterons don't suffer from this same problem due to their NUMA configuration, so a 2-CPU 2-thread SMT with an Opteron-like memory system would get the same per-thread memory bandwidth as a 2-CPU non-SMT Xeon system, while supporting twice as many threads.

    3. Re:Why the future of SMT is bleak by spockvariant · · Score: 1

      >The Intel chips have ridiculously small L1 caches - >only 8KB. A quick sampling of washington papers >shows they simulate machines with 64-128KB L1 >caches, which are entirely reasonable - all AMD >processors since the Athlon have had 64KB L1 >caches. Both companies are increasing L2 sizes, and >1-2MB is not unreasonable either. Well, L1 caches don't count for much anymore, since L1 and L2 latencies aren't that much far apart (2-5ns for L1 and 3-8 for L2 for the new Intel Chips). L1 caches are most useful in storing the stack, which really benefits from locality and can capitalize on that 10% or so latency improvement. The Washington experiments used between 3MB and 8MB L2 caches, which is not reasonable.

  93. Re:Ready for CMT? Hell no! by Anonymous Coward · · Score: 0

    Yep, and his first song is about how Darth Vader's pickup truck broke down, his wife left him, and he's drinking alone again.

  94. Re:well at least he seems to understand the proble by Doctor+Memory · · Score: 1

    That might be a solution if you've got the source. One of the more terrifying things learned during the great Y2K scare was that there exist a large number of legacy system that have been patched by directly modifying the binaries. Such systems have no source code anymore, and are not decompilable. Also, let's not forget that much of this code was probably written "oddly" to get another 2% worth of performance out of the original architecture; to do the job properly you'd have to rewrite the fiddly bits in a more standard fashion, then verify that you haven't: a) introduced any new bugs, and b) fixed any bugs the program was depending on. The second may be a non-issue in most cases, but the first is still a non-trivial exercise.

    --
    Just junk food for thought...
  95. Re:well at least he seems to understand the proble by maxwell+demon · · Score: 1
    The original loop I posted can be optimized into constant time like this:
    void AccumulateLoopCount(int N) {
    return N * (N + 1) / 2;
    }

    Actually it can't (because your multiplication might overflow even if the result doesn't). But I already posted what I think is the optimal version (without loops, and without overflow issues).

    Thinking about it, the branch might be more costly than an addition and an xor, so here's another version which also avoids the if:
    void AccumulateLoopCount(int N)
    {
    register int tmp = N&1;
    return ((N>>1)+(tmp^1))*(N+tmp);
    }
    As my previous version, this doesn't suffer from possible overflow.

    According to the cost of parallelization: It's of course true that it has memory cost. And of course it would be silly to make N threads (which each would do exactly 1 addition, namely adding their value of i to 0, which of course could be easily optimized away), and then add all those values together (which would be exactly the work of the single-threaded version anyway). I'd expect the number of threads to be vastly less than N.

    According to the time cost: This just means that parallelization only makes sense for N much larger than the number of threads. if N is of the order of a milliard, then the time to synchronize the four threads should be negligible (after all, you don't synchronize in every loop cycle, but only the ends of the thread). All of course under the (sensible) assumption that N_threads << N.
    --
    The Tao of math: The numbers you can count are not the real numbers.
  96. Re:PHP and multi-threading by cosinezero · · Score: 2, Informative

    It's very simple, actually, I do it quite frequently. Let's say you need to populate a drop-down based on user input in another drop-down.

    At the start of the page you collect user input and fire off the data access code for the original drop and the parameterized drop, each in a seperate thread.

    This executes while you're performing other formatting actions, like include headers, menu formatting, and outputting strings to your response (like client scripts, etc).

    All the while, the other threads are formatting the first & second dropdown with the returned data, while your main thread is doing more menial UI tasks like formatting the tables and such that hold your page.

    This is simply a basic example, but anyone who uses data access code even for a single databound table or drop should always be running it in a seperate thread and letting the main thread handle the non-data related rendering. There is a TON of work on web pages that you can be doing that is not data related, nor is it required to be peformed before or after the data is available.

  97. "big hairy package" by Anonymous Coward · · Score: 0

    Big hairy package..

    Uhlm.. Too much information.

    BROOKLYN

  98. Cilk on a CMT? by Anonymous Coward · · Score: 0

    Looks like an interesting platform to run Cilk on.
    Forget Java it can't multi-tread it's way out of a paper bag.

  99. I will buy into it when...... by Anonymous Coward · · Score: 0

    Sun sends me my free Solaris 10.0 DVD I signed up for years ago. All I got was there spam and phone calls trying to get me to fork down 50k+ for development software.

  100. Dude, you're gettin' a Cell! by spun · · Score: 2, Funny

    Sorry, sorry, sorry...
    I couldn't help it.

    --
    - None can love freedom heartily, but good men; the rest love not freedom, but license. -- John Milton
  101. I CALL BULLSHIT by Anonymous Coward · · Score: 0

    you must only be familiar with shared-state concurrency. because if you weren't, you wouldn't spread this FUD framed from the perspective that threads are the only way to manage concurrency (parallelism) in software.

    the means to do "thoughtless" concurrency has been available for going on 40 years now. look up (http://www.c2.com/) carl hewitt's actors paradigm (predecessor to alan kay, both as student-teacher relationship and actors-OOP) as well as read the successor to SICP, Concepts, Models, and Techniques of Computer Programming (http://www.info.ucl.ac.be/people/PVR/book.html).

    now, understand instead: MESSAGE-PASSING CONCURRENCY. this is the "thoughtless" solution to programming on the Cell chip, CMT, and all these other new buzzwords for concurrent processing of information with multiple cores. (though, i don't think Oz, or MOZart, are the pragmatic languages to do this with; simply because i don't believe in multiple-paradigm languages -- perhaps i'm just really bitter against C++.)

  102. Memory bandwidth is still the bottleneck... by cayle+clark · · Score: 1

    ...and I don't see how on-chip threading helps. Instead of memory serving a single thread's stream of instructions and one set of registers being loaded/stored, you now have multiple threads demanding multiple streams of instructions and loading/storing from multiple register sets.

    Do the CMT chips assume greatly expanded L1 and L2 caches? The more threads, the broader and more scattered the working set. Without parallelism in memory service, multiple threads in multiple cores will just mean even more hardware sitting idle while waiting for a cache line to be loaded. Doing nothing in parallel? Not a win.

    1. Re:Memory bandwidth is still the bottleneck... by Jerry+Coffin · · Score: 1
      Do the CMT chips assume greatly expanded L1 and L2 caches? The more threads, the broader and more scattered the working set.

      Most CMT simulations (at least that I've seen) have assumed larger caches. Most implementations don't match -- which seems to be a large part of why it's not working nearly as well in real-life as simulations.

      Of course, the simulations also often assume a more or less optimal mix of compute- and I/O-bound threads, where improving thread-switching speed helps a lot. Again, real-world loads are rarely so cooperative.

      --
      The universe is a figment of its own imagination.

      --
      The universe is a figment of its own imagination.
  103. Multi-threading can be bad for latency by Coward+Anonymous · · Score: 1

    From the article "At one point during the CMT summit, I stuck my hand up and asked: is there anything that in principle doesn't scale with multithreading? There wasn't a lot that leapt to the minds' eyes, except for compiler code."

    Any application where latency is important (high performance network servers and proxies, for instance) will either not gain or often suffer from multi-threading for two reasons:
    • multi-threading requires synchronization which always consumes extra CPU cycles.
    • switching execution from thread to thread is an OS task switch and is generally an expensive operation consuming many CPU cycles. It is possible that CMT mitigates this problem but I doubt it completely solves it.
    1. Re:Multi-threading can be bad for latency by Anonymous Coward · · Score: 0
      As far as synchronization is concerned, the solution is to avoid it and go with lock-free solutions, preferabley ones that don't require memory barriers or interlocked primatives such as compare and swap. The big problem with conventional synchronization such as mutexes is the suspend/resume overhead if a thread blocks on a lock. It's a big problem in scalability since everything slows down dramatically and usually doesn't recover.

      Context switching overhead is proportional to the number of threads, not cpu's. Adding extra cpu's splits the overhead among the cpu's, so more is better here.

  104. Re:Ready for CMT? Hell no! by ksheff · · Score: 1

    Nashville doesn't produce music like that any more. If anything, the first song would be Anakin singing about how Padme thinks his astromech droid is sexy.

    --
    the good ground has been paved over by suicidal maniacs
  105. cost/cpu throughput? !:Cache Gb/core? by lpq · · Score: 1
    Seems like one of the next, well, what am I talking about, it's a problem today: how to get the data to the multiple cores & threads? Right now it's a notable performance hit to "have" to go out to "main-memory", let alone wait for a disk read (might as well run that CPU at 800MHz or less).

    Unless they are already planning on many more Gb of on-chip cache, data-starvation will become an even bigger issue than it is today.

    It might be less of a problem for multiple-threads that are executing in the same program, but they are still likely to be operating on different data streams.

    In the case of multiple cores running different programs it will get much worse, unless average program sizes shrink to a 1-2Mb of Resident/Working-Set size. Right now, looking at 2 Desktop systems:

    Numbers are number of processes
    RES = Working Set/Resident Size, #processes>="X"MB
    VMSIZE = #proceses >= "X"MB
    ______________VMSIZE__RES
    --OS--___Total_ 10M __4M__8M__16M__32M
    WinXP____24____ 22 ___15___8___6____2
    KDE-lnx__80____ 23 ___21__14___3____1 **

    This would seem to indicate a need for 4-16Mb of L2 cache needed to keep all these processes from forcing L2-cache misses at 100's to 1000's of context switches/second. These are desktop systems that are not doing much other than email and web browsing. I cannot see it being better with high-load server systems. How many of the new multi-core systems are going to have L2 cache > 8Mb/core? 4Mb/core (for fast cache/low-latency memory)? How many systems will fast enough main memory feed 8-32 processes.

    I've read that CPU starvation is already a problem in the faster Intel family processors, will the "system" hardware infrastructure be there to enable multiple cores to be fed?

    They may be lowering the GHz/cpu, but as the Sun article points out, with 8 cores, that's still 8-cores times "N"GHz to be kept fed with data.

    It's going to be a strained design scenario if you need to constrain those 8 cores to a using an average of 1Gb/core of cache memory.

    Does anyone know if the new "breed" of multi-core CPU's have a shared cache or if they are going to be limited to separate caches/core? Could cache memory contention become an issue?

    BTW -- does anyone know if disk manufacturers are planning (or are switching to common use) of multiple heads/platter? I could see arrays of 2-4 heads to cut seek latency by 50-75% and disks with heads 90 or 180 degrees out of phase to reduce rotational latency -- perhaps allowing lower RPM disks to consume less power and run with lower noise/cooling requirements. Maybe this is already being done in higher end SCSI disks?

    **-why doesn't "ecode" support spacing? How does one do tables? What are "too many "lame" characters (when I had better table w/more spacing)? Grumble -- took 3x as long to format as write! >;-((

  106. Re:Ready for CMT? Hell no! by sharkey · · Score: 1

    You want to see Ghyslain jump the General Lee over the creek, same as the rest of us. Stop with the denial, man!

    --

    --
    "Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
  107. One side of the schism is crumbling by GunFodder · · Score: 1

    I think you are correct that there are two camps, but I would define them differently. There are scientific/educational/industrial users that are already running multithreaded software. And there are home users that are about to get a lot of new multithreaded software.

    There isn't much that can be done to improve the performance of a single-threaded CPU with current technology. Both Intel and AMD have recently announced or released dual-core chips that will start making their way into home systems. The big iron vendors have been working on this for years. In the very near future home users will notice big performance differences between newer multi-threaded apps and their older software.

  108. Re:Ready for CMT? Hell no! by MarkGriz · · Score: 2, Funny

    "Has John Williams gone country?"

    No, that's his brother, Hank.

    --
    Beauty is in the eye of the beerholder.
  109. Re:well at least he seems to understand the proble by Anonymous Coward · · Score: 0

    Trust me, any app that old isn't going to be able to handle anything approaching large amounts of data. Case in point: the recent implosion of the pilot scheduling system last December. A legacy system not decades old which nevertheless was unable to handle a moderate load.

    At the edge case, big iron wasn't farther than maybe 36 bits or so in terms of memory addressing, so that's probably about as much data as you'll be able to crunch, which is well within current capabilities. Scientific computing has always been designed with parallelism in mind, which is why you still have decades-old Fortran running around in modern simulations, but business computing (transactions, basically) is either designed for multiple CPUs running lots of transactions already, or is some dinky one-off which does minimal processing on its input.

    I won't deny that some fool is probably trying to keep a 30 year old legacy system going because it just works, but it's rather wildly optimistic to think that there are any single-threaded legacy apps which need to crunch multi-TB or even GB data sets. Anyone whose data requirements have grown anywhere close to the pace of Moore's law (and thus need more and more single threaded performance) certainly has the need for a rewrite.

  110. Smaller than you think by phlamingo · · Score: 1

    The class of problems that must be serialized is smaller than we tend to think it is.

    If you look up some of the old research that Thinking Machines did with the Connection Machine (a hyper-cube architecture; a common configuration was 64K processes in 16K nodes of 4 processors each), one of the surprising results was parsing large program files in logarithmic time.

    I do not claim that is a practical result, but I hope that it is enough to make some of us drop our assumptions about whether a given problem is fundamentally serial, with no hope of improving the performance with parallel processing.

    --
    I had forgotten how much cooler teenagers look when they are smoking. Oh, wait ...
  111. Country Music Television? by jon_oner · · Score: 1

    Any one else read the title as "Ready for Country Music Television?"

    My first thought was: "Nooooooooooooooooo!!!!"

  112. But if the CMT part tells the DRM part... :-) by BerntB · · Score: 1
    Now my hardware will force me to support CMT on my computer? This is as bad as DRM.
    Hmm... maybe the CMT and DRM will cooperate?

    First, the BIOS will download and play CMT stuff. (It would need a microphone to verify that it was played load enough.)

    After a while, the CMT will tell the DRM in your computer about your criminal behaviour!

    I wonder if the DRM will print the lawsuit on your own printer or send off an email so you get it by snailmail?

    Or maybe the DRM will just demand your credit card number -- or kill all the data on your hard disk?

    I wish I could add a ":-)" here.

    --
    Karma: Excellent (My Karma? I wish...:-( )
  113. Re:well at least he seems to understand the proble by lgw · · Score: 1

    An enourmous amount of work has gone into finding ways to optimize legacy FORTRAN code for highly parallel architectures. Universities have been hammering away at the FORTRAN "lobster" (the huge body of well-tested library code used in research and some engineering circles) for close to 15 years now. I remember my college roommate doing this for a summer job in the early 90s, and it wasn't a new effort then.

    --
    Socialism: a lie told by totalitarians and believed by fools.
  114. WTF? by npsimons · · Score: 1

    Why would Star Wars Weenies care about Central Mean Time?

  115. CMT coming by Anonymous Coward · · Score: 0

    For those with an interest in HPC, this would seem to be an attempt to bring Burton Smith's (Tera Computing, now Cray, Inc.) MTA idea to the world of mainstream business computing. Still a far cry from 128 simultaneous threads but working down that road. I do know that they have ported over and parallelized a lot of code for this design.

  116. Favorite parallel language? by sleepingsquirrel · · Score: 1
    Do you have a favorite parallel language you like to recommend?
  117. Yes, exactly by Anonymous Coward · · Score: 0

    This is not multi-threaded programming, it's serial programming run on a farm. The only difference is the farm is at the CPU level rather than the data center level. You can get the same performance benefits by just optically hooking together a bunch of boxes.

  118. Just wait for the benchmarks ... by Anonymous Coward · · Score: 0

    ... they will tell what kinds of workloads Niagara is suited for.

    Thread-happy applications like Java and databases should do well.

  119. Re:well at least he seems to understand the proble by bluGill · · Score: 1

    I have to disagree. The mutex is not significant, because N is very large at least > 100,000. (Assume a processor large enough to hold the result). If N is expected to be small you are correct, if N is very large, for some value of very large, then locking time is not significant.

    When doing big-O analysis in computer science we always assume n large enough to overtake constant losses. If n is known to never be more than 4, than an O(n!) algorithm may run faster than a O(1) algorithm. However the O(1) algorithm will not be any slower when n is 1000, while the O(n!) algorithm will not finish in our lifetime.

    So my code would be in C++: (Sorry about the formatting, but I'm not about to figure out how to make it look nice in allowable html):
    int loopCountHelper(int start, int end) {
    int result = 0;
    for(int i = start;i result += 1;
    }
    }

    int AccumulateLoopCount(int N) {
    int accumulator = 0;
    threadList tl;
    while(N >0) {
    int start = N > THREADFACTOR ? N - THREADFACTOR : 0;
    tl.newThread(*loopCountHelper(start, N));
    N = N > THREADFACTOR ? N - THREADFACTOR : 0;
    }

    while(tl.moreThreads()) {
    accumulator = tl.waitandNextGetResult();
    }
    }

    It has been a long time since I've done function pointers in C++ so I'm sure I did that all wrong. However I think you get the idea. For that matter now that you see what I'm trying to do I suspect you could come up with a better design. (I know I could if I had a few hours and a good editor to work with)

  120. Re:well at least he seems to understand the proble by Sique · · Score: 1

    As someone who has worked on a project to replace 25 years old legacy applications I have some insight in how such old applications are used on a daily basis. Believe me. There was this report which was running for 200 processor hours because of the sheer amount of data to be processed, which was to be run on 18 different processors to keep the time low enough to have it run on weekend and still have the chance to restart it if it fails, once a month.
    Sometimes the processing time goes with the square or a higher potency of the amount of data ;) It's called O(n^m), and this is not even the worst case.
    Addressing often isn't the issue, matrix operations are. And many optimization algorithms use matix operations.
    With every new generation of hardware more data was thrown at the old report, so it still was run once a month, and it still took about 200 hours of processing time, without changing the base algorithm.

    And I was also working on the new report, where the processing time condensed to 1:20h, and suddenly the managers were running the report on a daily basis to make sure they don't miss any changing results.

    --
    .sig: Sique *sigh*
  121. Really... by Anonymous Coward · · Score: 0

    15x the then 900 MHz SPARC III, I doubt it seriously

    That is EXACTLY what they are claiming.


    Based on what I've read the Niagara CPU will only be deployed in a single slot server...the only thing it might be useful for is front-end web servers and light-duty app servers.


    Yes, Sun will market Niagara for blades and small servers such as web servers with a single CPU (8 cores 32 thread in hardware).

    The larger servers will have multiple Fujitsu chips - 2.4GHz 64bit Dual-core SPARC64 with a very large cache.
    They will use Fujitsu chips until their (Sun's) "Rock" processor arrives sometime after 2008 and it will offer both twice the throughput of Niagara and blazing single-threaded performance.


    I would like to see a showdown between the IBM/Toshiba Cell and Niagara.

    Brilliant now I can test a Playstation 3 vs a Web server.


    It's my opinion that the Sun engineering team are in serious trouble.


    You have clearly been misinformed. See above posts.

  122. Re:PHP and multi-threading by Alt_Cognito · · Score: 0

    Oversimplified. I believe your "It's very simple" as much as I believe there's an easy way to eliminate deadlock, resolve synchronization issues....

    Languages and frameworks do not change the fundamentals of computer science....

  123. Re:PHP and multi-threading by cosinezero · · Score: 1

    I don't have to synclock DA objects running SELECTs... do you? No wonder you think this is hard. There isn't synch issues in the above if you partition your code properly. The only time locking should come into play is if you have a cache you're dumping the returned data into... which was clearly outside the scope of my simple example provided to prove a point.

  124. Re:well at least he seems to understand the proble by daVinci1980 · · Score: 1
    When doing big-O analysis in computer science we always assume n large enough to overtake constant losses.
    Yes, when doing analysis in CS via O(n), you get to ignore those pesky constants.

    However, in that pesky real-world, O(n) is just a guide. The constants are quite meaningful in a lot of cases. Locking a mutex costs *thousands of clocks*. Each time. Incrementing or decrementing a semaphore also costs thousands of clocks. (You should also consider the fact that sometimes for performance metrics, it's more meaningful to use theta(n).)

    Bottom line, if you were trying to get maximum performance out of that simple loop that I posted, you'd care very much about those costs, I assure you.
    --
    I currently have no clever signature witicism to add here.
  125. Re:well at least he seems to understand the proble by bluGill · · Score: 1

    Yes, in the real world you do care. Which is why the solution I presented is best - it uses some factor to determin how many threads to start. Each thread does enough work to make the locking trivial compared to the total run time.

    Note that in the simple thread you posted results do not depend on the results of the last run, and I was able to factor all the data accesses out.

    You are correct that theta(n) is often useful, but I posted a solution where theta(n) isn't as important as O(n). Remember the assumption that n is very large, and this algorithm dominates run time. When either case is not true, then my solution isn't of much use.

  126. parallel languages by nuked · · Score: 1

    Just to add, there are languages designed specifically to exploit parallel architectures, effective in both shared-memory and non-shared-memory environments. Bit of a plug, but see KRoC (the Kent Retargetable occam-pi Compiler). And google for CSP (Tony Hoare, Bill Roscoe) for the formal semantics that make such parallel systems "safe" (and understandable/composable).