Slashdot Mirror


NVIDIA Shaking Up the Parallel Programming World

An anonymous reader writes "NVIDIA's CUDA system, originally developed for their graphics cores, is finding migratory uses into other massively parallel computing applications. As a result, it might not be a CPU designer that ultimately winds up solving the massively parallel programming challenges, but rather a video card vendor. From the article: 'The concept of writing individual programs which run on multiple cores is called multi-threading. That basically means that more than one part of the program is running at the same time, but on different cores. While this might seem like a trivial thing, there are all kinds of issues which arise. Suppose you are writing a gaming engine and there must be coordination between the location of the characters in the 3D world, coupled to their movements, coupled to the audio. All of that has to be synchronized. What if the developer gives the character movement tasks its own thread, but it can only be rendered at 400 fps. And the developer gives the 3D world drawer its own thread, but it can only be rendered at 60 fps. There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.'"

154 comments

  1. need some brains by conan1989 · · Score: 2, Funny

    where's the MIT CS guys when you need them?

    1. Re:need some brains by JamesRose · · Score: 2, Funny

      Leave them a post it, they'll get back to you ;)

    2. Re:need some brains by mrbluze · · Score: 1

      Leave them a post it, they'll get back to you ;) But how do you know it's them and not some kind of AI script?
      --
      Do it yourself, because no one else will do it yourself. [beta blockade 10-17 Feb]
    3. Re:need some brains by definate · · Score: 0, Offtopic

      They turned me into a newt!

      I got better.

      --
      This is my footer. There are many like it, but this one is mine.
    4. Re:need some brains by Anonymous Coward · · Score: 0

      I need better, too. :(

    5. Re:need some brains by Anonymous Coward · · Score: 0

      Plenty of other folks are doing what the parent thread post/article notes (SETI@Home's end-user client has been for years now, iirc, since 2006, & on a GPU)!

      Example/Proof? This use of multiple thread design is only 1 of 1,000's out there:

      ---

      APK Dr. Who Screensaver 2008++:

      http://www.drwhodaily.com/community/index.php?showtopic=386

      ----

      (That single file monolithic .scr Win32 PE multithread designed screensaver that also contains data access pointers for playing its animation files, which are stored literally inside itself as a runtime accessible resource for using its data in memory, not on disk (so it is only "1 moving part" K.I.S.S. designed))

      AND, not just "MIT CS guys" can handle/deal with it!

      (Not that they're "that much 'smarter'" (fact is, I doubt they are) than the rest of the joes out there writing code either, especially experienced devs)

      MOST "ordinary coders" can "handle it", especially IF they "think out/architect" their apps wisely...

      So, as far as it being the "sole province" of "MIT type guys"? Hey, trust me... it's far from that & their "exclusive proveince of the elite/brainchildren of MIT" (they're just students too, & there is a HUGE diff. between academia level experience, & that of the "real working world" in this "art & science" - experience usually DOES create that effect in a highly technical field!)

      Sure, occasionally, you DO get a "prodigy" that comes outta the academic world, but, the odds are STRONG they discovered it @ an early age too (mostly due to exposure to programming by parents I'd say).

      STILL, in essence - multithreaded apps that are implicitly "SMP ready/Multicore ready" that have to sync sound & animation abound online, such as the one noted above!

      Heck - multithreaded, implicitly SMP/MultiCore Ready apps + Operating Systems exist in fairly large numbers & have for years! Using taskmgr.exe in Windows can show anyone that much, easily (if you have the PROCESS Tab's THREADS column selected as visible! e.g.-> Here, I have 30 apps running, & 28 of those ARE 2-N thread bearing & thus, my system (AMD Athlon64 X2 dualcore) is being taken advantage of via the OS Process Scheduler kernel subsystem shunting off child OR parent threads of app execution to multiple cores, when & IF necessary (especially when the first of N cores gets near to saturated)).

      (The screensaver above is one that's not only "coarse multithreading designed" (meaning diff. threads of execution from the parent thread processing diff. discrete tasks & data), but, also does the "fine grained" multithreading problems this thread notes also (by taking the same set of data & busting it across diff. threads (2-3 of them in this app's case, depending on what's going on @ any given moment during its operations + having to perform syncronization & blocking as needed)).

      Imo & experience - today's (& even last decade's) programming tools do NOT "take a prodigy" to do this level of work, + it's almost as simple as programming using LEGOs (the building blocks toy)... you largely program CONTROLS for more than a decade now (oh, there IS more to it, but it has gotten simpler than when I first started coding Windows apps (Win16 stuff)).

      MY BOTTOM-LINE/POINT, after all that is now "said & aside"?

      Plus, I'd wager NVidia &/or ATI's toolkits make it elegant, & relatively "simple/easy" to do as well (today's programming IDE &/or addon tools such as .OCX/ActiveX controls/OLEServers, or .VCL make things a LOT simpler than it was prior to say, 15 yrs. ago too, using tools like Microsoft Macro Assembler OR even C/C++ dev environments))

      Today & FOR YEARS NOW (more than a decade)?

      With Today's dev. tools?? Put it THIS way - YOU DON'T NEED TO BE "TONY STARK BOY GENIUS INDUSTRIALIST" (ou

    6. Re:need some brains by Eli+Gottlieb · · Score: 1

      Just don't call a string-theory lab!

    7. Re:need some brains by Anonymous Coward · · Score: 0

      Telling everyone to stop their bitching and just use CUDA + pthreads.

  2. Dumbing down by mrbluze · · Score: 5, Funny

    'The concept of writing individual programs which run on multiple cores is called multi-threading. That basically means that more than one part of the program is running at the same time, but on different cores.

    Wow, I bet nobody on slashdot knew that!

    --
    Do it yourself, because no one else will do it yourself. [beta blockade 10-17 Feb]
    1. Re:Dumbing down by Lobais · · Score: 1

      Well, it's a hot copy from tfa.
      If you rtfa you'll notice that it's about "Nvidia's CUDA system, originally developed for their graphics cores, are finding migratory uses into other massively parallel computing applications."

    2. Re:Dumbing down by Lobais · · Score: 2, Informative

      Oh, and CUDA btw. http://en.wikipedia.org/wiki/CUDA

      CUDA ("Compute Unified Device Architecture"), is a GPGPU technology that allows a programmer to use the C programming language to code algorithms for execution on the graphics processing unit (GPU).

    3. Re:Dumbing down by aliquis · · Score: 1

      Yeah, I had prefered if the summary mentioned some short little information about how CUDA helps and does it better. Because as the summary is written now it's like "Maybe a video card dude will fix it because they need to run more threads", not "This video card dude came up with a new language which made it much easier to handle multiple threads", or whatever.

      HOW does CUDA make it easier? I'm very confident it's not because Nvidia hardware contains lots of stream processors.

      Ohwell, guess I need to RTFA, and maybe Wikipedia aswell..

    4. Re:Dumbing down by Kawahee · · Score: 2, Funny
      Slow down cowboy, not all of us are as cluey as you. It didn't come together for me until the last sentence!

      There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization
      --
      I'll subscribe to Slashdot when I see a month without a dupe, a typo, or an article the "editors" didn't read.
    5. Re:Dumbing down by smallfries · · Score: 1

      There is no real detail in the article so dredging my memory for how CUDA works... It probably is because they are stream processors - i.e a pool of vector processors that are optimised for SIMD. The innovation was that the pool could be split into several chunks working on separate SIMD programs. Rather than threads there are programmable barriers to control the different groups and explicit memory locking to ensure the cache is partitioned between the different groups.

      So to put it another way, the big threading "innovation" in CUDA is to not use threading, but instead to partition the memory and use low-level synchronisation primitives. Something that the supercomputing guys are well aware of, although they prefer to stick a MPI layer on top of it.

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    6. Re:Dumbing down by elloGov · · Score: 1

      'The concept of writing individual programs which run on multiple cores is called multi-threading. That basically means that more than one part of the program is running at the same time, but on different cores.

      Wow, I bet nobody on slashdot knew that!

      Although your comment funny at Slashdot, it's a wise ass, arrogant reply in reality. It's always good to reinforce knowledge and be reminded of it. Meanwhile, thank you for reinforcing the stereotype of a programmer, that of arrogance. :)
    7. Re:Dumbing down by Anonymous Coward · · Score: 0

      Threads are OS simply constructs that allow for multiple paths of execution.

      On single processor systems threads give the allusion of concurrent processing due to timeslicing.

      In multicore environments, threads are often used to logically divide a program into components that are concurrently executed on different cores. After all, why not use a programming model (threads) that everyone is already familiar with?

      However, there is no requirement that one use threads for programming on in a multicore environment, so the statement "The concept of writing individual programs which run on multiple cores is called multi-threading" is a tad misleading, and an over simplification.

    8. Re:Dumbing down by ultranova · · Score: 1

      Although your comment funny at Slashdot, it's a wise ass, arrogant reply in reality. It's always good to reinforce knowledge and be reminded of it. Meanwhile, thank you for reinforcing the stereotype of a programmer, that of arrogance. :)

      Why would anyone but a fairly advanced programmer be interested in the new fads in parallel programming ? Besides, the summary is misleading, giving the impression that multithreading is exclusive to multicore processors, which is false; it can give huge benefits in a single-core processor to have the UI run in a thread of its own so it won't get blocked by a long-running task, for example.

      While reinforcing knowledge is good, explaining the addition of integers in an university-level mathemathics book is ridiculous. So is the summary, and for the same reason. Moreso because the explanation is wrong.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    9. Re:Dumbing down by TapeCutter · · Score: 1

      Regardless of your arrogance quotient, you are correct.

      IAACS, multi-threading and parallel processing are two different but related concepts. The hard part is coming up with a parallel algorithm for certain classes of problems, implementing low level syncronization is trivial by comparison. OTOH I've seen a lot of programmers stab themselves in the eye with forks.

      --
      And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
    10. Re:Dumbing down by Anonymous Coward · · Score: 0

      >> The concept of writing individual programs which run on multiple cores is called multi-threading.
      >> That basically means that more than one part of the program is running at the same time, but on different cores.

      > Wow, I bet nobody on slashdot knew that!

      I hope not, because it's an incorrect definition!

    11. Re:Dumbing down by Anonymous Coward · · Score: 0

      'The concept of writing individual programs which run on multiple cores is called multi-threading. That basically means that more than one part of the program is running at the same time, but on different cores.

      Wow, I bet nobody on slashdot knew that!

      Will somebody give me a car analogy?
    12. Re:Dumbing down by LarsG · · Score: 1

      Is like you being the Plastic Man driving all the cars in a convoy.

      --
      If J.K.R wrote Windows: Puteulanus fenestra mortalis!
    13. Re:Dumbing down by aliquis · · Score: 1

      I don't see what the difference is in synching shared memory access between threads or synching shared (partitioned) memory between programs runnong on stream processors would be thought.

    14. Re:Dumbing down by smallfries · · Score: 1

      If you're syncing shared memory access then you have to worry about memory consistency. This is basically the problem that you would have between separate caches in an SMP system. If you partition the memory then you have a simple form of locking.

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    15. Re:Dumbing down by kilgor · · Score: 0

      Blah, both TFA and the Wiki are pretty sparse on details.

      This article is a bit better. The third page talks about some real-world applications that are benefitting from CUDA.

    16. Re:Dumbing down by aliquis · · Score: 1

      I still don't get it. If I share the RAM I was expecting all the things which shared it to have the same "knowledge" about it. Like say it's representing an ingame map of the environment and everything knows what things are. No matter if it's the own-unit-have-moved, others-units-have-moved or the thread which draws the map.

    17. Re:Dumbing down by smallfries · · Score: 1

      Ok, so think of it like this. If I have 1GB of ram shared amongst 128 processors then I have two choices: one large image (shared) or multiple smaller images (partitioned). If the whole bank is shared then each memory access has to arbitrate for access to the whole bank. The memory ranges of each processor completely overlap so there is always a cost for arbitration to access the resource.

      If we partition the bank into two pools, with 64 processors accessing each pool then we have just cut the arbitration cost in two so every memory access will be faster. In CUDA there is flexibility for how to split the ram, and it can be done hierarchically. So I might split in two as described, but then in each group of 64 processors I may partition off 32 processors into a subgroup on a smaller partition.

      In practice if I'm doing something at the extremely parallel end of the range (like rendering) then I can split my bank into 128 pieces as each subtask is completely independent and the data does not overlap (shared parameters would be replicated onto each node). Now there is no arbitration costs at all for accessing memory so the entire program receives a boost in performance.

      The nice idea in CUDA is that I can choose somewhere between these two extremes and setup the memory hierarchy appropriately. Now I'm working from memory and I've only used a 7-series with OpenGL, my knowledge of CUDA is just from the reference docs and working in a similar area. But if I remember correctly you are not locking a single pool of memory - you can specify locks at each level of the memory hierarchy. So now the L1 / L2 caches don't fight between each stream processor and there is no need to handle cache consistency once they are running in separate partitions.

      Although I've never had a series-8 to try it out on... (hmm, suddenly realises there is one in the laptop I'm typing on... must play with that later) my PhD supervisor did some work on partitioned caches about a decade ago. It was one of those typical things where the academic community went "what's the point of that then?" and it was ignored right up until the point that Nvida "reinvented" it without any credit. But hey, that's life.

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    18. Re:Dumbing down by volpe · · Score: 1

      Well, I, for one, certainly didn't know that. I used to think that multi-threading could be done on a single core!!

  3. Where's the story? by pmontra · · Score: 4, Informative

    The articles sums up the hurdles of parallel programming and says that NVIDIA's CUDA is doing something to solve them but it doesn't say what. Even the short Wikipedia entry at http://en.wikipedia.org/wiki/CUDA tells more about it.

    1. Re:Where's the story? by mrbluze · · Score: 3, Insightful

      No offence, but I'm perplexed as to how this rubbish made it past the firehose.

      --
      Do it yourself, because no one else will do it yourself. [beta blockade 10-17 Feb]
    2. Re:Where's the story? by pmontra · · Score: 1

      Agreed.

    3. Re:Where's the story? by linRicky · · Score: 0

      Exactly! Not the sort of link I was expecting for a slashdot article. There's no technical details whatsoever. And how exactly is the CUDArchitecture alleviating the present issues with parallel programming?

    4. Re:Where's the story? by harry666t · · Score: 0, Offtopic

      Indeed.

    5. Re:Where's the story? by definate · · Score: 0, Offtopic

      Exactly.

      --
      This is my footer. There are many like it, but this one is mine.
    6. Re:Where's the story? by mrbluze · · Score: 1

      Precisely.

      --
      Do it yourself, because no one else will do it yourself. [beta blockade 10-17 Feb]
    7. Re:Where's the story? by Anonymous Coward · · Score: 0

      It's the middle of the night, they need to post something.

    8. Re:Where's the story? by Anonymous Coward · · Score: 0

      Right on!

    9. Re:Where's the story? by Anonymous Coward · · Score: 0

      Me, too.

    10. Re:Where's the story? by UziBeatle · · Score: 0

      Oh, as if it will help I rate this thread a 10 out of 10.

        Mega Ditto's.

        Makes me wonder why I still bother to check into slashdot. Force of habit now perhaps.

        I know this is an old saw but it is true, Slashdot has degraded from the site
      I recall back in the 90's.
      Vastly so.

      I hope they are not paying the 'editors' to review and link story submissions to the main page.
      If so, they surely are not getting their moneys worth. Random Brownian motion site , DIgg, can do as well, if not better.
      Okay, that last as a bit over the top but I meant well.

      --
      Something between the lines jumps out and bites your arm off. Soltan Gris / London
    11. Re:Where's the story? by alex4u2nv · · Score: 3, Funny

      I sense a race condition developing

    12. Re:Where's the story? by EvilNTUser · · Score: 1

      In my opinion it doesn't even summarize the hurdles properly. I'm not a game programmer, so I don't know if the article makes sense, but it left me with the following questions. Hopefully someone can clarify.

      -Why would character movement need to run at a certain rate? It sounds like the thread should spend most of its time blocked waiting for user input.

      -What's so special about the audio thread? Shouldn't it just handle events from other threads without communicating back? It can block when it doesn't have anything to do.

      -How do semaphores affect SMP cache efficiency? Is the CPU notified to keep the data in shared cache?

      -What is a "3D world drawer"? Is it where god keeps us in his living room?

      For all I know, I have ridiculous misconceptions about game programming, but this article certainly didn't make anything clearer.

      --
      My Sig: SEGV
    13. Re:Where's the story? by harry666t · · Score: 1

      You really made me lmao (: +8, funny :P

    14. Re:Where's the story? by Yokaze · · Score: 2, Informative

      -Why would character movement need to run at a certain rate? It sounds like the thread should spend most of its time blocked waiting for user input.

      You usually have a game-physics engine running, which practically integrates the movements of the characters (character movement) or generally updates the world model (position and state of all objects). Even without input, the world moves on. The fixed rate is usually taken, because it is simpler than a varying time-step rate.

      -What's so special about the audio thread? Shouldn't it just handle events from other threads without communicating back?

      Audio is the most sensible thing to timing issues: Contrary to video (or simulation), you cannot drop arbitrary pieces of sound without the user immediately noticing.

      -How do semaphores affect SMP cache efficiency? Is the CPU notified to keep the data in shared cache?

      Not specially, they are simply a special case of the problem: How to access data
      Several threads may compete for the same data, but if they are accessing the same data in one cache-line, it will lead to lots of communication (thrashing the cache).
      In CUDA, a thread-manager is aware of the memory layout and will decide, which parts of memory will be processed by which shaders/ALUs/CPUs. Thereby, it is also possible to make more efficient use of the caches.

      -What is a "3D world drawer"? Is it where god keeps us in his living room?

      Drawer as in "someone, who draws", or 3D world painter. It draws/paints the state of the world as updated by the simulation thread.
      This can happen asynchronously, as you will not notice, if a frame is dropped occasionally.

      --
      "Between strong and weak, between rich and poor [...], it is freedom which oppresses and the law which sets free"
    15. Re:Where's the story? by EvilNTUser · · Score: 1

      Thanks for the reply, but I still don't understand why audio would be a synchronization issue. As you say, it needs a certain amount of CPU time or it'll stutter, but isn't that a performance issue?

      Also, the article would've done better just talking about the thread manager you mention. That makes more sense than the stuff about semaphores affecting performance positively (unless I misunderstood the sentence about the cache no longer being stale).

      And, uh, that drawer comment was a joke...

      --
      My Sig: SEGV
    16. Re:Where's the story? by adonoman · · Score: 1

      One issue is that the audio thread may need priority access to event data. If a lower priority thread has locks on data that the higher priority thread needs, you can end up with a priority inversion where the lower priority thread starves out the high priority threads.

    17. Re:Where's the story? by krelian · · Score: 1

      Agreed. I could understand if it was bashing Microsoft in some away but it doesn't... /. editors, is that too much to ask?

    18. Re:Where's the story? by KillerCow · · Score: 1

      CUDA is terrible.

      It does nothing to solve the synchronization issues that are the plague of multi-threaded programming, and it makes it all worse by having a very non-uniform memory access model (that hasn't even been abstracted).

      The problem with multi-threaded models is that they are fundamentally harder than a single-threaded model. CUDA does nothing to address this, and it makes it even harder by forcing the programmer to worry about what kind of memory they are using and forcing them to move data in and out of the different types of memory manually.

      CUDA is only seeing uptake because it is the only game in town, not because it is a good solution.

    19. Re:Where's the story? by Anonymous Coward · · Score: 0

      "Why would character movement need to run at a certain rate? It sounds like the thread should spend most of its time blocked waiting for user input."

      That's probably true for Zork but in a modern 3D game my character can be moved, assualted by game "bots" or fellow game players, thrown off a cliff etc with no input on my part.

    20. Re:Where's the story? by philipgar · · Score: 1
      -How do semaphores affect SMP cache efficiency? Is the CPU notified to keep the data in shared cache?

      Not specially, they are simply a special case of the problem: How to access data
      Several threads may compete for the same data, but if they are accessing the same data in one cache-line, it will lead to lots of communication (thrashing the cache).</blockquote>

      I think you have this wrong. Sharing data in one cache line between processors is not always bad. In fact in multicores this can be a very good thing. What causes problems is when multiple threads are writing to a cache line. This will happen with the cache line that holds the semaphore, and can cause multiple invalidate messages being sent out if multiple threads are spinning on a lock, and the lock is poorly coded (a proper lock will just perform reads while spinning, which does not cause invalidate messages). In a multicore architecture locks can be done much more efficiently. The data being locked is likely in one processors L1 cache, and written back to the L2. Reads can be read from the L2 cache by any processor, and cached in their L1. Upon a write to the lock, the local copies are invalidated, and the update is propagated to the L2. As the L2 can be read by all processors relatively fast, this should not cause cache thrashing.

      Cache thrashing on a multicore is associated with one thread having a much larger cache footprint than another, causing the other threads data to be evicted. The big problem with locks and semaphores is that if there is regularly contention for it, performance will suffer greatly due to a thread having to spin (or block to the OS) as it waits to obtain the lock.

      Phil
  4. Thats.. by mastershake_phd · · Score: 4, Funny

    There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization. That's called wasted CPU cycles.
    1. Re:Thats.. by aliquis · · Score: 1

      ... and bad planning / lack of effort / simplest solution.

      But we already know it's hard to split up all kinds of work evenly.

      Anyway, what does CUDA to help with that?

    2. Re:Thats.. by badpazzword · · Score: 1

      That's called wasted CPU cycles. BOINC for PS3?
      --
      When ideas fail, words become very handy.
    3. Re:Thats.. by mpbrede · · Score: 1

      There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization. That's called wasted CPU cycles. Actually, synchronization and waiting does not necessarily equate to wasted cycles. Only if the waiting thread holds the CPU to the exclusion of other tasks does it equate to waste. In other cases it equates to multiprogramming - a familiar concept.
    4. Re:Thats.. by Machtyn · · Score: 1

      That's called wasted CPU cycles. Just like what is happening when I scroll through these comments

      /it's funny, laugh.
    5. Re:Thats.. by GreyWolf3000 · · Score: 1

      Not really...normally, your process goes to sleep during this time. Your CPU spends its cycles doing other things.

      --
      Slashdot: Where people pretend to be twice as smart as they really are by behaving like children.
    6. Re:Thats.. by mastershake_phd · · Score: 1

      Not really...normally, your process goes to sleep during this time. Your CPU spends its cycles doing other things. From the point of view of the game engine the cycles are wasted.
    7. Re:Thats.. by GreyWolf3000 · · Score: 1

      Well, your operating system can either schedule all those other background tasks that run when the time slice granted to the game's threads are up, or it can stop them mid flight. It doesn't matter too much. Plus, the game may have other threads running which are not blocked on the synchronization.

      --
      Slashdot: Where people pretend to be twice as smart as they really are by behaving like children.
  5. Oversimplifying is bad by BattleCat · · Score: 1, Insightful

    Topic is rather interesting, especially for game developers, among whom I sometimes lurk , but what's the point of simplifying descriptions and problems up to the point of being meaningless and useless ?

    1. Re:Oversimplifying is bad by mrbluze · · Score: 2, Insightful

      but what's the point of simplifying descriptions and problems up to the point of being meaningless and useless ? This isn't information, it's advertising. The target audience is teenagers with wealthy parents who will buy the NVIDIA cards for them.
      --
      Do it yourself, because no one else will do it yourself. [beta blockade 10-17 Feb]
  6. just hype and commercialism by MauricioAP · · Score: 1

    This is just hype, it is well known that for real high-performance applications cuda is compute-bound, i.e. a lot of bandwidth is waste. Cuda is just another platform for niche applications, never to compete with commodity processors.

    1. Re:just hype and commercialism by Anonymous Coward · · Score: 1, Interesting

      It's definitely not just hype. In our company, we're using it to speed up some image processing algorithms which can now be applied in real time by just utilizing the <100$ video card in the PC. We are quite excited about this, as we would otherwise have to invest in expensive special purpose hardware accelerators (which are usually obsolete by the time they're designed, so you spend the rest of their life time paying off on hardware which has already lost its edge).

      Perhaps the CUDA model in itself is a little clumsy; the fact that it opens up commodity hardware to do some impressive work is very nice.

      Computer Programming: An Introduction for the Scientifically Inclined

  7. Nothing new by Anonymous Coward · · Score: 0

    I wonder what the hell I've been doing when I needed multiple threads to have a consistent view of state.

    Oh wait, its called SYNCHRONISATION.

    I'm sure its exciting to the OP and all, but hell, this is basic CS101 shit.

  8. Er. by Safiire+Arrowny · · Score: 1

    So make it all synchronize to the lowest fps, the video of course.. We are talking about one game object after all.

    In real application, the audio/video must be calculated for many of objects, and it is a static 30 or 60 fps video, and always static samples per second audio, perhaps cd quality 44100 samples per second but likely less.

    This synchronization is not unsolved. Every slice of game time is divided between how many $SampleRate frames of audio divided by game objects producing audio, and how many triangles versus amount triangles possible.

    You take the lowest amount possible per slice of game time (here a second) and call that the target amount. You don't put more game objects than resources in your environment at a time, the AI can know this secret detail and not show up.

    How does more than one core processor issue ever help to a game object to be expressed better? There is only AI left. Use the extra cores to display more objects, give the objects better AI, or give better sounding/looking objects.

  9. NVidia is doing that? an insult to INMOS... by master_p · · Score: 4, Interesting

    Many moons ago, when most slashdotters were nippers, a British company named INMOS provided an extensible hardware and software platform that solved the problem of parallelism, in many ways similar to CUDA.

    Ironically, some of the first demos I saw using transputers was raytracing demos.

    The problem of parallelism and the solutions available are quite old (more than 20 years), but it's only now that limits are reached that we see the true need for it. But the true pioneers is not NVIDIA, because there were others long before them.

    1. Re:NVidia is doing that? an insult to INMOS... by ratbag · · Score: 2, Interesting

      That takes me back. My MSc project in 1992 was visualizing 3D waves on Transputers using Occam. Divide the wave into chunks, give each chunk to a Transputer, pass the edge case between the Transputers and let one of them look after the graphics. Seem to recall there were lots of INs and OUTs. A friend of mine simulated bungie jumps using similar code, with a simple bit of finite element analysis chucked in (the rope changed colour based on the amount of stretch).

      Happy Days at UKC.

    2. Re:NVidia is doing that? an insult to INMOS... by Anonymous Coward · · Score: 1, Interesting

      I worked at INMOS, now I work for NVIDIA (and I'm posting this anon). Believe me when I say, the way INMOS and NVIDIA solved the problems of parallelism are not the same.

      Transputers were essentially an embodiment of Tony Hoare's CSP (communicating sequential processes) and is quite general and powerfully expressive. Cuda (in occam-like-ease) has no message passing, but is shared memory in nature.

      SEQ block.num=0 FOR desired.num.of.blocks
          [...]INT global_mem:
          PAR cta=0 FOR num_ctas
              [...]INT local_mem:
              PAR warp_group=0 FOR warps_groups_in_cta
                  SEQ barrier=0 FOR num.barriers[warp_group]
                      PAR warp=0 FOR num_warps[warp_group]
                          PAR thread=0 FOR 16
                              thread.id := warp*16+thread
                              proc[barrier](block.num, thread.id)

      With no barriers (common in many problems), this reduces to...

      SEQ block.num=0 FOR desired.num.of.blocks
          [...]INT global_mem:
          PAR cta=0 FOR num_ctas
              [...]INT local_mem:
              PAR warp=0 FOR num_warps.in.cta
                  PAR thread=0 FOR 16
                      thread.id := warp*16+thread
                      proc(block.num, thread.id)

      The procedure can use barriers, but cannot communicate (in the CSP sense) other than through shared local memory. In many ways CUDA is more like SIMD than CSP, and can be thought of as arrays of cooperating parallel SIMD threads (or cooperating thread array cta).

      The primary difference is that the flexibility in the general CSP model puts quite a bit of overhead in the scalar part of the implementations which need to be replicated for parallelism. In the CUDA model, the use of SIMD-thread parallel model amortizes the scalar parts of the implemenation over multiple threads reducing overhead cost-per-thread.

      The processes rendevous model in CSP is quite general, but need to be integrated into the thread scheduling making thread scheduling quite heavyweight (linked list in the transputer implementation). With the CTA/CUDA model, threads are highly structured and not-dynamic. Threads are dispatched in preconfigured arrays, not per-thread and not dynamically creatable (yet) so they are very efficient to dispatch. Of course after they are running, they can still be temporarily stalled independently (sort of like hyperthreading in CPU cores on register data dependencies, or like in the transputer when you loop communicate). This means the difference between a multi-thread model (transputer processing several threads efficiently), and a many thread model (cuda processing thousands of threads efficiently) for practical implementations.

      In short, Cuda is a very structured model that allows for high efficiency. CSP is a very generic model which has lots of flexiblity.

      In many ways, INMOS was a pioneer, but as with most pioneers, they got lots of arrows in their back (occam, secret assembly language, propritary vlsi design process, etc), although they eventually corrected most of these (icc c-compiler, transputer assembly language book, ported to verilog, etc), but by then, the industry had passed it by (not to mention, they stuck with the 3-operand stack model way-way-way too long)...

  10. OMG Slashdot by Anonymous Coward · · Score: 0

    > The concept of writing individual programs which run on multiple cores is called multi-threading

    What the hell has happened to you, dear slashdot? I told you visual basic was bad for your brain...

  11. New programming tools needed by maillemaker · · Score: 3, Insightful

    When I came up through my CS degree, object-oriented programming was new. Programming was largely a series of sequentially ordered instructions. I haven't programmed in many years now, but if I wanted to write a parallel program I would not have a clue.

    But why should I?

    What is needed are new, high-level programming languages that figure out how to take a set of instructions and best interface with the available processing hardware on their own. This is where the computer smarts need to be focused today, IMO.

    All computer programming languages, and even just plain applications, are abstractions from the computer hardware. What is needed are more robust abstractions to make programming for multiple processors (or cores) easier and more intuitive.

    --
    A work that expires before its copyright never enters the public domain and thus enjoys eternal copyright protection.
    1. Re:New programming tools needed by Anonymous Coward · · Score: 1, Insightful

      Erlang?

    2. Re:New programming tools needed by destruk · · Score: 1

      I can agree with that. Any error that crashes 1 out of 20 or so concurrent threads, on multiple cores, using shared cache, is too complex for a mere human to figure out. After 30+ years programming single threaded applications, it will take a lot of new tools to make this happen.

    3. Re:New programming tools needed by Anonymous Coward · · Score: 0

      The programming model of CUDA is SIMD (single instruction multiple data), like SSE, but with much more D. It is not really parallel programming with complex dependencies, but doing the same thing to lots of data at the same time. If your program has complex dependencies between different tasks which do not involve huge gobs of data, then it probably won't translate to an efficient CUDA program.

    4. Re:New programming tools needed by TheRaven64 · · Score: 2, Interesting
      There's only so much that a compiler can do. If you structure your algorithms serially then a compiler can't do much. If you write parallel algorithms then it's relatively easy for the compiler to turn it into parallel code.

      There are a couple of approaches that work well. If you use a functional language, then you can use monads to indicate side effects and the compiler can implicitly parallelise the parts that are free from side effects. If you use a language like Erlang or Pict based on a CSP or a pi-calculus model then you split your program into logically independent chunks with a message passing interface between them the compiler or runtime can schedule them independently.

      --
      I am TheRaven on Soylent News
    5. Re:New programming tools needed by maraist · · Score: 4, Interesting

      Consider that if you've ever done UNIX programming, you've been doing MT programming all along - just by a different name.. Multi-Processing. Pipelines are, in IMO the best implementation of parallel programming (and UNIX is FULL of pipes). You take a problem and break it up into wholly independent stages, then multi process or multi-thread the stages. If you can split the problem up using message-passing then you can farm the work out to decoupled processes on remote machines, and you get farming / clustering. Once you have the problem truely clustered, then multi-threading is just a cheaper implementation of multi-processing (less overhead per worker, less number of physical CPUs, etc).

      Consider this parallel programing pseudo-example

      find | tar | compress | remote-execute 'remote-copy | uncompress | untar'

      This is a 7 process FULLY parallel pipeline (meaning non-blocking at any stage - every 512 bytes of data passed from one stage to the next gets processed immediately). This can work with 2 physical machines that have 4 processing units each, for a total of 8 parallel threads of execution.

      Granted, it's hard to construct a UNIX pipe that doesn't block.. The following variation blocks on the xargs, and has less overhead than separate tar/compress stages but is single-threaded

      find name-pattern | xargs grep -l contents-pattern | tar-gzip | remote-execute 'remote-copy | untar-unzip'

      Here the message-passing are serialized/linearized data.. But that's the power of UNIX.

      In CORBA/COM/GNORBA/Java-RMI/c-RPC/SOAP/HTTP-REST/ODBC, your messages are 'remoteable' function calls, which serialize complex parameters; much more advanced than a single serial pipe/file-handle. They also allow synchronous returns. These methodologies inherently have 'waiting' worker threads.. So it goes without saying that you're programming in an MT environment.

      This class of Remote-Procedure-Calls is mostly for centralization of code or central-synchronization. You can't block on a CPU mutex that's on another physically separate machine.. But if your RPC to a central machine with a single variable mutex then you can.. DB locks are probably more common these days, but it's the exact same concept - remote calls to a central locking service.

      Another benifit in this class of IPC (Inter Process Communication) is that a stage or segment of the problem is handled on one machine.. BUt a pool of workers exists on each machine.. So while one machine is blocking, waiting for a peer to complete a unit of work, there are other workers completing their stage.. At any given time on every given CPU there is a mixture of pending and processing threads. So while a single task isn't completed any faster, a collection of tasks takes full advantage of every CPU and physical machine in the pool.

      The above RPC type models involve explicit division of labor. Another class are true opaque messages.. JMS, and even UNIX's 'ipcs' Message Queues. In Java it's JMS. The idea is that you have the same workers as before, but instead of having specific UNIQUE RPC URI's (addresses), you have a common messaging pool with a suite of message-types and message-queue-names. You then have pools of workers that can live ANYWHERE which listen to their queues and handle an array of types of pre-defined messages (defined by the application designer). So now you can have dozens or hundreds of CPUs, threads, machines all symmetriclly passing asynchronous messages back and forth.

      To my knowledge, this is the most scaleable type of problem.. You can take most procedural problems and break them up into stages, then define a message-type as the explicit name of each stage, then divide up the types amongst different queues (which would allow partitioning/grouping of computational resources), then receive-message/process-message/forward-or-reply-message. So long as the amount of work far exceeds the overhead of message passing, you can very nicely scale with the amount of hardware you can throw at the problem.

      --
      -Michael
    6. Re:New programming tools needed by Tim+Browse · · Score: 1

      When I came up through my CS degree, object-oriented programming was new. Programming was largely a series of sequentially ordered instructions. I haven't programmed in many years now, but if I wanted to write a parallel program I would not have a clue.

      But why should I?

      What is needed are new, high-level programming languages that figure out how to take a set of instructions and best interface with the available processing hardware on their own. This is where the computer smarts need to be focused today, IMO. Crikey, when was your CS degree? Mine was a long time ago, yet I still learned parallel programming concepts (using the occam language).
    7. Re:New programming tools needed by ultranova · · Score: 1

      Consider that if you've ever done UNIX programming, you've been doing MT programming all along - just by a different name.. Multi-Processing. Pipelines are, in IMO the best implementation of parallel programming (and UNIX is FULL of pipes).

      Unix pipes are a very primitive example of a dataflow language.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    8. Re:New programming tools needed by Kjella · · Score: 1

      Yes, they could be better but the problem isn't going to go away entirely. When you run a single-threaded application A to Z you only need to consider sequence. When you try to make a multi-threaded application you have to not only tell it about sequence but also the choke points where the state must be consistent. There are already languages to make it a lot easier to fit into the "pool" design pattern where you have a pool of tasks and a pool of resources (threads) to handle it, which works when you got static parallelization but that doesn't cover anywhere near all the issues.

      Imagine you're doing a physics simulation with each thread handling an object, but if they come too close you need to simulate the interaction as well. Now the parallelism is dynamic with objects moving in and out of different interactions and plenty messaging back and forth to do collision detection. The compiler is never going to figure that out - collidig objects and etheral objects passing through each other is both "valid" solutions as far as it knows.

      --
      Live today, because you never know what tomorrow brings
    9. Re:New programming tools needed by maillemaker · · Score: 1

      I actually finished my degree in 2005, but I took all my CS classes from 1992-1997. I started college in 1988.

      --
      A work that expires before its copyright never enters the public domain and thus enjoys eternal copyright protection.
    10. Re:New programming tools needed by philipgar · · Score: 2, Insightful

      While you make some good points in your comment, there are parts that are off. First, UNIX pipes are not an effective way to parallelize an application. UNIX pipes provide a method that tends to be inefficient, and will involve much "needless" copying of data (from your application to the pipe, the OS will then read in the data and write it to the other process which will then likely read the data into its address space). Additionally, UNIX pipes work well for steady state, but tend to have problems with starting up and stopping. This is true of many pipeline systems. They also lack the ability to easily communicate "side channel" information. There are other streaming programming languages that are being tried that will hopefully fix some of these problems, for instance, see the StreamIT project at MIT.

      Additionally your examples given in your post about UNIX pipes neglect to mention the fact that a pipeline is only as fast as it's slowest component. The idea of breaking tar -czf into tar cf|gzip and fully utilizing 2 CPUs is laughable. tar (without compression) is an extremely simple program that doesn't require much CPU time to run, while gzip will likely run for a while on the data. You'll likely get the situation where gzip is running 100% of the time (assuming your I/O is fast enough), and tar is running 5-10% of the time. This is quite far from the 200% number quoted.

      Pipelining is a fairly easy technique to exploit parallelism for some applications, but to really utilize many processors, SPMD (single program multiple data) techniques are necessary, as you hinted at with the worker threads (one way to achieve this). A well parallelized program should take advantage of both types of parallelism to maximize performance.

      Phil

    11. Re:New programming tools needed by Effugas · · Score: 1

      It's an interesting example you raise. Lets take a look at your example:

      find | tar | compress | remote-execute 'remote-copy | uncompress | untar'

      find -- you're sweeping the file system and comparing against rules. Maybe IO-driven CPU, at best.
      tar -- You're appending a couple headers. No work.
      compress -- OK, here there's a CPU bound.
      remote-execute remote-copy -- Throwing stuff onto the network and pulling it off.
      uncompress -- OK, more CPU bound.
      untar -- Now you're adding files to the file system, but only as fast as the network can give them to you. No work here either, really.

      So, in your example, you have two CPU bound processes. Interestingly, the problem itself is a network copy, which conveniently always has two processes. About the only improvement an extra core would add, then, is maybe find going a little faster.

      Parallelize the compressors and decompressors, now, and you're onto something.

    12. Re:New programming tools needed by nietpiet · · Score: 1

      yes, to program such an interface is of course very easy, just as simple as writing a program to check if a thread ever halts.

  12. Aritificial Intelligence. by Safiire+Arrowny · · Score: 1

    Like I was saying in another post, since everything per game object must be synchronized to the slowest procedure (video rendering of the object), the way to not wasted cpu cycles is to spend it on AI.

    In essence, the faster your CPU then, (static on consoles), the more time you can devote to making your game objects smarter after you're done the audio visual.

  13. I don't understand the point of this article. by destruk · · Score: 1

    This tells me nothing. Why would you want a game (Common single threaded-programmed application) to compete with your divx compression and ray tracing bryce3d application running in the background? Are they (Intel, AMD, IBM) all saying that we need to hook up 8 or 12 or 24 processor cores at 3ghz each to get an actual speed of 4ghz while each one waits around wasting processing cycles to get something to do? That is the lamest thing I've heard in a long time. I'd much rather have a SINGLE CORE Graphene processor at 12Ghz, than quadcore or oct-core at 4ghz.

    1. Re:I don't understand the point of this article. by Safiire+Arrowny · · Score: 1

      Even though I think this is a very speculative and information free article, if you imagine it in the domain of the PS3 console for example, where any time a core is not doing anything useful it is wasting potential, I guess you could see where they're coming from.

      At least that is the idea I had while reading it, I wasn't thinking about running other cpu intensive PC apps at the same time as a game.

    2. Re:I don't understand the point of this article. by rdebath · · Score: 1

      But you can't have a 12GHz, at that speed light goes about ONE INCH per clock cycle in a vacuum, anything else is slower, signals in silicon are a lot slower.

      So much slower that a modern single core processor will have a lot of "execution units" to keep up with the instructions arriving at the 3GHz rate these instructions are handed off to the units in parallel and the results drop out of the units "a few" clock cycles later. This is good except when the result of UnitA is needed before UnitB can start. At this point UnitB has to wait; Intel discovered that nearly half their execution units were waiting most of the time so they invented HyperThreading.

      With HyperThreading the execution units are shared between two threads that the OS wants to run at the same time which means more of the silicon is used and the machine is faster.

      At this point clock speeds have hit a hard wall, they will continue to go up but only in ratio to the feature size on the silicon. OTOH the number of gates goes up with the square of the feature size. I would expect the mass retail sale of 16 to 30 cores in a single silicon before we hit 12GHz.

      Of course that brings it's own problems, refactoring a program into (say) 10 threads is easy; when compared to 100 or 10000!

    3. Re:I don't understand the point of this article. by TheRaven64 · · Score: 2, Informative

      But you can't have a 12GHz, at that speed light goes about ONE INCH per clock cycle in a vacuum, anything else is slower, signals in silicon are a lot slower.

      An inch is a long way on a CPU. A Core 2 die is around 11mm along the edge, so at 12GHz a signal could go all of the way from one edge to the other and back. It uses a 14-stage pipeline, so every clock cycle a signal needs to travel around 1/14th of the way across the die, giving around 1mm. If every signal needs to move 1mm per cycle and travels at the speed of light, then your maximum clock speed is 300GHz.

      Of course, as you say, electric signals travel a fair bit slower in silicon than photons do in a vacuum, and you often have to go a quite indirect route due to the fact that wires can't cross on a CPU, so the practical speed might be somewhat lower.

      Intel discovered that nearly half their execution units were waiting most of the time so they invented HyperThreading. Minor nitpick, but actually IBM were the first to market with SMT, and they took it from a university research project. Intel didn't discover anything other than that their competitors were getting more instructions per transistor than them.
      --
      I am TheRaven on Soylent News
    4. Re:I don't understand the point of this article. by rdebath · · Score: 1

      Personally I would have guessed the speed to be over the current 3GHz (or so) but CPU companies haven't increased the clock rate for a long time and there are a lot of people (like the OP) who would pay top dollar for a faster clocked CPU.

      SMT: Oh the DEC Alpha was the first commercial CPU was it. I thought about checking it after I posted! Still HyperThreading is a somewhat special variant because only the absolute minimum of hardware is duplicated to allow SMT making it reasonable to run both with and without the second thread.

    5. Re:I don't understand the point of this article. by smallfries · · Score: 1

      I think that you've oversimplified a tad too much. You are assuming instant switching time on your gates. Sure light could propagate that fast in a vacuum, and electrons in a wire could do some comparable %. But a pipeline stage may have a combinatorial depth of several hundred gates and once you subtract their switching time signal propagation is a serious problem. The current range of Core2s has to use lots of fancy tricks (like asynchronous timing domains) to get around clock-skew at 3Ghz on a 11mm square die.

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    6. Re:I don't understand the point of this article. by rdebath · · Score: 1

      I'll have a minor nitpick too please

      Electrons in a wire move at around 3 inches per hour, it's the signals that move at near lightspeed.

  14. Uh, what a crap by udippel · · Score: 4, Informative

    "News for Nerds, Stuff that matters".
    But not if posted by The Ignorant.

    What if the developer gives the character movement tasks its own thread, but it can only be rendered at 400 fps. And the developer gives the 3D world drawer its own thread, but it can only be rendered at 60 fps. There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.

    If a student of mine wrote this, a Fail will be the immediate consequence. How can 400 fps be 'only'? And why is threading bad, if the character movement is ready after 1/400 second? There is not 'a lot of waiting'; instead, there are a lot of cycles to calculate something else. and 'waiting' is not 'synchronisation'.
    [The audio-rate of 7000 fps gave the author away; and I stopped reading. Audio does not come in fps.]

    While we all agree on the problem of synchronisation in parallel programming, and maybe especially in the gaming world, we should not allow uninformed blurb on Slashdot.

    1. Re:Uh, what a crap by destruk · · Score: 1

      Samples per second would be more accurate.

    2. Re:Uh, what a crap by maxume · · Score: 1

      I'm pretty sure it means "fixed at 400 fps" rather than "just 400 fps".

      --
      Nerd rage is the funniest rage.
    3. Re:Uh, what a crap by Anonymous Coward · · Score: 0

      [The audio-rate of 7000 fps gave the author away; and I stopped reading. Audio does not come in fps.]

      My guess is that the author tried to find a common base unit to compare computing time required by different tasks. Rather than using time (e.g. milliseconds) per task, he chose to use frames per second which may be easier to understand by the average gamer. Both units indicate how processing intensive a task is.

    4. Re:Uh, what a crap by Quixote · · Score: 1
      While I agree that the "article" was by a nitwit, I do have to quibble about something you wrote.

      How can 400 fps be 'only'?

      You are responding to the following (hypothetical) statement:
      but it can be rendered at only 400 fps

      Which is different from the one written:
      but it can only be rendered at 400 fps

      See the difference?

    5. Re:Uh, what a crap by Anonymous Coward · · Score: 0

      >[The audio-rate of 7000 fps gave the author away; and I stopped reading. Audio does not come in fps.]

      Well not to argue, but digital audio for soundtracks often is expressed in terms of frames.

    6. Re:Uh, what a crap by Anonymous Coward · · Score: 0

      If a student of mine wrote this

      A CS tutor who thinks he knows a lot about the theory, but knows little about the practice? Firstly, the artical said:

      but it can only be rendered at 400 fps.

      not:

      but it can be rendered at only 400 fps.

      Might help if you read your students' work properly...

      Secondly, Audio requires high update rates. The reason for this is that to play audio you fill a set of buffers. Once a buffer is filled it is locked and its contents cannot be changed. For playing music, that's not a problem - just fill the buffers up and top them up every now and again.
      For a computer game sound effect, the sound has to respond to the game environment - which may rapidly change. Imagine trying to change the sound a jet engine makes on a flight simulator as the throttle is engaged: if you fill the buffer up too much the sound will 'lag behind' therefore you must use small buffer!
      If the buffers all end up empty, the sound will stop. This results in stuttering which is not acceptable. Therefore, with small buffers, they need to be updated very rapidly. I typically devote an entire thread to it.

    7. Re:Uh, what a crap by Anonymous Coward · · Score: 0

      Anybody who'd ever done any MP3 decoding would know that! Our "professor" sounds like a typical academic CS guy who thinks he knows something about software engineering. More likely he's nobody, and doesn't know anything about computer science either. How he got upmodded anything other than "Funny" is a mystery.

  15. CUDA helps by... by Joce640k · · Score: 1

    CUDA helps by moving more work to the GPU - where the biggest bottleneck is.

    Um, no, that can't be right... :-(

    --
    No sig today...
  16. hardware encoded timestamps by Anonymous Coward · · Score: 0

    Oh and could you figure out some way to timestamp FPS game captures for the upcoming olympic video games?

  17. yawn by nguy · · Score: 1

    Except for being somewhat more cumbersome to program and less parallel than previous hardware, there is nothing really new about the nVidia parallel programming model. And their graphics-oriented approach means that their view of parallelism is somewhat narrow.

    Maybe nVidia will popularize parallel programming, maybe not. But I don't see any "shake up" or break throughs there.

  18. Nvidia should just put out their own OS by Latinhypercube · · Score: 1

    Why not start again with a massively parallel GPU, skipping all the years of catchup that will be necessary with multi-core cpu's. Make an OS for your chips...

  19. couldn't resist a quick Inmos story... by Fallen+Andy · · Score: 4, Interesting

    Back in the early 80's I was working in Bristol UK for TDI (who were the UCSD p-system licensees) porting it to various machines... Well, we had one customer who wanted a VAX p-system so we trotted off to INMOS's office and sat around in the computer room. (VAX 11/780 I think). At the time they were running Transputer simulations on the machine so the VAX p-system took er... about 30 *minutes* to start. Just for comparison an Apple ][ running IV.x would take less than a minute. Almost an hour to make a tape. (About 15 users running emulation I think). Fond memories of the transputer. Almost bought a kit to play with it... Andy

  20. CUDA = NVIDIA desperate to compete with Intel? by Cordath · · Score: 4, Insightful

    CUDA is an interesting way to utilize NVIDIA's graphics hardware for tasks it wasn't really designed for, but it's not a solution to parallel computing in and of itself. (more on that momentarily) A few people have gotten their nice high end Quadros to do some pretty cool stuff, but to date it's been limited primarily to relatively minor academic purposes. I don't see CUDA becoming big in gaming circles anytime soon. Let's face it, most gamers buy *one* reasonably good video card and leave it at that. Your video card has better things to do than handle audio or physics when your multi-core CPU is probably being criminally underutilized. Nvidia, of course, wants people to buy wimpy CPU's and then load up on massive SLI rigs and then do all their multi-purpose computation in CUDA. Not gonna happen.

    First of all, there are very few general purpose applications that special purpose NVIDIA hardware running CUDA can do significantly better than a real general purpose CPU, and Intel intends to cut even that small gap down within a few product cycles. Second, nobody wants to tie themselves to CUDA when it's built entirely for proprietary hardware. Third, CUDA still has a *lot* of limitations. It's not as easy to develop a physics engine for a GPU using CUDA as it is for a general purpose CPU.

    Now, I haven't used CUDA lately, so I could be way off base here. However, multi-threading isn't the real challenge to efficient use of resources in a parallel computing environment. It's designing your algorithms to be able to run in parallel in the first place. Most multi-threaded software out there still has threads that have to run on a single CPU, and the entire package bottlenecks on the single CPU running that thread even if other threads are free to run on other processors. This sort of bottleneck can only be avoided at the algorithm level. This isn't something CUDA is going to fix.

    Now, I can certainly see why NVIDIA is playing up CUDA for all they're worth. Video game graphics rendering could be on the cusp of a technological singularity. Namely, ray tracing. Ray tracing is becoming feasible to do in real time. It's a stretch at present, but time will change that. Ray tracing is a significant step forward in terms of visual quality, but it also makes coding a lot of other things relatively easy. Valve's recent "Portal" required some rather convoluted hacks to render the portals with acceptable performance, but in a ray tracing engine those same portals only take a couple lines of code to implement and have no impact on performance. Another advantage of ray tracing is that it's dead simple to parallelize. While current approaches to video game graphics are going to get more and more difficult to work with as parallel processing rises, ray tracing will remain simple.

    The real question is whether NVIDIA is poised to do ray-tracing better than Intel in the next few product cycles. Intel is hip to all of the above, and they can smell blood in the water. If they can beef up the floating point performance of their processors then dedicated graphics cards may soon become completely unnecessary. NVIDIA is under the axe and they know it, which might explain all the recent anti-Intel smack-talk. Still, it remains to be seen who can actually walk the walk.

    1. Re:CUDA = NVIDIA desperate to compete with Intel? by smallfries · · Score: 1

      First of all, there are very few general purpose applications that special purpose NVIDIA hardware running CUDA can do significantly better than a real general purpose CPU, and Intel intends to cut even that small gap down within a few product cycles. That's not strictly true. Off the top of my head: Sorting, FFTs (or any other dense Linear Algebra) and Crypto (both public key and symmetric) covers quite a lot of range. The only real issue for these application is the large batch sizes necessary to overcome the latency. Some of this is inherent in warming up that many pipes, but most of it is shit drivers and slow buses.

      The real question is what benefits will CUDA offer when the vector array moves closer to the processor? Most of the papers with the above applications used pre-CUDA hardware with all of the horrors of general-purpose coding running under OpenGL. A couple of the applications would already receive a significant boost from running in CUDA on modern hardware (primarily from latency reducton).

      It doesn't suprise anyone that we are watching the second generation of FPU being folded into the processor. It wouldn't suprise me personally if ten years from now the individual floating EUs inside most chips had disappeared completely leaving small Integer / Control pipes as a front-end to a massive vector array of FP units. There is more at stake than who can trace rays the quickest.
      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    2. Re:CUDA = NVIDIA desperate to compete with Intel? by Barny · · Score: 1

      As an article earlier this month pointed out they are in fact in the process of porting the CUDA system to CPUs.

      The advantages would be (assuming this is the wonderful solution it claims) you run your task in the CUDA environment, if your client only has a pile of 1U racks then he can at least run it, if he replaces a few of them with some Tesla racks, things will speed up a lot.

      I did some programming at college, I do not claim to know anything about the workings of Tesla or CUDA, but it sure sounds rosy if this stuff would work.

      --
      ...
      /me sighs
    3. Re:CUDA = NVIDIA desperate to compete with Intel? by Spatial · · Score: 1

      How is a raytracing renderer going to render the view through a portal with "no performance impact"? Magic? It still has to draw the view through the portal just like the render target method used in the normal renderer does, even if it might be more efficient. That isn't free.

      The whole raytracing thing seems like empty hype to me. How is it going to be a significant step forward when we already have proven methods that're capable of graphics bordering on the photo-realistic? It's hard to move forward when you're already at the end of the line.

    4. Re:CUDA = NVIDIA desperate to compete with Intel? by Anonymous Coward · · Score: 0

      A few people have gotten their nice high end Quadros to do some pretty cool stuff, but to date it's been limited primarily to relatively minor academic purposes. From what I've heard, that's not nearly true. There are quite a few hard-core industrial applications that have seen speedups of 200X using CUDA. Stuff to do with complex medical imaging, oil field flow analysis, Monte Carlo computations on whatever.
    5. Re:CUDA = NVIDIA desperate to compete with Intel? by daveisfera · · Score: 1

      The point wasn't that it was free, but that the cost was dramatically less with raytracing than it is with rasterization.

    6. Re:CUDA = NVIDIA desperate to compete with Intel? by Spatial · · Score: 1

      Relatively less, perhaps. But isn't it going to be much slower overall just by virtue of using raytracing? We can get a lot more done with what we've got right now.

    7. Re:CUDA = NVIDIA desperate to compete with Intel? by Rockoon · · Score: 1

      Perhaps you should learn a little somehting about how raytracing works before you critique it.

      The point isnt that its free, the point is that it costs no more than any other reflective or refractive surface under raytracing. Thats the elegance of raytracing. Such surfaces are cheap in a raytracer, expensive in a rasterizer.

      Raytracing isnt just elegant. Raytracers scale better than rasterizers in the specific task of rendering, outperforming rasterizers when the primitive count grows enourmous. Raytracing most definately is the future of high detail rendering.

      There are still a few hills to climb with raytracers, such as how to go about maintain O(Log n) on intersection tests in highly dynamic environments. The current solutions all have annoying warts, but they are just warts.

      Rasterizers have far more warts than raytracers ever will. Rasterizers use many dirty little tricks to fake what raytracers do naturally. Everything from shadows, reflective surfaces, parallax mapping, radiosity, and just about every other "boasted about" lighting technique.

      If you want to see high quality realtime global illumination in games, then you want raytracing. Rasterizers can't do it unless they cheat and perform raytracing.

      Get it now?

      --
      "His name was James Damore."
    8. Re:CUDA = NVIDIA desperate to compete with Intel? by unsigned+integer · · Score: 1


      "I don't see CUDA becoming big in gaming circles anytime soon." ... until Aegis PhysX is ported to CUDA. Thus enabling every single G80 and higher card to also turn into a physics accelerator. Yeah, gamers won't go for that shit at all.

      "Third, CUDA still has a *lot* of limitations. It's not as easy to develop a physics engine for a GPU using CUDA as it is for a general purpose CPU."

      Guess we'll see.

      http://en.wikipedia.org/wiki/PhysX

    9. Re:CUDA = NVIDIA desperate to compete with Intel? by anss123 · · Score: 1

      Raytracing is itself a cheat. Don't forget that. Rasterizers may cheat more but we got ten years of experience with real time rasterizers now while raytracing is a 'might be but probably not unless problem X is overcome'.

    10. Re:CUDA = NVIDIA desperate to compete with Intel? by Rockoon · · Score: 1

      Thats where you are wrong.

      You quote "might be but probably not unless problem X is overcome"

      Who exactly are you quoting? People in your imagination do not count.

      The "problems" that raytracing face HAVE been overcome, decades ago.

      If you dont understand the different algorithmic complexities between rasterization and raytracing then you should do your homework before spouting off. Raytracing dominates rasterizers when using enormous primitive counts. It has always done so, and always will.

      Raytracer complexity grows at O(log n), Rasterizer complexity grows at O(n).

      Any programmer should know exactly what that means.

      Rasterizers are the bubble sort of the rendering world, and while it outperforms raytracing with small (n) just like bubble outperforms quicksort on small (n), it cannot ever outperform raytracing with large (n) just like bubble cannot outperform quicksort on large (n).

      Class dismissed.

      --
      "His name was James Damore."
    11. Re:CUDA = NVIDIA desperate to compete with Intel? by anss123 · · Score: 1

      Solved? Issues such as aliasing and indirect light have just been 'worked around'. That's not solved IMO. Ray tracing is not happy about movement in scenes (think swaying trees). Neither is ray tracing all that accurate with reality, based on the assumption that light travels in the opposite direction than what it actually does (from the eyes to the sun).

      IOW Ray tracing is not a golden bullet. It handles dynamic scenes poorly and needs an exponential raise in the number of rays as quality is improved. Pixar for instance used ray tracing first in 'Cars' (a movie with many reflective surfaces and shadows), having relied on scanline rendering for all their previous movies. Even then Pixar outright stated that Raytracing was unsuited for very complex scenes, which is the exact opposite of your claim.

      This may be the result of Pixar's implementation but I rather take their word for it than yours to the opposite.

    12. Re:CUDA = NVIDIA desperate to compete with Intel? by Rockoon · · Score: 1

      Aliasing? Its not solved in strict raytracing, thats why variants of raytracing such as beam-tracing were invented decades ago. Wrong. Solved.

      Indirect lighting? Radiosity at the worst, just like rasterization. Wrong. Solved.

      Not accurate with reality because its done backwards? Math is hard. Lets go shopping! Wrong. What do you think rasterizers are doing?

      Dynamic scenes? Existing solutions are ugly, which I call warts. They exist tho. In fact, many solutions exist, so Wrong. Solved.

      The Cars movie? Small (n). Pixar doesnt do large (n).

      If you want to take Pixars word over mathematical facts, then so be it. You cannot argue that O(n) is better than O(log n) for large N. Here's an idea.. learn what Big-O notation is, and why its important.

      (as if Pixar doesnt have a motive for every statement they make..)

      --
      "His name was James Damore."
    13. Re:CUDA = NVIDIA desperate to compete with Intel? by Rockoon · · Score: 1

      Doing you a favor.
      \
      A Power Point Presentation

      Digest until slide 13.

      Notice what happens as size of input increases in magnitude for the O(log n) algorithms.
      Compare with what happens when the size of the input increases in magnitude for the O(n) algorithms.

      There must exist an N where O(log n) blows O(n) out of the water, no matter what per-iteration constants are in play.

      We have a good idea what size N must be for a raytracer to begin to blow a raterizer out of the water, given the observed per-iteration constants of each algorithm. Its somewhere between 1 million (ideal conditions for the raytracer) and 10 million (doesnt matter how poor the conditions are for the raytracer.)

      Put another way, if a rasterizer can render a scene with 1,000,000 primitives in DT time, then if we double N to 2,000,000 then it takes 2*DT time.

      But if we consider the same problem for a raytracer.. if a raytracer can render 1,000,000 primitives in DT time then doubling N to 2,000,000 only takes 1.05*DT time.

      With large N, doubling the input size is practically free for a raytracer but is always a linear double-the-time-for-double-the-input for the rasterizer.

      Rasterization cannot compete for large N. Its a certainty.

      --
      "His name was James Damore."
    14. Re:CUDA = NVIDIA desperate to compete with Intel? by anss123 · · Score: 1

      Big-O is an annoying notation, I prefer linear instead of O(n) and logarithmic instead of O(log n). That said, regardless of what you want to believe workarounds are workarounds, not solutions. The point anyway was that ray tracing does not reflect reality, not that scanline renderers were superior. Both methods are at the end of the day compromises.

      You call pixar scenes with tens of thousand of objects simple? Pixar has shown of the most complex off line rendering I've seen and you consider that too simple to benefit from ray tracing?!

    15. Re:CUDA = NVIDIA desperate to compete with Intel? by Rockoon · · Score: 0

      I do.

      Modern games deal with way more than "tens of thousands" of primitives, and they do it realtime. Games already juggle millions of polygons, and that count is only going to grow.

      Pixar deals with low numbers of primitives, they dont do it in realtime, and they only deal with what is absolutely necessary given extensive off-line scene analysis.

      They arent even playing on the same field. Pixar == small n.

      --
      "His name was James Damore."
    16. Re:CUDA = NVIDIA desperate to compete with Intel? by anss123 · · Score: 1

      So you're now saying that if Pixar just skipped doing off line scene analysis and instead did ray tracing on the direct they would render faster? And what stops future triangle heavy games from doing the same pre-scene analysis if that's not the case?

    17. Re:CUDA = NVIDIA desperate to compete with Intel? by Rockoon · · Score: 0

      Triangle-heavy games have one issue that pixar does not have. An unpredictable user controlling the camera, who is going to move it to every corner of every map. Pre-processing like pixar does simply isnt feasable for games.

      If you have the computing power to do this sort of preprocessing in realtime, then all your computational arguements against raytracing have just evaporated.

      Why is it so hard for you to understand that you are comparing apples to oranges? These two problem sets are very different and no matter how sly you think you are about asking a loaded question, its still a loaded question.

      Pixar also does a lot of between-frame rendering operations (such as calculating realistic motion bluring) that also will not work in an interactive manner. Again, games (Crysis) are faking this sort of effect. Raytracers will as well, but will do it more efficiently (rasterizers use alphablending sparingly, because usualy alphablended objects need to be sorted back to front in a rasterizer)

      --
      "His name was James Damore."
  21. More investment needed in e.g Erlang by Kupfernigk · · Score: 3, Interesting
    The approach used by Erlang is interesting as it is totally dependent on message passing between processes to achieve parallelism and synchronisation. To get real time performance, the message passing must be very efficient. Messaging approaches are well suited to parallelism where the parallel process are themselves CPU and data intensive, which is why they work well for cryptography and image processing. From this point of view alone, a parallel architecture using GPUs with very fast intermodule channels looks like a good bet.

    The original Inmos Transputer was designed to solve such problems and relied on fast inter-processor links, and the AMD Hypertransport bus is a modern derivative.

    So I disagree with you. The processing hardware is not so much the problem. If GPUs are small, cheap and address lots of memory, so long as they have the necessary instruction sets they will do the job. The issue to focus on is still interprocessor (and hence interprocess) links. This is how hardware affects parallelism.

    I have on and off worked with multiprocessor systems since the early 80s, and always it has been fastest and most effective to rely on data channels rather than horrible kludges like shared memory with mutex locks. The code can be made clean and can be tested in a wide range of environments. I am probably too near retirement now to work seriously with Erlang, but it looks like a sound platform.

    --
    From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
    1. Re:More investment needed in e.g Erlang by jkndrkn · · Score: 2, Interesting

      > and always it has been fastest and most effective to rely on data channels rather than horrible kludges like shared memory with mutex locks. While shared-memory tools like UPC and OpenMP are gaining ground (especially with programmers), I too feel that they are a step backwards. Message passing languages, especially Erlang, are better designed to cope with the unique challenges of computing on a large parallel computer due to their excellent fault tolerance features.

      You might be interested in some work I did evaluating Erlang on a 16 core SMP machine:

      http://jkndrkn.livejournal.com/205249.html

      Quick summary: Erlang is slow, though using the Array module for data structure manipulation can help matters. Erlang could still be useful as a communications layer or monitoring system for processes writen in C.

    2. Re:More investment needed in e.g Erlang by Anonymous Coward · · Score: 0

      eh, the message passing architectures always have higher overhead than the simpler ones. just look at the many overhauls the OS such as mach have had (which still doesn't scale to high parallelism). Shared memory IPC is the one that works fastest in the real world within a machine, message passing is for infrequent comm on very slow channels such as a network.

    3. Re:More investment needed in e.g Erlang by Dr.Ruud · · Score: 1

      See also Perl6.
      For example hyperoperators: Perl6 Hyperoperators

  22. Is this why there's no OpenGL 3.0? by zackhugh · · Score: 1

    NVidia is one of the major voices in the Khronos Group, the organization that promised to release the OpenGL 3.0 API over six months ago. The delay is embarrassing, and many are turning to DirectX.

    It occurs to me that NVidia may not want OpenGL to succeed. Maybe they're holding up OpenGL development to give CUDA a place in the sun. Does anyone else get the same impression?

    1. Re:Is this why there's no OpenGL 3.0? by mikael · · Score: 1

      Delays are mainly due to disagreements between different vendors rather than any one company wanting to slow the show down.

      Look at the early OpenGL registry extension specifications - vendors couldn't even agree on what vector arithmetic instructions to implement.

      --
      Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
    2. Re:Is this why there's no OpenGL 3.0? by johannesg · · Score: 2, Insightful

      NVidia has every reason to want OpenGL to succeed - if it doesn't, Microsoft will rule supreme over the API to NVidia's hardware, and that isn't a healthy situation to be in. As it is, OpenGL gives them some freedom to do their own thing.

      However, having mentioned Microsoft... If *someone* does want OpenGL to succeed it is them... If and when OpenGL 3.0 ever appears, I bet there will be some talk of some "unknown party" threatening patent litigation...

      Destroying OpenGL is of paramount important to Microsoft, since it will grant them total dominance over 3D graphics. Apple, Linux, Sony (PS3), and other vendors that rely on OpenGL will completely lose their ability to compete.

    3. Re:Is this why there's no OpenGL 3.0? by Shinobi · · Score: 1

      Actually, there is one company that has actively stalled and effectively sabotaged a lot of OpenGL development: ATI

    4. Re:Is this why there's no OpenGL 3.0? by mikael · · Score: 1

      I'm interested in learning more - do you have any more links? From many slashdot articles, people have a bad opinion of ATI drivers.

      --
      Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
    5. Re:Is this why there's no OpenGL 3.0? by Shinobi · · Score: 1

      No links to any web site or so. What insight I have comes from a former schoolmate and fellow computer club member who was on the ARB as a non-voting member, and sat in on a number of meetings, and even his own presentation. I'll see if he's willing to talk about it with you

    6. Re:Is this why there's no OpenGL 3.0? by mikael · · Score: 1

      Thanks. I noticed the competition between ATI and Nvidia was related to the Stanford vs. University of Waterloo rivalry over 3D research. But I can well believe it.

      --
      Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
  23. CUDA is limiting, not liberating by njord · · Score: 4, Informative

    From my experience, CUDA was much harder to take advantage of then multi-core programming. CUDA requires you to use a specific model of programming that can make it difficult to take advantage of the full hardware. The restricted caching scheme makes memory management a pain, and the global synchronization mechanism is very crude - there's a barrier after each kernel execution, and that's it. It took me a week to 'parallelize' port some simple code I had written to CUDA, whereas it took my an hour or so to add the OpenMP statements to my 'reference' CPU code. Sorry Nvidia - there is no silver bullet. By making some parts of parallel programming easy, you make others hard or impossible.

    1. Re:CUDA is limiting, not liberating by ameline · · Score: 1

      Mod parent up... His is one of the best on this topic.

      --
      Ian Ameline
    2. Re:CUDA is limiting, not liberating by Anonymous Coward · · Score: 1, Interesting

      You make a good point: The data-parallel computing model used in CUDA is very unfamiliar to programmers. You might read the spec sheet and see "128 streaming processors" and think that is the same as having 128 cores, but it is not. CUDA inhabits a world somewhere between SSE and OpenMP in terms of task granularity. I blame part of this confusion on Nvidia's adoption of familiar sounding terms like "threads" and "processors" for things which behave nothing like threads and processors from a multicore programming perspective. Managing people's expectations is an important part of marketing a new tech, and hype can lead to anger.

      That said, for a truly data parallel calculation, CUDA blows any affordable multicore solution out of the water. By removing some of the flexibility, GPUs can spend their transistor budget on more floating point units and bigger memory buses.

      OpenMP is nice, but it will be a while before multicore CPUs can offer you 128 (single precision) floating point units fed by a 70 GB/sec memory bus. :)

      (I'm willing to forgive more of the limitations of CUDA because now putting a $350 card into a workstation gives us a 10x performance improvement in a program we use very frequently. Speed-ups in the range of 10x to 40x are pretty common for the kind of data parallel tasks that CUDA is ideal for. If you only see a 2x or 3x improvement, you are probably better off with OpenMP and/or SSE.)

  24. Why was this greenlit by Anonymous Coward · · Score: 0

    Why was that headline greenlit? Next we'll be like the NY Times and have to avoid those confusing acronyms and spell out Central Processing Unit and Redundant Array of Inexpensive Disk [drives].

  25. Yes, I read your paper by Kupfernigk · · Score: 2, Interesting
    It doesn't surprise me in the slightest. Erlang is designed from the ground up for pattern matching rather than computation, because it was designed for use in messaging systems - telecoms, SNMP, now XMPP. Its integer arithmetic is arbitrary precision, which prevents overflow in integer operations at the expense of performance. Its floating point is limited. My early work on a 3-way system used hand coded assembler to drive the interprocess messaging using hardware FIFOs, for Pete's sake, and that was as high performance as you could get - given the huge limitations of trying to write useful functions in assembler.

    That in a nutshell is why I suggested that investment in Erlang would be a good idea. It's better to start with the right approach and optimise it, than go off into computer science blue sky and try to design a perfect language for paralleling GPUs - which practically nobody will ever really use.

    --
    From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
  26. The EETimes article is much better by Jeremy+Erwin · · Score: 3, Informative
  27. Blog spam. Link to actual article. Nvidia loss? by Futurepower(R) · · Score: 2, Interesting

    Avoid the blog spam. This is the actual article in EE times: Nvidia unleashes Cuda attack on parallel-compute challenge.

    Nvidia is showing signs of being poorly managed. CUDA is a registered trademark of another hi-tech company.

    The underlying issue is apparently that Nvidia will lose most of its mid-level business when AMD/ATI and Intel/Larrabee being shipping integrated graphics. Until now, Intel integrated graphics has been so limited as to be useless in many mid-level applications. Nvidia hopes to replace some of that loss with sales to people who want to use their GPUs to do parallel processing.

    1. Re:Blog spam. Link to actual article. Nvidia loss? by Fulcrum+of+Evil · · Score: 1

      Nvidia is showing signs of being poorly managed. CUDA is a registered trademark of another hi-tech company.

      Who cares? Medical equipment != parallel computation.

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
  28. No one is going to "solve" the problem by swillden · · Score: 1

    Multi-threaded programming is a fundamentally hard problem, as is the more general issue of maximally-efficient scheduling of any dynamic resource. No one idea, tool or company is going to "solve" it. What will happen is that lots of individual ideas, approaches, tools and companies will individually address little parts of the problem, making it incrementally easier to produce efficient multi-threaded code. Some of these approaches will work together, others will be in opposition, there will be engineering tradeoffs to be made (particularly between efficiency of execution and ease of development) and the incremental improvements will not so much make it easier to to multi-threaded programming as make it feasible to attack more complex problems.

    Pretty much just like the history of every other part of software development.

    --
    Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    1. Re:No one is going to "solve" the problem by Anonymous Coward · · Score: 0

      Exactly! I couldn't agree more...

      Multi-threaded, parallel algorithm development is anything but easy - however, it's a mighty powerful approach to many problem.
      While CUDA may not be the perfect solution to every problem, it is yet another tool which I can use (as a software engineer) to solve complex problems.

      One thing that's nice about NVIDIA putting out CUDA is that it's free. If you already have an high-endish card you can make use of it. Yeah, it only runs on NVIDIA and right now it only works on some of their cards but that will change. As time goes on, every NVIDIA card will be able to make use of it and something like an OpenGL for GPGPU will evolve (boy, wouldn't that be nice). Hell, my OSX laptop can already use it for parallel-applicable work. I now have a bunch of additional processors to use on my machine which I can used for a TON of things and I didn't have to buy anything... Fabulous!

      Life is not synchronous - Parallel is the future..

  29. Reminds me of OLD the stories I used to hear... by JRHelgeson · · Score: 3, Interesting

    I live in Minnesota, home of the legendary Cray Research. I've met with several old timers that developed the technologies that made the Cray Supercomputer what it was. Hearing about the problems that multi-core developers are facing today reminds me of the stories I heard about how the engineers would have to build massive cable runs from processor board to processor board to memory board just to synchronize the clocks and operations so that when the memory was ready to read or write data, it could tell the processor board... half a room away.

    As I recall:
    The processor, as it was sending the data to the bus, would have to tell the memory to get ready to read data through these cables. The "cables hack" was necessary because the cable path was shorter than the data bus path, and the memory would get the signal just a few mS before the data arrived at the bus.

    These were fun stories to hear but now seeing what development challenges we face in parallel programming multi-core processors gives me a whole new appreciation for those old timers. These are old problems that have been dealt with before, just not on this scale. I guess it is true what they say, history always repeats itself.

    --
    Good security is based upon reality and common sense. Common sense is a function of having common knowledge.
    1. Re:Reminds me of OLD the stories I used to hear... by Anonymous Coward · · Score: 0

      I don't think any such hacks were required on a Cray, at least the early ones, as the memory was integral to CPU. The external mass storage was not particularly unusual.

      I think the 'cables hack' was the way all wires in the CPU were exactly the same length, regardless of the distance they were required to transverse. This ensured all signals arrived simultaneously.

      The fun really starts when you combine williams tube, drum and mercury delay line memories.

    2. Re:Reminds me of OLD the stories I used to hear... by JRHelgeson · · Score: 1

      No, the way I recall the story I heard (and this is all complete hearsay) is that they built up memory and proc cards and then they had everything running in massive parallel structures using the cables to synchronize the reads writes, etc. That was the only way they could get massive parallel architectures working in those days. Nevertheless, my point is not to debate the finite points of the Cray architecture, but to share the chuckle I had when this article just reinforced the old axiom that "The more things change, the more they stay the same."

      --
      Good security is based upon reality and common sense. Common sense is a function of having common knowledge.
  30. Why?? by Anonymous Coward · · Score: 0

    I always wondered why the parallelization problem couldn't be solved using a concept similar to TCP/IP. If you think of CPU instructions like a "packet" and assign them a sequence number, then the CPU can keep track of what order the instruction results should come out. In addition to L1 and L2 cache, there should be a cache to hold the results of CPU instructions until they can be streamed out in the correct order. In essence, it would look like a single-core CPU to the outside world, but using buffers and sequencing tricks, perform in parallel.

    1. Re:Why?? by Anonymous Coward · · Score: 0

      Wow. Uhmm. I think you just reinvented hyper threading. For that, you must be killed.

  31. Ah, but that's the point! by nxmehta · · Score: 1

    The entire reason why CUDA works and is powerful is exactly because it is limited. Nvidia knows that there is no silver bullet. They're not claiming that this is one (David Kirk has said so himself at conferences). CUDA is a fairly elegant way of mapping embarrassingly data parallel programs to a large array of single precision FP units. If your problem fits into the model, the performance you get via CUDA will smoke just about anything else (except maybe an FPGA in some scenarios).

    Your notion about particular models making some parts of parallel programming easy while other parts are hard is what people really need to learn to accept about parallel programming. If you're expecting a single model to make everything easy for you, trust me, stop programming right now.

    You need to pick the programming model that matches the parallelism in your application- there will never be one solution. When sitting down to write code, you have to ask yourself: what is the right model for this algorithm? Is it:

    Data parallel (SIMD, Vector)
    Message Passing
    Actors
    Dataflow
    Transactional
    Streaming (pipe and filter)
    Sparse Graph
    Etc...

    There are many models out there, and many languages + hardware substrates for these models that will give you orders of magnitude speedup for parallel programs. They key is to just to sit down, think about the problem, and pick the right one (or combinations).

    The real research focus in parallel programming should be to make a taxonomy of models and start coming up with a unified infrastructure to support intelligent selection of models, mixing and matching, and compilation.

  32. BOINC for PS3 by Anonymous Coward · · Score: 0

    BOINC is only a framework for organizing job-level massive parallelism. It's not an abstraction for parallelism at the application level: when you write a BOINC application, you don't get any parallelism for "free". It's still up to you as the application developer to target your app for a specific platform, let alone hardware, because BOINC simply hands off / manages execution of your application. The app developer must write for x86/Win, x86/Linux, x86/Mac, PPC/Mac, etc. Most critically, that means that you have the privilege and responsibility of exploiting the hardware (x86, amd64, PPC, PS3, etc.) yourself, specific to your application's needs, at the application level. BOINC will then do handle job management and scheduling between your server and each instance of the client.

    So your question is actually a bit ill-formed. Instead of asking "could we run a framework on the PS3?", which would provide no free parallelism, you probably meant to ask "could we run BOINC applications on the PS3?". The problem lies not in porting BOINC to PS3 but in having yet another platform which users (application authors such as SETI@H or Einstein@H) would need to target. Some (most...) of those guys are fairly small operations and stick to x86 hardware and often only Windows at that, at least for a while until they get Mac and Linux clients working alongside.

    The Folding@home operation is well-organized and has more resources than most, and they don't run on BOINC. They're the ones who have a PS3 client (which is much tougher to write than an x86 client to exploit the given hardware), and who even support a handful of ATI's recent but disparate GPUs (Windows only I believe). It's not that BOINC on PS3 (or whatever) is impossible; it's that it gains the application developer nothing without a LOT more effort. The question of whether or not it's worth that effort falls to the user and not the authors of BOINC.

  33. Errata by johannesg · · Score: 1

    ...NOT want to succeed. Microsoft does NOT want OpenGL to succeed.

    But you knew that already.

  34. Erlang for multithreaded apps by Anonymous Coward · · Score: 0

    There is a language that makes programming in threads much easier. See erlang.org.

  35. Parallelism hype... by blahplusplus · · Score: 1

    ... programs are still only as fast as their slowest link.

  36. Use Ada by iliketrash · · Score: 1

    Use Ada.

  37. Re: by clint999 · · Score: 0

    No offence, but I'm perplexed as to how this rubbish made it past the firehose.

  38. Real problem: data staging by Anonymous Coward · · Score: 0

    The real problem is not parallelizing applications. It is staging the data so your CPU isn't spending all of its time waiting for data.

    GPUs deal with this by having so many threads that they can afford to swap in other threads when they are waiting for data, but this has a *huge* overhead in terms of thread state storage, and generates "flocking" effects in the caches.

    The real reason CUDA isn't the revolution NVIDIA wants you to think it is is simply that the GPUs from NVIDIA only work well if the threads all do the same thing. (At least in batches of 16.) If you run 1024 different threads on a GPU you get about 1/16th the performance, and now you're not looking much better than Intel, and a lot worse than Cell...

  39. Has anyone compared this approach to Erlang? by wsgeek · · Score: 1

    I am no Erlang expert, but isn't it supposed to be a language that is inherently parallel, thus allowing programs to "automatically" take advantage of multi-core systems?

  40. In the Entertainment Industry..... by xclr8r · · Score: 1

    When we need different systems to run thousands of cues for a show (lighting, pyrotechnics (with a deadman switch), special effects, audio, video projection and automated staging) we use SMPTE time code . IANACD I am not a chipset developer but if you could feed the multiple cores a time code pre-processor then everything post processor should sync up on cue in your various outputs.

    --
    Beware of those who profit off the docile and persecute the unbelievers.