Slashdot Mirror


NVIDIA Shaking Up the Parallel Programming World

An anonymous reader writes "NVIDIA's CUDA system, originally developed for their graphics cores, is finding migratory uses into other massively parallel computing applications. As a result, it might not be a CPU designer that ultimately winds up solving the massively parallel programming challenges, but rather a video card vendor. From the article: 'The concept of writing individual programs which run on multiple cores is called multi-threading. That basically means that more than one part of the program is running at the same time, but on different cores. While this might seem like a trivial thing, there are all kinds of issues which arise. Suppose you are writing a gaming engine and there must be coordination between the location of the characters in the 3D world, coupled to their movements, coupled to the audio. All of that has to be synchronized. What if the developer gives the character movement tasks its own thread, but it can only be rendered at 400 fps. And the developer gives the 3D world drawer its own thread, but it can only be rendered at 60 fps. There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.'"

10 of 154 comments (clear)

  1. NVidia is doing that? an insult to INMOS... by master_p · · Score: 4, Interesting

    Many moons ago, when most slashdotters were nippers, a British company named INMOS provided an extensible hardware and software platform that solved the problem of parallelism, in many ways similar to CUDA.

    Ironically, some of the first demos I saw using transputers was raytracing demos.

    The problem of parallelism and the solutions available are quite old (more than 20 years), but it's only now that limits are reached that we see the true need for it. But the true pioneers is not NVIDIA, because there were others long before them.

    1. Re:NVidia is doing that? an insult to INMOS... by ratbag · · Score: 2, Interesting

      That takes me back. My MSc project in 1992 was visualizing 3D waves on Transputers using Occam. Divide the wave into chunks, give each chunk to a Transputer, pass the edge case between the Transputers and let one of them look after the graphics. Seem to recall there were lots of INs and OUTs. A friend of mine simulated bungie jumps using similar code, with a simple bit of finite element analysis chucked in (the rope changed colour based on the amount of stretch).

      Happy Days at UKC.

  2. Re:New programming tools needed by TheRaven64 · · Score: 2, Interesting
    There's only so much that a compiler can do. If you structure your algorithms serially then a compiler can't do much. If you write parallel algorithms then it's relatively easy for the compiler to turn it into parallel code.

    There are a couple of approaches that work well. If you use a functional language, then you can use monads to indicate side effects and the compiler can implicitly parallelise the parts that are free from side effects. If you use a language like Erlang or Pict based on a CSP or a pi-calculus model then you split your program into logically independent chunks with a message passing interface between them the compiler or runtime can schedule them independently.

    --
    I am TheRaven on Soylent News
  3. couldn't resist a quick Inmos story... by Fallen+Andy · · Score: 4, Interesting

    Back in the early 80's I was working in Bristol UK for TDI (who were the UCSD p-system licensees) porting it to various machines... Well, we had one customer who wanted a VAX p-system so we trotted off to INMOS's office and sat around in the computer room. (VAX 11/780 I think). At the time they were running Transputer simulations on the machine so the VAX p-system took er... about 30 *minutes* to start. Just for comparison an Apple ][ running IV.x would take less than a minute. Almost an hour to make a tape. (About 15 users running emulation I think). Fond memories of the transputer. Almost bought a kit to play with it... Andy

  4. More investment needed in e.g Erlang by Kupfernigk · · Score: 3, Interesting
    The approach used by Erlang is interesting as it is totally dependent on message passing between processes to achieve parallelism and synchronisation. To get real time performance, the message passing must be very efficient. Messaging approaches are well suited to parallelism where the parallel process are themselves CPU and data intensive, which is why they work well for cryptography and image processing. From this point of view alone, a parallel architecture using GPUs with very fast intermodule channels looks like a good bet.

    The original Inmos Transputer was designed to solve such problems and relied on fast inter-processor links, and the AMD Hypertransport bus is a modern derivative.

    So I disagree with you. The processing hardware is not so much the problem. If GPUs are small, cheap and address lots of memory, so long as they have the necessary instruction sets they will do the job. The issue to focus on is still interprocessor (and hence interprocess) links. This is how hardware affects parallelism.

    I have on and off worked with multiprocessor systems since the early 80s, and always it has been fastest and most effective to rely on data channels rather than horrible kludges like shared memory with mutex locks. The code can be made clean and can be tested in a wide range of environments. I am probably too near retirement now to work seriously with Erlang, but it looks like a sound platform.

    --
    From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
    1. Re:More investment needed in e.g Erlang by jkndrkn · · Score: 2, Interesting

      > and always it has been fastest and most effective to rely on data channels rather than horrible kludges like shared memory with mutex locks. While shared-memory tools like UPC and OpenMP are gaining ground (especially with programmers), I too feel that they are a step backwards. Message passing languages, especially Erlang, are better designed to cope with the unique challenges of computing on a large parallel computer due to their excellent fault tolerance features.

      You might be interested in some work I did evaluating Erlang on a 16 core SMP machine:

      http://jkndrkn.livejournal.com/205249.html

      Quick summary: Erlang is slow, though using the Array module for data structure manipulation can help matters. Erlang could still be useful as a communications layer or monitoring system for processes writen in C.

  5. Re:New programming tools needed by maraist · · Score: 4, Interesting

    Consider that if you've ever done UNIX programming, you've been doing MT programming all along - just by a different name.. Multi-Processing. Pipelines are, in IMO the best implementation of parallel programming (and UNIX is FULL of pipes). You take a problem and break it up into wholly independent stages, then multi process or multi-thread the stages. If you can split the problem up using message-passing then you can farm the work out to decoupled processes on remote machines, and you get farming / clustering. Once you have the problem truely clustered, then multi-threading is just a cheaper implementation of multi-processing (less overhead per worker, less number of physical CPUs, etc).

    Consider this parallel programing pseudo-example

    find | tar | compress | remote-execute 'remote-copy | uncompress | untar'

    This is a 7 process FULLY parallel pipeline (meaning non-blocking at any stage - every 512 bytes of data passed from one stage to the next gets processed immediately). This can work with 2 physical machines that have 4 processing units each, for a total of 8 parallel threads of execution.

    Granted, it's hard to construct a UNIX pipe that doesn't block.. The following variation blocks on the xargs, and has less overhead than separate tar/compress stages but is single-threaded

    find name-pattern | xargs grep -l contents-pattern | tar-gzip | remote-execute 'remote-copy | untar-unzip'

    Here the message-passing are serialized/linearized data.. But that's the power of UNIX.

    In CORBA/COM/GNORBA/Java-RMI/c-RPC/SOAP/HTTP-REST/ODBC, your messages are 'remoteable' function calls, which serialize complex parameters; much more advanced than a single serial pipe/file-handle. They also allow synchronous returns. These methodologies inherently have 'waiting' worker threads.. So it goes without saying that you're programming in an MT environment.

    This class of Remote-Procedure-Calls is mostly for centralization of code or central-synchronization. You can't block on a CPU mutex that's on another physically separate machine.. But if your RPC to a central machine with a single variable mutex then you can.. DB locks are probably more common these days, but it's the exact same concept - remote calls to a central locking service.

    Another benifit in this class of IPC (Inter Process Communication) is that a stage or segment of the problem is handled on one machine.. BUt a pool of workers exists on each machine.. So while one machine is blocking, waiting for a peer to complete a unit of work, there are other workers completing their stage.. At any given time on every given CPU there is a mixture of pending and processing threads. So while a single task isn't completed any faster, a collection of tasks takes full advantage of every CPU and physical machine in the pool.

    The above RPC type models involve explicit division of labor. Another class are true opaque messages.. JMS, and even UNIX's 'ipcs' Message Queues. In Java it's JMS. The idea is that you have the same workers as before, but instead of having specific UNIQUE RPC URI's (addresses), you have a common messaging pool with a suite of message-types and message-queue-names. You then have pools of workers that can live ANYWHERE which listen to their queues and handle an array of types of pre-defined messages (defined by the application designer). So now you can have dozens or hundreds of CPUs, threads, machines all symmetriclly passing asynchronous messages back and forth.

    To my knowledge, this is the most scaleable type of problem.. You can take most procedural problems and break them up into stages, then define a message-type as the explicit name of each stage, then divide up the types amongst different queues (which would allow partitioning/grouping of computational resources), then receive-message/process-message/forward-or-reply-message. So long as the amount of work far exceeds the overhead of message passing, you can very nicely scale with the amount of hardware you can throw at the problem.

    --
    -Michael
  6. Yes, I read your paper by Kupfernigk · · Score: 2, Interesting
    It doesn't surprise me in the slightest. Erlang is designed from the ground up for pattern matching rather than computation, because it was designed for use in messaging systems - telecoms, SNMP, now XMPP. Its integer arithmetic is arbitrary precision, which prevents overflow in integer operations at the expense of performance. Its floating point is limited. My early work on a 3-way system used hand coded assembler to drive the interprocess messaging using hardware FIFOs, for Pete's sake, and that was as high performance as you could get - given the huge limitations of trying to write useful functions in assembler.

    That in a nutshell is why I suggested that investment in Erlang would be a good idea. It's better to start with the right approach and optimise it, than go off into computer science blue sky and try to design a perfect language for paralleling GPUs - which practically nobody will ever really use.

    --
    From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
  7. Blog spam. Link to actual article. Nvidia loss? by Futurepower(R) · · Score: 2, Interesting

    Avoid the blog spam. This is the actual article in EE times: Nvidia unleashes Cuda attack on parallel-compute challenge.

    Nvidia is showing signs of being poorly managed. CUDA is a registered trademark of another hi-tech company.

    The underlying issue is apparently that Nvidia will lose most of its mid-level business when AMD/ATI and Intel/Larrabee being shipping integrated graphics. Until now, Intel integrated graphics has been so limited as to be useless in many mid-level applications. Nvidia hopes to replace some of that loss with sales to people who want to use their GPUs to do parallel processing.

  8. Reminds me of OLD the stories I used to hear... by JRHelgeson · · Score: 3, Interesting

    I live in Minnesota, home of the legendary Cray Research. I've met with several old timers that developed the technologies that made the Cray Supercomputer what it was. Hearing about the problems that multi-core developers are facing today reminds me of the stories I heard about how the engineers would have to build massive cable runs from processor board to processor board to memory board just to synchronize the clocks and operations so that when the memory was ready to read or write data, it could tell the processor board... half a room away.

    As I recall:
    The processor, as it was sending the data to the bus, would have to tell the memory to get ready to read data through these cables. The "cables hack" was necessary because the cable path was shorter than the data bus path, and the memory would get the signal just a few mS before the data arrived at the bus.

    These were fun stories to hear but now seeing what development challenges we face in parallel programming multi-core processors gives me a whole new appreciation for those old timers. These are old problems that have been dealt with before, just not on this scale. I guess it is true what they say, history always repeats itself.

    --
    Good security is based upon reality and common sense. Common sense is a function of having common knowledge.