Slashdot Mirror


An Overview of Parallelism

Mortimer.CA writes with a recently released report from Berkeley entitled "The Landscape of Parallel Computing Research: A View from Berkeley: "Generally they conclude that the 'evolutionary approach to parallel hardware and software may work from 2- or 8-processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism.' This assumes things stay 'evolutionary' and that programming stays more or less how it has done in previous years (though languages like Erlang can probably help to change this)." Read on for Mortimer.CA's summary from the paper of some "conventional wisdoms" and their replacements.
Old and new conventional wisdoms:
  • Old CW: Power is free, but transistors are expensive.
  • New CW is the "Power wall": Power is expensive, but transistors are "free." That is, we can put more transistors on a chip than we have the power to turn on.

  • Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins.
  • New CW: As chips drop below 65-nm feature sizes, they will have high soft and hard error rates.

  • Old CW: Multiply is slow, but load and store is fast.
  • New CW is the "Memory wall" [Wulf and McKee 1995]: Load and store is slow, but multiply is fast.

  • Old CW: Don't bother parallelizing your application, as you can just wait a little while and run it on a much faster sequential computer.
  • New CW: It will be a very long wait for a faster sequential computer (see above).

4 of 197 comments (clear)

  1. Hmmm... by ardor · · Score: 5, Interesting

    "but is likely to face diminishing returns as 16 and 32 processor systems are realized"

    Then we are doing something wrong. The human brain provides compelling evidence that massive parallelization works. So: what are we missing?

    --
    This sig does not contain any SCO code.
  2. Re:It's not hard by ardor · · Score: 3, Interesting

    Well the indeterministic nature of multithreading is still a problem. With one thread, debugging is simple: the only thread present will be stopped. But with multiple threads, how is the debugger supposed to handle the problem? Stop all threads? Only the current one? Etc. This is relevant when debugging race conditions.

    Also, the second great problem is that thread problems are hard to find. When I write a class hierarchy, an OOP language can help me with seeing design errors (for example, unnecessary multiple inheritance), or misses in const-correctness. Threading, however, is only present as mutexes, conditions etc.

    One other issue with threads is that they effectively modify the execution sequence. Traditional single-threaded programs have a sequence that looks like a long line. Threading introduces branches and joins, turning the simple line into a net. Obviously, this complicates things. Petri nets can be useful in modeling this.

    --
    This sig does not contain any SCO code.
  3. what about I/O and mosix style clustering? by soldack · · Score: 3, Interesting

    I used to work for SilverStorm (recently purchased by QLogic). They make InfiniBand switches and software for use in high performance computing and enterprise database clustering. The quality of the I/O subsystem of a cluster played a large part in determining the performance of a cluster. Latency (down the microsecond) and bandwidth (over 10 gigabits per second) both mattered.

    Also, we found that sometimes, what made a deal go through was how well your proposed system could run some prexisting software. For example, vendors would publish how well they could run a standard crash test simulation.

    Also, I would like to see more research put into making clustered operating systems like mosix good enough so that developers can stick to what they have learned on traditional SMP systems and have their code just work on large clusters. I don't think that multicore processors eliminate the need for better cluster software.

    --
    -- soldack
  4. There is a way to automate shared-state con/cy!!! by master_p · · Score: 3, Interesting

    There is a way to automate shared state concurrency! every object should be its own thread. Computations that refer to the same object must be executed by the object's thread.

    Here is how it works:

    A computation does not return a result, but a tuple of {key, continuation}. The key is used to locate the thread to pass the continuation to. The computation is stored in the thread's queue and the thread is woken up.

    The tuple {key, continuation} pair can be an 64-bit value (on 32-bit machines) that consists of a pointer to a memory location (the key) and a pointer to code (the continuation).

    The insertion to the thread's queue can be done using lock-free data structures.

    Threads can be user-level so there need not be a switch to kernel space.

    This design can allow for linear scaling of performance: the more cores you put in, the more performance you get (for algorithms that are not linear, that is). Linear algorithms would execute a little slower than usual, but the trade off is acceptable: for many applications that allow for parallelization due to having lots of (relatively) independent objects, the performance boost be tremendous.

    There are many domains of applications that would benefit from such an approach:

    -web servers/application servers that must serve thousands of clients simultaneously.
    -video games with thousands of objects.
    -simulations that have many independent agents that can run in parallel.
    -GUI apps that use the observer pattern and each observable has many observers than can be notified in parallel.

    Note: The above ideas are taken from libasync-mp and lock-free data structure programming.