Slashdot Mirror


More Effective Use of Shared Memory on Linux

An anonymous reader writes "Making effective use of shared memory in high-level languages such as C++ is not straightforward, but it is possible to overcome the inherent difficulties. This article describes, and includes sample code for, two C++ design patterns that use shared memory on Linux in interesting ways and open the door for more efficient interprocess communication."

14 of 280 comments (clear)

  1. Re:shmem (soon in Boost!) by maxwell+demon · · Score: 5, Interesting
    It can place complex objects and STL-like containers in shared memory.

    Depends on your definition of "complex objects".

    From the documentation:

    Virtuality forbidden

    This is not an specific problem of Shmem, it is a problem for all shared memory object placing mechanisms. The virtual table pointer and the virtual table are in the address space of the process that constructs the object, so if we place a class with virtual function or inheritance, the virtual function pointer placed in shared memory will be invalid for other processes.


    Basically, I would have been surprised if they had found a solution for that. But I guess it cannot be portably solved. Instead, the system would have to be prepared for it. I could imagine that objects in a shared library (so the same code is guaranteed to be shared to both processes) could be placed in shared memory, if the compiler/runtime system provided the means for it (say, instead of the pointer to a VMT, it would contain an offset into the constant data section of the shared library, and something to identify the library with, say a system-wide unique active library index which is generated by the dynamic linker).
    --
    The Tao of math: The numbers you can count are not the real numbers.
  2. 10 fold speed improvement - Dekkers mutex ! fast! by Anonymous Coward · · Score: 1, Interesting
    article not perfect yet, much too slow :

    A 10 fold speed improvement in switching context can be done by avoiding OS calls for semaphores and customizing a set of calls for as many comsumer-producers as needed.

    This avoids using any special opcodes or inneficient cache line flushes.

    As long as shared memory is cache coherent, even multiple cpus will work with dekkers 1965 algorithm.

    Here is the complete classic code for one cpu of a dual cpu design system or a dual thread setup :
    INITIALIZATION:
            typedef char boolean;
            volatile boolean flags[2];
            volatile int turn;
     
            turn = AA ;
            flags[AA ] = FREE;
            flags[BB ] = FREE;
     
    ENTRY PROTOCOL (shown for CPU AA ):
    /* Claim the resource */
            flags[ AA  ] = BUSY;
     
    /* Wait if the other process is using the resource */
     
            while (flags[ BB  ] == BUSY) {
     
    /* If waiting for the resource, also wait our turn */
                    if (turn != AA  ) {
     
    /* Release the resource while waiting */
     
                            flags[ AA  ] = FREE;
                            while (turn != AA  ) {
                            }
                            flags[ AA  ] = BUSY;
                    }
            }
     
      EXIT PROTOCOL (shown for CPU AA ):
    /* Pass the turn on, and release the resource */
          turn = BB  ;
          flags[ AA  ] = FREE;
    =

    amazing! unbelievably fast. In fact is optimal.

    Its best if the flags are allocated in their own cachelines, so perhaps pad to 32 bytes on PowerPC for example, and other CPUS might use as few as 16 byte cachlines. This avoids contention and increases coherency for rapid read-writes.

    Add Dekkers mutex as I described and the speed of transactions per second will make your head spin in disbelief even in pathological situations

    How many people know about this? Nobody! I never read about it anywhere. I invented it myself years ago, before I discoverred this year it was called Dekkers, and Dekker beat me to it in 1965. I tried unsuccessfully, verbally, to get a Phd in comp Sci with embedded management experience to believe me it is 100% sound.... argued for 40 minutes. The guy never had a clue. No wonder that his company's stock is down over a couple billion in market cap since the argument.

    Lets not forget the past. Some algorithms are worth remembering.
  3. This is nothing new by Anonymous Coward · · Score: 4, Interesting
    You've been able to do this for a while using process shared mutexes and condition variables which allow you to do the same things you could do with pthreads and shared memory. The tradeoff is you get better performance avoiding syscalls to do IPC but it's less robust. If you get a segfault, you have to assume that the shared memory is in an unknown state and either shutdown or restart everything. The other processes can (or will be able) to detect this using once robust futex support is in Linux. Idiot programmers will of course ignore this and continue to use the corrupted memory anyway just like they do now with sysV semaphores used as mutexes with the SEM_UNDO option to allow the semaphore to auto reset if a process exits without resetting it.

    Anyway, old stuff. Wake me up when you start talking about the newer tricks with shared memory.

  4. Hardware-enforced sharing: OLD HAT by VernonNemitz · · Score: 4, Interesting

    Quite a few years ago, there was a brief popularity of something called VRAM (video ram) that had memory cells specifically designed with one input line and TWO output lines. The idea was that the part of the hardware needing to construct an image for the screen ONLY needed to read memory, while the system responsible for creating the image needed both read and write access. Ever since then, I've wondered why they don't use this kind of memory in multi-processor systems, for communication between processors, such that Processor A has read/write access to a block of VRAM, to give info to Processor B (it has read-access only), while Processor B has read/write access to a different block of VRAM, to give info to Processor A (it has read-access only).

  5. Doors by Anonymous Coward · · Score: 5, Interesting

    I'm surprised no-one has mentioned Solaris Doors. Doors is an IPC mechanism whereby the first process (client) can hand off any residual time in its timeslice to the second process (server) resulting in short IPC calls running much less time as there is no discarded timeslice time and no wait for the server process to be scheduled (since it uses the client's timeslice).

    1. Re:Doors by Anonymous Coward · · Score: 1, Interesting
      Except for the time slice trickery, you can do everything a door call can do with Unix domain sockets. And even there you could probably bump the priority of the threads in the thread pool to get the same effect. Something that would really be nice would be cross address call like IBM mainframes have. No context switching except for the memory space switch. In contract, Solaris door calls require a syscall and context switching to and from the door process thread pool. Plus some cross memory space copying of input and output data. Not exactly cheap.

      I was proposing on comp.programming.threads process level smart pointers that would let you use read only shared segments for IPC. Read only meaning more robust. So for example you could implement a NSCD (name services cache daemon) which would allow in cache host name resolution to run in only microseconds or less which is a lot faster than a door call could ever be. But not a lot of interest so far so the idea is currently shelved.

    2. Re:Doors by Foolhardy · · Score: 2, Interesting

      That sounds like the same as NT's event pairs, used to implement Quick LPC. An event pair consists of a high and a low event. The server thread waits on the high event and the client thread waits on the low event. Only one event can be signalled at one time, and two software interrupts are provided to toggle the event pair's state: interrupt 0x2C calls the function KiSetLowWaitHighThread() and interrupt 0x2B calls the function KiSetHighWaitLowThread(). When one of these is called and another thread (in another process) is waiting on the other event, the kernel schedules the other thread immediately, continuing the same timeslice of the calling thread. Transfer of data is done through a shared memory section mapped to both processes. Quick LPC was introduced in NT 3.51 to make the out-of-process GUI server faster. Apparently Microsoft didn't think it was fast enough, so they moved much of the GUI server into kernel mode in NT4.

      For more information, scroll down to Quick LPC in Local Procedure Call from Undocumented Windows NT.

  6. Re:shmem (soon in Boost!) by Cyberax · · Score: 2, Interesting

    Well, it's possible to use shmem as a very fast method for marshalling of arguments across process boundaries and then use BIL (Boost Interfaces Library) to marshall actual function calls. It will look like Local Procedure Call subsystem in Windows NT.

    You can get virtual functions this way and it will be fast enough but not very "nice", of course.

  7. Re:C++ has bigger memory issues by theCoder · · Score: 4, Interesting
    C++ already has a garbage collector. Just allocate your objects on the stack instead of the heap:
    void foo()
    {
      SomeObject obj;
     
    // other code
     
    // poof -- obj is deallocated automatically, even if an exception is thrown
    }
    I work on a project that has tens of thousands of C++ classes, and very few "new" and "delete" operations (more "new" than "delete" because we have a class that manages reference counting like a heap garbage collector would do).

    People who think they always need to "new" objects in C++ have spent way too much time using Java.

    Here's another hint -- pass objects to functions as const references:
    void foo(const SomeObject& obj)
    {
    // code
    }
    This way, a copied object isn't allocated for the passing (no memory at all is in fact allocated). The biggest drawback is you can only call "const" methods on the object, but this is outweighed by not using pointers. Not that I don't like pointers, they just increase the complexity and should be used prudently. And as my .sig says, be sure to free those mallocs!

    --
    "Save the whales, feed the hungry, free the mallocs" -- author unknown
  8. And? by ratboy666 · · Score: 3, Interesting

    Ok, I get it... it's an attempt to exploit shared memory in C++.

    And why is this news? Is it so difficult that nobody has done it? No, that can't be -- the shm stuff can be wrapped. This is so important that it rates a "design pattern"? Not it either -- the one illustrated isn't the best solution.

    So, just what is this article? Methinks fluff. Sort of in line with "How to implement co-routines with setjmp/longjmp" thing. Or, "Restructuring data to assist processor cache residency". And "How to remove locks from performance critical MP code".

    Except not as interesting or useful.

    Ratboy.

    --
    Just another "Cubible(sic) Joe" 2 17 3061
  9. Re:There are better ways by bogolisk · · Score: 2, Interesting

    1) shm_open(2) is already mentioned in the 2nd post.

    2) dont u know that NPTL is already doing this for u? On fast-path, NPTL's posix mutex just do atomic operations and avoid doing syscall. Stick to the standard API and let the platform guys (libc, kernel, ...) do the optimization. They're smarter than u.

    3) u dont want to do this, seriously! if futex is that consummable by the public, then why did the glibc guy write a looooooong paper describing howto use futex.

    --
    Bogus
  10. Re:C++ has bigger memory issues by smcdow · · Score: 2, Interesting

    Like Java, right?

    Getting back to the original premise of the story, can you even do OS-level shared memory (SysV or POSIX) with Java? OS-level semaphores? Any meaningful kind of IPC? OS-level anything? I mean without godawful JNI nonsense.

    --
    In the course of every project, it will become necessary to shoot the scientists and begin production.
  11. Cache choherency is NOT sufficient by Krischi · · Score: 2, Interesting

    You don't get it about out-of-order writes, do you? Simple scenario, according to your algorithm:

    CPU AA:

    resource = produce_something();
    turn = BB;
    flags[AA] = FREE;

    CPU BB:

    flags[BB] = BUSY; /* CPU AA clears its BUSY flag at this point in time, so, the while (flags[AA] == BUSY) terminates immediately */
    consume(resource);

    The problem is that AA is free to reorder its writes. So, the actual order could be:

    flags[AA] = FREE; /* from AA */
    flags[BB] = BUSY; /* from BB */
    consume(resource); /* BB uses the resource */
    resource = result of produce_something() call /* writeback from AA is too late */

    Oops. BB accesses the resource before AA writes back the current state. Cache coherency does not solve this problem - the problem is that the write to the resource is still pending. That is what the memory barrier is there for.

    Argue with facts, don't hide behind oh-so-impressive credentials.

  12. Re:C++ has bigger memory issues by kbw · · Score: 3, Interesting

    C++ is more than just an OO language. It provides direct support for the procedural paradigm too.

    STL, for example, is not an OO library. Yet it has proved to be immensly useful.

    One place where the garbage collected languages fall down is in the management of resources. The handling of limited resources such as files or sockets must be explicitly released by the programmer. This demonstrates that you simply cannot ignore the lifetime of objects with a garbage collector. And I also assert here that memory is a limited resource too.

    That silly singleton thing in the example is a demonstration of the disregard for the lifetime of that particular object. Does it really need to live for the lifetime of the application? Does it need to be cleanly released?

    I think C++'s memory management model is sufficient. One can hardly say that about garbage collected languages.