Slashdot Mirror


More Effective Use of Shared Memory on Linux

An anonymous reader writes "Making effective use of shared memory in high-level languages such as C++ is not straightforward, but it is possible to overcome the inherent difficulties. This article describes, and includes sample code for, two C++ design patterns that use shared memory on Linux in interesting ways and open the door for more efficient interprocess communication."

21 of 280 comments (clear)

  1. SysV IPC is obsolete by bogolisk · · Score: 4, Informative

    some1 should tell the authors to rtfm.

    $ man shm_open

    --
    Bogus
  2. shmem (soon in Boost!) by Cyberax · · Score: 4, Informative

    There is a great C++ library for shared memory support: SHMEM. It can place complex objects and STL-like containers in shared memory. And it is crossplatform (POSIX and Windows are supported).

    And it will soon (hopefully) be a part of Boost!

  3. Re:C++ has bigger memory issues by Cyberax · · Score: 2, Informative

    No, but I think reading about ownership policies and why they almost always make GC unneccessary must be compulsory.

  4. CML by putko · · Score: 3, Informative

    For concurrent applications, it is hard to beat Reppy's CML.
    http://portal.acm.org/citation.cfm?id=113470

    In particular, the things you synchronize on are first-class. Also you can speculatively send/receive things. Normal "select" is only for reading. You don't have to manage your memory either.

    There are other concurrent languages, but CML is nice in that it has a formal semantics, so unlike typical languages like "C", "C++", Erlang or Java, a program has a meaning other than "whatever the program does when I run it."

    You can implement the primitives of CML in your favorite higher-order language, so you don't have to be limited by ML. That's what's in Reppy's book.

    A proper implementation can achieve speeds that are about 30x faster than pthreads for typical tests like "ping/pong".

    --
    http://www.thebricktestament.com/the_law/when_to_s tone_your_children/dt21_18a.html
  5. Re:10 fold speed improvement - Dekkers mutex ! fas by Anonymous Coward · · Score: 1, Informative

    in my above example i noticed slashcode converting some single ascii space white space into

    AMPERSAND POUND-SIGN 160 SEMICOLON

    just swap those back to spaces.

    "POUND-SIGN" is defined as octothorpe, not pound-sign the english monetary unit glyph

    I would have typed OCTOTHORPE above but i was just letting usa people understand a little more clearly

    anyway ignore the AMPERSAND POUND-SIGN 160 SEMICOLON ... the source code is correct

    i wanted to post this earlier, but slashcode crap makes me wait FOREVER to post a correction! What crap. its been 6 minutes and it says "Slow Down Cowboy! It's been 6 minutes since you last successfully posted a comment" I wonder why engineers like me even bother trying to help out on slahdot anymore.

  6. Re:Microsoft code? by maxwell+demon · · Score: 2, Informative

    Doesn't MS code use CWhatever? I guess the author is coming from a Java background (I for Interface).

    BTW, the very first file is not valid C++: All identifiers which contain double underscores are everywhere and under all circumstances reserved for the implementation. This also includes __COMMON_H__. Change it to e.g. COMMON_H to get valid C++.

    Well, at least his main function returns int.

    --
    The Tao of math: The numbers you can count are not the real numbers.
  7. Re:Hardware-enforced sharing: OLD HAT by TheRaven64 · · Score: 3, Informative

    Because it's a really ugly hack that only works in a specific domain. If the frame buffer is being read and updated at the same time, you might get some screen corruption. This is irritating, but not really a problem. If a general piece of memory is being read from and written to at the same time, then the reading process gets garbage data. This means that you need to put locks around your memory (which you do anyway) to prevent this, which eliminates the advantage of the second read line.

    --
    I am TheRaven on Soylent News
  8. Re:10 fold speed improvement - Dekkers mutex ! fas by Anonymous Coward · · Score: 5, Informative

    Yes, some algorithms are worth remembering...

    This one is worth remembering as one to avoid -- it's based on the idea of a busy-wait. Look at the while(test) { /* do nothing */ } loop and outer while loop. This should not be done. Semaphores might be slower in the specific case, but overall system performance will benefit from using best-practices.

    There's a reason this algorithm lies in rest in academic journals: it's only useful as a teaching tool.

  9. Fanboy mods... by mangu · · Score: 2, Informative
    why parent is +1 insightful and not -1 troll? (or flamebait)?


    There are some subjects that draw fanboy clubs here in /.


    Some examples: Java, AMD, Apple, Ruby.


    Try criticizing any of them here, you'll be down-moderated to (-1) pretty quickly. OTOH, praise any of those and you'll get moderated up, no matter how stupid or inconsistent the comment is.

    1. Re:Fanboy mods... by bogolisk · · Score: 2, Informative

      you forgot xxxBSD and Gentoo.

      --
      Bogus
  10. Re:C++ has bigger memory issues by Curien · · Score: 2, Informative

    Two reasons:
    a) The called function _cannot_ modify the argument. This becomes important to the code surrounding the function call.
        T x(...);
        foo(x); // did x get changed?

    If foo is declared "void foo(T const&)", then you *know* that x has not changed. If instead it's declared as taking a plain reference, you can't know.

    b) You can pass const objects or objects with limited lifetimes.

        foo(T()); // legal if foo takes a T const&, but not a T&

    --
    It's always a long day... 86400 doesn't fit into a short.
  11. Re:10 fold speed improvement - Dekkers mutex ! fas by at_18 · · Score: 3, Informative

    I tried unsuccessfully, verbally, to get a Phd in comp Sci with embedded management experience to believe me it is 100% sound.... argued for 40 minutes. The guy never had a clue.

    The guy had a clue. Your algorithm is a busy-wait loop, so your CPUs will be maxed at 100% while waiting, and the thread will be pushed by the scheduler to lower priority, and so on...

  12. There are better ways by photon317 · · Score: 4, Informative


    A lot of shared memory synchronization and/or caching problems can be solved on Linux through the effective use of a few simple things:

    1) shm_open (if seperately-started processes which need to coordinate in shared memory), or mmap(MAP_SHARED|MAP_ANONYMOUS) for a process which will fork children which need to communicate/share between themselves and the parent.

    2) Use 's "atomic_t" integer type within that shared memory array (atomic_t* my_shm_array = mmap(....)). The atomic_t type has several functions defined in that header for atomic read, write, increment, etc for the linux hardware platform at hand. On most sane (cache-coherent) SMP architectures, reading and writing are already atomic operations, so this basically devolves to just setting and getting integers like normal (with a little bit of syntactic sugar (struct { volatile int val }) to make sure the C compiler doesn't optimize things away that it shouldn't. And you can implement a whole lot of sane algorithms using nothing but shared memory integer reads and writes with no locking or special atomic increment ops.

    3) If you need more advanced or complex locking on the shared memory for synchronization, use Linux's "futex"'s. They're in the man pages, and they're really fast.

    --
    11*43+456^2
  13. yeah, fast, and 10-fold chance of odd failures by Krischi · · Score: 5, Informative

    Yeah, this algorithm is fast. Too bad that it does not work. This kind of design is a common mistake by people who do not understand the intricacies of multithreaded programming. In short, it fails miserably when the CPUs are allowed to reorder loads and stores, a.k.a. pretty much any modern CPU. You need a memory barrier between setting and testing of a shared variable.

    Google for Dekker's algorithm and memory barrier - you will find better explanations of the problem there than I could type up in my limited time here right now.

  14. Re:High-level language? by psykocrime · · Score: 2, Informative

    I've seen this before, but why is C/C++ marked as a high-level language (as in the summary)? C/C++ are LOW-level languages

    I've never heard that before... everyone I know, and all the literature I've read that described programming languages, considers assembler as "low level" and anything at a higher level of abstraction as "high level." With the exception of a few folks who try to describe C as a "mid level language" or as "high level assembler."

    Calling C++ a "low level language" is absolutely a mistake anyway. It's really a mixed-paradigm language, which includes "low-level" abstractions and some very "high-level" ones. Just because you have the option of dropping down to a low-level close to the hardware doesn't mean you have to.

    --
    // TODO: Insert Cool Sig
  15. Not much experience on this but... by hdante · · Score: 2, Informative

    The mutex doesn't seem to be shared between processes. This would make the code incorrect. Can anyone confirm this ?

  16. not really usefull by vtoroman · · Score: 2, Informative

    The code shown is using pthread mutex for sync-ing. The mutex works only for synchronization of threads, not processes so the code is useless (even dangerous) for inter process communication (IPC). In the case of threads another question is just screaming for an answer:
          Why would someone use a shared memory block for threads which are all running in the same memory space anyway?

    We come to the conclusion that the code is quite useless for inter-thread communication too. All in all - useless.

    1. Re:not really usefull by vtoroman · · Score: 2, Informative

      The only way to make mutexes interprocess is to enable pthread_mutexattr_setpshared attribute. This hasn't been done in the article's code so the mutex which is used there hase an inter-thread scope, not an inter-process scope.

  17. Re:10 fold speed improvement - The Phd was idiot by LO0G · · Score: 2, Informative

    The PhD is STILL right.

    That code makes a huge fundimental assumption, that write order is preserved. In other words, if you do:

    Write to location 3 on processor 1 (take the lock)
    Read from location 30 on processor 1 (do stuff with the lock held)
    Read from location 3 on processor 2 (check the lock)

    that the reads and writes will appear in order. On ALL modern processors, this assumption is not true, it's possible for the write to location 3 to occur AFTER the read from location 3 on processor 2. It works great on single processor machines, but fails on MP machines.

    In order to make the code work, you need to put a memoy write barrier after the write to location 3, this will force the write to be flushed from the cache.

  18. Re:This is nothing new by Anonymous Coward · · Score: 1, Informative

    If you get a segfault, you have to assume that the shared memory is in an unknown state and either shutdown or restart everything.

    Actually, you don't. There are multiple ways of handling this. Three approaches are critical sections, journalling, and partial rebuilds.

    If you get a segfault in some function, memory does not start magically going haywire; it is not changed after that point. Another process can attach to your shared memory and look at the data structure that was being manipulated and repair it. If the operation is journalled, just replay it (of course you have to fix whatever the NULL pointer is). Or you can avoid full journalling by doing a mini-journal and if interrupted during a write operation, rebuild just that one data structure. If using critical sections, if interrupted, revert to a previous copy of the data. These approaches work not just for segfaults but for kill -9, power going out, etc.

    Multiple processes plus shared memory is a much more robust solution than threading. For one, pthreads itself has historically been a source of much bugginess, like Sendmail. For another, if one thread dies they all die; with separate processes, that isn't the case. Just add locks/mutexes to your shared data structures. This also simplifies some things, because if it's NOT in shared memory, it's local to only one process, so you don't have to lock every single thing.

  19. This is a mediocre way to get IPC. by Animats · · Score: 3, Informative
    For historical reasons, most of the UNIX-like operating systems have terrible interprocess communication mechanisms. Early UNIX only had pipes. This started a tradition that interprocess communication works like I/O, leading to named pipes, sockets, and domain sockets. The result is a set of rather slow interprocess communication mechanisms. (One can do worse. In the old MacOS, interprocess communication could only pass one message per vertical refresh time, and this wasn't documented.)

    On top of those mechanisms, even slower interprocess communication systems are typically implemented, such as OpenRPC and CORBA. (For even more inefficiency, there's XPC. In Perl. But I digress.)

    Because of this history, there's a perception that interprocess communication has to be slow. It doesn't.

    What you really want looks more like what QNX has - fast interprocess messaging that interacts properly with the scheduler. QNX has to have interprocess communication done right, because it does everything through it, including all I/O. This works out quite well. You take a performance hit (maybe 20% for this), but you get much of that back because the higher levels become more efficient when built on good IPC.

    The QNX messaging primitives are available for Linux, although the implementation isn't good enough for inclusion in the standard kernel. That work should be redone for the current kernel.

    IPC/scheduler interaction really matters. If you get it wrong, each interprocess transaction results in an extra pass through the scheduler, or worse, both the sending process and the receiving process lose their turn at the CPU. This is easy to test. Start up two processes that communicate using your IPC mechanism. Measure the performance. Then start up a compute-bound process and measure again. If the IPC rate drops by much more than a factor of 2, something is wrong. Don't be surprised if it drops by two orders of magnitude. That's an indication that IPC/scheduler interaction was botched.

    Sun addressed this in the mid-1990s with their "Doors" interface in Solaris, which had roughly the right primitives. But that idea never caught on.

    The article here implements a message-passing system via shared memory, which is not exactly a new idea, even for UNIX. I think it first appeared in MERT, in the 1970s. It's an attempt to solve at the user level something that the OS should be doing for you.

    Shared memory is a hack. It's hard to make it work right. With it, one process can crash other processes in hard-to-debug ways. Sometimes you need it because you're moving vast amounts of data, (by which I mean more than just a video stream) but that's rarely the case.