More Effective Use of Shared Memory on Linux
An anonymous reader writes "Making effective use of shared memory in high-level languages such as C++ is not straightforward, but it is possible to overcome the inherent difficulties. This article describes, and includes sample code for, two C++ design patterns that use shared memory on Linux in interesting ways and open the door for more efficient interprocess communication."
Depends on your definition of "complex objects".
From the documentation:
Virtuality forbidden
This is not an specific problem of Shmem, it is a problem for all shared memory object placing mechanisms. The virtual table pointer and the virtual table are in the address space of the process that constructs the object, so if we place a class with virtual function or inheritance, the virtual function pointer placed in shared memory will be invalid for other processes.
Basically, I would have been surprised if they had found a solution for that. But I guess it cannot be portably solved. Instead, the system would have to be prepared for it. I could imagine that objects in a shared library (so the same code is guaranteed to be shared to both processes) could be placed in shared memory, if the compiler/runtime system provided the means for it (say, instead of the pointer to a VMT, it would contain an offset into the constant data section of the shared library, and something to identify the library with, say a system-wide unique active library index which is generated by the dynamic linker).
The Tao of math: The numbers you can count are not the real numbers.
A 10 fold speed improvement in switching context can be done by avoiding OS calls for semaphores and customizing a set of calls for as many comsumer-producers as needed.
This avoids using any special opcodes or inneficient cache line flushes.
As long as shared memory is cache coherent, even multiple cpus will work with dekkers 1965 algorithm.
Here is the complete classic code for one cpu of a dual cpu design system or a dual thread setup
amazing! unbelievably fast. In fact is optimal.
Its best if the flags are allocated in their own cachelines, so perhaps pad to 32 bytes on PowerPC for example, and other CPUS might use as few as 16 byte cachlines. This avoids contention and increases coherency for rapid read-writes.
Add Dekkers mutex as I described and the speed of transactions per second will make your head spin in disbelief even in pathological situations
How many people know about this? Nobody! I never read about it anywhere. I invented it myself years ago, before I discoverred this year it was called Dekkers, and Dekker beat me to it in 1965. I tried unsuccessfully, verbally, to get a Phd in comp Sci with embedded management experience to believe me it is 100% sound.... argued for 40 minutes. The guy never had a clue. No wonder that his company's stock is down over a couple billion in market cap since the argument.
Lets not forget the past. Some algorithms are worth remembering.
Anyway, old stuff. Wake me up when you start talking about the newer tricks with shared memory.
Quite a few years ago, there was a brief popularity of something called VRAM (video ram) that had memory cells specifically designed with one input line and TWO output lines. The idea was that the part of the hardware needing to construct an image for the screen ONLY needed to read memory, while the system responsible for creating the image needed both read and write access. Ever since then, I've wondered why they don't use this kind of memory in multi-processor systems, for communication between processors, such that Processor A has read/write access to a block of VRAM, to give info to Processor B (it has read-access only), while Processor B has read/write access to a different block of VRAM, to give info to Processor A (it has read-access only).
I'm surprised no-one has mentioned Solaris Doors. Doors is an IPC mechanism whereby the first process (client) can hand off any residual time in its timeslice to the second process (server) resulting in short IPC calls running much less time as there is no discarded timeslice time and no wait for the server process to be scheduled (since it uses the client's timeslice).
Well, it's possible to use shmem as a very fast method for marshalling of arguments across process boundaries and then use BIL (Boost Interfaces Library) to marshall actual function calls. It will look like Local Procedure Call subsystem in Windows NT.
You can get virtual functions this way and it will be fast enough but not very "nice", of course.
People who think they always need to "new" objects in C++ have spent way too much time using Java.
Here's another hint -- pass objects to functions as const references:This way, a copied object isn't allocated for the passing (no memory at all is in fact allocated). The biggest drawback is you can only call "const" methods on the object, but this is outweighed by not using pointers. Not that I don't like pointers, they just increase the complexity and should be used prudently. And as my
"Save the whales, feed the hungry, free the mallocs" -- author unknown
Ok, I get it... it's an attempt to exploit shared memory in C++.
And why is this news? Is it so difficult that nobody has done it? No, that can't be -- the shm stuff can be wrapped. This is so important that it rates a "design pattern"? Not it either -- the one illustrated isn't the best solution.
So, just what is this article? Methinks fluff. Sort of in line with "How to implement co-routines with setjmp/longjmp" thing. Or, "Restructuring data to assist processor cache residency". And "How to remove locks from performance critical MP code".
Except not as interesting or useful.
Ratboy.
Just another "Cubible(sic) Joe" 2 17 3061
1) shm_open(2) is already mentioned in the 2nd post.
...) do the optimization. They're smarter than u.
2) dont u know that NPTL is already doing this for u? On fast-path, NPTL's posix mutex just do atomic operations and avoid doing syscall. Stick to the standard API and let the platform guys (libc, kernel,
3) u dont want to do this, seriously! if futex is that consummable by the public, then why did the glibc guy write a looooooong paper describing howto use futex.
Bogus
Like Java, right?
Getting back to the original premise of the story, can you even do OS-level shared memory (SysV or POSIX) with Java? OS-level semaphores? Any meaningful kind of IPC? OS-level anything? I mean without godawful JNI nonsense.
In the course of every project, it will become necessary to shoot the scientists and begin production.
You don't get it about out-of-order writes, do you? Simple scenario, according to your algorithm:
/* CPU AA clears its BUSY flag at this point in time, so, the while (flags[AA] == BUSY) terminates immediately */
/* from AA */ /* from BB */ /* BB uses the resource */ /* writeback from AA is too late */
CPU AA:
resource = produce_something();
turn = BB;
flags[AA] = FREE;
CPU BB:
flags[BB] = BUSY;
consume(resource);
The problem is that AA is free to reorder its writes. So, the actual order could be:
flags[AA] = FREE;
flags[BB] = BUSY;
consume(resource);
resource = result of produce_something() call
Oops. BB accesses the resource before AA writes back the current state. Cache coherency does not solve this problem - the problem is that the write to the resource is still pending. That is what the memory barrier is there for.
Argue with facts, don't hide behind oh-so-impressive credentials.
C++ is more than just an OO language. It provides direct support for the procedural paradigm too.
STL, for example, is not an OO library. Yet it has proved to be immensly useful.
One place where the garbage collected languages fall down is in the management of resources. The handling of limited resources such as files or sockets must be explicitly released by the programmer. This demonstrates that you simply cannot ignore the lifetime of objects with a garbage collector. And I also assert here that memory is a limited resource too.
That silly singleton thing in the example is a demonstration of the disregard for the lifetime of that particular object. Does it really need to live for the lifetime of the application? Does it need to be cleanly released?
I think C++'s memory management model is sufficient. One can hardly say that about garbage collected languages.