More Effective Use of Shared Memory on Linux
An anonymous reader writes "Making effective use of shared memory in high-level languages such as C++ is not straightforward, but it is possible to overcome the inherent difficulties. This article describes, and includes sample code for, two C++ design patterns that use shared memory on Linux in interesting ways and open the door for more efficient interprocess communication."
now you see i'm sitting here with an opportunity to get a first post but i cant think what to write because this story is so damn not at all interesting.
some1 should tell the authors to rtfm.
$ man shm_open
Bogus
Le cochon dans le maïs
Demain, je porte plainte contre l'Amérique
J'ai bouffé du maïs et du chou transgénique
Je n'bande plus ça reste mou comme une chique
José Bové a dit que c'était allergique
Il paraît qu'on est plusieurs dans le le même cas-ca
On ira plus manger au Ricain car c'est caca
Le seul remède c'est une assiette de foie gras
Un verre de rouge, du Roquefort et pas de soda
Demain, je porte plainte contre l'Amérique
L'Amérique
Nous ce qu'on veut c'est du bon et du biologique
Attends tu vas voir, on va leur faire la nique
Faire la nique
Si les Ricains nous embrouillent con:
On leur mettra le cochon
Le cochon dans le maïs
Et on mettra les glaçons
Les glaçons dans le pastis
{x4}
Demain, je porte plainte contre L'Amérique
J'ai bouffé du poireau transformé génétique
Et depuis ma carotte n'est plus énergique
Pourtant ma femme tu verrais comme elle l'astique
Il paraît qu'on est plusieurs dans le même cas-ca
On ira plus manger au Ricain car c'est caca
Le seul remède c'est une assiette de foie gras
Un verre de rouge, du Roquefort et pas de soda
Demain, je porte plainte contre l'Amérique
L'Amérique
Nous ce qu'on veut c'est du bon et du biologique
Attends, tu vas voir, on va leur faire la nique
Faire la nique
Si les Ricains nous embrouillent con:
On leur mettra le cochon
Le cochon dans le maïs
Et on mettra les glaçons
Les glaçons dans le pastis
{ad libitum}
Smile, don't click...
There is a great C++ library for shared memory support: SHMEM. It can place complex objects and STL-like containers in shared memory. And it is crossplatform (POSIX and Windows are supported).
And it will soon (hopefully) be a part of Boost!
In fact, forget it; just use an actual OO language instead.
TWW
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
Imemcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
This looks like Microsoft code. The classes are named lke ISomething or IAnother, like the hungarian stuff in MSDN.
And then there's the backwards comparison, that teachers who come directly from Visual Basic advise to students because the teachers themselves aren't comfortable with C++ syntax.
Please, let the guy finish his first semester, and then learn to write readable code.
Unix shared memory paradigm has been around for a long time. There is no need to
re-invent the wheel. BOOST has something, ACE has something, The person that posted
this article should be modded redundant permanently.
"1 comment" ;)
No more I say.
Yup they released it and it sucks. Although somewhat related, a distributed object cache is a solution to a totally different set of problems. Why mention it here?
http://www-128.ibm.com/developerworks/i/p-sudas.jp g
I suppose everything marked const could be shared.
A 10 fold speed improvement in switching context can be done by avoiding OS calls for semaphores and customizing a set of calls for as many comsumer-producers as needed.
This avoids using any special opcodes or inneficient cache line flushes.
As long as shared memory is cache coherent, even multiple cpus will work with dekkers 1965 algorithm.
Here is the complete classic code for one cpu of a dual cpu design system or a dual thread setup
amazing! unbelievably fast. In fact is optimal.
Its best if the flags are allocated in their own cachelines, so perhaps pad to 32 bytes on PowerPC for example, and other CPUS might use as few as 16 byte cachlines. This avoids contention and increases coherency for rapid read-writes.
Add Dekkers mutex as I described and the speed of transactions per second will make your head spin in disbelief even in pathological situations
How many people know about this? Nobody! I never read about it anywhere. I invented it myself years ago, before I discoverred this year it was called Dekkers, and Dekker beat me to it in 1965. I tried unsuccessfully, verbally, to get a Phd in comp Sci with embedded management experience to believe me it is 100% sound.... argued for 40 minutes. The guy never had a clue. No wonder that his company's stock is down over a couple billion in market cap since the argument.
Lets not forget the past. Some algorithms are worth remembering.
Anyway, old stuff. Wake me up when you start talking about the newer tricks with shared memory.
For concurrent applications, it is hard to beat Reppy's CML.
http://portal.acm.org/citation.cfm?id=113470
In particular, the things you synchronize on are first-class. Also you can speculatively send/receive things. Normal "select" is only for reading. You don't have to manage your memory either.
There are other concurrent languages, but CML is nice in that it has a formal semantics, so unlike typical languages like "C", "C++", Erlang or Java, a program has a meaning other than "whatever the program does when I run it."
You can implement the primitives of CML in your favorite higher-order language, so you don't have to be limited by ML. That's what's in Reppy's book.
A proper implementation can achieve speeds that are about 30x faster than pthreads for typical tests like "ping/pong".
http://www.thebricktestament.com/the_law/when_to_
Quite a few years ago, there was a brief popularity of something called VRAM (video ram) that had memory cells specifically designed with one input line and TWO output lines. The idea was that the part of the hardware needing to construct an image for the screen ONLY needed to read memory, while the system responsible for creating the image needed both read and write access. Ever since then, I've wondered why they don't use this kind of memory in multi-processor systems, for communication between processors, such that Processor A has read/write access to a block of VRAM, to give info to Processor B (it has read-access only), while Processor B has read/write access to a different block of VRAM, to give info to Processor A (it has read-access only).
in my above example i noticed slashcode converting some single ascii space white space into
... the source code is correct
AMPERSAND POUND-SIGN 160 SEMICOLON
just swap those back to spaces.
"POUND-SIGN" is defined as octothorpe, not pound-sign the english monetary unit glyph
I would have typed OCTOTHORPE above but i was just letting usa people understand a little more clearly
anyway ignore the AMPERSAND POUND-SIGN 160 SEMICOLON
i wanted to post this earlier, but slashcode crap makes me wait FOREVER to post a correction! What crap. its been 6 minutes and it says "Slow Down Cowboy! It's been 6 minutes since you last successfully posted a comment" I wonder why engineers like me even bother trying to help out on slahdot anymore.
News for Nerds. Stuff that matters. ????
I'm surprised no-one has mentioned Solaris Doors. Doors is an IPC mechanism whereby the first process (client) can hand off any residual time in its timeslice to the second process (server) resulting in short IPC calls running much less time as there is no discarded timeslice time and no wait for the server process to be scheduled (since it uses the client's timeslice).
Too many people here are willing to make inane useless comments about honest work efforts. If you have a better way, offer it. If you merely want to say something nasty about someone else's work, save it for the coffee house.
"If all the American people want is security, let them live in prisons." Eisenhower
"Now let's look at using shared memory and caching of events for interprocess communications. If the events are cached within the shared object, they can be fired later. The receiving process will have to query the shared object for events. Thus, by sticking to a synchronous model, interprocess communication can be achieved. This is the motivation behind developing the following design pattern."
Que? How is that different from an asynchronous message event que? I think they've misused the word synchronous. Presumably the calling process is not waiting for the return from the process it called, so the two processes are running asynchronously. So this is just asynchronous event message passing.
That's funny. This was one of the first synchronization algorithms we learned in my operating systems class.
kids and newbies, welcome to all new slashdot newbie poo site. it teaches u...u guessed it, to puke a lot of patterns into ur manager's pocket. if u thought programming was an art, forget it. the guy who thought that up was a fool. pls feel free to learn BS here and close ur mind plz; /. will provide u with all the programming patterns u need.
WTF? u disagree...oh...it is IBM branded puke!!!
> How many people know about this? Nobody!
Only those, which know of Futexes.
"Between strong and weak, between rich and poor [...], it is freedom which oppresses and the law which sets free"
Yes, some algorithms are worth remembering...
/* do nothing */ } loop and outer while loop. This should not be done. Semaphores might be slower in the specific case, but overall system performance will benefit from using best-practices.
This one is worth remembering as one to avoid -- it's based on the idea of a busy-wait. Look at the while(test) {
There's a reason this algorithm lies in rest in academic journals: it's only useful as a teaching tool.
There are some subjects that draw fanboy clubs here in
Some examples: Java, AMD, Apple, Ruby.
Try criticizing any of them here, you'll be down-moderated to (-1) pretty quickly. OTOH, praise any of those and you'll get moderated up, no matter how stupid or inconsistent the comment is.
This thread goes out to all the HTML turned PHP hackers who whine that Comp Sci is a useless major
Nothing against PHP or HTML of course
MOD the parent of this comment up!!!
IMHO this algorithm is not a panacea because :
- It does busy waiting. If one thread holds the 'mutex' for a long time, the other thread will take a lot of CPU for nothing.
If you really need to take the resource as soon as it is available without giving up the proc, then have a look at "spin locks".
- It is not very scalable.
First, you need one version of the algorithm for mono proc one for bi proc, etc. Of course you could put them all in a shared lib and select one at runtime.
Second, the algo seems to be O(N), N being the number of processors. Therefore the algo slows down when the number of proc increases.
- And last: it is unclear to me how you pass turn when there are more than 2 procs involved. Does this algorithm work when there is more than 2 processors ?
The D programming language is shaping up to be a nice alternative.
you fool! Duh!
Are you an idiot! naturally that is where you would implement sleeping or thread suspension! the point is to show the relevant FAST part.
The routines used by linux and even BSD historically (before 1994) were slow, and the current stuff is still not as fast as this, by far.
naturally you would implement your sleep or poll in that section, if more than a microsecond or two elapsed.
the code is a skeleton demonstrating that the massive hit hidden in special opcodes or special os calls can be avoided.
So, what would it take to satisfy your criteria on being a proper successor to C++, if none of C#, Java, Pike, Python, and many others are unable to qualify?
now we need to go OSS in diesel cars
I tried unsuccessfully, verbally, to get a Phd in comp Sci with embedded management experience to believe me it is 100% sound.... argued for 40 minutes. The guy never had a clue.
The guy had a clue. Your algorithm is a busy-wait loop, so your CPUs will be maxed at 100% while waiting, and the thread will be pushed by the scheduler to lower priority, and so on...
Unix Domain Sockets use shared memory to transfer data between applications. How does this compare to other shared memory methods in performance?
Ok, I get it... it's an attempt to exploit shared memory in C++.
And why is this news? Is it so difficult that nobody has done it? No, that can't be -- the shm stuff can be wrapped. This is so important that it rates a "design pattern"? Not it either -- the one illustrated isn't the best solution.
So, just what is this article? Methinks fluff. Sort of in line with "How to implement co-routines with setjmp/longjmp" thing. Or, "Restructuring data to assist processor cache residency". And "How to remove locks from performance critical MP code".
Except not as interesting or useful.
Ratboy.
Just another "Cubible(sic) Joe" 2 17 3061
It can be tweaked to avlod busy waiting, obviosly. The code is a clear skeleton example. you would sleep, or suspend, or force context switch in the busy wait section after a certain amount of microseconds elapses. Also on can assume that the mutex itself is controlling a single shared VARIABLE or RECORD, and that the entire time waiting is nanoseconds at maximum in best case. THINK!!! This is a special tool, and can be made slower or bloated if needed, but not necessarily nonoptimal even in skeleton design!
secondly :
IT IS SCALABLE for a few tasks. i would not setup more than 20 of them, but many situations only need 3 or less tasks sharing one message bin. Yes of course you need the make X copies of specially crafted code depending on the design.
Remember the point, We are talking over 10 times faster... that is all that counts, and that is still an inarguable fact comapred to the linux example code in the article.
finally : as for how to pass the turn, instead of two opcodes to exit, you would add an increment, and the special code for the final participant would merely set back to 0
FAST FAST FAST!!!! potentially over 500 million transactions per second protecting a tiny data structure on normal machines
You've just gotta love Dekkers
I've seen this before, but why is C/C++ marked as a high-level language (as in the summary)? C/C++ are LOW-level languages - closer to the hardware, while languages like Visual Basic are HIGH-level languages since they abstract hardware-related tasks like memory management away.
-- Proof by analogy is fraud.
A lot of shared memory synchronization and/or caching problems can be solved on Linux through the effective use of a few simple things:
1) shm_open (if seperately-started processes which need to coordinate in shared memory), or mmap(MAP_SHARED|MAP_ANONYMOUS) for a process which will fork children which need to communicate/share between themselves and the parent.
2) Use 's "atomic_t" integer type within that shared memory array (atomic_t* my_shm_array = mmap(....)). The atomic_t type has several functions defined in that header for atomic read, write, increment, etc for the linux hardware platform at hand. On most sane (cache-coherent) SMP architectures, reading and writing are already atomic operations, so this basically devolves to just setting and getting integers like normal (with a little bit of syntactic sugar (struct { volatile int val }) to make sure the C compiler doesn't optimize things away that it shouldn't. And you can implement a whole lot of sane algorithms using nothing but shared memory integer reads and writes with no locking or special atomic increment ops.
3) If you need more advanced or complex locking on the shared memory for synchronization, use Linux's "futex"'s. They're in the man pages, and they're really fast.
11*43+456^2
The article is about shared-mem and synchronization accross process boundary! In Java that would mean: object that are shared between VMs; methods are are serialized across VM boundary.
Bogus
Yeah, this algorithm is fast. Too bad that it does not work. This kind of design is a common mistake by people who do not understand the intricacies of multithreaded programming. In short, it fails miserably when the CPUs are allowed to reorder loads and stores, a.k.a. pretty much any modern CPU. You need a memory barrier between setting and testing of a shared variable.
Google for Dekker's algorithm and memory barrier - you will find better explanations of the problem there than I could type up in my limited time here right now.
WRONG!
The Phd had no argument based on efficiency... he claimed it was physically impossible to avoid special hardware support (opcodes, mem control, etc).
By the way it is 100% efficient if two cpus are sharing a single cacheline of memory as a small communications status record.
imagine that reads or writes are bracketed around the cacheline with a call to the mutex.
as soon as the byte or bytes are read or written, the mutex is dropped.
idiots like you would take that pristine 100% optimal example of 500 million transactions per second and slow it down over 100 times slower probably!
Do you ever think about what you are claiming! I am sorry i tried to teach anyone anything here.... its like talkign to retards. I have a respected Com Sci degree and it was not clearly discussed in the older days when i got my worth-crap diploma. instead people rely on HARDWARE based semaphores in machines. THERE IS NO NEED FOR THEM and that is the point.
Well... i would rather recommend a good Haskell compiler as it's the successor of CML and a bit cleaner too. ;)
Or use Ocaml if you like it lightning fast. (meaning: nearly beating raw C) Even in bytecode! (Beating java by far!)
But it's not as clean as Haskell as it allows *shudder* side effects...
Any sufficiently advanced intelligence is indistinguishable from stupidity.
The mutex doesn't seem to be shared between processes. This would make the code incorrect. Can anyone confirm this ?
The neat thing about "die" in Perl is that it instantly converts into an exception just by wrapping the function call inside an "eval { }"
The code shown is using pthread mutex for sync-ing. The mutex works only for synchronization of threads, not processes so the code is useless (even dangerous) for inter process communication (IPC). In the case of threads another question is just screaming for an answer:
Why would someone use a shared memory block for threads which are all running in the same memory space anyway?
We come to the conclusion that the code is quite useless for inter-thread communication too. All in all - useless.
The PhD is STILL right.
That code makes a huge fundimental assumption, that write order is preserved. In other words, if you do:
Write to location 3 on processor 1 (take the lock)
Read from location 30 on processor 1 (do stuff with the lock held)
Read from location 3 on processor 2 (check the lock)
that the reads and writes will appear in order. On ALL modern processors, this assumption is not true, it's possible for the write to location 3 to occur AFTER the read from location 3 on processor 2. It works great on single processor machines, but fails on MP machines.
In order to make the code work, you need to put a memoy write barrier after the write to location 3, this will force the write to be flushed from the cache.
This is of course a textbook algorithm. Every textbook will also tell you that it has a fatal flaw (except from the busy waiting, which the textbook we were using (Ben-Ari) wasn't even interested in): if a process that holds the turn crashes or terminates, the whole system deadlocks. That makes the whole thing, unfortunately, rather useless.
It would still result in massive CPU usage, compounded with potentially lots of MOESI traffic on the caches. These are some of the reasons why it isn't used anymore. If it were the end-all, be-all as you claim then why wouldn't this be used everywhere? I mean, it's not like the algorithm hasn't been around since 1965 (as you also stated).
Basically, the reason is this: Polling, especially in a single CPU environment, is a BadThing(tm) if you want your machine to handle load well. If you have no load and have nothing else useful for your CPUs to be doing, then polling is about as good as anything else. Considering that all modern OSs are pre-emptive and are doing lots of things during any given second, few would meet that criteria.
You don't get it about out-of-order writes, do you? Simple scenario, according to your algorithm:
/* CPU AA clears its BUSY flag at this point in time, so, the while (flags[AA] == BUSY) terminates immediately */
/* from AA */ /* from BB */ /* BB uses the resource */ /* writeback from AA is too late */
CPU AA:
resource = produce_something();
turn = BB;
flags[AA] = FREE;
CPU BB:
flags[BB] = BUSY;
consume(resource);
The problem is that AA is free to reorder its writes. So, the actual order could be:
flags[AA] = FREE;
flags[BB] = BUSY;
consume(resource);
resource = result of produce_something() call
Oops. BB accesses the resource before AA writes back the current state. Cache coherency does not solve this problem - the problem is that the write to the resource is still pending. That is what the memory barrier is there for.
Argue with facts, don't hide behind oh-so-impressive credentials.
Personally I think actions have no business whatsoever inside the condition clause of an "if", "while", etc. "x = foo()" can easily go on another line. And if someone is sick enough to really want a side-effect inside of a boolean clause, they should use parens to make the order of operations easy to read.
As for the more general issue of "backwards comparisons"... The big thing I guess is that it helps people remember not to type "if (x = 3)" or whatever... not that that'll help them if both values being compared are variables.
---GEC
I'm but the humble pupil, seeking to snatch the scratchbuilt pebble from the master's fully articulated hand
Does that count as a replacement?
"How many people know about this? Nobody! I never read about it anywhere. I invented it myself years ago, .."
Turn to page 55 of your OS design and implementation by Tanenbaum. See where he says, "For a discussion of Dekker's algorithm, see Dijkstra (1965)."? How do you get through a proper comp sci honours degree to the point where you can take a masters and then a PhD without reading Dijkstra?
How about you crack open that copy of Operating Systems (4th ed) by William Stallings, which has a discussion of concurrency and Dekker's on pages 208-213? How can you get past a 2nd/3rd-year introductory operating systems class without having gone over this topic?
You are a troll. A troll preying on the fact that most of the moderators here have no idea about computer science, and have not taken a wiff of a real operatings systems class.
For the record, Peterson's algorithm (published in 1981) is a much simpler solution to your problem. It's on page 56 of the Tanenbaum book, and also discussed in Stallings on page 213. There's a new 5th edition of the Stallings book, but the index will take you to the correct chapter/page in short order.
--
Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
There is another assumption that may not hold (but usually does).
The code assumes that a writes are atomic. This will almost always hold for 8 processes and usually form 32, but if the flags array is larger than a word, atomic writes go out the window.
However, given that these people are IBM engineers, I'll give them the benefit of the doubt -- can somebody explain what I'm missing here?
I think it's perfectly fine to subscribe to the OP's take on this: a language is purely object oriented if all instances of all types are objects, except for the ones that aren't.
Are you adequate?
Experience demonstrates that, by and large, even very good programmers commit a sizeable number of these errors. Not to mention that ensuring proper security and memory management takes time; and time is money.
Are you adequate?
Even with thread suspension, you need to wake up every so often to check the status of the flags. This leads to one of two outcomes.
1)YOu use a long sleep timeout. This means your responsiveness is lower than using an OS semaphore, because you won't be woken immediately upon thread B finishing.
2)You use a short sleep timeout. This means you will wake up and check the flag repeatedly. That measn incurring a contaxt switch into the program for every poll, which is far more costly than the switch to kernel mode.
I still have more fans than freaks. WTF is wrong with you people?
You're hoping that the compiler decides to use a bitfield for booleans then. This does not have to be the case- a compiler is free to implement them as whatever they wish. I would expect chars to be a common choice. You'd need to rewrite the code as bitfields to ensure that.
I still have more fans than freaks. WTF is wrong with you people?
In addition to the problems of write ordering and busy looping, I have one more- your solution works only if you know how many threads will be accessing the resource beforehand. You have to choose an array size for the flags, and hope that you guessed a high enough number. That may be ok for some subset of problems, but not for a general solution.
I still have more fans than freaks. WTF is wrong with you people?
0x1234... Amazing that's the same code I use on my luggage
Precisely, I think he was saying that you cannot recover from segfaults when using threads, and that this is yet one more reason why threads are a bad idea in Unix (for user apps). In any case, it is common wisdom that SysV IPC methods and threads do not mix well at all.
Well, darn, I thought UNION was the best way to share memory...guess it's time to upgrade my skill set
never bring a twinkie to a food fight.
On top of those mechanisms, even slower interprocess communication systems are typically implemented, such as OpenRPC and CORBA. (For even more inefficiency, there's XPC. In Perl. But I digress.)
Because of this history, there's a perception that interprocess communication has to be slow. It doesn't.
What you really want looks more like what QNX has - fast interprocess messaging that interacts properly with the scheduler. QNX has to have interprocess communication done right, because it does everything through it, including all I/O. This works out quite well. You take a performance hit (maybe 20% for this), but you get much of that back because the higher levels become more efficient when built on good IPC.
The QNX messaging primitives are available for Linux, although the implementation isn't good enough for inclusion in the standard kernel. That work should be redone for the current kernel.
IPC/scheduler interaction really matters. If you get it wrong, each interprocess transaction results in an extra pass through the scheduler, or worse, both the sending process and the receiving process lose their turn at the CPU. This is easy to test. Start up two processes that communicate using your IPC mechanism. Measure the performance. Then start up a compute-bound process and measure again. If the IPC rate drops by much more than a factor of 2, something is wrong. Don't be surprised if it drops by two orders of magnitude. That's an indication that IPC/scheduler interaction was botched.
Sun addressed this in the mid-1990s with their "Doors" interface in Solaris, which had roughly the right primitives. But that idea never caught on.
The article here implements a message-passing system via shared memory, which is not exactly a new idea, even for UNIX. I think it first appeared in MERT, in the 1970s. It's an attempt to solve at the user level something that the OS should be doing for you.
Shared memory is a hack. It's hard to make it work right. With it, one process can crash other processes in hard-to-debug ways. Sometimes you need it because you're moving vast amounts of data, (by which I mean more than just a video stream) but that's rarely the case.
Hi! /. crowd would help me...
/var/www/html/sharedmem.php on line 2
// These are fine
/*$shm_size = shmop_size($shm_id); // Now lets read the string back
I have a prob with shared memory in PHP and C++ I thought the
I have a server written in C++ and my webpages are in PHP. The PHP has to communicate with the server using shared memory. This was working fine on the server running FC-1 with php-4.3.8. We recently migrated to CentOS 4.1 (Equivalent to RHEL 4.1) running php-4.3.9. The error it displays is as follows:
shmop_open(): unable to attach or create shared memory segment in
The server opens the shm in 666 (originally was 644) even then it was not working. I can see the shared mem open using 'ipcs' command.
The source code of PHP is as follows:
";
# print $shm_key;
$shm_id = shmop_open($shm_key, "a",0,0) or die("FATAL ERROR:: Unable to Access Shared Memory");
DEBUG:: print ("Shared Memory Block Size: " . $shm_size."\n");
*/
$data = shmop_read($shm_id, 0, $shm_size);
if (!$data) {
echo "FATAL ERROR:: Couldn't read from shared memory\n";
exit;
}
?>
Both the configs say that 'shmop' is enabled.
Can some one help me with this, I am in desperate need of this, if this fails I might have to search for an alternative and the project has to go live in a week or so. I am in desperate need of help, can any one help pls?
Regards,
Yaswanth
I cant believe i am even bothering to rebutt you again... did you READ wht you wrote above carefully?
Do you understand cacheline coherency? It does not matter the ORDER of the bytes written out.... first of all only ONE byte in the cacheline is intended to be used or defined. Each flag ought ot be in its own cacheline (ignore the C syntax for the declaration and assume the "boolean' is padded into a structure large enough to guarantee one flag per cacheline..
secondly... if a process requests a cacheline and it is coherent and not opaque in L1 L2 or possibly L3 caches... then the REAL DATA is resolved and fetched.... not stale corrupt old data.
thirdly the algorithm itself can be immune to out of order writes on the bus, even without this because it uses TWO flags and both have to be logical before it can advance... since the other cpu is coherent to its own access to the "turn" variable, it knows if it is its own turn or not.
the order the bytes touch physical ram are irrelevant to Dekkers algorithm... all that matters is that the access of the flags are caceh coherent amoung processors or processes.
we are talking about coherent cache. if coherency cannot be guaranteed, then merely allocating the page uncachable or setting the cacheline uncacheable can be done on all normal hardware.
as for the order the bytes hit memory.... its irrelevant and does not hurt a wisely implemented Dekkers algorithm.
the reason my drivers are closed source is because I am tired of teaching programmers in india how to write drivers and firmware... just as I hate showing people how to make things 10 times faster.
This entire excercise has reminded me not only how stupid the PhD in comp Sci was to refute the validity of Dekkers.... but now it seems the same non critical thinking buffoon infest slashdot.
Slashdot would be a less noisy and inaccurate place if engineers like you did not bother trying to help out on slashdot any more.
HTHHAND.
Dude, you DO realize they teach this to first year CS students at Drexel, right?
110100 1101000 1101000 1100110 0 1101111 1101000 1100011 1
You would use this only to guard very short sections. If the sections were long, this optimization matter much, anyway. If they are *very* short, eg insert an element into a list, even busy waiting would do. A short sleep timeout obviously wouldn't hurt for cases where multiple threads keep using this section.