Existing "loose" implementations in most cases follow the same pattern:
- * read some condition and make a copy of it **
- * change copy **
- perform blocking operation
- try again if it does not work
(* optional: depends on data structure / algorithm)
The last bit is terribly like spinlock. In fact, this is the main spinlock . :)
I agree with @nobugz in this: the cost of interconnected operations used in keyless multithreading, which is dominated by the caching and memory tasks that it should perform ,
However, you get, however, a data structure that is "loose" so that your "locks" are very shallow. This reduces the likelihood that two parallel threads are accessing the same “lock” (memory location).
The trick in most cases is that you do not have dedicated locks - instead, you heal, for example. all elements in the array or all nodes in the linked list as "spin-lock". You are reading, modifying and trying to update if there was no update since your last read. If there is, try again.
This makes your "lock" (oh, sorry, does not block :) very fine-grained, without introducing additional memory or resource requirements.
The finer the grain, the less likely it is to wait. Making it as small as possible without additional resource requirements is great, isn't it?
Most of the fun, however, can come from ensuring that the store is loaded / saved correctly .
Unlike one intuition, processors are free to change the order of reading / writing in memory - they are very smart, by the way: it will be difficult for you to observe this from one thread. However, you will encounter problems when you start multithreading on multiple cores. Your intuition will break: simply because the instruction is earlier in your code, this does not mean that it will actually happen earlier. Processors can process instructions out of order: and they especially like to do this with instructions with memory accesses to hide the attenuation of main memory and better use their cache.
Now convinced against intuition that the code sequence does not flow "from top to bottom", instead it works as if there was no sequence - and it can be called a "devilish platform". I believe that it is impossible to give an exact answer to the question of which reordering of the goods / store will take place. Instead, everyone always speaks in terms of mases, motives and cans and prepares for the worst. "Oh, the CPU can reorder this reading before it writes, so it’s best to put a memory barrier here."
The issues are complicated by the fact that even these mays and mights can vary in CPU architecture. For example, it may happen that something that is not guaranteed to happen in one architecture may happen to another.
To get unprotected multithreading, you need to understand memory models.
However, getting a memory model and guarantees of correctness is not trivial, as evidenced by this story, in which Intel and AMD made some corrections to the MFENCE documentation, causing some incitement among JVM developers . As it turned out, the documentation that the developers relied on from the start was not so accurate in the first place.
Locks in .NET lead to an implicit memory barrier, so you can use them safely (most of the time, that is ... see, for example, this is Joe Duffy - Brad Abrams - Vance Morrison greatness about lazy initialization, locks, volatile and barriers memory. :) (Remember to follow the links on this page.)
As an added bonus, you get an idea of the .NET memory model on a third-party quest . :)
There is also an "oldie but goldie" from Vance Morrison: What Every Dev Should Know About Multithreaded Applications .
... and of course, as @Eric mentioned, Joe Duffy is the final reading on this.
A good STM can come close to a fine-grained lock as it arrives and is likely to provide performance close or equivalent to a manual implementation. One of them is STM.NET from DevLabs MS Projects .
If you're not just a .NET fanatic, Doug Lee has done an excellent job in JSR-166 . Cliff Click has an interesting hash table approach that does not rely on lock-striping - as regular Java and .NET hash tables do - and it seems to scale well to 750 processors.
If you are not afraid to venture into Linux territory, the following article provides more information on the internal components of existing memory architectures and how cache sharing can ruin performance: What every programmer should know about memory .
@Ben made a lot of comments about MPI: I sincerely agree that MPI can shine in some areas. An MPI-based solution may be easier to reason with, easier to implement, and less error prone than a half-lock implementation that is trying to be smart. (Nevertheless, subjectively, it is also true for a solution based on STM.) I would also like to bet that it is easier on the fly to write a decent distributed application correctly, for example. Erlang, as many successful examples show.
MPI, however, has its own costs and problems of its own when it runs on a single multi-core system. For example. Erlang has problems that need to be addressed around synchronizing scheduling processes and message queues .
In addition, based on them, MPI systems usually implement a kind of cooperative N: M scheduling for "light processes". This, for example, means that there is an inevitable context between light processes. It’s true that this is not a “classic context switch”, but basically a user-space operation, and this can be done quickly, however I sincerely doubt that it can be brought into 20-200 cycles executed with a lock . User mode context switching is of course slower even in the Intel McRT library. Planning N: M with lightweight processes is not new. LWPs have been in Solaris for a long time. They were abandoned. There were fibers in NT. Now they are a relic. NetBSD had "revitalization." They were abandoned. Linux had its own approach to the N: M. stream. It looks like it is already somewhat dead.
From time to time, new rivals appear: for example, McRT from Intel or the last time, "User Mode Planning" along with Microsoft's ConCRT .
At the lowest level, they do what the N: M MPI scheduler does. Erlang - or any MPI system - can greatly affect SMP systems using the new UMS .
I think that the OP question does not concern the merits and subjective arguments for / against any solution, but if I were to answer this, I believe it depends on the task: to build low-level high-performance basic data structures that work in a single a system with many cores, either using low-lock / no-lock technology or using STM, will provide the best results in terms of performance and will probably beat the MPI solution at any time in performance, even if the above wrinkles are smoothed, for example in Erlang.
To create something moderately more complex that works on a single system, I would opt for the classic coarse-grained lock, or if performance is of great concern, STM.
To create a distributed system, the MPI system is likely to make a natural choice.
Note that the MPI implementation for .NET is also (although they do not seem to be as active).