Contactless multi-threading designed for real-world thread experts

Question

Contactless multi-threading designed for real-world thread experts

I read the answer that Jon Skeet gave to the question, and in it he mentioned it

As far as I know, multithreading without blocking is intended for real thread experts, of which I'm not the only one.

Not the first time I heard this, but I find very few people talking about how you actually do this, if you are interested in learning how to write loose multithreaded code.

So, my question is, besides learning everything that you can about streams, etc., where you start trying to learn how to specifically write multi-user code without blocking and some good resources.

Greetings

+79

multithreading c # .net lock-free

vdhant Mar 27 '10 at 10:50

source share

6 answers

Joe Duffy Book:

http://www.bluebytesoftware.com/books/winconc/winconc_book_resources.html

He also writes a blog on these topics.

The trick to getting low-lock programs is to understand at the deepest level that the memory model rules are in your particular combination of hardware, operating system, and runtime.

I personally am not smart enough to do the right programming with a low level of lock outside InterlockedIncrement, but if you are great, go for it. Just make sure that you leave most of the documentation in code so that people who are not as smart as you accidentally break one of your memory model invariants and make an error that is impossible to find.

+26

Eric Lippert Mar 27 2018-10-03T00:

source share

Nowadays there is no such thing as a "lock". It was an interesting venue for academia, etc., At the end of the last century, when computer equipment was slow and expensive. Dekker's algorithm has always been my favorite, modern equipment brought him to the pasture. It doesn't work anymore.

Two events ended with this: a growing discrepancy between the speed of RAM and the processor. And the ability of chip manufacturers to put more than one processor core on a chip.

The problem with RAM required chip designers to place a buffer on the CPU chip. The buffer stores code and data quickly accessible by the CPU core. And you can read and write from / to RAM at a much lower speed. This buffer is called the CPU cache; most CPUs have at least two of them. Level 1 cache is small and fast, while the second is large and slow. While the CPU can read data and instructions from the cache of the 1st level, it will work quickly. The disadvantage of the cache is very expensive, it makes the processor sleep 10 whole cycles if the data is not in the 1st cache, as many as 200 cycles if it is not in the second cache, and it must be read from RAM.

Each CPU core has its own cache, they store their own "type" of RAM. When the CPU writes data, the write is written to the cache, which is then slowly placed in RAM. Inevitably, each core will now have a different view of the contents of RAM. In other words, one CPU does not know what another processor wrote until the write cycle of RAM is complete and the CPU updates its own look.

This is highly incompatible with threads. It is always very important for you that the state of another stream is when you should read data that was written by another stream. To ensure this, you need to explicitly program the so-called memory barrier. This is a low-level processor primitive that ensures that all CPU caches are in a consistent state and have a modern RAM look. All pending entries must be flushed to RAM, then the caches must be updated.

This is available in .NET, the Thread.MemoryBarrier () method implements one. Given that this is 90% of the job that the lock statement (and 95 %% of the execution time) does, you are simply not ahead of the curve by avoiding the tools that .NET gives you and tries to implement your own.

+18

Hans Passant Mar 27 '10 at 15:01

source share

Google to lock free data structures and transactional memory software .

I will agree with John Skeet on this; loose carving is a platform for the devil, and it’s best to leave people who know that they know what they need to know.

+6

Marcelo Cantos Mar 27 '10 at 10:54

source share

When it comes to multithreading, you need to know exactly what you are doing. I mean studying all the possible scenarios / cases that may arise when you work in a multi-threaded environment. Threadless multithreading is not a library or the class that we include, its knowledge / experience that we gain during our travels on streams.

0

bragboy Mar 27 '10 at 10:54

source share

Although .NET does not have the ability to block threads, you can often make significant improvements when using blocking by learning what you need to block and minimizing a blocked section ... this is also known as minimizing blocking.

As an example, just say that you need to make the thread stream safe. It is not easy to blindly throw a lock around a method that iterates through a collection if it performs some tasks with an intensive processor for each element. You may only need to create a lock to create a shallow copy of the collection. Copy iteration can work without blocking. Of course, it depends a lot on the specifics of your code, but I was able to fix the lock convoy problem with this approach.

0

dodgy_coder Apr 19 '12 at 7:18

source share

Andras Vass · Accepted Answer · 2010-03-27 16:50

Existing "loose" implementations in most cases follow the same pattern:

* read some condition and make a copy of it **
* change copy **
perform blocking operation
try again if it does not work

_{(* optional: depends on data structure / algorithm)}

The last bit is terribly like spinlock. In fact, this is the main spinlock . :)
I agree with @nobugz in this: the cost of interconnected operations used in keyless multithreading, which is dominated by the caching and memory tasks that it should perform ,

However, you get, however, a data structure that is "loose" so that your "locks" are very shallow. This reduces the likelihood that two parallel threads are accessing the same “lock” (memory location).

The trick in most cases is that you do not have dedicated locks - instead, you heal, for example. all elements in the array or all nodes in the linked list as "spin-lock". You are reading, modifying and trying to update if there was no update since your last read. If there is, try again.
This makes your "lock" (oh, sorry, does not block :) very fine-grained, without introducing additional memory or resource requirements.
The finer the grain, the less likely it is to wait. Making it as small as possible without additional resource requirements is great, isn't it?

Most of the fun, however, can come from ensuring that the store is loaded / saved correctly .
Unlike one intuition, processors are free to change the order of reading / writing in memory - they are very smart, by the way: it will be difficult for you to observe this from one thread. However, you will encounter problems when you start multithreading on multiple cores. Your intuition will break: simply because the instruction is earlier in your code, this does not mean that it will actually happen earlier. Processors can process instructions out of order: and they especially like to do this with instructions with memory accesses to hide the attenuation of main memory and better use their cache.

Now convinced against intuition that the code sequence does not flow "from top to bottom", instead it works as if there was no sequence - and it can be called a "devilish platform". I believe that it is impossible to give an exact answer to the question of which reordering of the goods / store will take place. Instead, everyone always speaks in terms of mases, motives and cans and prepares for the worst. "Oh, the CPU can reorder this reading before it writes, so it’s best to put a memory barrier here."

The issues are complicated by the fact that even these mays and mights can vary in CPU architecture. For example, it may happen that something that is not guaranteed to happen in one architecture may happen to another.

To get unprotected multithreading, you need to understand memory models.
However, getting a memory model and guarantees of correctness is not trivial, as evidenced by this story, in which Intel and AMD made some corrections to the MFENCE documentation, causing some incitement among JVM developers . As it turned out, the documentation that the developers relied on from the start was not so accurate in the first place.

Locks in .NET lead to an implicit memory barrier, so you can use them safely (most of the time, that is ... see, for example, this is Joe Duffy - Brad Abrams - Vance Morrison greatness about lazy initialization, locks, volatile and barriers memory. :) (Remember to follow the links on this page.)

As an added bonus, you get an idea of the .NET memory model on a third-party quest . :)

There is also an "oldie but goldie" from Vance Morrison: What Every Dev Should Know About Multithreaded Applications .

... and of course, as @Eric mentioned, Joe Duffy is the final reading on this.

A good STM can come close to a fine-grained lock as it arrives and is likely to provide performance close or equivalent to a manual implementation. One of them is STM.NET from DevLabs MS Projects .

If you're not just a .NET fanatic, Doug Lee has done an excellent job in JSR-166 . Cliff Click has an interesting hash table approach that does not rely on lock-striping - as regular Java and .NET hash tables do - and it seems to scale well to 750 processors.

If you are not afraid to venture into Linux territory, the following article provides more information on the internal components of existing memory architectures and how cache sharing can ruin performance: What every programmer should know about memory .

@Ben made a lot of comments about MPI: I sincerely agree that MPI can shine in some areas. An MPI-based solution may be easier to reason with, easier to implement, and less error prone than a half-lock implementation that is trying to be smart. (Nevertheless, subjectively, it is also true for a solution based on STM.) I would also like to bet that it is easier on the fly to write a decent distributed application correctly, for example. Erlang, as many successful examples show.

MPI, however, has its own costs and problems of its own when it runs on a single multi-core system. For example. Erlang has problems that need to be addressed around synchronizing scheduling processes and message queues .
In addition, based on them, MPI systems usually implement a kind of cooperative N: M scheduling for "light processes". This, for example, means that there is an inevitable context between light processes. It’s true that this is not a “classic context switch”, but basically a user-space operation, and this can be done quickly, however I sincerely doubt that it can be brought into 20-200 cycles executed with a lock . User mode context switching is of course slower even in the Intel McRT library. Planning N: M with lightweight processes is not new. LWPs have been in Solaris for a long time. They were abandoned. There were fibers in NT. Now they are a relic. NetBSD had "revitalization." They were abandoned. Linux had its own approach to the N: M. stream. It looks like it is already somewhat dead.
From time to time, new rivals appear: for example, McRT from Intel or the last time, "User Mode Planning" along with Microsoft's ConCRT .
At the lowest level, they do what the N: M MPI scheduler does. Erlang - or any MPI system - can greatly affect SMP systems using the new UMS .

I think that the OP question does not concern the merits and subjective arguments for / against any solution, but if I were to answer this, I believe it depends on the task: to build low-level high-performance basic data structures that work in a single a system with many cores, either using low-lock / no-lock technology or using STM, will provide the best results in terms of performance and will probably beat the MPI solution at any time in performance, even if the above wrinkles are smoothed, for example in Erlang.
To create something moderately more complex that works on a single system, I would opt for the classic coarse-grained lock, or if performance is of great concern, STM.
To create a distributed system, the MPI system is likely to make a natural choice.
Note that the MPI implementation for .NET is also (although they do not seem to be as active).

Contactless multi-threading designed for real-world thread experts

More articles: