False sharing is the result of several cores with separate caches that access the same region of physical memory (although not the same address - this will be a real exchange).
To understand the false separation, you need to understand the caches. On most processors, each core will have its own L1 cache, which contains recently received data. Caches are organized in βlinesβ, which are aligned pieces of data, usually 32 or 64 bytes long (depending on your processor). When you read an address that is not in the cache, the entire line is read from the main memory (or L2 cache) in L1. When you write the address in the cache, the line containing this address is marked as dirty.
Here where the sharing component is present. If several cores are read from one line, each of them can have a copy of the line in L1. However, if a copy is marked as dirty, it is not valid in other caches. If this does not happen, then recordings made on one core may not be available for other kernels until much later. Therefore, the next time the other core starts reading from this line, the cache misses and it should get the line again.
False sharing occurs when kernels read and write different addresses on the same line. Although they do not share data, caches act as if they are so close.
This effect is highly dependent on your processor architecture. If you had one main processor, you would not see the effect at all, because there would be no sharing. If your cache lines were longer, you will see the effect in both βbadβ and βgoodβ cases, since they are still close to each other. If your kernels did not share the L2 cache (which I assume they do), you might see a 300-400% difference, as you said, since they had to go all the way to main memory when skipping the cache.
You may also like to know that it is important that each stream read and write (+ = instead of =). Some processors have end-to-end caching, which means that if the kernel does not write the address to the cache, it does not skip and extracts a string from memory. Compare this with write-back caches that skip writes.
Jay conrod
source share