False sharing and pthreads

I have the following task to demonstrate false sharing and wrote a simple program:

#include <sys/times.h> #include <time.h> #include <stdio.h> #include <pthread.h> long long int tmsBegin1,tmsEnd1,tmsBegin2,tmsEnd2,tmsBegin3,tmsEnd3; int array[100]; void *heavy_loop(void *param) { int index = *((int*)param); int i; for (i = 0; i < 100000000; i++) array[index]+=3; } int main(int argc, char *argv[]) { int first_elem = 0; int bad_elem = 1; int good_elem = 32; long long time1; long long time2; long long time3; pthread_t thread_1; pthread_t thread_2; tmsBegin3 = clock(); heavy_loop((void*)&first_elem); heavy_loop((void*)&bad_elem); tmsEnd3 = clock(); tmsBegin1 = clock(); pthread_create(&thread_1, NULL, heavy_loop, (void*)&first_elem); pthread_create(&thread_2, NULL, heavy_loop, (void*)&bad_elem); pthread_join(thread_1, NULL); pthread_join(thread_2, NULL); tmsEnd1 = clock(); tmsBegin2 = clock(); pthread_create(&thread_1, NULL, heavy_loop, (void*)&first_elem); pthread_create(&thread_2, NULL, heavy_loop, (void*)&good_elem); pthread_join(thread_1, NULL); pthread_join(thread_2, NULL); tmsEnd2 = clock(); printf("%d %d %d\n", array[first_elem],array[bad_elem],array[good_elem]); time1 = (tmsEnd1-tmsBegin1)*1000/CLOCKS_PER_SEC; time2 = (tmsEnd2-tmsBegin2)*1000/CLOCKS_PER_SEC; time3 = (tmsEnd3-tmsBegin3)*1000/CLOCKS_PER_SEC; printf("%lld ms\n", time1); printf("%lld ms\n", time2); printf("%lld ms\n", time3); return 0; } 

I was very surprised when I saw the results (I ran it on my i5-430M processor).

  • With a false exchange, it was 1020 ms.
  • Without a false exchange, it was 710 ms, only 30% faster instead of 300% (on some sites it was written that it would be faster than 300-400%).
  • Without using pthreads, it was 580 ms.

Please show me my mistake or explain why this is happening.

+8
source share
2 answers

Briefly about the clock () function in C - it gives you the number of processor clock speeds that have passed from beginning to end. Thus, when you start two parallel threads, the number of CPU clock cycles will be CPU clock cycle + CPU clock cycle. I think you want a real timer clock. To do this, use clock_gettime (), and you should get the expected result.

I ran your code using clock_gettime (). I got it:

with false separation 874.587381 ms

without false exchange 331.844278 ms

sequential calculation 604.160276 ms

+3
source

False sharing is the result of several cores with separate caches that access the same region of physical memory (although not the same address - this will be a real exchange).

To understand the false separation, you need to understand the caches. On most processors, each core will have its own L1 cache, which contains recently received data. Caches are organized in β€œlines”, which are aligned pieces of data, usually 32 or 64 bytes long (depending on your processor). When you read an address that is not in the cache, the entire line is read from the main memory (or L2 cache) in L1. When you write the address in the cache, the line containing this address is marked as dirty.

Here where the sharing component is present. If several cores are read from one line, each of them can have a copy of the line in L1. However, if a copy is marked as dirty, it is not valid in other caches. If this does not happen, then recordings made on one core may not be available for other kernels until much later. Therefore, the next time the other core starts reading from this line, the cache misses and it should get the line again.

False sharing occurs when kernels read and write different addresses on the same line. Although they do not share data, caches act as if they are so close.

This effect is highly dependent on your processor architecture. If you had one main processor, you would not see the effect at all, because there would be no sharing. If your cache lines were longer, you will see the effect in both β€œbad” and β€œgood” cases, since they are still close to each other. If your kernels did not share the L2 cache (which I assume they do), you might see a 300-400% difference, as you said, since they had to go all the way to main memory when skipping the cache.

You may also like to know that it is important that each stream read and write (+ = instead of =). Some processors have end-to-end caching, which means that if the kernel does not write the address to the cache, it does not skip and extracts a string from memory. Compare this with write-back caches that skip writes.

+21
source

All Articles