OpenMP and C ++ for the loop: why is my code slowing down when using OpenMP?

I have a simple question about using OpenMP (with C ++) that I was hoping someone could help me. I gave a small example below to illustrate my problem.

#include<iostream> #include<vector> #include<ctime> #include<omp.h> using namespace std; int main(){ srand(time(NULL));//Seed random number generator vector<int>v;//Create vector to hold random numbers in interval [0,9] vector<int>d(10,0);//Vector to hold counts of each integer initialized to 0 for(int i=0;i<1e9;++i) v.push_back(rand()%10);//Push back random numbers [0,9] clock_t c=clock(); #pragma omp parallel for for(int i=0;i<v.size();++i) d[v[i]]+=1;//Count number stored at v[i] cout<<"Seconds: "<<(clock()-c)/CLOCKS_PER_SEC<<endl; for(vector<int>::iterator i=d.begin();i!=d.end();++i) cout<<*i<<endl; return 0; } 

In the above code, a vector v is created containing 1 billion random numbers in the range [0,9] . Then the code goes through v , counting how many instances of each different integer are (i.e., how many of them are found in v , how many are two, etc.)

Each time a certain integer occurs, it is calculated by increasing the corresponding element of the vector d . Thus, d[0] counts how many zeros, d[6] counts how many gears, and so on. Does it still make sense?

My problem is that I am trying to make the count loop parallel. Without the #pragma OpenMP instruction, my code takes 20 seconds, but with pragma it takes 60 seconds.

It is clear that I misunderstood some concept related to OpenMP (perhaps how data is shared / accessible?). Can someone explain my mistake, please, or point me towards some insightful literature with relevant keywords to help me find?

+4
source share
2 answers

Your code suggestions:

  • race conditions due to unsynchronized access to a shared variable
  • false and true sharing cache issues.
  • incorrect measurement of runtime

Race conditions arise because you simultaneously update the same elements of the vector d in several threads. Comment out the srand() and run your code several times with the same number of threads (but with multiple threads). Compare the results from different runs.

False sharing occurs when two threads write to memory cells that are close to each other to lead to the same cache line. This leads to the fact that the cache line constantly jumps from the kernel to the core or to the CPU in the CPU in multitasking systems and exceeds the cache connectivity messages. With 32 bytes per cache line, 8 vector elements can be on the same cache line. With 64 bytes per cache line, the entire vector d is placed on one cache line. This makes the code slow on Core 2 processors and slightly slower (but not as slow as on Core 2) on Nehalem and post-Nehalem (e.g. Sandy Bridge). True sharing occurs in those elements that are accessed by two or more threads at the same time. You must either put an increment in the OpenMP atomic construct (slow), use an OpenMP lock array to protect access to d elements (faster or slower, depending on your OpenMP runtime) or copy local values โ€‹โ€‹and then perform the final synchronized reduction (fastest ) The first is implemented as follows:

 #pragma omp parallel for for(int i=0;i<v.size();++i) #pragma omp atomic d[v[i]]+=1;//Count number stored at v[i] 

The second is implemented as follows:

 omp_lock_t locks[10]; for (int i = 0; i < 10; i++) omp_init_lock(&locks[i]); #pragma omp parallel for for(int i=0;i<v.size();++i) { int vv = v[i]; omp_set_lock(&locks[vv]); d[vv]+=1;//Count number stored at v[i] omp_unset_lock(&locks[vv]); } for (int i = 0; i < 10; i++) omp_destroy_lock(&locks[i]); 

(enable omp.h to access omp_* functions)

I leave this for you to come up with the implementation of the third option.

You measure elapsed time with clock() , but it measures CPU time, not runtime. If you have one thread running at 100% CPU usage for 1 second, then clock() will be indicata - an increase in CPU time of 1 second. If you have 8 threads running at 100% CPU utilization for 1 second, clock() will indicate an increase in processor time of 8 seconds (this is 8 threads times 1 processor per second). Use omp_get_wtime() or gettimeofday() or some other high-resolution timer API).

+5
source

EDIT After your race condition is resolved using the correct synchronization, then the following paragraph is applied before your data race conditions, unfortunately, do not allow you to disable speed comparison:

Your program slows down because you have 10 possible outputs in the pragma section that are accessed randomly. OpenMP cannot access any of these elements without blocking (which you will need to provide through synchronization), and as a result, blocking will cause your threads to have higher overhead than you get from counting in parallel.

The solution to make this speed is to instead make a local variable for each OpenMP stream that counts all the values โ€‹โ€‹0-10 that a particular stream has seen. Then sum them up on the main account vector. This will be easy to parallelize and much faster, since the threads do not need to block the overall recording vector. I would expect acceleration close to Nx, where N is the number of threads from OpenMP, since very limited blocking is required. This solution also avoids the many race conditions currently in your code.

See http://software.intel.com/en-us/articles/use-thread-local-storage-to-reduce-synchronization/ for more information on the local OpenMP stream

+1
source

All Articles