OpenMP: sharing arrays between threads

Good day to all!

I am modeling molecular dynamics, and recently I started trying to implement it in parallel. At first glance, everything looked quite simple: write #pragma omp parallel for the directive before the longest cycles. But, as it happens, the functions in these cycles work on arrays or, more precisely, on arrays that belong to an object of my class that contains all the information about the particle system and the functions that work in this system, so when I added that # pragma before one of the longest cycles, the computation time actually increased several times, despite the fact that my 2-core 4-processor thread was fully loaded.

To understand this, I wrote another, simpler program. This test program performs two identical loops, one in parallel and one in sequential order. The time required to complete both of these cycles is measured. The results surprised me: whenever the first cycle was calculated in parallel, its calculation time decreased compared to the sequential one (1500 and 6000 ms, respectively), but the second cycle calculation time increased sharply (15 000 versus 6000 in sequential order).

I tried to use private () and firstprivate () sentences, but the results were the same. Should each variable defined and initialized before the parallel region be shared automatically? The calculation time of the second cycle returns to normal if it is executed on another vector: vec2, but creating a new vector for each iteration is obviously not an option. I also tried putting the actual vec1 update in the #pragma omp critical area, but that wasn't good either. It did not help to add the Shared clause (vec1).

I would appreciate if you could point out my mistakes and show the correct path.

Do I need to embed this personal (i) in the code?

Here is the test program:

#include "stdafx.h" #include <omp.h> #include <array> #include <time.h> #include <vector> #include <iostream> #include <Windows.h> using namespace std; #define N1 1000 #define N2 4000 #define dim 1000 int main(){ vector<int>res1,res2; vector<double>vec1(dim),vec2(N1); clock_t t, tt; int k=0; for( k = 0; k<dim; k++){ vec1[k]=1; } t = clock(); #pragma omp parallel { double temp; int i,j,k; #pragma omp for private(i) for( i = 0; i<N1; i++){ for(j = 0; j<N2; j++){ for( k = 0; k<dim; k++){ temp+= j; } } vec1[i]+=temp; temp = 0; } } tt = clock(); cout<<tt-t<<endl; for(int k = 0; k<dim; k++){ vec1[k]=1; } t = clock(); for(int g = 0; g<N1; g++){ for(int h = 0; h<N2; h++){ for(int y = 0; y<dim; y++){ vec1[g]+=h; } } } tt = clock(); cout<<tt-t<<endl; getchar(); } 

Thank you for your time!

PS I am using visual studio 2012, my processor is Intel Core i3-2370M. My build file consists of two parts:

http://pastebin.com/suXn35xj

http://pastebin.com/EJAVabhF

+7
source share
1 answer

Congratulations! You discovered yet another bad OpenMP implementation, kindly provided by Microsoft. My initial theory was that the problem was with the L3 partitioned cache in Sandy Bridge and later Intel processors. But the result of the second cycle only in the first half of the vector did not confirm this theory. Then it should be something in the code generator that runs when OpenMP is enabled. The build result confirms this.

Basically, the compiler does not optimize the serial loop when compiling with OpenMP enabled. Where the slowdown comes from. Part of the problem was also introduced by itself, making the second cycle not identical to the first. In the first loop, you accumulate intermediate values ​​in a temporary variable that the compiler optimizes to register the variable, and in the second case, you call operator[] on each iteration. When you compile without OpenMP enabled, the code optimizer converts the second loop into something that is very similar to the first loop, so you get almost the same runtime for both loops.

When you enable OpenMP, the code optimizer does not optimize the second loop and is slower. The fact that your code executes a parallel block before this has nothing to do with slowdown. I assume that the code optimizer cannot understand that vec1 is outside the scope of OpenMP parallel and therefore it can no longer be considered as a shared variable, and the loop can be optimized. Obviously, this is a β€œfeature” that was introduced in Visual Studio 2012, because the code generator in Visual Studio 2010 is able to optimize the second loop even when OpenMP is enabled.

One of the possible solutions would be to switch to Visual Studio 2010. Another (hypothetical, since I don't have VS2012) solution would be to extract the second loop into the function and pass the vector by reference to it. Hopefully the compiler will be smart enough to optimize the code in a separate function.

This is a very bad trend. Microsoft has virtually abandoned support for OpenMP in Visual C ++. Their implementation still (almost) corresponds only to OpenMP 2.0 (therefore, there are no obvious tasks and other advantages of OpenMP 3.0+), and errors like this do not improve the situation. I would recommend you switch to another compiler with OpenMP support (Intel C / C ++ compiler, GCC, something non-Microsoft) or switch to another compiler-independent thread paradigm, for example Intel Threading Building Blocks. Microsoft is clearly promoting its parallel library for .NET and that everything is developing there.


Big Fat Warning

Do not use clock() to measure the elapsed time of a wall clock! This only works on Windows. On most Unix systems (including Linux), clock() actually returns the total processor time consumed by all threads in the process since it was created . This means that clock() can return values ​​that are several times larger than the elapsed time of the wall clock (if the program works with many busy threads) or several times shorter than the time of the wall clock (if the program sleeps or waits for IO between measurements) . Instead, use the portable timer function omp_get_wtime() in OpenMP programs.

+8
source

All Articles