In your code example, only the outer loop is parallel. You can test by typing omp_get_thread_num() in the inner loop, and you will see that for a given i number of threads is the same (of course, this test is demonstrative and not final, since different runs will give different results), For example, with:
#include <stdio.h> #include <omp.h> #define dimension 4 int main() { #pragma omp parallel for for (int i = 0; i < dimension; i++) for (int j = 0; j < dimension; j++) printf("i=%d, j=%d, thread = %d\n", i, j, omp_get_thread_num()); }
I get:
i=1, j=0, thread = 1 i=3, j=0, thread = 3 i=2, j=0, thread = 2 i=0, j=0, thread = 0 i=1, j=1, thread = 1 i=3, j=1, thread = 3 i=2, j=1, thread = 2 i=0, j=1, thread = 0 i=1, j=2, thread = 1 i=3, j=2, thread = 3 i=2, j=2, thread = 2 i=0, j=2, thread = 0 i=1, j=3, thread = 1 i=3, j=3, thread = 3 i=2, j=3, thread = 2 i=0, j=3, thread = 0
As for the rest of your code, you might want to add more details to the new question (it's hard to say from a small example), but, for example, you cannot put private(j) when j only announced later. It is automatically closed in my example above. I think diff is a variable that we do not see in the sample. In addition, the loop variable i automatically closed (from spec version 2.5 - the same in specification 3.0)
The iterative loop variable in for-loop for for or parallel for construct is private in this build.
Edit: all of the above is true for the code that you and I showed, but you may be interested in the following. For OpenMP version 3.0 (for example, gcc version 4.4 is available , but not version 4.3) there is a collapse clause where you can write code as you have, but with #pragma omp parallel for collapse (2) for parallelization as for loops (see . specification ).
Edit : OK, I downloaded gcc 4.5.0 and executed the above code, but using collapse (2) to get the following output, showing that the inner loop is now parallelized:
i=0, j=0, thread = 0 i=0, j=2, thread = 1 i=1, j=0, thread = 2 i=2, j=0, thread = 4 i=0, j=1, thread = 0 i=1, j=2, thread = 3 i=3, j=0, thread = 6 i=2, j=2, thread = 5 i=3, j=2, thread = 7 i=0, j=3, thread = 1 i=1, j=1, thread = 2 i=2, j=1, thread = 4 i=1, j=3, thread = 3 i=3, j=1, thread = 6 i=2, j=3, thread = 5 i=3, j=3, thread = 7
The comments here (searching for "Workarounds") are also relevant to workarounds in version 2.5, if you want to parallelize both loops, but the above specification for version 2.5 is pretty clear (see non-conforming examples in section A.35 ).