I am trying to test a very simple program that uses gcc 5 to unload through OpenMP 4.0 directives. My goal is to write two independent tasks with one task performed on the accelerator (i.e., Intel MIC Emulator), and the other simultaneously with the processor.
Here is the code:
#include <omp.h> #include <stdio.h> #define limit 100000 int main(int argc, char** argv) { int cpu_prime, acc_prime; #pragma omp task shared(acc_prime) { #pragma omp target map(tofrom: acc_prime) { printf("mjf-dbg >> acc computation\n"); int i, j; acc_prime=0; for(i=0; i<limit; i++){ for(j=2; j<=i; j++){ if(i%j==0) break; } if(j==i) acc_prime = i; } printf("mjf-dbg << acc computation\n"); } } #pragma omp task shared(cpu_prime) { int i, j; cpu_prime=0; printf("mjf-dbg >> cpu computation\n"); for(i=0; i<limit; i++){ for(j=2; j<=i; j++){ if(i%j==0) break; } if(j==i) cpu_prime = i; } printf("mjf-dbg << cpu computation\n"); } #pragma omp taskwait printf("cpu prime: %d \n", cpu_prime); printf("gpu prime: %d \n", acc_prime); }
With this code, I was expecting the following thread of execution:
- A master thread (MT) encounters the first explicit area of โโtasks, becomes attached to this task, and begins its execution.
- MT Target Directive Detection Unloads Target Block to Accelerator and Reaches Planning Point
- MT will return to the area of โโimplicit tasks
- MT meets the second explicit area of โโthe task, becomes attached to this task, and begins its execution.
- MT performs the calculation on the node in parallel with the unloading of the calculator on the accelerator device.
- MT returns to the implicit task area and reaches the planning point invoked by the taskwait directive
- MT returns to the first explicit task pane, waiting for the end of the unloaded block.
Compile and run:
gcc -fopenmp -foffload="-march=knl" overlap.c -o overlap OFFLOAD_EMUL_RUN="sde -knl --" ./overlap
Output:
mjf-dbg >> acc computation mjf-dbg << acc computation mjf-dbg >> cpu computation mjf-dbg << cpu computation cpu prime: 99991 gpu prime: 99991
This is not the result that I expected, since it means that the main thread is waiting for the upload calculation to complete before scheduling the node task. Instead, I was looking for something like this:
mjf-dbg >> acc computation mjf-dbg >> cpu computation mjf-dbg << cpu computation mjf-dbg << acc computation cpu prime: 99991 gpu prime: 99991
The unload emulator works correctly, because at runtime I see that the _offload_target process switches to 100% CPU usage when the program executes the target block.
So the question is: does anyone have an idea of โโwhy two tasks are serialized and not executed in parallel (one in the host process, and the other in the _offload_target emulation process)?
source share