Update: the condition below is while()optimized by the compiler, so both threads just skip the condition and enter CS even with the flag -O0. Does anyone know why the compiler does this? By the way, declaring global variables volatilemakes the program freeze for some odd reason ...
I read the CUDA programming guide , but I still don't understand a bit how CUDA handles memory consistency with respect to global memory. (This is different from the memory hierarchy) Basically, I run tests trying to break consistent consistency . The algorithm I use is Peterson's algorithm for mutually eliminating between two threads inside a kernel function:
flag[threadIdx.x] = 1;
turn = 1-threadIdx.x;
while(flag[1-threadIdx.x] == 1 && turn == (1- threadIdx.x));
shared_gloabl_variable_x ++;
flag[threadIdx.x] = 0;
It is pretty simple. Each thread requests a critical section, setting its flag to one and, being beautiful, giving the queue to another thread. When evaluating while(), if another thread has not set its flag, the requesting thread can safely enter the critical section. Now, the subtle problem with this approach is that if the compiler reorders the records so that the write in turnis done before the write to flag. . If this happens, then both threads will be CS at the same time. This is fairly easy to prove with regular Pthreads, since most processors do not implement consistent consistency. But what about GPUs ?
. . turn, , - (, ). , , , while, , , while() ? while(), , .
, SC. , , 1, , - C.S. . ( )? (: -O0, , , volatile).