CUDA instruction problem does not work

I am having a problem with the shuffle statement in CUDA 5.0.

This is a fragment of my kernel. It is inside the loop. Printing exists for debugging purposes only, because I cannot use a regular debugger:

...
tex_val = tex2D(srcTexRef, threadIdx.x + w, y_pos);
if (threadIdx.x == 0)
{
    left = left_value[y_pos];
}
else
{
    printf("thread %d; shfl value: %f \n", threadIdx.x, __shfl_up(value, 1));
    left = __shfl_up(value, 1);
}

printf("thread %d; value: %f; tex_val: %f; left: %f \n", threadIdx.x, value, tex_val, left);
...

From this, I get this output:

l0:  ITERATION 1
l1:  thread 0; value: 0; tex_val: 1; left: 4
l2: 
l3:  ITERATION 2
l4:  thread 1; shfl value: 0
l5:  thread 0; value: 5; tex_val: 1; left: 5
l6:  thread 1; value: 0; tex_val: 1; left: 0
l7: 
l8:  ITERATION 3
l9:  thread 1; shfl value: 0
l10: thread 2; shfl value: 1
l11: thread 0; value: 6; tex_val: 1; left: 6
l12: thread 1; value: 1; tex_val: 1; left: 0
l13: thread 2; value: 2; tex_val: 1; left: 1
...

From the output, I see that stream 1 does not get the value from stream 0 at any iteration, although I can clearly see that it has a value (line 4 - the value of shfl is 0, line 5 - the value 5). Thread 2 and up can get a value from the bottom thread. Where am I mistaken? Is this due to branching?

+4
source share
1 answer

Yes, this is due to branching. Quoting from the CUDA B.14.2 Programming Guide :

__shfl() . warp,...

, __shfl(). , undefined.

- , , , , . 0 , .

+7

All Articles