Edit: This question is a re-version of the original, so the first few answers may be more irrelevant.
I am wondering what effect a device function call causes with forced no-inlining when synchronizing in a device function. I have a simple test core that illustrates the behavior in question.
The kernel takes a buffer and passes it to the device functions, together with a shared buffer and an indicator variable that identifies one thread as a boss stream. The device function has a divergent code: first, the boss thread spends time performing trivial operations in a shared buffer, and then writes to the global buffer. After a synchronization call, all threads are written to the global buffer. After calling the kernel, the host prints the contents of the global buffer. Here is the code:
CUDA CODE:
test_main.cu
#include<cutil_inline.h> #include "test_kernel.cu" int main() { int scratchBufferLength = 100; int *scratchBuffer; int *d_scratchBuffer; int b = 1; int t = 64; // copy scratch buffer to device scratchBuffer = (int *)calloc(scratchBufferLength,sizeof(int)); cutilSafeCall( cudaMalloc(&d_scratchBuffer, sizeof(int) * scratchBufferLength) ); cutilSafeCall( cudaMemcpy(d_scratchBuffer, scratchBuffer, sizeof(int)*scratchBufferLength, cudaMemcpyHostToDevice) ); // kernel call testKernel<<<b, t>>>(d_scratchBuffer); cudaThreadSynchronize(); // copy data back to host cutilSafeCall( cudaMemcpy(scratchBuffer, d_scratchBuffer, sizeof(int) * scratchBufferLength, cudaMemcpyDeviceToHost) ); // print results printf("Scratch buffer contents: \t"); for(int i=0; i < scratchBufferLength; ++i) { if(i % 25 == 0) printf("\n"); printf("%d ", scratchBuffer[i]); } printf("\n"); //cleanup cudaFree(d_scratchBuffer); free(scratchBuffer); return 0; }
test_kernel.cu
#ifndef __TEST_KERNEL_CU #define __TEST_KERNEL_CU #define IS_BOSS() (threadIdx.x == blockDim.x - 1) __device__ __noinline__ void testFunc(int *sA, int *scratchBuffer, bool isBoss) { if(isBoss) { // produces unexpected output-- "broken" code //if(IS_BOSS()) { // produces expected output-- "working" code for (int c = 0; c < 10000; c++) { sA[0] = 1; } } if(isBoss) { scratchBuffer[0] = 1; } __syncthreads(); scratchBuffer[threadIdx.x ] = threadIdx.x; return; } __global__ void testKernel(int *scratchBuffer) { __shared__ int sA[4]; bool isBoss = IS_BOSS(); testFunc(sA, scratchBuffer, isBoss); return; } #endif
I compiled this code from the CUDA SDK to use the "cutilsafecall ()" functions in test_main.cu, but, of course, they could be pulled out if you want to compile outside the SDK. I compiled using CUDA Driver / Toolkit version 4.0, calculated the capabilities of 2.0, and the code was run on a GeForce GTX 480, which has a Fermi architecture.
Expected Result:
0 1 2 3 ... blockDim.x-1
However, the conclusion that I get is
1 1 2 3 ... blockDim.x-1
This, apparently, indicates that the boss thread has executed the conditional "scratchBuffer [0] = 1;" statement AFTER all threads execute "scratchBuffer [threadIdx.x] = threadIdx.x;" although they are separated by the __syncthreads () barrier.
This happens even if the boss thread is instructed to write the sentinel value to the stream buffer position in the same warp; sentinel is the final value present in the buffer, not the corresponding threadIdx.x.
One modification that causes the code to generate the expected result is to change the conditional statement
if (isBoss) {
to
if (IS_BOSS ()) {
; those. change the control variable divergence from storage in the parameter register for calculation in the macro function. (Pay attention to the comments on the corresponding lines in the source code.) This is a special change that I focused on in order to try to identify the problem. Considering disassembled core cubes with conditional (isboss) conditional (i.e. broken code) and conditional (IS_BOSS) (conditional) code (), the most noticeable difference in instructions seems to be the lack of an SSY instruction in the disassembled broken code.
Here are the parsed kernels created by disassembling .cubin files using "cuobjdump -sass test_kernel.cubin". everything before the first "EXIT" is the core, and everything after that is the function of the device. The only differences in the function of the device.
DISCASSELY OBJECT CODE:
broken code
code for sm_20 Function : _Z10testKernelPi MOV R1, c [0x1] [0x100]; MOV R2, c [0x0] [0x8]; S2R R0, SR_Tid_X; MOV R5, RZ; MOV R4, c [0x0] [0x0]; IADD R2, R2, 0xfffff; MOV R7, c [0x0] [0x24]; MOV R6, c [0x0] [0x20]; ISET.EQ.U32.AND R0, R0, R2, pt; I2I.S32.S32 R8, -R0; CAL 0x60; EXIT; I2I.S32.S8 R0, R8; ISETP.NE.AND P0, pt, R0, RZ, pt; @!P0 BRA 0xa8; MOV R0, RZ; IADD R0, R0, 0x1; MOV32I R2, 0x1; ISETP.NE.AND P1, pt, R0, 0x2710, pt; ST.E [R4], R2; @P1 BRA 0x80; @P0 MOV32I R0, 0x1; @P0 ST.E [R6], R0; BAR.RED.POPC RZ, RZ; S2R R0, SR_Tid_X; IMAD.U32.U32 R4.CC, R0, 0x4, R6; IMUL.U32.U32.HI R2, R0, 0x4; IADD.X R5, R7, R2; ST.E [R4], R0; RET; .................................
"working" code
code for sm_20 Function : _Z10testKernelPi MOV R1, c [0x1] [0x100]; MOV R2, c [0x0] [0x8]; S2R R0, SR_Tid_X; MOV R5, RZ; MOV R4, c [0x0] [0x0]; IADD R2, R2, 0xfffff; MOV R7, c [0x0] [0x24]; MOV R6, c [0x0] [0x20]; ISET.EQ.U32.AND R0, R0, R2, pt; I2I.S32.S32 R8, -R0; CAL 0x60; EXIT; MOV R2, c [0x0] [0x8]; S2R R3, SR_Tid_X; I2I.S32.S8 R0, R8; SSY 0xd0; IADD R2, R2, 0xfffff; ISETP.NE.U32.AND P0, pt, R3, R2, pt; @P0 BRA 0xc8; MOV R2, RZ; IADD R2, R2, 0x1; MOV32I R8, 0x1; ISETP.NE.AND P0, pt, R2, 0x2710, pt; ST.E [R4], R8; @P0 BRA 0xa0; ISETP.EQ.AND.S P0, pt, R0, RZ, pt; @!P0 MOV32I R0, 0x1; @!P0 ST.E [R6], R0; BAR.RED.POPC RZ, RZ; IMAD.U32.U32 R4.CC, R3, 0x4, R6; IMUL.U32.U32.HI R2, R3, 0x4; S2R R0, SR_Tid_X; IADD.X R5, R7, R2; ST.E [R4], R0; RET; .................................
The "SSY" command is present in the working code, but not in the broken code. The cuobjdump user guide describes the instruction using "Set the synchronization point used before potentially diverging instructions." This makes me think that for some reason the compiler does not recognize the possibility of a discrepancy in the broken code.
I also found that if I comment on the __noinline__ directive, then the code gives the expected result, and indeed, the assembly created in the otherwise βbrokenβ and βworkingβ versions is exactly identical. Thus, it makes me think that when a variable is passed through the call stack, this variable cannot be used to control the divergence and subsequent synchronization call; the compiler does not seem to recognize the possibility of a discrepancy in this case and therefore does not insert the "SSY" instruction. Does anyone know if this is a legitimate CUDA limitation, and if so, if it is documented somewhere?
Thanks in advance.