I am trying to track registry usage and came across an interesting scenario. Consider the following source:
#define OL 20 #define NHS 10 __global__ void loop_test( float ** out, const float ** in,int3 gdims,int stride){ const int idx = blockIdx.x*blockDim.x + threadIdx.x; const int idy = blockIdx.y*blockDim.y + threadIdx.y; const int idz = blockIdx.z*blockDim.z + threadIdx.z; const int index = stride*gdims.y*idz + idy*stride + idx; int i = 0,j =0; float sum =0.f; float tmp; float lf; float u2, tW; u2 = 1.0; tW = 2.0; float herm[NHS]; for(j=0; j < OL; ++j){ for(i = 0; i < NHS; ++i){ herm[i] += in[j][index]; } } for(j=0; j<OL; ++j){ for(i=0;i<NHS; ++i){ tmp = sum + herm[i]*in[j][index]; sum = tmp; } out[j][index] = sum; sum =0.f; } }
As a side note about the source - the current amount I could do + =, but played with how to change this use of the effect register (it seems this is not so - it just adds an additional mov command). In addition, this source focuses on access to memory displayed in 3D space.
Counting the registers, there seem to be 22 registers (I assume that float [N] occupies N + 1 registers - please correct me if I'm wronge) based on declarations.
However, compiling with:
nvcc -cubin -arch=sm_20 -Xptxas="-v" src/looptest.cu
gives:
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 25 registers, 72 bytes cmem[0]
So, the number is different from what is "expected." Also, if compiled with:
nvcc -cubin -arch=sm_13 -Xptxas="-v" src/looptest.cu
The use of the register is much less - 8 to be exact (apparently due to a stronger adherence in sm_20 than sm_13 to IEEE floating point mathematical standards?):
ptxas info : Compiling entry function '_Z9loop_testPPfPPKfS2_4int3i' for 'sm_13' ptxas info : Used 17 registers, 40+16 bytes smem, 8 bytes cmem[1]
As a final note, change the OL macro definition to 40 and suddenly:
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 28 registers, 72 bytes cmem[0]
In conclusion, I would like to know where the registers are burned, and what leads to the observations of the pair that I made.
I donβt have enough experience with the assembly to get through cuobjdump - of course I buried the answer there - maybe someone can enlighten me on what I should look for, or show me a guide on how to approach the dump assembly .