Cuda kernel register usage tracking

Question

Cuda kernel register usage tracking

I am trying to track registry usage and came across an interesting scenario. Consider the following source:

#define OL 20 #define NHS 10 __global__ void loop_test( float ** out, const float ** in,int3 gdims,int stride){ const int idx = blockIdx.x*blockDim.x + threadIdx.x; const int idy = blockIdx.y*blockDim.y + threadIdx.y; const int idz = blockIdx.z*blockDim.z + threadIdx.z; const int index = stride*gdims.y*idz + idy*stride + idx; int i = 0,j =0; float sum =0.f; float tmp; float lf; float u2, tW; u2 = 1.0; tW = 2.0; float herm[NHS]; for(j=0; j < OL; ++j){ for(i = 0; i < NHS; ++i){ herm[i] += in[j][index]; } } for(j=0; j<OL; ++j){ for(i=0;i<NHS; ++i){ tmp = sum + herm[i]*in[j][index]; sum = tmp; } out[j][index] = sum; sum =0.f; } }

As a side note about the source - the current amount I could do + =, but played with how to change this use of the effect register (it seems this is not so - it just adds an additional mov command). In addition, this source focuses on access to memory displayed in 3D space.

Counting the registers, there seem to be 22 registers (I assume that float [N] occupies N + 1 registers - please correct me if I'm wronge) based on declarations.

However, compiling with:

 nvcc -cubin -arch=sm_20 -Xptxas="-v" src/looptest.cu

gives:

 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 25 registers, 72 bytes cmem[0]

So, the number is different from what is "expected." Also, if compiled with:

 nvcc -cubin -arch=sm_13 -Xptxas="-v" src/looptest.cu

The use of the register is much less - 8 to be exact (apparently due to a stronger adherence in sm_20 than sm_13 to IEEE floating point mathematical standards?):

 ptxas info : Compiling entry function '_Z9loop_testPPfPPKfS2_4int3i' for 'sm_13' ptxas info : Used 17 registers, 40+16 bytes smem, 8 bytes cmem[1]

As a final note, change the OL macro definition to 40 and suddenly:

 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 28 registers, 72 bytes cmem[0]

In conclusion, I would like to know where the registers are burned, and what leads to the observations of the pair that I made.

I don’t have enough experience with the assembly to get through cuobjdump - of course I buried the answer there - maybe someone can enlighten me on what I should look for, or show me a guide on how to approach the dump assembly .

+7

optimization c cuda

Marm0t Mar 15 '12 at 15:56

source share

2 answers

harrism · Answer 1 · 2012-09-17T01:23:59+0000

sm_20 and sm_13 are very different architectures, with a very different instruction set (ISA). The main difference that causes an increase in the use of registers is that sm_1x has special address registers, while sm_2x and later do not. Instead, addresses are stored in general registers, as are values, which means that most programs require more registers on sm_2x than on sm_1x.

sm_20 also has the sm_13 register file double the size to compensate for this effect.

Roger dahl · Answer 2 · 2012-03-16T16:07:24+0000

Using the registry does not necessarily have a close correlation with the number of variables.

The compiler tries to evaluate the advantage of the speed of storing a variable in a register between two points of code usage by comparing the potential gain in one core with the cost for all simultaneously working cores due to the lack of available registers in the register pool. (A Fermi SM has 32768 registers). Therefore, it is not surprising if a code change causes unexpected fluctuations in the number of registers used.

You really need to worry only about using the register, if the profiler says that your activity is limited to using the register. In this case, you can use the --maxrregcount option to reduce the number of registers used by a single kernel to see if it improves overall execution speed.

To reduce the number of registers used by the kernel, you can try to use the variable as local as possible. For example, if you do:

 set variable 1 set variable 2 use variable 1 use variable 2

This can lead to the use of 2 registers. Although, if you:

 set variable 1 use variable 1 set variable 2 use variable 2

This can lead to the use of 1 register.

Cuda kernel register usage tracking

More articles: