__Forceinline__ effect with CUDA C __device__ functions

There are many tips on when to use built-in functions and when to avoid them with regular coding. What is the effect of __forceinline__ on CUDA C __device__ ? Where should they be used and where should they be avoided?

+7
c gpgpu cuda nvidia
source share
1 answer

Typically, the nvcc device code compiler will make its own decisions about when to embed a particular __device__ function, and generally speaking, you probably don't need to worry about overriding this with the __forceinline__ decorator / directive.

cc 1.x devices do not have all the same hardware capabilities as new devices, so very often the compiler automatically installs functions for these devices.

I think the reason for specifying __forceinline__ is the same as what you learned about host code C. It is usually used to optimize when the compiler cannot use the function otherwise (for example, on cc 2.x or newer devices). This optimization (i.e., the function call overhead) may be negligible if you called the function once, but if you called the function in a loop, for example, make sure it was built in, it can significantly improve code execution.

As a counter example, embedding and recursion usually have contraindications. For a recursive function that calls itself, I do not find it possible to handle arbitrary recursion, as well as strict insertion. Therefore, if you intend to use the function recursively (supported in cc 2.x and later), you probably will not want to specify __forceinline__ .

In general, I think you should let the compiler manage this for you. It will be wise to decide whether to enable the function.

+5
source share

All Articles