Typically, the nvcc device code compiler will make its own decisions about when to embed a particular __device__ function, and generally speaking, you probably don't need to worry about overriding this with the __forceinline__ decorator / directive.
cc 1.x devices do not have all the same hardware capabilities as new devices, so very often the compiler automatically installs functions for these devices.
I think the reason for specifying __forceinline__ is the same as what you learned about host code C. It is usually used to optimize when the compiler cannot use the function otherwise (for example, on cc 2.x or newer devices). This optimization (i.e., the function call overhead) may be negligible if you called the function once, but if you called the function in a loop, for example, make sure it was built in, it can significantly improve code execution.
As a counter example, embedding and recursion usually have contraindications. For a recursive function that calls itself, I do not find it possible to handle arbitrary recursion, as well as strict insertion. Therefore, if you intend to use the function recursively (supported in cc 2.x and later), you probably will not want to specify __forceinline__ .
In general, I think you should let the compiler manage this for you. It will be wise to decide whether to enable the function.
Robert Crovella
source share