The kernel runs asynchronously. This means that it returns control to the CPU thread immediately after starting the GPU process, before the kernel completes execution.
So what's new in this case? Application exit.
When the application exits, the ability to send output to standard output ends with the OS.
Thus, the result that is later generated by the kernel has nowhere to go, and you will not see it.
On the other hand, if you use cudaDeviceSynchronize() , then the kernel will be terminated (and the output from the kernel will find a pending standard output queue) before the application can exit.
Robert Crovella
source share