Short version:
The main cause of the problem when class A is created outside the main one is that the specific routine procedure that is required to initialize the CUDA runtime library with your kernels is not executed before the class A constructor is called. This is because there are no guarantees regarding the order in which static objects are created and initialized in the C ++ execution model. Your global visibility class is created for global scope objects that are initialized by the CUDA installation. Your kernel code is never loaded into the context before it is called, and a runtime error occurs.
As far as I can tell, this is a genuine limitation of the CUDA runtime API, and not something easily fixed in the user code. In your trivial example, you can replace the kernel call with a call to cudaMemset or one of the non-character based memset API functions, and it will work. This issue is completely limited to user kernels or device symbols loaded at runtime through the runtime API. For this reason, an empty default constructor will also solve your problem. From a design point of view, I would very much doubt any template that calls the kernels in the constructor. Adding a specific method for installing / debugging the GPU class, which is independent of the default constructor or destructor, will be much cleaner and less error prone design, IMHO.
More details:
There is an internal generated procedure ( __cudaRegisterFatBinary ) that must be run to load and register kernels, textures, and statically defined device symbols contained in the grease payload of any runtime API program with the CUDA API before the kernel can be without mistakes. This is part of the context initialization function of the lazy API. You can confirm this for yourself as follows:
Here is the gdb tag of the revised example you posted. Note. I am inserting a breakpoint in __cudaRegisterFatBinary , and this is not achieved before you call static constructor A and the kernel crashes:
talonmies@box:~$ gdb a.out GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04 Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://bugs.launchpad.net/gdb-linaro/>... Reading symbols from /home/talonmies/a.out...done. (gdb) break '__cudaRegisterFatBinary' Breakpoint 1 at 0x403180 (gdb) run Starting program: /home/talonmies/a.out [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Scalar Field [New Thread 0x7ffff5a63700 (LWP 10774)] Class A Kernel : invalid device function [Thread 0x7ffff5a63700 (LWP 10774) exited] [Inferior 1 (process 10771) exited with code 0377]
Here is the same procedure, this time with an A instance inside main (which is guaranteed to happen after the objects that perform lazy configuration have been initialized):
talonmies@box:~$ cat main.cu
If this is really a problem for you, I would recommend contacting NVIDIA developer support and reporting an error.
talonmies
source share