Problems Starting CUDA Cores from Static Initialization Code

Question

Problems Starting CUDA Cores from Static Initialization Code

I have a class that calls the kernel in its constructor, as shown below:

"ScalarField.h"

#include <iostream> void ERROR_CHECK(cudaError_t err,const char * msg) { if(err!=cudaSuccess) { std::cout << msg << " : " << cudaGetErrorString(err) << std::endl; std::exit(-1); } } class ScalarField { public: float* array; int dimension; ScalarField(int dim): dimension(dim) { std::cout << "Scalar Field" << std::endl; ERROR_CHECK(cudaMalloc(&array, dim*sizeof(float)),"cudaMalloc"); } };

"classA.h"

 #include "ScalarField.h" static __global__ void KernelSetScalarField(ScalarField v) { int index = threadIdx.x + blockIdx.x * blockDim.x; if (index < v.dimension) v.array[index] = 0.0f; } class A { public: ScalarField v; A(): v(ScalarField(3)) { std::cout << "Class A" << std::endl; KernelSetScalarField<<<1, 32>>>(v); ERROR_CHECK(cudaGetLastError(),"Kernel"); } };

"main.cu"

 #include "classA.h" A a_object; int main() { std::cout << "Main" << std::endl; return 0; }

If I instantiate this class on main ( A a_object; ), I get no errors. However, if I create an instance outside the main one, immediately after its definition ( class A {...} a_object; ), I get the error "incorrect device function" when starting the kernel. Why is this happening?

EDIT

Updated code to provide a more complete example.

EDIT 2

Following the advice in Raxvan's comment, I wanted to say that the dimensions variable used in the ScalarField constructor is also defined (in another class) outside the main, but above all. Maybe this is an explanation? The debugger showed the correct value for dimensions , though.

+8

c ++ global-variables cuda

Noel Jul 21 '14 at 15:43

source share

1 answer

talonmies · Accepted Answer · 2014-07-22T09:29:32+0000

Short version:

The main cause of the problem when class A is created outside the main one is that the specific routine procedure that is required to initialize the CUDA runtime library with your kernels is not executed before the class A constructor is called. This is because there are no guarantees regarding the order in which static objects are created and initialized in the C ++ execution model. Your global visibility class is created for global scope objects that are initialized by the CUDA installation. Your kernel code is never loaded into the context before it is called, and a runtime error occurs.

As far as I can tell, this is a genuine limitation of the CUDA runtime API, and not something easily fixed in the user code. In your trivial example, you can replace the kernel call with a call to cudaMemset or one of the non-character based memset API functions, and it will work. This issue is completely limited to user kernels or device symbols loaded at runtime through the runtime API. For this reason, an empty default constructor will also solve your problem. From a design point of view, I would very much doubt any template that calls the kernels in the constructor. Adding a specific method for installing / debugging the GPU class, which is independent of the default constructor or destructor, will be much cleaner and less error prone design, IMHO.

More details:

There is an internal generated procedure ( __cudaRegisterFatBinary ) that must be run to load and register kernels, textures, and statically defined device symbols contained in the grease payload of any runtime API program with the CUDA API before the kernel can be without mistakes. This is part of the context initialization function of the lazy API. You can confirm this for yourself as follows:

Here is the gdb tag of the revised example you posted. Note. I am inserting a breakpoint in __cudaRegisterFatBinary , and this is not achieved before you call static constructor A and the kernel crashes:

 talonmies@box:~$ gdb a.out GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04 Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://bugs.launchpad.net/gdb-linaro/>... Reading symbols from /home/talonmies/a.out...done. (gdb) break '__cudaRegisterFatBinary' Breakpoint 1 at 0x403180 (gdb) run Starting program: /home/talonmies/a.out [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Scalar Field [New Thread 0x7ffff5a63700 (LWP 10774)] Class A Kernel : invalid device function [Thread 0x7ffff5a63700 (LWP 10774) exited] [Inferior 1 (process 10771) exited with code 0377]

Here is the same procedure, this time with an A instance inside main (which is guaranteed to happen after the objects that perform lazy configuration have been initialized):

 talonmies@box:~$ cat main.cu #include "classA.h" int main() { A a_object; std::cout << "Main" << std::endl; return 0; } talonmies@box:~$ nvcc --keep -arch=sm_30 -g main.cu talonmies@box:~$ gdb a.out GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04 Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://bugs.launchpad.net/gdb-linaro/>... Reading symbols from /home/talonmies/a.out...done. (gdb) break '__cudaRegisterFatBinary' Breakpoint 1 at 0x403180 (gdb) run Starting program: /home/talonmies/a.out [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Breakpoint 1, 0x0000000000403180 in __cudaRegisterFatBinary () (gdb) cont Continuing. Scalar Field [New Thread 0x7ffff5a63700 (LWP 11084)] Class A Main [Thread 0x7ffff5a63700 (LWP 11084) exited] [Inferior 1 (process 11081) exited normally]

If this is really a problem for you, I would recommend contacting NVIDIA developer support and reporting an error.

Problems Starting CUDA Cores from Static Initialization Code

More articles: