I worked on a CUDA program that crashes randomly with unspecified launch failure , quite often. Thanks to careful debugging, I localized which kernel fails, and in addition, the failure only occurred if certain transcendental functions were called from within the CUDA kernel (for example, sinf() or atanhf() ).
This led me to write a much simpler program (see below) to confirm that these transcendental functions really cause the problem, and it looks like it really is. When I compile and run the code below that simply repeats the kernel calls using tanh and atanh several times, sometimes the program runs, and sometimes it prints Error with Kernel along with a message from the driver that says:
NVRM: XiD (0000: 01: 00): 13 0002 000000 000050c0 00000368 00000000 0000080
As for the frequency, it will probably work 50% of the time when I run the executable.
From what I read on the Internet, it seems that XiD 13 similar to a host based seg crash. However, given the indexing of the array, I don't see how this could be. In addition, the program does not crash if I replace the transcendental functions in the kernel with other functions (for example, subtracting and adding floating point). That is, I do not receive an XiD error message, and the program ultimately returns the correct atanh (0.7) value.
I am running cuda-5.0 on the Ubuntu 11.10 x64 desktop. The driver version is 304.54, and I am using the GeForce 9800 GTX.
I am inclined to say that this is a hardware problem or a driver error. It is strange that the sample applications from nvidia work fine, perhaps because they do not use the affected transcendental functions.
The last bit of potentially important information is that if I run either my main project or this test program in cuda-memcheck, it does not report errors and will never work. Honestly, I just run my project under cuda-memcheck, but a performance hit makes it impractical.
Thanks in advance for any help / understanding. If someone has a 9800 GTX and wants to run this code to make sure it works, we will be very grateful.
#include <iostream> #include <stdlib.h> using namespace std; __global__ void test_trans (float *a, int length) { if ((threadIdx.x + blockDim.x*blockIdx.x) < length) { float temp=0.7; for (int i=0;i<100;i++) { temp=atanh(temp); temp=tanh(temp); } a[threadIdx.x+ blockDim.x*blockIdx.x] = atanh(temp); } } int main () { float *array_dev; float *array_host; unsigned int size=10000000; if (cudaSuccess != cudaMalloc ((void**)&array_dev, size*sizeof(float)) ) { cerr << "Error with memory Allocation\n"; exit (-1);} array_host = new float [size]; for (int i=0;i<10;i++) { test_trans <<< size/512+1, 512 >>> (array_dev, size); if (cudaSuccess != cudaDeviceSynchronize()) { cerr << "Error with kernel\n"; exit (-1);} } cudaMemcpy (array_host, array_dev, sizeof(float)*size, cudaMemcpyDeviceToHost); cout << array_host[size-1] << "\n"; }
Edit: I postponed this project for several months, but yesterday after updating to driver version 319.23 I no longer have this problem. I think the problem I described was a bug that was fixed. Hope this helps.