Cuda uncorrectable ECC error

My environment

  • Windows 7 x64
  • Matlab 2012a x64
  • Cuda SDK 4.2
  • Tesla C2050 GPU

I am having trouble figuring out why my GPU crashes "with ECC fix error." This error occurs only when using 512 threads or more. I cannot publish the kernel, but I will try to describe what it does.

In general, the kernel takes a number of parameters and creates 2 complex matrices determined by the size of the stream, M and another number N. Thus, the returned matrices will have the size MxN. A typical configuration is 512x512, but each number is independent and can vary up or down. The kernel works when the numbers are 256x256.

Each thread (core) extracts a vector of size 999 from a 2D array based on the identifier of the stream, that is, size 999xM, then it cycles through the row (0 .. N-1) of the output matrices for calculation. A number of intermediate parameters are calculated using only pow, sin, and cos among the + - * / operators. In order to calculate one of the output matrices, an additional cycle is necessary to summarize the contribution of vector 999, which was extracted earlier. This loop performs some intermediate calculations to determine the range of values ​​that will contribute. Then the contribution is scaled by a coefficient determined by the cos and sine values ​​of the calculated fractional value. Here it is crashing. If I stick with a constant value of either 1.0 or whatever, then the kernel runs without problems. however, when only one of the calls (cos or sine) is enabled, the kernel fails.

Below is some pseudo-code:

 kernel() { /* Extract 999 vector from 2D array 999xM - one 999 vector for each thread. */ for (int i = 0; i < 999; i++) { ..... } /* Cycle through the 2nd dimension of the output matricies */ for (int j = 0; j < N; j++) { /* Calculate some intermediate variables */ /* Calculate the real and imaginary components of the first output matrix */ /* real = cos(value), imaginary = sin(value) */ /* Construct the first output matrix from some intermediate variables and the real and imaginary components */ /* Calculate some more intermediate variables */ /* cycle through the extracted vector (0 .. 998) */ for (int k = 0; k < 999; k++) { /* Calculate some more intermediate variables */ /* Determine the range of allowed values to contribute to the second output matrix. */ /* Calculate the real and imaginary components of the second output matrix */ /* real = cos(value), imaginary = sin(value) */ /* This is were it crashes, unless real and imaginary are constant values (1.0) */ /* Sum up the contributions of the extracted vector to the second output matrix */ } /* Construct the Second output matrix from some intermediate variables and the real and imaginary components */ } } 

I thought that this could be due to the registration limit, but the employment calculator indicates that it is not, I use less than 32,768 registers with 512 threads. Can anyone give any suggestions as to what might be causing this?

Here is the ptasx info:

 ptxas info : Compiling entry function '_Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_' for 'sm_20' ptxas info : Function properties for _Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_ 8056 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Function properties for __internal_trig_reduction_slowpathd 40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 53 registers, 232 bytes cmem[0], 144 bytes cmem[2], 28 bytes cmem[16] tmpxft_00001d70_00000000-3_MexFunciton.cudafe1.cpp 
+4
source share
1 answer

"Unrecoverable ECC error" usually refers to hardware failures. ECC is an error correction code, a method for detecting and correcting errors in bits stored in RAM. A wandering cosmic ray can interrupt one bit stored in RAM each time for a long time, but an “uncorrectable ECC error” indicates that several bits are coming out of the memory “incorrectly” - too much for ECC to restore the original bit values.

This may mean that you have a bad or marginal memory in the memory of the GPU device.

Marginal circuits of any type may not fail by 100%, but are more likely to fail under voltage during intensive use and the associated increase in temperature.

There are diagnostic utilities floating around to stress test all the RAM banks of your PC to confirm or determine which chip fails, but I don’t know the analog for testing bank RAM devices of a GPU device.

If you have access to another computer with a similar GPU, try running the application on that computer to find out how it behaves. If you do not get an ECC error on the second machine, this confirms that the problem is almost certainly related to the equipment of the first machine. If you get the same ECC error on the second computer, then ignore everything that I wrote here and continue to look for your software error. If your code does not actually harm the hardware, the likelihood that the two machines have the same hardware failure is extremely small.

+6
source

All Articles