CUDA cores give different results on two different GPUs (GeForce 8600M GT vs Quadro FX 770M)

Question

CUDA cores give different results on two different GPUs (GeForce 8600M GT vs Quadro FX 770M)

I am working on an AES CUDA application, and I have a kernel that performs ECB encryption on the GPU. To ensure that the logic of the algorithm does not change when running in parallel, I send a known input test vector provided by NIST, and then from the host code, compare the output with the know test output vector provided by NIST using assert. I tested this test on my NVIDIA GPU, which is the 8600M GT. This works under Windows 7, and the driver version is 3.0. In this scenario, everything works perfectly, and the statement succeeds.

Now that the application is launching on the Quadro FX 770M. The same application starts, the same test vectors are sent, but the result is incorrect and the statement does not work !!. This works on Linux with the same driver version. Kernels run on 256 threads. In the kernels and to skip arithmetic pre-calculated lookup tables, 256 elements are used. These tables are initially loaded into global memory, 1 stream of 256 threads starting the colaborate kernel when 1 element of the search table is loaded and moves the element to a new search table in shared memory, so access latency is reduced.

Initially, I was thinking about synchronization problems due to the difference in clock frequency between the GPUs. Thus, threads can use values that have not yet been loaded into shared memory, or some values that have not been processed yet, making the conclusion spoil and finally get it wrong.

Well-known test vectors are declared here, so basically they are sent to AES_set_encrption, which is responsible for tuning the kernel

void test_vectors () { unsigned char testPlainText[] = {0x6b, 0xc1, 0xbe, 0xe2, 0x2e, 0x40, 0x9f, 0x96, 0xe9, 0x3d, 0x7e, 0x11, 0x73, 0x93, 0x17, 0x2a}; unsigned char testKeyText[] = {0x60, 0x3d, 0xeb, 0x10, 0x15, 0xca, 0x71, 0xbe, 0x2b, 0x73, 0xae, 0xf0, 0x85, 0x7d, 0x77,0x1f, 0x35, 0x2c, 0x07, 0x3b, 0x61, 0x08, 0xd7, 0x2d, 0x98, 0x10, 0xa3, 0x09, 0x14, 0xdf, 0xf4}; unsigned char testCipherText[] = {0xf3, 0xee, 0xd1, 0xbd, 0xb5, 0xd2, 0xa0, 0x3c, 0x06, 0x4b, 0x5a, 0x7e, 0x3d, 0xb1, 0x81, 0xf8}; unsigned char out[16] = {0x0}; //AES Encryption AES_set_encrption( testPlainText, out, 16, (u32*)testKeyText); //Display encrypted data printf("\n GPU Encryption: "); for (int i = 0; i < AES_BLOCK_SIZE; i++) printf("%x", out[i]); //Assert that the encrypted output is the same as the NIST testCipherText vector assert (memcmp (out, testCipherText, 16) == 0); }

Here, the tuning function is responsible for allocating memory, calling the kernel, and sending the results back to hos. Notice that I synchronize before sending back to the host, so at this point everything should be completed, which makes me think that the problem is in the kernel.

 __host__ double AES_set_encrption (... *input_data,...*output_data, .. input_length, ... ckey ) //Allocate memory in the device and copy the input buffer from the host to the GPU CUDA_SAFE_CALL( cudaMalloc( (void **) &d_input_data,input_length ) ); CUDA_SAFE_CALL( cudaMemcpy( (void*)d_input_data, (void*)input_data, input_length, cudaMemcpyHostToDevice ) ); dim3 dimGrid(1); dim3 dimBlock(THREAD_X,THREAD_Y); // THREAD_X = 4 & THREAD_Y = 64 AES_encrypt<<<dimGrid,dimBlock>>>(d_input_data); cudaThreadSynchronize(); //Copy the data processed by the GPU back to the host cudaMemcpy(output_data, d_input_data, input_length, cudaMemcpyDeviceToHost); //Free CUDA resources CUDA_SAFE_CALL( cudaFree(d_input_data) ); }

And finally, in the kernel, I have the set of AES rounds computed. Since I thought the synchhonization problem was then in the kernel, I installed __syncthreads (); after each round or computational operation to ensure that all threads are moving at the same time, so no uncertain migth values are computed. But still, this did not solve the problem.

Here is the result when I use the 8600M GT GPU, which works great:

AES 256-bit key

NIST test vectors:

PlaintText: 6bc1bee22e409f96e93d7e117393172a

Key: 603deb1015ca71be2b73aef0857d7781

CipherText: f3eed1bdb5d2a03c64b5a7e3db181f8

GPU Encryption: f3eed1bdb5d2a03c64b5a7e3db181f8

Test Status: Passed

And so when I use the Quadro FX 770M and it doesn’t work out !!

AES 256-bit key NIST test vectors:

PlaintText: 6bc1bee22e409f96e93d7e117393172a

Key: 603deb1015ca71be2b73aef0857d7781

CipherText: f3eed1bdb5d2a03c64b5a7e3db181f8

GPU encryption: c837204eb4c1063ed79c77946893b0

General argument memcmp (out, testCipherText, 16) == 0 caused an error

Test Status: Failure

What could be the reason why 2 GPUs calculate different results, even if they process the same cores ??? I would appreciate any hint or troubleshooting that any of you may have to give me or any step to fix this problem.

Thanks in advance!

+4

c parallel-processing gpgpu cuda nvidia

Bartzilla Oct 20 '10 at 12:19

source share

1 answer

jmilloy · Answer 1 · 2010-10-22T17:27:20+0000

disclaimer: I don't know anything about AES encryption.

Do you use double precision? You probably know, but to be sure, I believe that both cards that you use are the computing ability 1.1, which does not support double precision. Perhaps maps or platforms are converted to the same accuracy in different ways ...? Somebody knows? In truth, IEEE floating point deviations are well indicated, so I will be surprised.

CUDA cores give different results on two different GPUs (GeForce 8600M GT vs Quadro FX 770M)

Here is the result when I use the 8600M GT GPU, which works great:

And so when I use the Quadro FX 770M and it doesn’t work out !!

More articles: