Since nothing changes in the GPU code when switching a project from x86 to x64, all this should do the way the multiplication is done on the CPU. There are some subtle differences between floating point processing in x86 and x64 modes, and the biggest one is that since any x64 processor also supports SSE and SSE2, it is used by default for mathematical operations in 64-bit mode in Windows
The HD4770 GPU performs all calculations using floating point units with one precision. On the other hand, modern x64 processors have two types of function blocks that process floating point numbers:
- x87 FPU that works with much higher extended precision of 80 bits
- SSE FPU, which works with 32-bit and 64-bit precision and is very compatible with the way other CPUs handle floating point numbers
In 32-bit mode, the compiler does not assume that SSE is available and generates regular x87 FPU code to do the math. In this case, operations like data[i] * data[i] are performed internally using much higher 80-bit precision. A comparison of the form if (results[i] == data[i] * data[i]) is performed as follows:
data[i] is pushed onto the FFU x87 stack using FLD DWORD PTR data[i]data[i] * data[i] calculated using FMUL DWORD PTR data[i]result[i] is pushed onto the FFU x87 stack using the FLD DWORD PTR result[i]- both values ββare compared using
FUCOMPP
That is the problem. data[i] * data[i] is located in the x87 FPU stack element with an accuracy of 80 bits. result[i] comes from the GPU with 32-bit precision. Both numbers will most likely differ, since data[i] * data[i] has much more significant digits, while result[i] has many zeros (in 80-bit precision)!
In 64-bit mode, everything happens differently. The compiler knows that your SSE processor is capable, and it uses SSE instructions to do the math. The same comparison operator is executed as follows on x64:
data[i] loaded into the SSE register using MOVSS XMM0, DWORD PTR data[i]data[i] * data[i] calculated using MULSS XMM0, DWORD PTR data[i]result[i] loaded into another SSE register using MOVSS XMM1, DWORD PTR result[i]- both values ββare compared using
UCOMISS XMM1, XMM0
In this case, the square operation is performed with the same 32-bit precision as on the GPU. Intermediate results are not created with 80-bit precision. That is why the results are the same.
It is very simple to check, even if the GPU is not involved. Just run the following simple program:
#include <stdlib.h> #include <stdio.h> float mysqr(float f) { f *= f; return f; } int main (void) { int i, n; float f, f2; srand(1); for (i = n = 0; n < 1000000; n++) { f = rand()/(float)RAND_MAX; if (mysqr(f) != f*f) i++; } printf("%d of %d squares differ\n", i); return 0; }
mysqr specially written so that an intermediate result of 80 bits will be converted to 32-bit precision float . If you compile and run in 64-bit mode, the output is:
0 of 1000000 squares differ
If you compile and run in 32-bit mode, the output is:
999845 of 1000000 squares differ
In principle, you should be able to change the floating point model in 32-bit mode (project properties β Configuration properties β C / C ++ β Code generation β Floating point model), but at the same time, nothing will change the intermediate results for VS2010 on - still stored in FPU. What you can do is to save and reload the calculated square so that it is rounded to 32-bit precision before it is comparable to the result from the GPU. In the simple example above, this is achieved by changing:
if (mysqr(f) != f*f) i++;
to
if (mysqr(f) != (float)(f*f)) i++;
After changing the 32-bit code output will be:
0 of 1000000 squares differ