OpenCL Floating Point Accuracy

I found a problem with the host and client float standard in OpenCL. The problem was that the floating points calculated by Opencl did not match the same floating points as my visual studio 2010 compiler when compiling in x86. However, when compiled to x64, they are in the same limit. I know this must be something, http://www.viva64.com/en/b/0074/

The source that I used during testing was: http://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism When I ran the program in x86, it would give me 202 numbers equal to, when the core and C ++ program occupied a square of 1269760 numbers. However, in the 64-bit assembly, 1269760 numbers were correct, in other words 100%. In addition, I found that the error between the calculated result of opencl and x86 C ++ was 5.5385384e-014, which is a very small fraction, but not small enough, compared to the epsilon of the number, which was 2.92212543378266922312416e-19. <br> This is because the error must be less than epsilon, so the program can recognize two numbers as one identical number. Of course, usually no one would compare floats initially, but it's good to know that the limitations of the float are different. And yes, I tried to install flt: static, but got the same error.

So, I want an explanation of this behavior. Thanks in advance for all the answers.

+4
source share
2 answers

Since nothing changes in the GPU code when switching a project from x86 to x64, all this should do the way the multiplication is done on the CPU. There are some subtle differences between floating point processing in x86 and x64 modes, and the biggest one is that since any x64 processor also supports SSE and SSE2, it is used by default for mathematical operations in 64-bit mode in Windows

The HD4770 GPU performs all calculations using floating point units with one precision. On the other hand, modern x64 processors have two types of function blocks that process floating point numbers:

  • x87 FPU that works with much higher extended precision of 80 bits
  • SSE FPU, which works with 32-bit and 64-bit precision and is very compatible with the way other CPUs handle floating point numbers

In 32-bit mode, the compiler does not assume that SSE is available and generates regular x87 FPU code to do the math. In this case, operations like data[i] * data[i] are performed internally using much higher 80-bit precision. A comparison of the form if (results[i] == data[i] * data[i]) is performed as follows:

  • data[i] is pushed onto the FFU x87 stack using FLD DWORD PTR data[i]
  • data[i] * data[i] calculated using FMUL DWORD PTR data[i]
  • result[i] is pushed onto the FFU x87 stack using the FLD DWORD PTR result[i]
  • both values ​​are compared using FUCOMPP

That is the problem. data[i] * data[i] is located in the x87 FPU stack element with an accuracy of 80 bits. result[i] comes from the GPU with 32-bit precision. Both numbers will most likely differ, since data[i] * data[i] has much more significant digits, while result[i] has many zeros (in 80-bit precision)!

In 64-bit mode, everything happens differently. The compiler knows that your SSE processor is capable, and it uses SSE instructions to do the math. The same comparison operator is executed as follows on x64:

  • data[i] loaded into the SSE register using MOVSS XMM0, DWORD PTR data[i]
  • data[i] * data[i] calculated using MULSS XMM0, DWORD PTR data[i]
  • result[i] loaded into another SSE register using MOVSS XMM1, DWORD PTR result[i]
  • both values ​​are compared using UCOMISS XMM1, XMM0

In this case, the square operation is performed with the same 32-bit precision as on the GPU. Intermediate results are not created with 80-bit precision. That is why the results are the same.

It is very simple to check, even if the GPU is not involved. Just run the following simple program:

 #include <stdlib.h> #include <stdio.h> float mysqr(float f) { f *= f; return f; } int main (void) { int i, n; float f, f2; srand(1); for (i = n = 0; n < 1000000; n++) { f = rand()/(float)RAND_MAX; if (mysqr(f) != f*f) i++; } printf("%d of %d squares differ\n", i); return 0; } 

mysqr specially written so that an intermediate result of 80 bits will be converted to 32-bit precision float . If you compile and run in 64-bit mode, the output is:

 0 of 1000000 squares differ 

If you compile and run in 32-bit mode, the output is:

 999845 of 1000000 squares differ 

In principle, you should be able to change the floating point model in 32-bit mode (project properties β†’ Configuration properties β†’ C / C ++ β†’ Code generation β†’ Floating point model), but at the same time, nothing will change the intermediate results for VS2010 on - still stored in FPU. What you can do is to save and reload the calculated square so that it is rounded to 32-bit precision before it is comparable to the result from the GPU. In the simple example above, this is achieved by changing:

 if (mysqr(f) != f*f) i++; 

to

 if (mysqr(f) != (float)(f*f)) i++; 

After changing the 32-bit code output will be:

 0 of 1000000 squares differ 
+8
source

In my case

 (float)(f*f) 

did not help. I used

  correct = 0; for(unsigned int i = 0; i < count; i++) { volatile float sqr = data[i] * data[i]; if(results[i] == sqr) correct++; } 

instead.

-1
source

All Articles